<a href="https://colab.research.google.com/github/brendanpshea/intro_cs/blob/main/Python_07_Data_Info_Knowledge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data, Information, and Knowledge
### Brendan Shea, PhD (Brendan.Shea@rctc.edu)

![Pandas for data analysis](https://github.com/brendanpshea/intro_cs/raw/main/images/panda_pic.png)


## Introduction


Data, information, and knowledge are fundamental concepts in computer science that describe the transformation of raw facts into useful insights. Let's examine each concept in more detail:


* **Data** refers to raw, unprocessed facts and figures that are typically collected, stored, and processed by computers. Data can come in various forms such as numbers, text, images, and audio. For example, a collection of temperatures in a city over several days is data.

* **Information** is data that has been processed, organized, or structured in a meaningful way. It is easier to understand and interpret than raw data. For instance, if we take the temperature data mentioned earlier and calculate the average temperature over a week, we have turned the data into information.

* **Knowledge**  is the understanding and interpretation of information, often gained through experience or learning. When we draw conclusions or make decisions based on information, we are applying knowledge. Continuing with the temperature example, using the average temperature information to predict suitable clothing for the weather demonstrates the application of knowledge.

### How can computers help us transform data into knowledge?
Computers play a crucial role in transforming data into knowledge through several stages:

*Data Collection*. Computers can collect data from various sources, such as sensors, user input, or the internet. For example, a weather station collects temperature data from sensors, while an online survey collects user responses.

*Data Storage*. Data can be stored in various formats, such as flat files (CSV, TXT), relational databases (SQL), or semi-structured files (JSON, XML). Choosing the right storage format depends on the data's structure and the intended use.

*Data Processing*. Computers can process data using algorithms to extract meaningful information. For example, they can sort, filter, and aggregate data to generate insights. In our temperature example, calculating the average temperature is a data processing step.

*Data Visualization*.  Visualizations, such as graphs, charts, or maps, can help users understand information more easily. Computers can generate visualizations to help us identify patterns or trends in the data. For instance, a line chart can show temperature fluctuations over time.

*Knowledge Acquisition*. By interacting with processed information and visualizations, users can gain knowledge and make informed decisions. For example, analyzing temperature trends can help users decide the best time to visit a particular location.

## Examples
### Example 1: Exam Scores

Data: Individual exam scores of a group of students

Information: The average exam score and the distribution of scores (e.g., number of students in each grade range)

Knowledge: Identifying which students may need extra help based on their exam performance

### Example 2: Social Media Analytics

Data: Timestamps, likes, comments, and shares of social media posts

Information: Engagement metrics such as average likes, comments, and shares per post

Knowledge: Determining the optimal time to post for maximum engagement based on historical data

### Example 3: Sales Data

Data: Product IDs, quantities, prices, and timestamps of individual sales transactions

Information: Total revenue, best-selling products, and seasonal trends

Knowledge: Planning marketing campaigns and inventory management based on sales insights

## I. A Short History of Data
Long before the advent of modern computers, humans have sought to organize, store, and analyze data to gain knowledge and understanding. The history of data before computers can be traced back to the development of writing systems, the invention of counting devices, and the emergence of data models. This essay will explore the relationship between technologies and data models and how they served the goal of transforming data into knowledge.

### Early Writing Systems and the Birth of Recorded Data
The invention of writing systems around 3200 BCE marked the beginning of recorded data. The earliest writing systems, like Sumerian cuneiform and Egyptian hieroglyphs, were developed to record information on trade, taxes, and religious events. The emergence of these systems allowed for the storage and transmission of data, enabling humans to access information from past events and make better decisions.

### Counting Devices and Quantitative Data Analysis
Counting devices, such as the abacus, were instrumental in the development of quantitative data analysis. The abacus, invented around 2500 BCE, facilitated the representation and manipulation of large numbers, enabling humans to perform complex calculations. This technology laid the foundation for advanced mathematical models, such as those found in ancient Greece and India, which allowed for the analysis of quantitative data and the formulation of mathematical theories.

### The Birth of Data Models: Tables and Lists
As humans accumulated more data, they needed better ways to organize and make sense of it. This led to the invention of tables and lists. Early civilizations, like the ancient Egyptians, used tables to organize tax and inventory data, while lists were used for legal and administrative purposes. These data models provided a more systematic approach to data organization and retrieval, allowing for the efficient processing of information.

### Cartography and Spatial Data
The development of cartography, or map-making, allowed for the visualization and analysis of spatial data. Maps have been used for thousands of years to represent geographic information and to aid in navigation, exploration, and resource management. **Cartographic data models**, like grids and coordinate systems, enabled humans to analyze and interpret spatial relationships, ultimately leading to a deeper understanding of the world around them.

### Libraries and Indexing Systems
As the amount of recorded information grew, libraries emerged as repositories of knowledge. To effectively manage and access this vast wealth of data, librarians developed **indexing** systems, such as the Library of Alexandria's Pinakes and the later Dewey Decimal System. These systems enabled the organization and retrieval of information, allowing scholars to access and build upon the collective knowledge of past generations.

### The Printing Press and the Dissemination of Data
The invention of the printing press in the 15th century revolutionized the dissemination of data. This technology allowed for the mass production and distribution of books, making information more accessible to the general public. The printing press also facilitated the standardization of data formats and the development of scientific journals, promoting the sharing of research and the advancement of knowledge.


## Early Computers and Data Models

The period from the late 1800s to the early 1970s saw the development of early computing technologies and data models, which paved the way for the modern digital age. This section will discuss the key technologies and data models that emerged during this period, with a focus on their contribution to the evolution of computing and data management.

### Punched Card Tabulating Machines and the Birth of Machine-Readable Data

In the late 1800s, Herman Hollerith's invention of the punched card tabulating machine revolutionized data processing. Punched cards, which stored data in the form of perforations, allowed for the efficient storage and retrieval of information. This technology was widely adopted for census data processing, and later, for business applications. The punched card system laid the foundation for the development of machine-readable data and early data processing machines.

### The Turing Machine and the Concept of Universal Computing

In the 1930s, British mathematician Alan Turing proposed the concept of a universal machine, now known as the Turing Machine. This theoretical model of computation demonstrated that any problem that could be encoded as an algorithm could be solved by a single machine with the appropriate input. The Turing Machine's concept of a stored program and its ability to read and write symbols laid the groundwork for the development of programmable digital computers.

### Early Programmable Computers and Data Processing

During and after World War II, several programmable digital computers were developed, such as the Colossus, ENIAC, and UNIVAC. These early computers used technologies like vacuum tubes, magnetic drums, and core memory for data storage and processing. They enabled faster and more efficient calculations, particularly in the fields of cryptography, scientific research, and business data processing. This period also saw the emergence of programming languages like assembly and FORTRAN, which facilitated the development of complex algorithms and data processing tasks.

### Hierarchical and Network Data Models

As computers became more advanced, new data models were needed to better organize and manage large volumes of data. In the 1960s, the hierarchical and network data models emerged as popular solutions for managing data in early mainframe computers. The hierarchical model, as used in IBM's Information Management System (IMS), organized data in a tree-like structure, while the network model, employed by the CODASYL consortium, allowed for more complex relationships between data elements. These models enabled the efficient storage, retrieval, and manipulation of data in large-scale computing systems.

### Edgar F. Codd and the Foundations of the Relational Model

In 1970, Edgar F. Codd, a computer scientist at IBM, proposed the relational model as an alternative to the hierarchical and network models. Codd's model was based on the concept of a mathematical relation, and it used a simple, tabular format to represent data. The relational model allowed for more flexible data organization and easier querying, setting the stage for the development of relational database management systems (RDBMS) and the SQL language in the following years.

### Summary Table

| Rough Dates | Important Technologies | Data Models | Applications |
| --- | --- | --- | --- |
| Ancient Times | Scrolls, Clay Tablets, Manuscripts | Hieroglyphics, Cuneiform, Alphabets | Record-keeping, Literature, Communication |
| 15th-19th Century | Paper, Printing Press | Tabular Records, Ledgers | Government Records, Accounting, Libraries |
| Mid 20th Century | Magnetic Tapes, Punch Cards | Electronic Flat Files | Early Computing, Business Data Processing |
| 1970s-1990s | Hard Disk Drives, Relational Databases | Relational Model | Enterprise Data Management, Software Development |
| Late 1990s-2000s | Solid-State Drives, NoSQL Databases | Key-Value, Document, Column-Family, Graph | Web Applications, Social Networks, IoT |
| 2010s-Present | Cloud Storage, Big Data Frameworks | Hadoop, Stream Processing, Machine Learning | Big Data Analytics, Real-time Processing, AI |

## Discussion Questions: A History of Data
1. How did early writing systems enable humans to transform data into knowledge? Reflect on how written records have helped societies learn from past events and make informed decisions.

2. Consider how counting devices, such as the abacus, have contributed to the development of advanced mathematical models. How does the ability to perform complex calculations facilitate the transformation of data into knowledge in various fields?

3. Discuss how early data models like tables and lists have helped people make sense of data and gain knowledge. Provide examples from ancient civilizations and reflect on how similar models are used in your daily life.

4. Describe the impact of cartography and map-making on the understanding of spatial data. How has the ability to analyze and interpret spatial relationships contributed to the growth of human knowledge?

5. Based on your own experiences, discuss how the use of modern data models and technologies has helped you transform data into knowledge. Provide examples from your daily life, including school, work, or personal projects.

## Discussion Questions: Your Answers

1. 

2. 

3.

4.

5.


## II. Modern Flat Files
The tabular organization of data in pre-computer times persisted and laid the foundation for modern flat file formats. Early computers used technologies such as punched cards and magnetic tapes to store data in machine-readable formats, which eventually transitioned into digital storage mediums like disks and solid-state drives. The concept of organizing data in tables, a fundamental aspect of the relational model, has remained a central theme in data management.

In contemporary data management, the term **flat files** refers to digital files that store data in a tabular format. These files are often used for simple data storage and exchange, without the need for a full-fledged database system. Common modern flat file formats include:

- Tables: A basic unit of data organization, where data is stored in rows and columns with a consistent schema.
- Spreadsheets: A widely-used data management tool, spreadsheets (such as Microsoft Excel or Google Sheets) provide an interactive interface for organizing and manipulating tabular data.
- CSV (Comma-Separated Values) files: A simple and widely-adopted text-based file format for storing tabular data. CSV files use commas to separate values and newline characters to indicate new rows.

## Data Manipulation with Pandas
Pandas is a popular open-source library for the Python programming language, designed to provide easy-to-use data structures and data analysis tools for handling tabular data. Pandas enables users to load, manipulate, and analyze modern flat files, such as CSV files, in a powerful and efficient manner. By using Pandas, users can perform complex data manipulation tasks, such as filtering, sorting, aggregating, and transforming data, with just a few lines of code.-

## Comma-Separated Values (CSV)

CSV files store data in a plain text format, with each line representing a record (row) and individual fields (columns) separated by commas. They are easy to create, read, and write using various tools, including text editors, spreadsheets, and programming languages like Python.

Example of a CSV file:


```
id,name,style,abv
1,Pliny the Elder,Double IPA,8.0
2,Heady Topper,Double IPA,8.0
3,Two Hearted Ale,American IPA,7.0

```


## Flate Files: Use Cases

Common use cases for modern flat files include:


1. Storing and exchanging small to medium-sized datasets: Flat files, such as CSV, are suitable for storing and sharing datasets that aren't overly large or complex. They are easy to create, read, and write, making them a popular choice for exchanging data between different users or systems.

2. Sharing data between different software applications or programming languages: Since flat files are essentially plain text files, they can be easily read and written by various software applications (e.g., Excel, R, SAS) and programming languages (e.g., Python, Java, JavaScript). This makes flat files an ideal choice for sharing data between diverse systems and platforms.

3. Importing and exporting data to and from databases: Flat files can serve as an intermediate format when transferring data between databases and other applications. For instance, you can export data from a database to a CSV file, manipulate the data using a spreadsheet application, and then import the modified data back into the database.

### Advantages:

1. Human-readable and easy to understand: Flat files store data in a plain text format that is easily understandable by humans. This allows users to quickly inspect and validate the data, making it a convenient choice for data storage and sharing.

2. Simple to create and edit using various tools: Creating and editing flat files is straightforward, as they can be opened and modified using a wide range of tools, from basic text editors to advanced spreadsheet applications. This versatility makes flat files accessible to users with varying levels of technical expertise.

3. Wide support across different platforms and software: Flat files, especially CSV files, enjoy widespread support across different operating systems, software applications, and programming languages. This broad compatibility ensures that flat files can be easily used in diverse environments without requiring additional software or libraries.

### Disadvantages:

1. Limited support for complex data structures and relationships: Flat files are not well-suited for representing complex data structures, such as hierarchical or relational data. They lack features like data typing, constraints, and relationships, which are essential for maintaining data integrity and consistency in more complex datasets.

2. Inefficient for large datasets due to lack of compression and indexing: Flat files can become unwieldy and slow to process when dealing with large datasets. They do not support compression or indexing, which can significantly increase the time required for reading, writing, and searching data. This makes them less suitable for big data or high-performance applications.

3. No standardized way to represent missing or special values: Flat files do not have a consistent, standardized method for representing missing or special values (e.g., null, NaN). This can lead to ambiguity and confusion when interpreting the data, particularly when sharing files between different systems or users. Users must often rely on ad hoc conventions or metadata to convey information about missing or special values.




## Example: Pandas
![Pandas logo](https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg)

In this example, we will use Python and Google Colab to read a CSV file containing information on 25 craft beers. We will use the pandas library, which is a powerful data manipulation and analysis tool. You can begin by "running" the following cell, which will save the craft beer as a CSV file and then load it into a Pandas **data frame.**

In [None]:
# Please run this cell. Don't alter it!
import pandas as pd # import pandas libary

# Save beer data to disk
csv_data = """id,name,brewery,style,abv,ibu,state_country,calories
2,Heady Topper,The Alchemist,IPA,8.0,120,VT,240
3,Two Hearted Ale,Bell's Brewery,IPA,7.0,55,MI,210
4,Zombie Dust,3 Floyds Brewing Co.,IPA,6.2,50,IN,186
5,Sip of Sunshine,Lawson's Finest Liquids,IPA,8.0,65,VT,240
6,Old Rasputin,North Coast Brewing Co.,Stout,9.0,75,CA,270
7,Speedway Stout,AleSmith Brewing Company,Stout,12.0,70,CA,360
8,The Abyss,Deschutes Brewery,Stout,11.0,86,OR,330
9,Péché Mortel,Brasserie Dieu du Ciel!,Stout,9.5,85,Canada,285
10,Founders Breakfast Stout,Founders Brewing Company,Stout,8.3,60,MI,249
11,Weihenstephaner Hefeweissbier,Bayerische Staatsbrauerei Weihenstephan,Wheat Beer,5.4,14,Germany,162
12,Allagash White,Allagash Brewing Company,Wheat Beer,5.2,13,ME,156
13,Blanche de Chambly,Unibroue,Wheat Beer,5.0,10,Canada,150
14,Gumballhead,3 Floyds Brewing Co.,Wheat Beer,5.6,35,IN,168
15,Aventinus,Schneider Weisse G. Schneider & Sohn,Wheat Beer,8.2,16,Germany,246
16,Anchor Steam Beer,Anchor Brewing Company,Lager,4.9,33,CA,147
17,Samuel Adams Boston Lager,Boston Beer Company (Samuel Adams),Lager,5.0,30,MA,150
18,Sierra Nevada Pale Lager,Sierra Nevada Brewing Co.,Lager,5.6,38,CA,168
19,Brooklyn Lager,Brooklyn Brewery,Lager,5.2,33,NY,156
20,Red Stripe,Jamaica Breweries,Lager,4.7,17,Jamaica,141
21,Surly Furious,Surly Brewing Company,IPA,6.6,99,MN,198
22,Toppling Goliath PseudoSue,Toppling Goliath Brewery,IPA,6.8,50,IA,204
23,Summit Extra Pale Ale,Summit Brewing Company,IPA,5.2,45,MN,156
24,New Glarus Moon Man,New Glarus Brewing Company,IPA,5.0,28,WI,150
25,Todd the Axe Man,Surly Brewing Company,IPA,7.2,65,MN,216
26,Darkness,Surly Brewing Company,Stout,12.0,85,MN,360
27,New Glarus Coffee Stout,New Glarus Brewing Company,Stout,5.7,34,WI,171
28,Barrel-Aged Silhouette,Lift Bridge Brewery,Stout,10.0,50,MN,300
29,War & Peace,Fulton Beer,Stout,9.5,85,MN,285
30,CocO Stout,West O Beer,Stout,6.0,28,IA,180
31,Spotted Cow,New Glarus Brewing Company,Wheat Beer,4.8,18,WI,144
32,Boulevard Wheat,Boulevard Brewing Co.,Wheat Beer,4.4,14,MO,132
33,Alluvial Hefeweizen,Alluvial Brewing Company,Wheat Beer,5.1,12,IA,153
34,Two Women Lager,New Glarus Brewing Company,Wheat Beer,5.0,10,WI,150
35,Haywire Hefeweizen,Third Street Brewhouse,Wheat Beer,4.5,15,MN,135
36,Grain Belt Premium,August Schell Brewing Company,Lager,4.6,12,MN,138
37,New Glarus Two Women,New Glarus Brewing Company,Lager,5.0,34,WI,150
38,Hell,Surly Brewing Company,Lager,4.5,20,MN,135
39,Iowa Gold,Exile Brewing Co.,Lager,4.8,15,IA,144
40,Staghorn Octoberfest,New Glarus Brewing Company,Lager,6.2,25,WI,186

"""

with open("craft_beers.csv", "w") as file:
    file.write(csv_data)

beer_data = pd.read_csv('craft_beers.csv', index_col = 0)

### Pandas: Display the First Few Rows
To get a quick glimpse of the data, we can use the head() function to display the first few rows.

In [None]:
beer_data.head(5)

### Pandas: Get Summary Statistics
To get a summary of the numerical columns in the dataframe, we can use the describe() function.

In [None]:
beer_data.describe()

### Pandas: Select a Single Column
To select a single column from the dataframe, you can use the column name in square brackets.

In [None]:
beer_data['brewery'].head(5)

### Pandas: Filter Rows Based on a Condition
You can filter rows based on a specific condition using boolean indexing. For example, to get all rows with an ABV greater than 7.0:

In [None]:
beer_data[beer_data['abv'] > 7.0].head(5)


### Pandas: Sort Rows by a Column

To sort the dataframe by a specific column, use the sort_values() function. For example, to sort by ABV:

In [None]:
beer_data.sort_values(by='abv', ascending=False).head(5)


### Pandas: Group By and Aggregate
You can group the data by a specific column and then perform aggregation operations on the grouped data. For example, to get the average ABV for each style:


In [None]:
beer_data.groupby('style')['abv'].mean()

### Pandas: Add a New Column
You can add a new column to the dataframe by assigning values to a new column name. For example, to add a column with ABV as a percentage:

In [None]:
beer_data['abv_pct'] = beer_data['abv'] * 100
beer_data.head()

### Pandas: Make a visualization
Pandas provides lots of great ways to visualize date, including:

* `df[column].hist()` for single numerical variables
* `df.plot.scatter(x=column1, y=column2)` for relationships between two numerical variables
* `df.groupby(column).boxplot(column)` for showing relationships between categorical and numerical data

In [None]:
# Example: histogram
beer_data['abv'].hist()


In [None]:
# Example: scatter plot
beer_data.plot.scatter(x="abv", y="ibu")

In [None]:
# Create a grouped boxplot for ABV by style using pandas
beer_data.boxplot(column='abv', by='style')



## Exercises: Pandas
1.  Load the beer data into a Pandas DataFrame

    Task: Load the CSV data into a Pandas DataFrame and display the first 5 rows.

    Hint: Use `pd.read_csv` to load the data and `DataFrame.head()` to display the first 5 rows.

2.  Display the summary statistics of the DataFrame

    Task: Use a method to display the summary statistics for the numeric columns in the DataFrame.

    Hint: Use the `DataFrame.describe()` method to generate the summary statistics.

3.  Find the number of unique breweries in the dataset

    Task: Find the number of unique breweries present in the dataset.

    Hint: Use the `DataFrame.nunique()` method on the 'brewery' column.

4.  Filter the DataFrame to show only IPAs

    Task: Filter the beer_data DataFrame to show only the rows where the 'style' column is 'IPA'.

    Hint: Use Boolean indexing to filter the DataFrame based on the 'style' column.

5.  Sort the DataFrame by ABV (Alcohol by Volume) in descending order

    Task: Sort the DataFrame by the 'abv' column in descending order.

    Hint: Use the `DataFrame.sort_values()` method with the 'abv' column and set `ascending=False`.

6.  Calculate the average ABV for each brewery

    Task: Calculate the average ABV for each brewery in the dataset.

    Hint: Use the `DataFrame.groupby()` method with the 'brewery' column and the `GroupBy.mean()` method on the 'abv' column.

7.  Find the highest-rated beer for each style

    Task: Find the highest-rated beer (based on IBU) for each style in the dataset.

    Hint: Use the `DataFrame.groupby()` method with the 'style' column and the `GroupBy.idxmax()` method on the 'ibu' column.

8.  Add a column for ABV category

    Task: Create a new column called 'abv_category' in the DataFrame, categorizing beers into 'Low', 'Medium', and 'High' ABV categories based on the 'abv' column.

    Hint: Use the `DataFrame.apply()` method with a custom function to categorize the ABV values.

9.  Count the number of beers in each ABV category

    Task: Count the number of beers in each ABV category ('Low', 'Medium', 'High').

    Hint: Use the `DataFrame.value_counts()` method on the 'abv_category' column.

10. Create a bar plot of the number of beers per state/country

    Task: Create a bar plot showing the number of beers in the dataset for each state/country.

    Hint: Use the `DataFrame.value_counts()` method on the 'state_country' column and the `Series.plot()` method with `kind='bar'`.

In [None]:
# Exercise 1

In [None]:
# Exercise 2

In [None]:
# Exercise 3

In [None]:
# Exercise 4

In [None]:
# Exercise 5

In [None]:
# Exercise 6

In [None]:
# Exercise 7

In [None]:
# Exercise 8

In [None]:
# Exercise 9

In [None]:
# Exercise 10

## III. The Relational Model and SQL
The **relational model**, introduced by Edgar F. Codd in 1970, revolutionized data management by providing a more intuitive, flexible, and efficient way of organizing and querying data. In this section, we will delve deeper into the relational model, explore its key principles and features, and discuss the technological advances that facilitated its widespread adoption.

###  Principles and Features of the Relational Model

The relational model is based on the concept of mathematical relations, where data is represented as a set of tuples (rows) in a table. Each tuple consists of a fixed number of attributes (columns), and the table enforces a consistent schema across all tuples. The key principles and features of the relational model include:

-   Tables (relations): The basic unit of data organization in the relational model. Tables consist of rows (tuples) and columns (attributes), and each column has a specific data type.

-   Keys: Unique identifiers for tuples within a table. Primary keys are used to uniquely identify each row, while foreign keys establish relationships between tables.

-   Integrity constraints: Rules that enforce data consistency and validity within the database. These include domain constraints, entity integrity constraints, and referential integrity constraints.

-   Relational algebra: A set of mathematical operations (e.g., selection, projection, join) that can be applied to tables to manipulate and query data.

-   Normalization: A process of organizing data within tables to minimize redundancy and improve data integrity.

###  Examples of the Relational Model in Action

Consider a simple example of a library database, which contains information about books and authors. The relational model would represent this data using two tables:

-   Authors table (AuthorID, FirstName, LastName)
-   Books table (BookID, Title, AuthorID, PublicationYear)

The Authors table has a primary key (AuthorID) to uniquely identify each author, and the Books table has a primary key (BookID) to uniquely identify each book. The AuthorID in the Books table serves as a foreign key, establishing a relationship between the Books and Authors tables.

Using relational algebra operations, one could query the database to answer questions such as:

-   Find all books written by a specific author.
-   List all authors who have published books in a specific year.
-   Retrieve the titles and publication years of all books in the library.

### Technological Advances That Enabled the Relational Model

Several technological advances made the adoption of the relational model possible, including:

-   Increased processing power: The development of more powerful CPUs allowed for faster execution of complex relational operations, making the relational model a viable option for large-scale data management.

-   Disk storage advancements: The growth of disk storage capacity and improvements in data retrieval performance enabled databases to store and manage larger volumes of data, which was crucial for the adoption of the relational model.

-   Memory management: Advancements in memory management techniques, such as caching and buffering, improved the performance of relational databases, making them more suitable for real-world applications.

-   Development of relational database management systems (RDBMS): The emergence of RDBMS, such as IBM's System R and Oracle, provided practical implementations of the relational model and facilitated its adoption in various industries.

-   The Structured Query Language (SQL): The development of SQL, a standardized language for querying and manipulating relational databases, made the relational model more accessible and contributed to its widespread acceptance.

## Relational databases: Use Cases

Common use cases for relational databases include:

1. Managing structured data, such as inventory, customer information, or financial transactions, where relationships between entities are crucial
2. Applications requiring complex queries, data analysis, or reporting
Handling data with many-to-many or one-to-many relationships, which are difficult to represent in flat files
3. Ensuring data integrity and consistency through transactions, constraints, and normalization, which is not easily achievable with flat files

### Advantages:

1. Robust query capabilities with SQL (Structured Query Language), allowing for powerful and flexible data retrieval and manipulation
2. Data integrity through constraints, transactions, and normalization, which helps maintain consistency and avoid data redundancy
3. Scalable and efficient handling of large datasets, as relational databases are designed for optimized storage and querying
4. Widely adopted and supported across various platforms and programming languages, making it easier to integrate with different tools and technologies


### Disadvantages:

1. Can be more complex to set up and maintain compared to flat files, due to the need for designing a suitable database schema and managing the database server
2. Less flexible in handling unstructured or semi-structured data, as relational databases are optimized for structured data with well-defined relationships
3. May require a dedicated server or significant resources for large-scale deployments, whereas flat files can be easily stored and shared using simple file systems
4. Can be overkill for small projects or simple data storage needs, where the overhead of a relational database may outweigh its benefits

In the next subsections, we will explore SQL, a powerful query language used to interact with relational databases, and work through an SQLite example to understand how to create a table, insert data, retrieve data, and update or delete data within a relational database.

## A Short Guide to SQL for Absolute Beginners
![SQlite Logo](https://upload.wikimedia.org/wikipedia/commons/3/38/SQLite370.svg)

Structured Query Language (SQL) is the standard language for managing and querying relational databases. SQL allows you to perform various operations, such as creating and modifying tables, inserting and updating data, and retrieving information based on specific conditions.

This short guide will provide an overview of basic CRUD (Create, Read, Update, Delete) operations in SQL for absolute beginners.

### CREATE: Creating Tables
To create a new table, use the CREATE TABLE command followed by the table name and a list of columns with their respective data types and constraints.




```
CREATE TABLE table_name (
    column1 datatype constraint,
    column2 datatype constraint,
    ...
);
```

For example, to create a simple users table:

```
CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    email TEXT UNIQUE NOT NULL
);
```

### INSERT: Adding Data
To insert data into a table, use the INSERT INTO command followed by the table name, a list of column names (optional), and the VALUES keyword with a list of values for each column.

```
INSERT INTO table_name (column1, column2, ...)
VALUES (value1, value2, ...);
```
For example, to insert a new user into the users table:

```
INSERT INTO users (id, name, email)
VALUES (1, 'John Doe', 'john.doe@example.com');
```

### SELECT: Querying Data
To retrieve data from a table, use the SELECT command followed by a list of columns (or * for all columns), the FROM keyword, and the table name. You can also add a WHERE clause to filter rows based on specific conditions, and use JOIN clauses to combine data from multiple tables.

```
SELECT column1, column2, ...
FROM table_name
WHERE condition;
```

For example, to select all users with an email containing 'example.com':
```
SELECT * FROM users
WHERE email LIKE '%example.com%';
```

### UPDATE: Modifying Data
To update data in a table, use the UPDATE command followed by the table name, the SET keyword with a list of columns and their new values, and a WHERE clause to specify the rows to update.

```
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;
```

For example, to update the name of a user with id = 1:

```
UPDATE users
SET name = 'Jane Doe'
WHERE id = 1;
```

### DELETE: Removing Data
To delete data from a table, use the DELETE FROM command followed by the table name and a WHERE clause to specify the rows to delete.

```
DELETE FROM table_name
WHERE condition;
```

For example, to delete a user with id = 1:

```
DELETE FROM users
WHERE id = 1;
```

These basic CRUD operations form the foundation of SQL and allow you to manage and query data in relational databases. As you become more familiar with SQL, you can explore more advanced features, such as aggregate functions, subqueries, and transactions, to further enhance your database capabilities.

## SQLite
SQLite is a lightweight, serverless, self-contained relational database management system (RDBMS) that is widely used for its simplicity and ease of use. It is an excellent choice for small-scale projects, development, and testing, as it stores the entire database in a single file and does not require a dedicated server or complex setup. SQLite is also well-supported by Python and can be easily accessed using Jupyter notebooks.

In this example, we will use SQLite to create a two-table database about breweries and the beer they make. We will walk through the process of creating the tables, inserting data, and querying the database using SQL commands.

First, let's load the SQL extension and set up a connection to the SQLite database:

In [None]:
# Connect to SQLite
!pip install SQLAlchemy==1.3.24 # Downgrade to avoid problems with more recent version
%load_ext sql
%sql sqlite://

### Creating the tables

We will create two tables: breweries and beers. The breweries table will contain information about each brewery, and the beers table will contain information about each beer. Each beer will have a foreign key referencing the brewery that produces it.

In [None]:
%%sql
-- Create the breweries table
CREATE TABLE breweries ( 
    id INTEGER PRIMARY KEY, -- a primary key is a unique identifier for each row
    name TEXT NOT NULL, -- not null means that the column cannot be empty
    city TEXT,
    state TEXT
);

In [None]:
%%sql
-- Create the beers table
CREATE TABLE beers (
    id INTEGER PRIMARY KEY, -- a primary key is a unique identifier for each row
    name TEXT NOT NULL,
    style TEXT,
    abv REAL, -- a real number is a floating point number
    ibu INTEGER,
    brewery_id INTEGER,
    -- a foreign key is a reference to another table
    FOREIGN KEY (brewery_id) REFERENCES breweries (id)
);


## Inserting Data into the Tables

Now that we have created the breweries and beers tables, let's insert some sample data into them. We will insert the brewery information first and then insert the beer information, referencing the corresponding brewery's ID.

In [None]:
%%sql
-- Insert sample data into the breweries table
INSERT INTO breweries (id, name, city, state)
VALUES (1, 'Schell Brewery', 'New Ulm', 'MN'),
       (2, 'Forager Brewery', 'Rochester', 'MN'),
       (3, 'Surly Brewing', 'Minneapolis', 'MN');


In [None]:
%%sql
-- Insert sample data into the beers table
INSERT INTO beers (id, name, style, abv, ibu, brewery_id)
VALUES (1, 'Schell Firebrick', 'Vienna Lager', 5.0, 20, 1),
       (2, 'Schell Deer Brand', 'American Adjunct Lager', 4.8, 15, 1),
       (3, 'Forager Humble Bumble', 'Brown Ale', 5.5, 25, 2),
       (4, 'Forager Pudding Goggles', 'Imperial Stout', 9.0, 50, 2),
       (5, 'Surly Furious', 'IPA', 6.7, 100, 3),
       (6, 'Surly Bender', 'American Brown Ale', 5.1, 45, 3),
       (7, 'Schell Noble Star', 'Berliner Weisse', 3.9, 5, 1),
       (8, 'Forager Sherpa', 'IPA', 6.8, 75, 2),
       (9, 'Surly Coffee Bender', 'American Brown Ale', 5.1, 45, 3),
       (10, 'Surly Hell', 'German Helles Lager', 4.5, 20, 3);


## Querying the Database

With the data inserted into our tables, we can now query the database using SQL commands. In this section, we will demonstrate some common queries such as selecting all rows from a table, filtering rows based on a condition, and joining tables to retrieve combined data.


In [None]:
%%sql
-- Select all rows from the breweries table
SELECT * FROM breweries;

In [None]:
%%sql
-- Select specific columns from a table
SELECT name, city, state FROM breweries;

In [None]:
%%sql
-- Select unique values from a column
SELECT DISTINCT state FROM breweries;


In [None]:
%%sql
-- Filter rows using a WHERE clause
SELECT * FROM beers WHERE abv > 5.0;

In [None]:
%%sql
--Combine multiple conditions using AND and OR
SELECT * FROM beers WHERE abv > 5.0 AND ibu < 50;

In [None]:
%%sql
--Sort the result using the ORDER BY clause
SELECT * FROM beers ORDER BY abv DESC;

In [None]:
%%sql
--Limit the number of rows returned
SELECT * FROM beers ORDER BY abv DESC LIMIT 3;


In [None]:
%%sql
--Join multiple tables using INNER JOIN
SELECT beers.name, breweries.name as brewery_name
FROM beers
INNER JOIN breweries ON beers.brewery_id = breweries.id;


In [None]:
%%sql
--Group rows and apply aggregate functions
SELECT breweries.name, COUNT(*) as num_beers
FROM beers
INNER JOIN breweries ON beers.brewery_id = breweries.id
GROUP BY  breweries.name;


In [None]:
%%sql
--Filter groups using the HAVING clause
SELECT brewery_id, COUNT(*) as num_beers
FROM beers JON
GROUP BY brewery_id
HAVING num_beers > 1;


### Table: SELECT in SQL
| S.No | Concept | Example SQL Query |
| --- | --- | --- |
| 1 | Select all columns | SELECT * FROM breweries; |
| 2 | Select specific columns | SELECT name, city, state FROM breweries; |
| 3 | Select unique values | SELECT DISTINCT state FROM breweries; |
| 4 | Filter rows (WHERE) | SELECT * FROM beers WHERE abv > 5.0; |
| 5 | Combine conditions (AND/OR) | SELECT * FROM beers WHERE abv > 5.0 AND ibu < 50; |
| 6 | Sort results (ORDER BY) | SELECT * FROM beers ORDER BY abv DESC; |
| 7 | Limit rows returned (LIMIT) | SELECT * FROM beers ORDER BY abv DESC LIMIT 3; |
| 8 | Join tables (INNER JOIN) | SELECT beers.name, breweries.name as brewery_name FROM beers INNER JOIN breweries ON beers.brewery_id = breweries.id; |
| 9 | Group rows (GROUP BY) | SELECT brewery_id, COUNT(*) as num_beers FROM beers GROUP BY brewery_id; |
| 10 | Filter groups (HAVING) | SELECT brewery_id, COUNT(*) as num_beers FROM beers GROUP BY brewery_id HAVING num_beers > 1; |

## Exercises: SQL
1.  Select all columns from the breweries table.

    -   Task: Write a SQL query to retrieve all columns and rows from the breweries table.
    -   Hint: Use the SELECT statement followed by an asterisk (*) to select all columns.
2.   Select specific columns from the beers table.

    -   Task: Write a SQL query to retrieve only the columns 'id', 'name', and 'style' from the beers table.
    -   Hint: Instead of using an asterisk (*), list the desired column names after the SELECT statement.
3.  Order the beers by their ABV (Alcohol by Volume).

    -   Task: Write a SQL query to display all columns from the beers table, ordered by the 'abv' column in ascending order.
    -   Hint: Use the ORDER BY clause followed by the column name ('abv').
4.  Filter the beers with an ABV greater than 5.

    -   Task: Write a SQL query to display all columns for beers with an 'abv' value greater than 5.
    -   Hint: Use the WHERE clause followed by the appropriate condition (e.g., abv > 5).
5.  Display beers with an IBU (International Bitterness Units) between 20 and 50.

    -   Task: Write a SQL query to display all columns for beers with an 'ibu' value between 20 and 50.
    -   Hint: Use the WHERE clause with the appropriate conditions (e.g., ibu BETWEEN 20 AND 50).
6.  Count the number of beers in the beers table.

    -   Task: Write a SQL query to count the total number of rows in the beers table.
    -   Hint: Use the COUNT() function along with the SELECT statement (e.g., SELECT COUNT(*) FROM beers).
7.   Display the average ABV for all beers.

    -   Task: Write a SQL query to calculate the average 'abv' value for all beers in the table.
    -   Hint: Use the AVG() function along with the SELECT statement (e.g., SELECT AVG(abv) FROM beers).
8.  Display beers and their corresponding brewery names.

    -   Task: Write a SQL query to display the 'name' column from the beers table and the 'name' column from the breweries table.
    -   Hint: Use the SELECT statement followed by the column names and a JOIN clause to combine the tables on the 'brewery_id' and 'id' columns.

9.  Display beers with a specific style.

    -   Task: Write a SQL query to display all columns for beers with a 'style' of 'IPA'.
    -   Hint: Use the WHERE clause followed by the appropriate condition (e.g., style = 'IPA').

10.  CHALLNEGE: Display the number of beers for each brewery.

    -   Task: Write a SQL query to display the 'name' column from the breweries table along with the count of beers associated with each brewery.
    -   Hint: Use the SELECT statement with the COUNT() function, a JOIN clause to combine the tables on the 'brewery_id' and 'id' columns, and a GROUP BY clause to group the results by the brewery 'name' column.

In [None]:
%%sql
--Exercise 1

In [None]:
%%sql
--Exercise 2

In [None]:
%%sql
--Exercise 3

In [None]:
%%sql
--Exercise 4

In [None]:
%%sql
--Exercise 5

In [None]:
%%sql
--Exercise 6

In [None]:
%%sql
--Exercise 7

In [None]:
%%sql
--Exercise 8

In [None]:
%%sql
--Exercise 9

In [None]:
%%sql
--Exercise 10

## IV. Expansions and Alternatives to the Relational Model: XML, JSON, and NoSQL

While the relational model has been the dominant approach to data management for decades, the advent of the internet, the rise of big data, and the increasing complexity of data structures have led to the development of expansions and alternatives to the traditional relational model. This final section will focus on XML, JSON, and NoSQL, explaining their key features and the technological advancements that have made them possible.

### XML: eXtensible Markup Language

XML, or eXtensible Markup Language, is a markup language designed to store and transport data in a self-descriptive, hierarchical format. It emerged in the late 1990s as a response to the growing need for a standard, platform-independent way to exchange data between different systems.

```
XML example
<brewery>
    <name>Schell Brewery</name>
    <city>New Ulm</city>
    <state>MN</state>
</brewery>

```


Key features of XML:

-   Hierarchical structure: XML data is organized in a tree-like structure, with elements (tags) nested within other elements.
-   Self-descriptive: XML tags are user-defined, allowing for clear, human-readable descriptions of the data being stored.
-   Platform-independent: XML can be easily parsed and generated by a wide variety of programming languages and platforms.

XML gained widespread adoption for its ability to handle complex, nested data structures and for its platform independence. The development of technologies such as web services and APIs has been crucial to the widespread use of XML as a format for exchanging data between systems.

### JSON: JavaScript Object Notation

JSON, or JavaScript Object Notation, is a lightweight data interchange format that emerged in the early 2000s as an alternative to XML. JSON is based on the object notation used in JavaScript and has since been widely adopted by other programming languages due to its simplicity and readability.

```
JSON Example
{
    "name": "Schell Brewery",
    "city": "New Ulm",
    "state": "MN"
}
```

Key features of JSON:

-   Simple syntax: JSON uses a minimal syntax that is easy to read and write for both humans and machines.
-   Lightweight: JSON is less verbose than XML, resulting in smaller file sizes and faster parsing.
-   Flexible structure: JSON supports a variety of data structures, including objects, arrays, and key-value pairs.

The rise of web applications and the need for more efficient data exchange between client and server have fueled the adoption of JSON. The development of JSON-based APIs and the popularity of JavaScript as a programming language for web development have contributed to JSON's widespread use.

### NoSQL: Non-relational Databases

NoSQL, or "not only SQL," is a term used to describe a wide range of non-relational database systems that offer alternatives to the traditional relational model. NoSQL databases have gained popularity in recent years due to their ability to handle large volumes of unstructured, semi-structured, or dynamically structured data, and their horizontal scalability.

Common types of NoSQL databases include:

-   Document stores (e.g., MongoDB, Couchbase): Store data in document-like structures (e.g., JSON, BSON) and allow for flexible, schema-less data models.
-   Key-value stores (e.g., Redis, Amazon DynamoDB): Store data as key-value pairs, providing simple and efficient data retrieval based on keys.
-   Column-family stores (e.g., Apache Cassandra, HBase): Organize data in columns rather than rows, allowing for efficient write and read operations on large, sparse datasets.
-   Graph databases (e.g., Neo4j, Amazon Neptune): Represent data as nodes and edges in a graph, enabling efficient querying of complex relationships between data elements.

Technological advancements that have contributed to the rise of NoSQL databases include:

-   **Distributed computing:** The ability to distribute data and processing across multiple servers has enabled NoSQL databases to scale horizontally and handle large volumes of data.
-   **Cloud computing:** The growth of cloud computing infrastructure has made it easier and more cost-effective to deploy and manage NoSQL databases, providing flexible solutions for businesses with varying data storage and processing needs.
- **The rise of big data:** The increasing volume, variety, and velocity of data generated by modern applications have driven the demand for more flexible and scalable data management solutions, which NoSQL databases can provide.
Conclusion

The expansions and alternatives to the relational model, such as XML, JSON, and NoSQL databases, have emerged in response to the evolving needs of data management in the digital age. These technologies offer different ways to handle complex data structures, enable efficient data exchange between systems, and provide scalable solutions for managing large volumes of data. By understanding the key features and technological advancements that have made these alternatives possible, beginners can gain a better grasp of the data management landscape and make informed decisions about the most suitable data models and technologies for their needs.

## Key-Value Strucutures: Use Cases 
Key-value data structures, like JSON and dictionaries, are often used for storing and exchanging data. They provide a simple and flexible way to store data as a collection of key-value pairs. Let's explore some common use cases, advantages, and disadvantages of key-value data structures in comparison to relational databases and flat files like Pandas.

Common Use Cases:

1.  Configuration settings: Key-value data structures are commonly used for storing configuration settings in applications, as they can easily store and retrieve values based on their keys.
2.  Document storage: They are ideal for storing unstructured or semi-structured data like documents, where each document has a unique identifier (key) and the document's contents are stored as a value.
3.  Caching: Key-value stores are often used as caching systems, storing frequently accessed data with an expiration time to improve application performance.
4.  Session storage: They can be used for storing session information in web applications, where each user session has a unique identifier (key) and the session data is stored as a value.

Advantages:

1.  Flexibility: Key-value data structures are schema-less, which means that they can store data without a predefined structure. This allows for more flexibility in storing data compared to relational databases, which require a fixed schema.
2.  Scalability: Key-value stores can be easily scaled horizontally, making them suitable for handling large volumes of data and high traffic loads.
3.  Speed: Key-value stores are often faster than relational databases for simple read and write operations since they only need to perform lookups based on a single key.
4.  Easy data exchange: JSON is a widely supported format for data exchange between different systems and programming languages, making it easier to work with APIs and web services.

Disadvantages:

1.  Limited query capabilities: Key-value data structures are not suitable for complex queries or operations that involve relationships between data entities, as they lack the sophisticated querying capabilities provided by relational databases.
2.  Data redundancy: Key-value data structures may lead to data redundancy, as there is no built-in support for normalizing data, unlike relational databases.
3.  Inconsistency: In some cases, key-value stores may provide eventual consistency rather than strong consistency, meaning that read operations might not always return the most recent write operation's result.

In summary, key-value data structures like JSON and dictionaries offer flexibility, scalability, and speed, making them well-suited for certain use cases such as document storage, caching, and configuration settings. However, they may not be the best choice for more complex data management tasks that require relational databases or flat file structures like Pandas, which provide more advanced querying and data manipulation capabilities.

## Example: Working With Python Dictionaries
Now that you've learned about XML, JSON, and NoSQL, you might be wondering how you can start working with key-value pairs in a programming language. In this section, we'll introduce the basics of Python dictionaries, which are a built-in data structure that allows you to store and manipulate key-value pairs.

Python dictionaries are similar to JSON objects in that they store data in **key-value pairs.** **Keys** in a dictionary are unique, while the **values** associated with them can be of any data type, such as numbers, strings, lists, or even other dictionaries.

### Dictionaries: Create a Dictionary

Let's start by creating a simple dictionary in Python representing a single Minnesota brewery:

In [None]:
brewery = {
    "name": "Schell Brewery",
    "city": "New Ulm",
    "state": "MN"
}


In this example, the keys are "name", "city", and "state", and their corresponding values are "Schell Brewery", "New Ulm", and "MN". The dictionary is enclosed in curly braces {} and key-value pairs are separated by commas.

### Dictionaries: Access a Value
To access the value associated with a key, you can use the following syntax:

In [None]:
print(brewery["name"])  # Output: "Schell Brewery"


You can also add a new key-value pair to the dictionary or update the value of an existing key using the following syntax:

In [None]:
brewery["founded"] = 1860  # Adds a new key-value pair
brewery["state"] = "Minnesota"  # Updates the value of "state"

### Dictionaries: Check if a Key Exists
If you want to check if a specific key exists in the dictionary, you can use the in keyword:

In [None]:
print("city" in brewery)  # Output: True
print("country" in brewery)  # Output: False

### Dictionaries: Remove Key-Value Pair
To remove a key-value pair from the dictionary, you can use the del keyword:

In [None]:
del brewery["founded"]  # Removes the key-value pair with the key "founded"

### Dictionaries: Loop Through Key/Values
Finally, you can loop through the keys and values in a dictionary using a for loop:

In [None]:
for key, value in brewery.items():
    print(key, value)

This brief introduction to Python dictionaries, using the example of a Minnesota brewery, should give you a starting point for working with key-value pairs in Python. As you gain more experience with dictionaries, you'll discover how they can be used in various applications, from storing configuration data to representing complex data structures in your programs.

## Exercises: Dictionaries
In this exercise, we will practice working with dictionaries in Python. Before diving into the problems, let's first review some basic dictionary operations and syntax.

### Dictionary Operations and Syntax

1.  Create a dictionary: Use curly braces `{}` to create a dictionary, with keys and values separated by colons. For example, `my_dict = {'key1': 'value1', 'key2': 'value2'}`.

2.  Access a value: Use the key inside square brackets to access a value, e.g., `my_dict['key1']` returns `'value1'`.

3.  Add a key-value pair: Assign a value to a new key, e.g., `my_dict['key3'] = 'value3'`.

4.  Update a value: Assign a new value to an existing key, e.g., `my_dict['key1'] = 'new_value1'`.

5.  Check if a key exists: Use the `in` keyword, e.g., `'key1' in my_dict` returns `True`.

6.  Remove a key-value pair: Use the `del` keyword, e.g., `del my_dict['key1']`.

7.  Iterate over a dictionary: Use a `for` loop, e.g., `for key, value in my_dict.items():`.

Now, let's move on to the exercises. Remember to use the syntax and operations explained above to solve the problems.

### Exercise 1: Create a Dictionary

Create a dictionary called `beer` with the following key-value pairs:

-   'name': 'Hoppy IPA'
-   'style': 'IPA'
-   'abv': 6.5
-   'ibu': 70

### Exercise 2: Access and Update Values

Using the `beer` dictionary created in Exercise 1, perform the following tasks:

1.  Print the value associated with the key 'name'.
2.  Update the value of 'abv' to 7.0.
3.  Print the updated 'abv' value.

### Exercise 3: Add a Key-Value Pair

Add a new key-value pair to the `beer` dictionary:

-   'brewery': 'Craft Brewery Co.'

After adding the new key-value pair, print the entire dictionary.

### Exercise 4: Check for a Key

Write a conditional statement that checks if the key 'ibu' exists in the `beer` dictionary. If it exists, print the value. Otherwise, print 'IBU not found'.

### Exercise 5: Iterate Over a Dictionary

Using a `for` loop, iterate over the `beer` dictionary and print the keys and their corresponding values in the following format:



In [None]:
# Exercise 1 code

In [None]:
# Exercise 2 code

In [None]:
# Exercise 3 code

In [None]:
# Exercise 4 code

In [None]:
# Exercise 5 code

# Summary: Turning Data Into Knowledge
In this lesson, we've reviewed three common ways of storing data. The utimate goal of storing data (in all three cases) is to help us obtain knowledge. However, each method has its own set of use cases, strengths, and weaknesses.


**Flat Files (CSV/Data Frame):** Flat files, such as CSVs, are great for small to medium-sized datasets that have a tabular structure. They are human-readable, easy to create and edit, and widely supported across platforms and software. However, they have limited support for complex data structures and relationships, and can be inefficient for large datasets due to a lack of compression and indexing. Remember that flat files are best suited for simpler datasets and applications that don't require advanced querying or strict data integrity.

**Relational Databases:** Relational databases are designed for large and complex datasets, providing robust data integrity through constraints and relationships, as well as optimized performance. They use standardized data types and SQL for querying, making them a popular choice for many applications. Keep in mind that relational databases can be more complex to set up and manage, but they offer the scalability, security, and data integrity needed for more demanding applications.

**Dictionaries/JSON (incl. NoSQL):**  Dictionaries and JSON, including NoSQL databases like MongoDB, provide a flexible and language-independent way to store and exchange data. They are ideal for configuration storage, API data, and semi-structured data that doesn't fit neatly into a tabular format. NoSQL databases offer schema-less design and high scalability for simple queries. However, they may lack built-in validation and be less efficient for complex queries compared to relational databases. Remember that dictionaries and JSON are best suited for applications that require flexibility and can tolerate some trade-offs in data integrity and query performance.



| Aspect | Flat Files (CSV/Data Frame) | Relational Databases | Dictionaries/JSON (incl. NoSQL) |
| --- | --- | --- | --- |
| Structure | Tabular data | Tables with relationships | Key-value pairs, documents |
| Data Storage | Text files (CSV) | Database management system | In-memory, JSON files, NoSQL DBs |
| Use Cases | Small to medium-sized datasets | Large and complex datasets | Configuration, API data, semi-structured data |
| Advantages | Human-readable | Data integrity, performance | Flexible data structure, language-independent, schema-less |
| Disadvantages | Limited complex data support | More complex to set up | Limited scalability for complex queries, no built-in validation |
| Query Language | N/A (Python/Pandas) | SQL | N/A (Python), NoSQL query languages |
| Data Types | Limited standardization | Standardized data types | Limited standardization |
| Data Integrity | None | Constraints, relationships | None |
| Performance | Inefficient for large datasets | Optimized for performance | Inefficient for complex queries |
| Scalability | Limited | High | High for simple queries |
| Security | None | Built-in features | Requires additional measures |
