# Data Sources

### User input data
User input can be text, images, videos, uploaded files, etc. If it’s even remotely possible for users to
input **wrong data**, they are going to do it. As a result, user input data can be easily malformatted.
- Text: misspelled words, wrong grammar, long text, short text, etc.
- Input value type mismatch: a user enters a string where a number is expected.
- Files: wrong file format, wrong file encoding, etc.

### System generated data
System generated data is data that is generated by the system itself.
- Logs: memory usage, number of instances, etc. The main purpose of logs is to help in debugging and monitoring.
- Users behavior: clicks, views, etc. This data is used to improve the user experience.
- And many more.

# Data Formats
Storing your data isn’t always straightforward and, for some cases, can be costly. It’s important to think about how the data will be used in the future so that the format you use will make sense. Here are some common data formats used in practice:

![data formats](./screenshots/data-formats.png)

## Row-Major Versus Column-Major Format
- **Row-major format**: In row-major format, the elements of a row are stored in contiguous memory locations. For example, CSV files are row-major format.
- **Column-major format**: In column-major format, the elements of a column are stored in contiguous memory locations. For example, Parquet files are column-major format.

***NumPy Versus pandas***: NumPy is row-major format, while pandas is column-major format.

In [9]:
import time
import pandas as pd

df = pd.DataFrame({
    f'Column_{i}': range(i * 1000 + 1, (i + 1) * 1000 + 1) for i in range(20)
})

start = time.time()
for column in df.columns:
    for item in df[column]:
        pass
print(f'column: {time.time() - start}')

start = time.time()
n_rows = len(df)
for i in range(n_rows):
    for item in df.iloc[i]:
        pass
print(f'row: {time.time() - start}')

column: 0.0035429000854492188
row: 0.011175870895385742


# Text Versus Binary Format
- **Text format**: Text format is human-readable and human-editable. It’s easy to understand and debug. However, it’s less efficient in terms of storage and processing.
- **Binary format**: Binary format is not human-readable or human-editable. The file contains only 0s and 1s.

Consider that you want to store the number 1000000. If you store it in a text file, it’ll require 7 characters, and if each character is 1 byte, it’ll require 7 bytes. If you store it in a binary file as int32, it’ll take only 32 bits or 4 bytes.

# Data Models
Data models describe how the data is represented. There are two main data models:
- **Relational model**: In the relational model, data is stored in tables. Each table has rows and columns. The tables are related to each other by keys.
- **Non-relational model**: In the non-relational model, data is stored in a non-tabular format. There are different types of non-relational databases, such as document-based, key-value, wide-column, and graph databases.
    - **Document-based databases**: Collection of documents: often a single continuous string, encoded as JSON, XML, or a binary format like BSON. All documents in the database are assumed to be decoded in the same format.
    - **Graph databases**: Collection of nodes and edges. Each node represents an entity and each edge represents a connection or relationship between two nodes.


### Structured data versus Unstructured data
Structured data follows a predefined data model, also known as a data schema. The predefined structure makes your
data easier to analyze, query, and organize.


### Transactional and Analytical Processing

- ***OLTP (Online Transaction Processing)***: Because these transactions often involve users, they need to be processed fast so that they don’t keep users waiting. The processing system needs to be available any time a
user wants to make a transaction. They are characterized by ACID properties:
    - Atomicity: All parts of a transaction must be completed successfully, or the transaction is aborted.
    - Consistency: The database must be in a consistent state before and after the transaction.
    - Isolation: Transactions should be isolated from each other until they are completed.
    - Durability: Once a transaction is completed, it should be permanent and not undone.

- ***OLAP (Online Analytical Processing)***: OLAP is used for complex queries that involve aggregations. These queries are often used for reporting and business intelligence. They are characterized by CAP properties:
    - Consistency: All nodes see the same data at the same time.
    - Availability: The system is always available.
    - Partition tolerance: The system continues to operate despite network partitions.


# ETL (Extract -> Transform -> Load)
In the early days of the relational data model, data was mostly structured. When data is extracted from different sources, it’s first transformed into the desired format before being loaded into the target destination such as a database or a data warehouse.

![ETL](./screenshots/etl.png)

Finding it difficult to keep data structured, some companies had this idea: “Why not just store all data in a data lake so we don’t have to deal with schema changes? Whichever application needs data can just pull out raw data from there and process it.

# ELT (Extract -> Load -> Transform)
In the ELT process, data is first loaded into the target destination and then transformed. This is done because the data is already in the desired format and can be transformed as needed. This approach is more flexible and scalable than the ETL approach.

### Data Warehouse
A data warehouse is a centralized repository that stores structured and unstructured data from one or more sources. It’s a system used for reporting and data analysis. A data warehouse is a type of database that is specifically designed for query and analysis rather than transaction processing.

### Data Lake
A data lake is a storage repository that holds a vast amount of raw data in its native format until it’s needed. Data lakes are often used to store unstructured data, such as web server logs, IoT data, images, and videos. Data lakes are often used for data exploration and analysis.

# DataFlow
Data flow is the movement of data between processes:
- Through databases
- APIs (request-driven)
- Real time transport (like Apache Kafka)

For example, Driver management service predicts the demand for drivers in a specific area, Ride management service - how many rides will be requested. Price optimization service - how to set the price for a ride. Because the price depends on supply and demand, the price optimization service needs data from both the driver management and ride management services.

The most popular styles of requests used for passing data through networks are REST (representational state transfer) and RPC (remote procedure call).

## Data Passing Through Real-Time Transport
The driver management service also needs to know the number of rides from the ride management service to know how many drivers to mobilize. It also wants to know the predicted prices from the price optimization service to use
them as incentives for potential drivers.

![Broker](./screenshots/broker.png)

Technically, a database can be a broker, ach service can write data to a database and other services that need the data can read from that database. However, reads and writes are too slow for this purpose. Thus, instead we use in-memory storage.

- pubsubs: publish-subscribe systems. Any service can publish to different topics in a real-time transport, and any service that subscribes to a topic can read all the events in that topic.
- message queues: An event often has intended consumers (an event with intended consumers is called a message), and the message queue is responsible for getting the message to the right consumers.

# Batch Processing Versus Stream Processing

- Batch: Data is processed in large blocks at scheduled times. Batch processing is used when you have a lot of data to process and you don’t need the results immediately. Batch processing is also used when data is collected over a period of time and then processed all at once. For features that do not change frequently.
- Stream: Data is processed in real time. Stream processing is used when you need to process data and get results immediately. Stream processing is also used when data is generated continuously. Low latency.

