# Differences between big data and traditional data

Simple! only ***the type of data handled and the tools used to analyze it.*** Thats it!

Usually traditional analytics deals with **structured data**, while big data analytics involves massive amounts of data in various formats, including **structured**, **semi-structured** and **unstructured** data.

# Five V’s of Big Data

Five V’s of Big Data are key characteristics that define the challenges and opportunities of handling large-scale data systems.

| V-Term     | Description                                                                                   |
|------------|-----------------------------------------------------------------------------------------------|
| **Volume**   | Refers to the massive scale of data generated from sources like social media and IoT, requiring scalable and cost-effective storage solutions. |
| **Velocity** | Describes the high speed at which data is created and needs to be processed, often in real-time, using technologies like stream processing. |
| **Variety**  | Highlights the wide range of data types—structured, semi-structured, and unstructured—necessitating flexible data management approaches. |
| **Veracity** | Concerns the reliability and quality of data, emphasizing the importance of validation, cleaning, and anomaly detection. |
| **Value**    | Focuses on extracting meaningful insights from data to support decision-making, innovation, and strategic growth. |



# Stages of big data analytics

Nothing different, just like the traditional data stages, but different strategies.

- **Collect data**: only big data involve collecting data from various sources, including structured and unstructured formats.

- **Process data**: Both require systematic processing to convert raw data into usable formats for analysis.

- **Clean data**: Both necessitate data cleaning to ensure accuracy, remove duplicates, and maintain data quality.

- **Analyze data**: Both apply analytical techniques to identify patterns, trends, and support decision-making.


# 17 Important Terms You Should Know

This guide breaks down the most common terms in the Big Data field:

| Term                                 | Definition                                                                                                                                                                                                                                                                                    |
|------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Big Data                           | Large and complex datasets that traditional data processing tools cannot handle effectively.                                                                                                                                                                                                  |
| Data Lake                         | A centralized repository that allows you to store all your structured and unstructured data at any scale.                                                                                                                                                                                     |
| NoSQL Databases                   | Non-relational databases such as MongoDB, Cassandra, and Redis designed to handle unstructured data.                                                                                                                                                                                          |
| ETL (Extract, Transform, Load)    | A process that extracts data from various sources, transforms it for analysis, and loads it into a data warehouse.                                                                                                                                                                            |
| Data Mining                      | The process of discovering patterns and knowledge from large datasets.                                                                                                                                                                                                                        |
| Stream Processing                | Real-time processing of data streams, as opposed to batch processing.                                                                                                                                                                                                                         |
| Data Visualization              | The graphical representation of data that makes it easier to understand and interpret.                                                                                                                                                                                                         |
| Web Crawler                     | A bot or automated program that systematically browses the internet to index and collect data from websites.                                                                                                                                                                                  |
| Data Warehouse                 | A central repository that stores processed data for reporting and analysis.                                                                                                                                                                                                                        |
| Data Pipeline                  | A set of automated processes that move data from one system to another, transforming it along the way.                                                                                                                                                                                        |
| Metadata                      | Data that describes other data, providing context or additional information.                                                                                                                                                                                                                  |
| Batch Processing              | A method of processing large volumes of data at once, typically at scheduled intervals.                                                                                                                                                                                                       |
| Data Sharding                | A database architecture pattern where data is partitioned across multiple servers to improve performance and scalability.                                                                                                                                                                     |
| Data Governance            | The set of processes, policies, and standards used to ensure proper management of data assets.                                                                                                                                                                                                |
| Data Cleaning              | The process of identifying and correcting errors, inconsistencies, or inaccuracies in a dataset.                                                                                                                                                                                              |
| OLAP (Online Analytical Processing)  | Software technology that allows users to analyze information from multiple database systems at the same time. It is based on a multidimensional data model and allows querying multi-dimensional data. Widely used for Business Intelligence (BI). Example: Analyzing sales data by region, product, and time.                                                      |
| OLTP (Online Transaction Processing) | A technology focused on managing transaction-oriented applications like order processing or banking systems. Example: A database handling real-time transactions on an e-commerce platform. Ensures efficient and reliable management of day-to-day operations in transactional systems. |


# What is a Distributed System?

A distributed system is like a team of computers that work together to complete a task, but each computer is separate and may be located far apart.

<img src="https://www.sangam.dev/_next/image?url=%2Fimages%2Fblog%2F2022%2Fdistributed-systems.jpg&w=3840&q=75" width="400">

<a href="https://www.sangam.dev/blog/2022/distributed-systems-key-concepts-and-challenges">Image source</a>

#### Why use distributed systems?

- To get work done faster by splitting tasks.

- To handle more data or users than one computer can.

- To keep working even if one computer fails (reliable).

**Example:** Online shopping websites use distributed systems to handle many users at once.

# Relationship Between Big Data and Distributed Systems

Big Data involves extremely large and complex datasets that cannot be processed efficiently on a single machine. 

1. **Data Distribution:**  
   Distributed systems split big datasets into smaller chunks and store them across multiple machines (nodes). 

2. **Parallel Processing:**  
   Analytics tasks are divided into smaller subtasks, each running simultaneously on different nodes. This parallelism drastically reduces the time needed to analyze huge volumes of data.

3. **Data Locality Optimization:**  
   Distributed systems often move computation close to where data is stored (data locality).

4. **Fault Tolerance and Reliability:**  
   Data is replicated across multiple nodes. If one node fails during analysis, the system can continue processing using replicas.

5. **Scalability:**  
   Adding more machines to the system easily increases storage and processing power, allowing analytics to scale smoothly as data grows.

In summary, distributed systems provide the scalable, reliable infrastructure and parallel computation capabilities that Big Data analytics requires to efficiently process and gain insights from massive datasets.

# Simple Real-World Example: How Distributed Systems Help Big Data Analytics

Imagine a popular video streaming service like `Netflix`.

- Every second, millions of users watch videos and generate data (which videos they watch, for how long, ratings, etc.).

- This data is too big to store or analyze on one computer.

- So, Netflix uses a **distributed system**, which means the data is split and saved across many computers around the world working together.

- When Netflix wants to find out which shows are most popular, it sends parts of the data to different computers to analyze at the same time.

- Each computer processes its part quickly and sends back the results.

- Finally, Netflix combines all these results to understand viewers' habits and recommend new shows.

# Tools that help in Big data analytics

## 1. Data Storage
- **Google BigQuery**: Fully managed, serverless data warehouse with super-fast SQL queries and scalable storage. 
- **Amazon S3**: Scalable, durable cloud storage widely used for big data storage in enterprise environments.

## 2. Data Processing
- **Apache Spark**: Fast, in-memory processing engine supporting batch and real-time analytics.
- **Databricks**: Unified analytics platform based on Apache Spark, with enhanced collaboration, management, and optimization features.   

## 3. Data Ingestion
- **Apache Kafka**: Distributed messaging system for real-time data streaming and ingestion.  
- **AWS Kinesis**: Cloud-based streaming service to collect, process, and analyze real-time data.

## 4. Data Querying and Warehousing
- **Apache Hive**: Data warehouse built on Hadoop for SQL-like querying on large datasets.  
- **Snowflake**: Cloud data platform providing scalable data warehousing and analytics.

## 5. Data Visualization and Reporting
- **Tableau**: Industry-leading visualization tool for creating interactive dashboards and reports.  
- **Microsoft Power BI**: Business analytics platform with rich visualization and integration capabilities.

# Sources:
<a href="https://www.ibm.com/think/topics/big-data-analytics">IBM</a>

<a hre="https://www.webdevstory.com/big-data-terms-and-definitions/">webdevstory</a>