# Understanding Big Data, Challenges, and the Hadoop Platform

## Evolution of Big Data

- Initially, data was structured and manageable using traditional systems.
- As data formats expanded (JSON, XML, images, videos, documents), traditional RDBMS began to struggle.
- This led to the need for classifying data into:
  - **Structured:** Follows rows and columns (e.g., relational databases).
  - **Semi-structured:** Uses key-value pairs but not strictly tabular (e.g., JSON, XML).
  - **Unstructured:** No defined pattern (e.g., PDFs, images, videos, text files).

## Problems with Big Data and Their Solutions

### The Emerging Big Data Challenges (The 3 Vs):

- **Variety:** Different types of data accumulated (structured, semi-structured, unstructured).
- **Volume:** Massive size of the data being generated.
- **Velocity:** High speed at which data is generated and collected.

### Solutions to Big Data Problems

#### Monolithic Approach

- One large and robust system (e.g., Teradata).
- Centralized infrastructure with vast resources (CPU, RAM, storage).

**Limitations:**

- **Scalability:** Vertical scaling required (adding more powerful hardware).
- **Fault Tolerance:** Single point of failure — if it crashes, the entire system goes down.
- **Cost:** High initial investment, even when not fully needed. Difficult to maintain.

#### Distributed Approach

- Many smaller systems working together in a cluster.
- Each node contributes to storage and processing.

**Advantages:**

- **Scalability:** Horizontal scaling — add more machines to the cluster as needed.
- **Fault Tolerance:** Failure of a single node doesn’t affect the whole system.
- **Cost-Effective:** Start small, use commodity hardware, rent infrastructure (e.g., cloud services).

---

## Hadoop Platform

- An open-source framework for distributed storage and processing of large datasets.
- Designed to address the volume, velocity, and variety of big data.
- Uses commodity hardware and enables horizontal scalability.
- Key components:
  - **HDFS (Hadoop Distributed File System):** For storing big data across multiple machines.
  - **MapReduce:** A programming model for processing data in parallel.
  - **YARN (Yet Another Resource Negotiator):** Manages computing resources in clusters.

---

# Broad Classes of Software Types

1. System Software
- Operating systems and utility programs that manage hardware and software.

2. Programming Languages
- Tools to write software (e.g., Java, Python, C++).

3. Desktop Applications
- Standalone software installed on personal computers (e.g., MS Office, Photoshop).

4. Data Processing Technologies
- Early example: **COBOL** — one of the first serious attempts at data processing.
- Allowed storing and processing data in files.

5. RDBMS (Relational Database Management Systems)
- Examples: MySQL, SQL Server, Teradata.
- Provided features like:
  - SQL for querying.
  - Scripting with PL/SQL, T-SQL.
  - ODBC and JDBC for programming language interaction.

6. Internet-Led Innovations
- Led to development of websites and web applications.

7. Platform Development
- Tools and environments for building and running applications (e.g., Java Platform, .NET).

8. Cloud Computing
- On-demand delivery of computing resources via the internet (e.g., AWS, Azure, Google Cloud).

9. Machine Learning and AI
- Advanced techniques to make systems intelligent and data-driven.
- Applications in data analysis, prediction, automation, and more.
