Question 1 : Describe different types of data sources used in ETL with suitable examples


```
ETL (Extract, Transform, Load) process uses multiple types of data sources depending on business needs. These data sources can be broadly classified as follows:

1. Database Sources
These are structured data sources stored in databases.

ðŸ”¹ Examples:
   > Relational Databases: MySQL, PostgreSQL, Oracle, SQL Server.
   > NoSQL Databases: MongoDB, Cassandra.

ðŸ”¹ Use Case:
    > Customer records stored in a MySQL database.
    > Sales transactions stored in Oracle DB.


2. Flat File Sources
Data stored in file formats, often outside databases.

ðŸ”¹ Examples:
    > CSV (Comma Separated Values).
    > Excel files (.xls, .xlsx).
    > Text files (.txt).
    > JSON and XML files.

ðŸ”¹ Use Case:
    > Daily sales report in CSV.
    > Employee details maintained in Excel.

3. Application Sources
Data generated by business applications.

ðŸ”¹ Examples:
    > ERP systems (SAP).
    > CRM systems (Salesforce).
    > HR systems (Workday).

ðŸ”¹ Use Case:
    > Customer interaction data from Salesforce.
    > Inventory data from SAP.

4. Web Services / APIs
Data collected from external or internal APIs.

ðŸ”¹ Examples:
    > REST APIs
    > SOAP APIs

ðŸ”¹ Use Case:
    > Weather data from an external REST API
    > Payment data from a payment gateway API

```
Question 2 : What is data extraction? Explain its role in the ETL pipeline.


```
Data Extraction is the process of collecting raw data from various source systems such as databases, files, APIs, or applications for further processing in the ETL pipeline.

It ensures that the required data is correctly and efficiently pulled from the source systems.

Role of Data Extraction in the ETL Pipeline
   1.Starting Point of ETL
   Data extraction is the foundation of the ETL pipeline. If incorrect or incomplete data is extracted, the entire ETL process will produce unreliable results.

   2.Collects Data from Multiple Sources
   Extraction gathers data from different systems such as:
     # Relational databases (MySQL, Oracle)
     # Flat files (CSV, Excel)
     # APIs and web services
     # Cloud storage systems

   3. Ensures Data Availability for Transformation
       Extracted data is made available for cleaning, validation, and transformation before loading into the target system.

   4. Supports Full and Incremental Loads
       # Full extraction: All data is extracted at once
       # Incremental extraction: Only new or modified data is extracted
        This improves performance and reduces system load.
    5. Maintains Source System Performance
       Efficient extraction minimizes impact on operational systems by scheduling jobs during low-usage periods or using optimized queries.
```
Question 3 : Explain the difference between CSV and Excel in terms of extraction and ETL usage.


```
CSV and Excel are commonly used flat file data sources in ETL processes. However, they differ significantly in structure, extraction complexity, and ETL usage.

1. File Format and Structure
    > CSV (Comma-Separated Values):
    A plain text file where data values are separated by commas. It contains only raw data with no formatting.

    > Excel:
    A binary or XML-based file format (.xls, .xlsx) that can contain multiple sheets, formulas, formatting, and charts.
2. Ease of Data Extraction
    > CSV:
    Easy to extract because of its simple and consistent structure. Most ETL tools can read CSV files directly with minimal configuration.

    > Excel:
    More complex to extract due to multiple sheets, merged cells, formulas, and formatting. ETL tools require additional configuration to select specific sheets and ranges.

```
Question 4 : Explain the steps involved in extracting data from a relational database.


```
> Identify the source database
  Select the relational database (e.g., MySQL, Oracle, SQL Server) and understand its schema, tables, and relationships.

> Define data requirements
  Decide which tables, columns, and records are required, and whether the extraction will be full or incremental.

> Establish database connection
  Connect to the database using proper credentials and JDBC/ODBC drivers.

> Write SQL queries
  Create SELECT queries with necessary filters, joins, and conditions to retrieve accurate data.

> Execute data extraction
  Run the queries to extract data, often in batches for large datasets to improve performance.

> Apply incremental extraction (if needed)
  Extract only new or updated records using timestamps, primary keys, or change data capture (CDC).

> Validate extracted data
  Verify data completeness, accuracy, and consistency with the source system.

> Store data in a staging area
  Save the extracted data temporarily for further transformation and loading.
```
Question 5 : Explain three common challenges faced during data extraction.


```
> Data Quality Issues
  Source data may contain missing values, duplicates, inconsistent formats, or incorrect entries. These issues can lead to inaccurate extraction results and affect the quality of data used in later ETL stages.

> Performance Impact on Source Systems
  Extracting large volumes of data can slow down operational databases, especially during peak business hours. Poorly optimized queries may increase load and affect system performance.

> Data Format and Schema Changes
  Changes in source database structure, such as added or modified columns, data type changes, or file format variations, can break extraction processes and cause failures in the ETL pipeline.
```
Question 6 : What are APIs? Explain how APIs help in real-time data extraction.

```
APIs (Application Programming Interfaces) are sets of rules and protocols that allow different software applications to communicate with each other. APIs define how requests for data are made and how responses are returned, usually in formats like JSON or XML.

How APIs Help in Real-Time Data Extraction
  > Direct Access to Live Data
    APIs provide direct access to data generated by applications in real time, such as user activities, transactions, or sensor data.

  > Event-Based Data Retrieval
    Many APIs support event-driven or streaming mechanisms, allowing systems to receive data immediately when an event occurs.

  > Standardized Data Exchange
    APIs use standard request and response formats, making it easier to extract and integrate real-time data from different systems.

  > Secure Data Access
    APIs support authentication and authorization methods (such as API keys or tokens), ensuring secure real-time data extraction.

  > Scalability and Automation
    APIs enable automated and scalable data extraction without manual intervention, which is essential for continuous real-time ETL processes.
```
Question 7 : Why are databases preferred for enterprise-level data extraction?


```
> High Data Reliability
  Databases ensure data accuracy, consistency, and integrity through constraints, transactions, and validation rules.

> Efficient Handling of Large Data Volumes
  Relational databases are designed to store and process massive amounts of data efficiently.

> Structured Data Organization
  Data is stored in well-defined tables with clear relationships, making extraction easier and more systematic.

> Support for Complex Queries
  Databases allow advanced SQL queries, joins, and aggregations to extract precise and meaningful data.

> Security and Access Control
  Databases provide strong security features such as authentication, authorization, and role-based access control.

> Support for Incremental and Real-Time Extraction
   Features like timestamps, triggers, and Change Data Capture (CDC) enable incremental and near real-time data extraction.

> Scalability and Performance Optimization
  Indexing, partitioning, and query optimization techniques help maintain high performance as data grows.

> Integration with ETL Tools
  Most enterprise ETL tools natively support database connections, making automation and scheduling easier.
```
Question 8 : What steps should an ETL developer take when extracting data from large CSV files (1GB+)?

```
> Use Streaming or Chunk-Based Processing
  Read the CSV file in chunks instead of loading the entire file into memory to avoid memory overflow.

> Validate File Structure Before Processing
  Check delimiters, headers, encoding, and column consistency to prevent extraction failures.

> Apply Filters Early
  Extract only required columns and rows to reduce processing time and resource usage.

> Handle Data Types and Null Values Properly
  Define data types explicitly and manage missing or invalid values during extraction.

> Enable Parallel Processing (If Possible)
   Split the file into parts or use parallel threads to speed up extraction.

> Use Compression and Efficient Storage
   Work with compressed files or convert data into optimized formats to improve I/O performance.

> Log and Monitor the Extraction Process
  Track progress, errors, and performance metrics to quickly identify and fix issues.

> Test with Sample Data First
  Validate extraction logic on smaller samples before processing the full 1GB+ file.
```







