# **Hands-on Lab: Implementing the ETL (Extract, Transform, Load) Process**

**Objective:**

In this hands-on lab, students will learn how to implement the fundamental steps of the ETL process by extracting data from multiple sources, transforming the data, and loading it into a database. Students will use Python along with libraries such as Pandas for data transformation and PyMongo for loading the data into a MongoDB database.

By the end of this lab, students will be able to:

* Extract data from different sources (CSV and API).
* Clean, transform, and validate the data.
* Load the transformed data into MongoDB.
* Automate the ETL process by building a reusable pipeline.

**Pre-requisites:**

* Basic knowledge of Python.
* MongoDB Atlas account (or a local MongoDB instance).
* Install the required Python libraries:



**In this Lab:**

You are tasked with creating an ETL pipeline for a fictitious retail company. You will extract product and sales data from different sources (a CSV file and a REST API), transform the data by cleaning and standardizing it, and load the transformed data into MongoDB for further analysis.

**Step 1: Extract Data**

**1.1. Extract Product Data from a CSV File**

Create a CSV file named ***products.csv*** with the following data:

product_id,product_name,category,price

1001,Laptop,Electronics,1200

1002,Smartphone,Electronics,800

1003,Chair,Furniture,150

Use Python and Pandas to extract the product data from this CSV file.

In [2]:
import pandas as pd

# Extract data from the CSV file
products_df = pd.read_csv('products.csv')
print("Extracted Product Data:")
print(products_df)


Extracted Product Data:
   product_id product_name     category  price
0        1001       Laptop  Electronics   1200
1        1002   Smartphone  Electronics    800
2        1003        Chair    Furniture    150


**1.2. Extract Sales Data from a REST API**

For the sales data, we will simulate an API response using a dictionary. In a real-world scenario, you would use the requests library to fetch data from an API.

In [3]:
import requests

# Simulated API response (in a real scenario, use requests.get(URL).json())
sales_data = [
    {"sale_id": "S001", "product_id": "1001", "quantity": 2, "total": 2400},
    {"sale_id": "S002", "product_id": "1002", "quantity": 1, "total": 800},
    {"sale_id": "S003", "product_id": "1003", "quantity": 4, "total": 600}
]

print("Extracted Sales Data:")
print(sales_data)


Extracted Sales Data:
[{'sale_id': 'S001', 'product_id': '1001', 'quantity': 2, 'total': 2400}, {'sale_id': 'S002', 'product_id': '1002', 'quantity': 1, 'total': 800}, {'sale_id': 'S003', 'product_id': '1003', 'quantity': 4, 'total': 600}]


**Step 2: Transform Data**

**2.1. Clean and Standardize the Product Data**

Use Pandas to clean and transform the product data. For this example, let's assume you need to ensure the price field is numeric and filter out products that are too expensive.

In [6]:
# Convert sales_data to a DataFrame
sales_df = pd.DataFrame(sales_data)

# Convert product_id to string in products_df
products_df["product_id"] = products_df["product_id"].astype(str)

#added by Georges to convert product_id to string
sales_df["product_id"]=sales_df["product_id"].astype(str)

# Join sales data with product data to add product_name
sales_df = pd.merge(sales_df, products_df[['product_id', 'product_name']], on='product_id', how='left')
print("Enriched Sales Data:")
print(sales_df)


Enriched Sales Data:
  sale_id product_id  quantity  total product_name
0    S001       1001         2   2400       Laptop
1    S002       1002         1    800   Smartphone
2    S003       1003         4    600        Chair


**2.2. Enrich the Sales Data**

For the sales data, we'll perform a simple enrichment by adding the product_name to each sale by joining the sales_data and products_df on the product_id.

In [7]:
# Convert sales_data to a DataFrame
sales_df = pd.DataFrame(sales_data)

# Join sales data with product data to add product_name
sales_df = pd.merge(sales_df, products_df[['product_id', 'product_name']], on='product_id', how='left')
print("Enriched Sales Data:")
print(sales_df)


Enriched Sales Data:
  sale_id product_id  quantity  total product_name
0    S001       1001         2   2400       Laptop
1    S002       1002         1    800   Smartphone
2    S003       1003         4    600        Chair


**Step 3: Load Data into MongoDB**

Now that the data is transformed and cleaned, load the product and sales data into MongoDB.

**3.1. Connect to MongoDB**

Ensure you have MongoDB running locally or use MongoDB Atlas. Connect to MongoDB using PyMongo.

In [9]:
from pymongo import MongoClient

# Connect to MongoDB (replace <username> and <password> with your MongoDB Atlas credentials)
#client = MongoClient("mongodb+srv://<username>:<password>@cluster0.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
#db = client['retail_db']

connection_string="mongodb+srv://gassaf2:dbUserPassword@cluster0.xjx2q.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
# Connect to the MongoDB Atlas cluster
client = MongoClient(connection_string)
# Access a specific database called sales_db
db = client['retails_db']

**3.2. Load Product Data**

Insert the transformed product data into the MongoDB products collection.

In [11]:
# Convert DataFrame to dictionary and insert into MongoDB
product_records = products_df.to_dict(orient='records')
db.products.insert_many(product_records)
print("Loaded Product Data into MongoDB")


Loaded Product Data into MongoDB


**3.3. Load Sales Data**

Insert the enriched sales data into the MongoDB sales collection.

In [12]:
# Convert DataFrame to dictionary and insert into MongoDB
sales_records = sales_df.to_dict(orient='records')
db.sales.insert_many(sales_records)
print("Loaded Sales Data into MongoDB")


Loaded Sales Data into MongoDB


**Step 4: Automate the ETL Process**

To make the ETL process reusable, wrap the steps into functions and run the ETL pipeline from start to finish.

In [14]:
def extract_products():
    return pd.read_csv('products.csv')

def extract_sales():
    return pd.DataFrame(sales_data)

def transform_products(products_df):
    products_df['price'] = pd.to_numeric(products_df['price'], errors='coerce')
    return products_df[products_df['price'] < 1000]

def transform_sales(sales_df, products_df):
    # Convert product_id to string in products_df
    products_df["product_id"] = products_df["product_id"].astype(str)

    #added by Georges to convert product_id to string
    sales_df["product_id"]=sales_df["product_id"].astype(str)
    return pd.merge(sales_df, products_df[['product_id', 'product_name']], on='product_id', how='left')

def load_data(products_df, sales_df):
    db.products.insert_many(products_df.to_dict(orient='records'))
    db.sales.insert_many(sales_df.to_dict(orient='records'))

# Run the ETL pipeline
products_df = extract_products()
sales_df = extract_sales()
transformed_products_df = transform_products(products_df)
transformed_sales_df = transform_sales(sales_df, products_df)
load_data(transformed_products_df, transformed_sales_df)
print("ETL Process Completed!")


ETL Process Completed!


**Conclusion:**
This hands-on lab provides a comprehensive introduction to the ETL process, from extracting raw data from multiple sources, transforming it for quality and consistency, and finally loading it into MongoDB.