# Step 1: Extract

In [1]:
import pandas as pd

## Extract data from CSV

In [3]:
productsDF = pd.read_csv("products_data.csv")
ordersDF = pd.read_csv("orders_data.csv")
usersDF = pd.read_csv("users_data.csv")

## Extract data from XLSX

In [4]:
paymentsDF = pd.read_excel("payments_data.xlsx")
reviewsDF = pd.read_excel("reviews_data.xlsx")

# Step 2: Transform

## Fill missing values

As needed

In [5]:
productsDF = productsDF.fillna({
    "Description": "No description available",
    "Category": "Miscellaneous",
})

ordersDF = ordersDF.fillna({
    "TotalAmount": 0.0
})

paymentsDF = paymentsDF.fillna({
    "PaymentMethod": "Unknown",
    "Amount": 0.0,
    "PaymentDate": pd.Timestamp("1970-01-01")
})

In [6]:
paymentsDF

Unnamed: 0,OrderID,PaymentMethod,PaymentDate,Amount
0,1001,InvalidMethod,1970-01-01,0.0
1,1002,Credit Card,2024-08-02,753.48
2,1003,Credit Card,2024-08-03,978.27
3,1004,InvalidMethod,2024-08-04,564.67
4,1005,Credit Card,2024-08-05,390.66
5,1006,Credit Card,1970-01-01,815.67
6,1007,InvalidMethod,2024-08-07,343.75
7,1008,Credit Card,2024-08-08,495.07
8,1009,Credit Card,2024-08-09,170.61
9,1010,InvalidMethod,2024-08-10,122.82


## Handle invalid emails

In the *Users* DataFrame

In [7]:
def generate_unique_email(index):
    return f"unknown{index}@example.com"

In [8]:
usersDF["Email"] = usersDF.apply(lambda row: row["Email"] if "@" in row["Email"] and "." in row["Email"] else generate_unique_email(row.name), axis=1)

- `usersDF.apply(..., axis=1)`:<br>
This applies a function row-wise across the DataFrame usersDF. Each row is passed to the `lambda` function.<br>
- `lambda row: ...`:<br>
This is an anonymous function that takes a row as input.<br>
- `row["Email"] if "@" in row["Email"] and "." in row["Email"]`:<br>
Checks if the email in the current row contains both "@" and ".", which are basic indicators of a valid email format.<br>
The lambda function checks whether the existing email value in `row["Email"]` contains both "@" and ".".<br>
If both conditions are met, it leaves the existing value unchanged — meaning it assumes the email is valid and keeps it as-is.<br>
Only if either condition fails (i.e., the email is missing "@" or "."), it replaces the value by calling `generate_unique_email(row.name)`.<br>
So the logic is essentially:
```
if "@" in email and "." in email:
    keep the email
else:
    generate a new one
```
- `else generate_unique_email(row.name)`:<br>
If the email is missing either "@" or ".", it calls a function `generate_unique_email()` using the row's index (`row.name`) to generate a new email.<br>
- `usersDF["Email"] = ...`:<br>
The result of the apply is assigned back to the *Email* column, effectively cleaning or replacing invalid emails.

## Ensure all prices are numeric

In [9]:
productsDF["Price"] = pd.to_numeric(productsDF["Price"], errors="coerce").fillna(0.0).round(2)

## Ensure all stock quantities are numeric

In [10]:
productsDF["StockQuantity"] = pd.to_numeric(productsDF["StockQuantity"], errors="coerce").fillna(0).astype(int)

In [11]:
productsDF

Unnamed: 0,ProductName,Description,Price,StockQuantity,Category
0,Contraption A,No description available,0.0,0,Miscellaneous
1,Tool B,A high-quality tool.,230.23,160,Apparatus
2,Machine C,A high-quality apparatus.,56.75,96,Apparatus
3,Device D,No description available,191.7,472,Widgets
4,Contraption E,A high-quality gadget.,337.73,233,Instruments
5,Appliance F,A high-quality appliance.,336.3,180,Miscellaneous
6,Gizmo G,No description available,299.74,113,Apparatus
7,Contraption H,A high-quality gadget.,144.61,318,Gizmos
8,Machine I,A high-quality appliance.,285.01,0,Tools
9,Device J,No description available,197.63,497,Tools


## Standardize and convert date formats to string

In [12]:
ordersDF["OrderDate"] = pd.to_datetime(ordersDF["OrderDate"],
                                       errors="coerce",
                                       format="%Y-%m-%d").fillna(pd.Timestamp("1970-01-01"))
ordersDF["OrderDate"] = ordersDF["OrderDate"].dt.strftime("%Y-%m-%d")

paymentsDF["PaymentDate"] = pd.to_datetime(paymentsDF["PaymentDate"],
                                       errors="coerce",
                                       format="%Y-%m-%d").fillna(pd.Timestamp("1970-01-01"))
paymentsDF["PaymentDate"] = paymentsDF["PaymentDate"].dt.strftime("%Y-%m-%d")

1. `pd.to_datetime(...)`<br>
This function converts strings or other date-like objects into pandas `datetime` objects. It's essential because:
    - It standardizes various date formats into a consistent datetime format.
    - It enables datetime operations like filtering by date, extracting year/month/day, sorting chronologically, etc.
`errors='coerce'` ensures that invalid date strings are converted to `NaT` (Not a Time), which are then replaced by '1970-01-01' using `.fillna(...)`.<br>

2. `.dt.strftime('%Y-%m-%d')`
After converting to datetime, this step formats the datetime objects back into strings in a specific format (YYYY-MM-DD). This is useful when:
    - You want to export or display the dates in a human-readable format.
    - **You need to ensure consistent string formatting for downstream systems or reports**.<br>

Why not just use one?<br>
If you skip `to_datetime()` and go straight to formatting, you risk:
- Misinterpreting date strings (e.g., '01-02-2023' could be Jan 2 or Feb 1 depending on locale).
- Losing the ability to handle invalid dates cleanly.
- Not being able to use datetime-specific operations before formatting.

# Step 3: Load

## Connect to MySQL database

In [13]:
import mysql.connector

conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password="MySql",
    database="ecommerce_db"
)
cursor = conn.cursor()