# Phase 2: Data Loading & Cleaning

**Objective:**  
Load all datasets and clean them to ensure accurate analysis and modeling in later phases.

**Why This Matters:**  
Clean data is critical for building reliable ML models and generating actionable business insights.  

**Success Metrics:**  
- No missing customer IDs in critical tables  
- Date columns properly formatted  
- Feature columns created (e.g., TotalAmount)  
- Ready-to-use datasets for SQL analysis and ML modeling

### 1️⃣ Import Libraries

In [1]:
import pandas as pd
import numpy as np

### 2️⃣ Load Datasets

In [2]:
retail = pd.read_csv("../data/online_retail_II.csv")       # Retail data
support = pd.read_csv("../data/support_tickets.csv")       # Support tickets
telco = pd.read_csv("../data/telco_customer_churn.csv")    # Telco churn 

### 3️⃣ Check column names

In [3]:
print("Retail Columns:", retail.columns)
print("Support Columns:", support.columns)
print("Telco Columns:", telco.columns)

Retail Columns: Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'Price', 'Customer ID', 'Country'],
      dtype='object')
Support Columns: Index(['Ticket ID', 'Customer Name', 'Customer Email', 'Customer Age',
       'Customer Gender', 'Product Purchased', 'Date of Purchase',
       'Ticket Type', 'Ticket Subject', 'Ticket Description', 'Ticket Status',
       'Resolution', 'Ticket Priority', 'Ticket Channel',
       'First Response Time', 'Time to Resolution',
       'Customer Satisfaction Rating'],
      dtype='object')
Telco Columns: Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')


### 4️⃣ Clean Retail Data

In [4]:
# Remove rows without Customer ID
retail = retail.dropna(subset=["Customer ID"]).copy()

# Convert InvoiceDate to datetime
retail["InvoiceDate"] = pd.to_datetime(retail["InvoiceDate"], errors='coerce')

# Create TotalAmount column
retail["TotalAmount"] = retail["Quantity"] * retail["Price"]

# drop rows with negative or zero total amount
retail = retail[retail["TotalAmount"] > 0]

In [5]:
print("Retail data cleaned")
display(retail.head())

Retail data cleaned


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country,TotalAmount
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom,83.4
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom,81.0
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom,81.0
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom,100.8
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom,30.0


### 5️⃣ Clean Support Data

In [6]:
# Use Customer Email as unique ID
support = support.dropna(subset=["Customer Email"]).copy()

# Convert Date of Purchase to datetime
support["Date of Purchase"] = pd.to_datetime(support["Date of Purchase"], errors='coerce')

# fill missing Priority / Status
if "Ticket Priority" in support.columns:
    support["Ticket Priority"].fillna("Medium", inplace=True)
if "Ticket Status" in support.columns:
    support["Ticket Status"].fillna("Open", inplace=True)

In [7]:
print("Support data cleaned")
display(support.head())

Support data cleaned


Unnamed: 0,Ticket ID,Customer Name,Customer Email,Customer Age,Customer Gender,Product Purchased,Date of Purchase,Ticket Type,Ticket Subject,Ticket Description,Ticket Status,Resolution,Ticket Priority,Ticket Channel,First Response Time,Time to Resolution,Customer Satisfaction Rating
0,1,Marisa Obrien,carrollallison@example.com,32,Other,GoPro Hero,2021-03-22,Technical issue,Product setup,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Social media,2023-06-01 12:15:36,,
1,2,Jessica Rios,clarkeashley@example.com,42,Female,LG Smart TV,2021-05-22,Technical issue,Peripheral compatibility,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Chat,2023-06-01 16:45:38,,
2,3,Christopher Robbins,gonzalestracy@example.com,48,Other,Dell XPS,2020-07-14,Technical issue,Network problem,I'm facing a problem with my {product_purchase...,Closed,Case maybe show recently my computer follow.,Low,Social media,2023-06-01 11:14:38,2023-06-01 18:05:38,3.0
3,4,Christina Dillon,bradleyolson@example.org,27,Female,Microsoft Office,2020-11-13,Billing inquiry,Account access,I'm having an issue with the {product_purchase...,Closed,Try capital clearly never color toward story.,Low,Social media,2023-06-01 07:29:40,2023-06-01 01:57:40,3.0
4,5,Alexander Carroll,bradleymark@example.com,67,Female,Autodesk AutoCAD,2020-02-04,Billing inquiry,Data loss,I'm having an issue with the {product_purchase...,Closed,West decision evidence bit.,Low,Email,2023-06-01 00:12:42,2023-06-01 19:53:42,1.0


### 6️⃣ Clean Telco Data

In [8]:
# Remove rows without customerID
telco = telco.dropna(subset=["customerID"]).copy()

# Convert TotalCharges to numeric
telco["TotalCharges"] = pd.to_numeric(telco["TotalCharges"], errors='coerce')
telco = telco.dropna(subset=["TotalCharges"])

# create numeric churn flag
telco["Churn_flag"] = telco["Churn"].map({"Yes":1, "No":0})

# Convert tenure to numeric
telco["tenure"] = pd.to_numeric(telco["tenure"], errors='coerce')

In [9]:
print("Telco data cleaned")
display(telco.head())

Telco data cleaned


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Churn_flag
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,0
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,No,No,No,One year,No,Mailed check,56.95,1889.5,No,0
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,0
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1


In [10]:
import os

# Create 'data' folder if it doesn't exist
if not os.path.exists("data"):
    os.makedirs("data")

# Now save cleaned CSVs
retail.to_csv("data/retail_cleaned.csv", index=False)
support.to_csv("data/support_cleaned.csv", index=False)
telco.to_csv("data/telco_cleaned.csv", index=False)

print("Cleaned CSVs saved successfully!")


Cleaned CSVs saved successfully!


### 7️⃣ Summary info

In [11]:
print("\n--- Dataset Info ---")
print("Retail:", retail.shape)
print("Support:", support.shape)
print("Telco:", telco.shape)


--- Dataset Info ---
Retail: (805549, 9)
Support: (8469, 17)
Telco: (7032, 22)
