# üßπ Data Cleaning - Master Script

**Purpose:** Clean and standardize Gamezone orders dataset for analysis  
**Input:** `gamezone_orders_data.csv`  
**Output:** `gamezone_orders_data_cleaned.csv`  
**Date:** 2025-11-06  
**Analyst:** Shaifali

---

## üìã Cleaning Overview

This notebook performs the following transformations:

1. ‚úÖ Standardize column names (lowercase, remove spaces)
2. ‚úÖ Fix data types (dates, numerics, categories)
3. ‚úÖ Handle missing values (per issue log)
4. ‚úÖ Standardize formats (country codes, product names)
5. ‚úÖ Create derived fields (time components, flags)
6. ‚úÖ Flag data quality issues (invalid ship dates)
7. ‚úÖ Validate output

**All cleaning decisions are documented below with rationale.**

In [36]:
import pandas as pd 
import numpy as np

df = pd.read_csv(r"C:\Users\shaif\OneDrive\Desktop\gamezone_orders_data.csv", sep=",", encoding="utf-8",keep_default_na=False, na_values=[""], dtype=str)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21864 entries, 0 to 21863
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   USER_ID                          21864 non-null  object
 1   ORDER_ID                         21864 non-null  object
 2   PURCHASE_TS                      21864 non-null  object
 3   PURCHASE_TS_CLEANED              21863 non-null  object
 4   PURCHASE_YEAR                    21863 non-null  object
 5   PURCHASE_MONTH                   21863 non-null  object
 6   TIME_TO_SHIP                     21863 non-null  object
 7   SHIP_TS                          21864 non-null  object
 8   PRODUCT_NAME                     21864 non-null  object
 9   PRODUCT_NAME_CLEANED             21864 non-null  object
 10  PRODUCT_ID                       21864 non-null  object
 11   USD_PRICE                       21859 non-null  object
 12  PURCHASE_PLATFORM               

## üîß Cleaning Steps

## Inital Issues Explored In MS excel 


### Issue #1Ô∏è‚É£: User ID Column Auto-Converted to Scientific Notation in Excel
**Table:** `orders`  
**Column:** `user_id`  
**Row Count:** 36  
**Magnitude:** 0.16%  
**Solvable?:** Yes  
**Resolution:** Converted column to text format.



### Issue #2Ô∏è‚É£: Inconsistent date formats
**Table:** `orders`  
**Column:** `purchase-ts`  
**Row Count:** 10  
**Magnitude:** 0.05%  
**Solvable?:** Yes  
**Resolution:** Extracted date component from timestamp field.



### Issue #3Ô∏è‚É£: Incomplete data
**Table:** `orders`  
**Column:** `purchase-ts`  
**Row Count:** 1  
**Magnitude:** 0.00%  
**Solvable?:** No  
**Resolution:** Left as is.



### Issue #4Ô∏è‚É£: Inconsistent / Misspelled product name
**Table:** `orders`  
**Column:** `product_name`  
**Row Count:** 61  
**Magnitude:** 0.28%  
**Solvable?:** Yes  
**Resolution:** Renamed the product.



### Issue #5Ô∏è‚É£: $0 transactions
**Table:** `orders`  
**Column:** `usd_price`  
**Row Count:** 29  
**Magnitude:** 0.13%  
**Solvable?:** No  
**Resolution:** No reference available ‚Äî to be validated with the team.



### Issue #6Ô∏è‚É£: Missing transactions
**Table:** `orders`  
**Column:** `usd_price`  
**Row Count:** 5  
**Magnitude:** 0.02%  
**Solvable?:** No  
**Resolution:** No reference available ‚Äî to be validated with the team.



### Issue #7Ô∏è‚É£: Missing marketing channels
**Table:** `orders`  
**Column:** `marketing_channel`  
**Row Count:** 83  
**Magnitude:** 0.38%  
**Solvable?:** Yes  
**Resolution:** Replaced missing values with ‚ÄòUnknown‚Äô.



### Issue #8Ô∏è‚É£: Missing account creation method - same count as marketing channel?
**Table:** `orders`  
**Column:** `account_creation_method`  
**Row Count:** 83  
**Magnitude:** 0.38%  
**Solvable?:** Yes  
**Resolution:** Replaced missing values with ‚ÄòUnknown‚Äô.



### Issue #9Ô∏è‚É£: Missing countries
**Table:** `orders`  
**Column:** `country_code`  
**Row Count:** 37  
**Magnitude:** 0.17%  
**Solvable?:** No  
**Resolution:** No reference available for validation.



### Issue #üîü: Inconsistent and nonsensical region values
**Table:** `region`  
**Column:** `region`  
**Row Count:** 9  
**Magnitude:** 0.04%  
**Solvable?:** Yes  
**Resolution:** Filled missing region values with respective regions.



### Issue #1Ô∏è‚É£1Ô∏è‚É£: Duplicate values
**Table:** `orders`  
**Column:** `all`  
**Row Count:** 145  
**Magnitude:** 0.66%  
**Solvable?:** No  
**Resolution:** ‚Äî



### Issue #1Ô∏è‚É£2Ô∏è‚É£: Shipping date < Purchase date
**Table:** `orders`  
**Column:** `ship_ts`  
**Row Count:** 2000  
**Magnitude:** 9.15%  
**Solvable?:** No  
**Resolution:** ~9% of rows have shipping dates earlier than purchase dates; flagged for business validation due to lack of reference data.



### Issue #1Ô∏è‚É£3Ô∏è‚É£: Shipping date is 300+ days from purchase date (very delayed shipping)
**Table:** `orders`  
**Column:** `ship_ts`  
**Row Count:** 2  
**Magnitude:** 0.01%  
**Solvable?:** No  
**Resolution:** ‚Äî Leaving as is. 



## Summary Insights

- **Total Issues Identified:** 13  
- **Solvable Issues:** 6  
- **Unsolvable / Pending Validation:** 7  
- **Highest Impact Issue:** Shipping date < purchase date (9.15%)  


## Data Cleaning Steps After Importing into Pandas

### 1Ô∏è‚É£ Standardize column names
**Issue:** Column names have inconsistent casing and spacing  
**Solution:** Convert to lowercase, remove spaces, use underscores  
**Impact:** Prevents case-sensitivity errors in analysis

In [37]:
df.columns = [col.lower() for col in df.columns]

### 2Ô∏è‚É£ Remove extra spaces in column names
**Issue:** Some column names have trailing/leading spaces  
**Solution:** Strip all whitespace  
**Impact:** Prevents key errors when referencing columns

In [38]:
df.columns = df.columns.str.strip()

### 3Ô∏è‚É£ Convert 'purchase_ts_cleaned' to datetime format
- Ensures that purchase timestamps are recognized as proper datetime objects instead of strings.
- 'errors="coerce"' turns invalid dates into NaT (missing values).
- 'dayfirst=True' ensures dates like 12/05/2024 are read as 12 May, not 5 Dec.

In [39]:
df['purchase_ts_cleaned'] = pd.to_datetime(df['purchase_ts_cleaned'], errors='coerce', dayfirst=True)

### 4Ô∏è‚É£ Convert 'ship_ts' to datetime format
Same logic as above ‚Äî ensures shipping timestamps are valid datetimes for time calculations.

In [40]:
df['ship_ts'] = pd.to_datetime(df['ship_ts'], errors='coerce', dayfirst=True)

### 5Ô∏è‚É£ Clean the 'revenue' column by removing symbols
Remove '$' and commas from revenue values to make them numeric-compatible.

In [41]:
df['revenue'] = df['revenue'].replace(r'[\$,]', '', regex=True)

### 6Ô∏è‚É£ Convert 'revenue' column to numeric type
After removing symbols, convert all values to numeric for aggregation or calculations.
Invalid values (like text) are coerced into NaN.

In [42]:
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')

### 7Ô∏è‚É£ Clean the 'region' column
- Convert to string type.
- Strip leading/trailing spaces.
- Replace blank strings ("") with missing values (pd.NA).
This prevents " " (empty strings) from being treated as valid data.

In [43]:
df['region'] = df['region'].astype(str).str.strip().replace('', pd.NA)

### 8Ô∏è‚É£ Check how many missing region values exist
Helps validate how many nulls are present after cleaning.

In [44]:
df['region'].isna().sum()

0

### 9Ô∏è‚É£ Handle missing numeric columns ('purchase_year', 'purchase_month', 'time_to_ship')
- Fill missing values with 0 to avoid NaNs causing errors in calculations.
- Convert to integer type since these should logically be whole numbers.

In [45]:
df['purchase_year'] = df['purchase_year'].fillna(0).astype(int)

df['purchase_month'] = df['purchase_month'].fillna(0).astype(int)

df['time_to_ship'] = df['time_to_ship'].fillna(0).astype(int)

### üîü Clean the 'country_code' column
- Convert to string.
- Strip spaces.
- Convert all to uppercase for consistency (e.g., 'us' ‚Üí 'US').

In [46]:
df['country_code'] = df['country_code'].astype(str).str.strip().str.upper()

In [47]:
df['country_code'].nunique()

152

### 1Ô∏è‚É£1Ô∏è‚É£ Clean the 'product_name_cleaned' column
- Strip extra spaces and convert all product names to lowercase.
- This standardizes the product names for grouping, deduplication, or analysis.

In [48]:
df['product_name_cleaned'] = df['product_name_cleaned'].str.strip().str.lower()

In [56]:
df['country_code'].nunique(dropna=False)

152

### 1Ô∏è‚É£2Ô∏è‚É£ Display final data structure and types
This helps confirm that all conversions were successful and datatypes are now correct.

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21864 entries, 0 to 21863
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   user_id                          21864 non-null  object        
 1   order_id                         21864 non-null  object        
 2   purchase_ts                      21864 non-null  object        
 3   purchase_ts_cleaned              21863 non-null  datetime64[ns]
 4   purchase_year                    21864 non-null  int32         
 5   purchase_month                   21864 non-null  int32         
 6   time_to_ship                     21864 non-null  int32         
 7   ship_ts                          21864 non-null  datetime64[ns]
 8   product_name                     21864 non-null  object        
 9   product_name_cleaned             21864 non-null  object        
 10  product_id                       21864 non-null  object   

In [59]:
# Export
df.to_csv(r"E:\Projects\Gamezone Orders Data\gamezone_orders_data_cleaned.csv", 
          index=False, encoding='utf-8')