# MLOps Data Transformation Pipeline
`This notebook demonstrates key data transformation techniques commonly used in machine learning pipelines. It follows MLOps best practices for data preprocessing and feature engineering.`

## Prepare Environment
### Install Dependencies
**Install pandas library uisng the below command**  
The following packages are required for this data transformation pipeline:  
- pandas: Data manipulation and analysis

Note: The **!** tells Jupyter to run this as a command in your system's shell


In [203]:
print("Installing pandas")
!pip install pandas

Installing pandas


### Create Dataset
Generate mock dataset using the create_dataset.py script.  
This ensures reproducible data for our transformation pipeline.  

In [173]:
!python3 create_dataset.py

### Import Required Libraries

In [174]:
import pandas as pd
import json

## 1. Data Exploration  
Load the raw dataset and perform initial data profiling. 
This step is crucial for understanding data quality and structure. 

### Step 1: Load the CSV File into the DataFrame

In [204]:
try:
    df = pd.read_csv("data/mock_data.csv")
    print(f"‚úÖ Dataset loaded successfully!")
    print(f"üìè Dataset shape: {df.shape}")
except FileNotFoundError:
    print("‚ùå Error: mock_data.csv not found. Please run create_dataset.py first.")
    exit()

‚úÖ Dataset loaded successfully!
üìè Dataset shape: (20000, 8)


### Step 2: Analyse the Data  
Perform comprehensive data analysis to understand:
- Data types and memory usage
- Missing values pattern
- Statistical distribution
- Unique values and categories

In [205]:
# Display the first 5 rows from the loaded DataFrame
print("\nüìã First 5 rows:")
df.head()


üìã First 5 rows:


Unnamed: 0,id,name,age,salary,hire_date,profile,department,bonus
0,1,Name_103,77.0,60000.0,2020-10-06,"{""address"": ""Street 90, City 47"", ""phone"": ""29...",Marketing,2303.0
1,2,Name_436,62.0,50000.0,2020-06-03,"{""address"": ""Street 39, City 29"", ""phone"": ""45...",Marketing,4574.0
2,3,Name_861,61.0,60000.0,2022-11-03,"{""address"": ""Street 10, City 2"", ""phone"": ""809...",HR,3051.0
3,4,Name_271,36.0,70000.0,2023-08-17,"{""address"": ""Street 40, City 7"", ""phone"": ""505...",,5698.0
4,5,Name_107,78.0,60000.0,2024-11-20,"{""address"": ""Street 94, City 24"", ""phone"": ""72...",IT,1446.0


In [207]:
# Get the summary of the DataFrame
print("\nüìä Data Types & Non-Null Counts:\n")
df.info()


üìä Data Types & Non-Null Counts:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          20000 non-null  int64  
 1   name        20000 non-null  object 
 2   age         19000 non-null  float64
 3   salary      13519 non-null  float64
 4   hire_date   17991 non-null  object 
 5   profile     18026 non-null  object 
 6   department  16003 non-null  object 
 7   bonus       17995 non-null  float64
dtypes: float64(3), int64(1), object(4)
memory usage: 1.2+ MB


In [208]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"\nüîÑ Duplicate rows: {duplicates}")


üîÑ Duplicate rows: 0


In [178]:
# Check unique values in the department column
df['department'].unique()

array(['Marketing', 'HR', nan, 'IT', 'Finance'], dtype=object)

In [209]:
# View statistical summary for numeric coloums
print("\nüìà Statistical Summary:")
df.describe(include='all')


üìà Statistical Summary:


Unnamed: 0,id,name,age,salary,hire_date,profile,department,bonus
count,20000.0,20000,19000.0,13519.0,17991,18026,16003,17995.0
unique,,999,,,3624,18026,4,
top,,Name_825,,,2017-01-15,"{""address"": ""Street 90, City 47"", ""phone"": ""29...",IT,
freq,,37,,,15,1,4058,
mean,10000.5,,48.444684,59962.275316,,,,5510.359933
std,5773.647028,,17.892848,8200.588356,,,,2595.005806
min,1.0,,18.0,50000.0,,,,1000.0
25%,5000.75,,33.0,50000.0,,,,3258.5
50%,10000.5,,48.0,60000.0,,,,5515.0
75%,15000.25,,64.0,70000.0,,,,7764.0


In [180]:
# Check for missing values
print("\n‚ùì Missing Values Analysis:\n")
df.isnull().sum()

id               0
name             0
age           1000
salary        6481
hire_date     2009
profile       1974
department    3997
bonus         2005
dtype: int64

## 2. Data Cleaning
### Step 1: Handle Missing values of age, and salary

In [181]:
print("Print the missing values in age, and salary\n")
print("Missing Age values")
print(df[df['age'].isnull()][['age', 'salary', 'department']])

Print the missing values in age, and salary

Missing Age values
       age   salary department
44     NaN  60000.0  Marketing
115    NaN  60000.0         IT
127    NaN      NaN  Marketing
147    NaN  60000.0         HR
164    NaN  70000.0         IT
...    ...      ...        ...
19872  NaN  60000.0         HR
19921  NaN      NaN         HR
19940  NaN  70000.0        NaN
19997  NaN  60000.0         IT
19998  NaN  60000.0  Marketing

[1000 rows x 3 columns]


In [182]:
print("Missing Salary values")
print(df[df['salary'].isnull()][['age', 'salary', 'department']])

Missing Salary values
        age  salary department
5      35.0     NaN         IT
11     61.0     NaN         IT
13     46.0     NaN        NaN
14     48.0     NaN         IT
15     61.0     NaN         HR
...     ...     ...        ...
19984  71.0     NaN        NaN
19988  72.0     NaN  Marketing
19992  60.0     NaN        NaN
19993  76.0     NaN  Marketing
19999  47.0     NaN        NaN

[6481 rows x 3 columns]


In [183]:
# Get the median values for age, and salary
age_median = df['age'].median()
salary_median = df['salary'].median()
print("Age Median", age_median)
print("Salary Median", salary_median)

Age Median 48.0
Salary Median 60000.0


In [184]:
# Fill missing values of age with age_median
df['age'] = df['age'].fillna(age_median)
# Fill missing values of salary with salary_median
df['salary'] = df['salary'].fillna(salary_median)

#### Age & Salary columns missing values are filled with the respective median

In [185]:
# Verify the Age & Salary data
df.head()
# Check for missing values
print("Missing values in each column")
df.isnull().sum()

Missing values in each column


id               0
name             0
age              0
salary           0
hire_date     2009
profile       1974
department    3997
bonus         2005
dtype: int64

### Step 2: Handle Missing values of Department

In [186]:
print("Print the missing values for Department\n")
print("Missing Department Missing values")
print(df[df['department'].isnull()][['age', 'salary', 'department']])

Print the missing values for Department

Missing Department Missing values
        age   salary department
3      36.0  70000.0        NaN
13     46.0  60000.0        NaN
49     34.0  50000.0        NaN
53     33.0  60000.0        NaN
57     28.0  70000.0        NaN
...     ...      ...        ...
19973  50.0  60000.0        NaN
19975  29.0  60000.0        NaN
19984  71.0  60000.0        NaN
19992  60.0  60000.0        NaN
19999  47.0  60000.0        NaN

[3997 rows x 3 columns]


In [187]:
# Fill the missing values in department with 'Unknown'
df['department'] = df['department'].fillna('Unknown')

#### Department column missing values are filled with the respective median

In [188]:
# Verify the Age & Salary data
df.head()
# Check for missing values
print("Missing values in each column")
print(df.isnull().sum())
# Check unique values in the department column
df['department'].unique()

Missing values in each column
id               0
name             0
age              0
salary           0
hire_date     2009
profile       1974
department       0
bonus         2005
dtype: int64


array(['Marketing', 'HR', 'Unknown', 'IT', 'Finance'], dtype=object)

### Step 3: Devide Profile Column into 3 different columns i.e., Address, Phone, Email

In [189]:
print("Top rows from profile column \n")
print(df['profile'].head())

# Find the first non-null value in the column
profile_first_value = df['profile'].dropna().iloc[0]
# Print its type
print("\nProfile column values current data type")
print(type(profile_first_value))

# If your 'profile' column already contains Python dictionaries, not JSON strings.
# You do not need to parse it with json.loads(). The data is ready to be used directly.

# Convert profile JSON strings into dictionaries
df['profile'] = df['profile'].apply(lambda x: json.loads(x) if pd.notnull(x) else {})

Top rows from profile column 

0    {"address": "Street 90, City 47", "phone": "29...
1    {"address": "Street 39, City 29", "phone": "45...
2    {"address": "Street 10, City 2", "phone": "809...
3    {"address": "Street 40, City 7", "phone": "505...
4    {"address": "Street 94, City 24", "phone": "72...
Name: profile, dtype: object

Profile column values current data type
<class 'str'>


In [190]:
# Extract Address Field
print("Extract Address Field....\n")
# Create new 'address' column by extracting from 'profile' dictionaries
df['address'] = df['profile'].apply(lambda x: x.get('address', None))  # Returns None if no address key

print("Top rows from profile column \n")
print(df['profile'].head())
print("\nTop rows from newly created address column \n")
print(df['address'].head())


Extract Address Field....

Top rows from profile column 

0    {'address': 'Street 90, City 47', 'phone': '29...
1    {'address': 'Street 39, City 29', 'phone': '45...
2    {'address': 'Street 10, City 2', 'phone': '809...
3    {'address': 'Street 40, City 7', 'phone': '505...
4    {'address': 'Street 94, City 24', 'phone': '72...
Name: profile, dtype: object

Top rows from newly created address column 

0    Street 90, City 47
1    Street 39, City 29
2     Street 10, City 2
3     Street 40, City 7
4    Street 94, City 24
Name: address, dtype: object


In [191]:
# Extract Phone Field
print("Extract Phone Field....\n")
# Create new 'phone' column by extracting from 'profile' dictionaries
df['phone'] = df['profile'].apply(lambda x: x.get('phone', None))  # Returns None if no address key

print("Top rows from profile column \n")
print(df['profile'].head())
print("\nTop rows from newly created phone column \n")
print(df['phone'].head())


Extract Phone Field....

Top rows from profile column 

0    {'address': 'Street 90, City 47', 'phone': '29...
1    {'address': 'Street 39, City 29', 'phone': '45...
2    {'address': 'Street 10, City 2', 'phone': '809...
3    {'address': 'Street 40, City 7', 'phone': '505...
4    {'address': 'Street 94, City 24', 'phone': '72...
Name: profile, dtype: object

Top rows from newly created phone column 

0    2911612090
1    4598319667
2    8095777290
3    5051497497
4    7240070070
Name: phone, dtype: object


In [192]:
# Extract Email Field
print("Extract Email Field....\n")
# Create new 'email' column by extracting from 'profile' dictionaries
df['email'] = df['profile'].apply(lambda x: x.get('email', None))  # Returns None if no address key

print("Top rows from profile column \n")
print(df['profile'].head())
print("\nTop rows from newly created email column \n")
print(df['email'].head())



Extract Email Field....

Top rows from profile column 

0    {'address': 'Street 90, City 47', 'phone': '29...
1    {'address': 'Street 39, City 29', 'phone': '45...
2    {'address': 'Street 10, City 2', 'phone': '809...
3    {'address': 'Street 40, City 7', 'phone': '505...
4    {'address': 'Street 94, City 24', 'phone': '72...
Name: profile, dtype: object

Top rows from newly created email column 

0    email_631@example.com
1    email_187@example.com
2    email_567@example.com
3    email_526@example.com
4    email_857@example.com
Name: email, dtype: object


In [193]:
# Now drop the profile column
print("\nColumns before dropping profile:")
print(df.columns.tolist())

# Without inplace=True (df remains unchanged)
new_df = df.drop(columns=['profile'])

# With inplace=True (df is modified directly)
#df.drop(columns=['profile'], inplace=True)

print("\nColumns in new DataFrame after dropping profile:")
# print(df.columns.tolist())
print(new_df.columns.tolist())


Columns before dropping profile:
['id', 'name', 'age', 'salary', 'hire_date', 'profile', 'department', 'bonus', 'address', 'phone', 'email']

Columns in new DataFrame after dropping profile:
['id', 'name', 'age', 'salary', 'hire_date', 'department', 'bonus', 'address', 'phone', 'email']


### Step 4: Save cleaned data into new CSV

In [194]:
print("Saving cleaned data csv to: 'data/cleaned_data.csv' ...")
new_df.to_csv("data/cleaned_data.csv", index=False)
print("\nCleaned data csv saved to: 'data/cleaned_data.csv'")

Saving cleaned data csv to: 'data/cleaned_data.csv' ...

Cleaned data csv saved to: 'data/cleaned_data.csv'


## Data Transformation
### Step 1: Load the cleaned dataset into new DataFrame

In [195]:
clean_df = pd.read_csv("data/cleaned_data.csv")
clean_df.head()

Unnamed: 0,id,name,age,salary,hire_date,department,bonus,address,phone,email
0,1,Name_103,77.0,60000.0,2020-10-06,Marketing,2303.0,"Street 90, City 47",2911612000.0,email_631@example.com
1,2,Name_436,62.0,50000.0,2020-06-03,Marketing,4574.0,"Street 39, City 29",4598320000.0,email_187@example.com
2,3,Name_861,61.0,60000.0,2022-11-03,HR,3051.0,"Street 10, City 2",8095777000.0,email_567@example.com
3,4,Name_271,36.0,70000.0,2023-08-17,Unknown,5698.0,"Street 40, City 7",5051497000.0,email_526@example.com
4,5,Name_107,78.0,60000.0,2024-11-20,IT,1446.0,"Street 94, City 24",7240070000.0,email_857@example.com


### Step2 : Verify the Address Length  

In [196]:
# Create a new column 'address_length' 
clean_df['address_length'] = clean_df['address'].apply(lambda x: len(str(x)))
print("Address followed by Address Length columns")
clean_df[['address', 'address_length']].head()

Address followed by Address Length columns


Unnamed: 0,address,address_length
0,"Street 90, City 47",18
1,"Street 39, City 29",18
2,"Street 10, City 2",17
3,"Street 40, City 7",17
4,"Street 94, City 24",18


### Step 3: Categorise the Salary

In [197]:
# Define the bins and labels
bins = [0, 50000, 70000, 100000]
labels = ['low', 'medium', 'high']

# Create a new column 'salary_category'
clean_df['salary_category'] = pd.cut(df['salary'], bins=bins, labels=labels, include_lowest=True)

# Print sample data after adding the 'salary_category' column
print("Sample data after adding the 'salary_category' column: \n")
clean_df[['salary', 'salary_category']].head()

Sample data after adding the 'salary_category' column: 



Unnamed: 0,salary,salary_category
0,60000.0,medium
1,50000.0,low
2,60000.0,medium
3,70000.0,medium
4,60000.0,medium


### Step 4: Calculate avg Salary and Age based on Department

In [198]:
# Group by 'department' and calculate average salary and age
summary_report = df.groupby('department').agg({
    'salary': 'mean',
    'age': 'mean'
}).reset_index()

# rename columns of summary_report for clarity
summary_report.columns = ['Department', 'Average Salary', 'Average Age']

In [199]:
# Print the Summary Report
print("Summary report of average salary and age based on the department:\n")
print(summary_report)

Summary report of average salary and age based on the department:

  Department  Average Salary  Average Age
0    Finance    59830.035515    48.345256
1         HR    60015.155342    48.620106
2         IT    60034.499754    48.650074
3  Marketing    60049.455984    48.419139
4    Unknown    59939.954966    48.075056


### Step 5: Save the transformed DataFrame to a new csv file

In [200]:
print("Saving Transformed data csv to: 'data/transformed_data.csv' ...")
clean_df.to_csv("data/transformed_data.csv", index=False)
print("\nTransformed data csv saved to: 'data/transformed_data.csv'")

Saving Transformed data csv to: 'data/transformed_data.csv' ...

Transformed data csv saved to: 'data/transformed_data.csv'
