# Ungraded Lab: Data Type Transformation Lab

## Overview 
In this hands-on lab, you'll work with the EngageMetrics employee dataset to clean and transform various data types. You'll encounter common data cleaning challenges like inconsistent date formats, mixed case categories, and currency-formatted numbers. This practical experience mirrors real-world data preparation tasks essential for accurate analysis.

As you work through this lab, remember that the lesson screencast is a valuable reference. Having the video readily available in another tab can help you tackle challenging sections more effectively.

## Learning Outcomes 
By the end of this lab, you will be able to:
- Transform date strings into standardized datetime objects
- Normalize categorical variables for consistency
- Convert currency strings to numeric values
- Handle missing values using pandas methods
- Apply data type transformations at scale

## Dataset Information 
We'll use EngageMetrics’s <b>employee_insights.csv</b> dataset, containing employee records with various fields including:
- Dates (last_training_date, last_promotion_date)
- Categorical data (department, work_mode)
- Currency values (salary)

## Activities
### Activity 1: Initial Data Exploration 

Let's first understand our data's current state.

<b>Step 1:</b> Load and examine the data:

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('employee_insights.csv')

# Examine the first few rows and data info
print(df.head())

print("\nDataset Info:")
print(df.info())

  employee_id   age  salary promotion_eligible last_training_date department  \
0       E0001  54.0     NaN                NaN         15/08/2023         HR   
1       E0002   NaN  $64761                  N         15/08/2023        NaN   
2       E0003  54.0     NaN                  N         15/08/2023  Marketing   
3       E0004   NaN     NaN                 No                NaN        NaN   
4       E0005  29.0  $61486                  Y         15/08/2023        NaN   

  work_experience  projects_completed  hours_worked_weekly    work_mode  \
0             NaN                14.0                  NaN  remote work   
1         1 years                 NaN                 53.3       HYBRID   
2               8                 6.0                 32.6       Hybrid   
3              16                 1.0                 37.8       Remote   
4             NaN                 1.0                 53.3       Hybrid   

  last_promotion_date  satisfaction_score  overtime_hours  
0       

<b>Tip:</b> Always check your data types and missing values before starting transformations.

**Step 2:** Handle missing values:

In [2]:
# Check initial missing values
print("\nMissing values before cleaning:")
print(df.isnull().sum())


Missing values before cleaning:
employee_id             0
age                    56
salary                 37
promotion_eligible     16
last_training_date     29
department             15
work_experience        29
projects_completed     52
hours_worked_weekly    33
work_mode              16
last_promotion_date    26
satisfaction_score     39
overtime_hours         30
dtype: int64


**Step 3:** **Try it Yourself:** Handle missing values in numeric and categorical columns:

In [3]:
# YOUR CODE HERE 

**Step 4:** Verify your work

In [4]:
# Verify missing values have been handled
print("\nRemaining missing values after cleaning:")
print(df.isnull().sum())


Remaining missing values after cleaning:
employee_id             0
age                    56
salary                 37
promotion_eligible     16
last_training_date     29
department             15
work_experience        29
projects_completed     52
hours_worked_weekly    33
work_mode              16
last_promotion_date    26
satisfaction_score     39
overtime_hours         30
dtype: int64


**Tip:** Use appropriate pandas methods (fillna()) with median for numeric columns and mode for categorical columns.

### Activity 2: Date Standardization

<b>Step 1:</b> Examine date formats:

In [5]:
print("Training dates sample:")
print(df['last_training_date'].head())

print("\nPromotion dates sample:")
print(df['last_promotion_date'].head())

Training dates sample:
0    15/08/2023
1    15/08/2023
2    15/08/2023
3           NaN
4    15/08/2023
Name: last_training_date, dtype: object

Promotion dates sample:
0    2022-05-10
1    05-10-2022
2    10/05/2022
3    05-10-2022
4    2022-05-10
Name: last_promotion_date, dtype: object


<b>Step 2: Try It Yourself:</b> Convert both date columns to datetime format: 

In [6]:
# YOUR CODE HERE

<b>Step 3:</b> Verify Your Work:

In [7]:
# Verify salary conversion
print(df['last_training_date'].dtype)
print(df['last_promotion_date'].dtype)

object
object


### Activity 3: Salary Data Cleaning
<b>Step 1:</b> Examine salary values:

In [8]:
# Examine current salary format
print("Salary values sample:")
print(df['salary'].head())

Salary values sample:
0       NaN
1    $64761
2       NaN
3       NaN
4    $61486
Name: salary, dtype: object


<b>Step 2: Try It Yourself:</b> Clean the salary column by removing currency symbols and converting to float: 

In [9]:
# YOUR CODE HERE

<b>Step 3:</b> Verify Your Work:

In [10]:
# Verify conversion
print(df['salary'].dtype)
print(df['salary'].head())

object
0       NaN
1    $64761
2       NaN
3       NaN
4    $61486
Name: salary, dtype: object


### Activity 4: Categorical Data Normalization
<b>Step 1:</b> Examine the work_mode column values:

In [11]:
# YOUR CODE HERE 

<b>Step 2:</b> Standardize the work_mode column values to be consistent case and format: 

In [12]:
# YOUR CODE HERE 

## Success Checklist
- All salary values are numeric (float)
- Dates are in datetime format
- work_mode values are consistent

## Common Issues & Solutions 

- Problem: ValueError when converting dates 

 - Solution: Check for multiple date formats and handle each separately

- Problem: NaN values after conversion 

 - Solution: Verify data cleaning steps and handle missing values appropriately

## Summary 
Congratulations! You've now mastered essential data type transformation techniques that data scientists use daily, including standardizing dates, cleaning categorical variables, and handling currency conversions. These skills will be invaluable as you work with real-world datasets that often come with inconsistent formatting and require careful cleaning before analysis can begin.

### Key Points
- Data type consistency is crucial for analysis
- Always validate transformations
- Handle missing values appropriately
- Document your cleaning steps

## Solution Code
Stuck on your code or want to check your solution? Here's a complete reference implementation to guide you. This represents just one effective approach—try solving independently first, then use this to overcome obstacles or compare techniques. The solution is provided to help you move forward and explore alternative approaches to achieve the same results. Happy coding!

### Activity 1: Initial Data Exploration - Solution Code

In [13]:
import pandas as pd
import numpy as np

#Load the dataset
df = pd.read_csv('employee_insights.csv')

#Initial examination
print("First few rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

# Check initial missing values
print("\nMissing values before cleaning:")
print(df.isnull().sum())

# Handle missing numeric values
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Handle missing categorical values
categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])

# Verify missing values have been handled
print("\nRemaining missing values after cleaning:")
print(df.isnull().sum())

First few rows:
  employee_id   age  salary promotion_eligible last_training_date department  \
0       E0001  54.0     NaN                NaN         15/08/2023         HR   
1       E0002   NaN  $64761                  N         15/08/2023        NaN   
2       E0003  54.0     NaN                  N         15/08/2023  Marketing   
3       E0004   NaN     NaN                 No                NaN        NaN   
4       E0005  29.0  $61486                  Y         15/08/2023        NaN   

  work_experience  projects_completed  hours_worked_weekly    work_mode  \
0             NaN                14.0                  NaN  remote work   
1         1 years                 NaN                 53.3       HYBRID   
2               8                 6.0                 32.6       Hybrid   
3              16                 1.0                 37.8       Remote   
4             NaN                 1.0                 53.3       Hybrid   

  last_promotion_date  satisfaction_score  overtime_

### Activity 2: Date Standardization - Solution Code

In [14]:
# Examine date formats
print("Training dates sample:")
print(df['last_training_date'].head())

print("\nPromotion dates sample:")
print(df['last_promotion_date'].head())

# Convert dates to datetime
df['last_training_date'] = pd.to_datetime(df['last_training_date'], format='mixed')
df['last_promotion_date'] = pd.to_datetime(df['last_promotion_date'], format='mixed')

# Verify conversion
print("\nUpdated data types:")
print("Training date dtype:", df['last_training_date'].dtype)
print("Promotion date dtype:", df['last_promotion_date'].dtype)

Training dates sample:
0    15/08/2023
1    15/08/2023
2    15/08/2023
3    15/08/2023
4    15/08/2023
Name: last_training_date, dtype: object

Promotion dates sample:
0    2022-05-10
1    05-10-2022
2    10/05/2022
3    05-10-2022
4    2022-05-10
Name: last_promotion_date, dtype: object

Updated data types:
Training date dtype: datetime64[ns]
Promotion date dtype: datetime64[ns]


### Activity 3: Salary Data Cleaning - Solution Code

In [15]:
# Examine current salary format
print("Original salary values:")
print(df['salary'].head())

# Clean salary column
df['salary'] = df['salary'].str.replace('$', '').str.replace(',', '').astype(float)

# Verify conversion
print("\nCleaned salary values:")
print(df['salary'].head())
print("Salary dtype:", df['salary'].dtype)

Original salary values:
0    $104328
1     $64761
2    $104328
3    $104328
4     $61486
Name: salary, dtype: object

Cleaned salary values:
0    104328.0
1     64761.0
2    104328.0
3    104328.0
4     61486.0
Name: salary, dtype: float64
Salary dtype: float64


### Activity 4: Categorical Data Normalization - Solution Code 

In [16]:
# Examine work_mode values
print("Original work_mode values:")
print(df['work_mode'].value_counts())

# Standardize work_mode values
df['work_mode'] = df['work_mode'].str.strip().str.lower()

# Group similar values
df['work_mode'] = df['work_mode'].replace({
    'remote': 'remote',
    'remote work': 'remote',
    'on-site': 'in-office',
    'hybrid': 'hybrid'
})

# Verify standardization
print("\nStandardized work_mode values:")
print(df['work_mode'].value_counts())

# Final Verification
print("\nFinal Data Types:")
print(df.dtypes)

Original work_mode values:
work_mode
HYBRID         35
remote work    19
On-site        17
Remote         15
Hybrid         14
Name: count, dtype: int64

Standardized work_mode values:
work_mode
hybrid       49
remote       34
in-office    17
Name: count, dtype: int64

Final Data Types:
employee_id                    object
age                           float64
salary                        float64
promotion_eligible             object
last_training_date     datetime64[ns]
department                     object
work_experience                object
projects_completed            float64
hours_worked_weekly           float64
work_mode                      object
last_promotion_date    datetime64[ns]
satisfaction_score            float64
overtime_hours                float64
dtype: object
