# Ungraded Lab: DataFrame Operations Lab

## Overview 
In this hands-on lab, you'll work with EngageMetrics employee and education data using pandas DataFrame operations. You'll learn to merge, filter, and analyze data to gain practical insights about employee educational backgrounds and work metrics. These skills are essential for data scientists working with multiple data sources.

Need help along the way? Don't hesitate to revisit this lesson’s screencast - this is a common practice in professional development and helps reinforce your learning.

## Learning Outcomes 
By the end of this lab, you will be able to:
- Merge multiple datasets using pandas DataFrame operations
- Filter and group data to extract meaningful insights
- Handle common data cleaning challenges in real datasets
- Create analytical summaries using DataFrame aggregations

## Dataset Information 
We'll work with two EngageMetrics datasets:
- <b>education_data.xlsx:</b> Contains employee educational backgrounds including graduation years and fields of study
- <b>employee_insights.csv:</b> Contains employee performance metrics, satisfaction scores, and work-related information

## Activities

### Activity 1: Data Loading and Initial Exploration

<b>Step 1:</b> Import required libraries and load datasets:

In [1]:
import pandas as pd

# Load the datasets
education_df = pd.read_excel('education_data.xlsx')
employee_df = pd.read_csv('employee_insights.csv')

<b>Step 2:</b> Examine the data:

In [2]:
# Display first few rows of each dataset
print("Education Data:")
print(education_df.head())

print("\nEmployee Data:")
print(employee_df.head())

Education Data:
      ID  graduation_year   educational_background
0  E0001             2011               Psychology
1  E0002             1995             Architecture
2  E0003             2007  Business Administration
3  E0004             2000  Business Administration
4  E0005             1991                 Medicine

Employee Data:
  employee_id   age  salary promotion_eligible last_training_date department  \
0       E0001  54.0     NaN                NaN         15/08/2023         HR   
1       E0002   NaN  $64761                  N         15/08/2023        NaN   
2       E0003  54.0     NaN                  N         15/08/2023  Marketing   
3       E0004   NaN     NaN                 No                NaN        NaN   
4       E0005  29.0  $61486                  Y         15/08/2023        NaN   

  work_experience  projects_completed  hours_worked_weekly    work_mode  \
0             NaN                14.0                  NaN  remote work   
1         1 years              

<b>Tip:</b> Always check your data types and missing values after loading datasets.

### Activity 2: Data Cleaning and Preparation
<b>Step 1:</b> Clean the work_mode column:

In [3]:
# Standardize work mode values
employee_df['work_mode'] = employee_df['work_mode'].str.strip().str.lower()

<b>Step 2:</b> Handle missing values:

In [4]:
# Your code here to handle missing values

### Activity 3: Merging Datasets
<b>Step 1:</b> Merge education and employee datasets:

In [5]:
# Merge datasets on employee_id
merged_df = pd.merge(
    employee_df,
    education_df,
    left_on='employee_id',
    right_on='ID',
    how='inner'
)

### Activity 4: Analysis and Insights
<b>Step 1:</b> Analyze educational backgrounds by department:

In [6]:
# Your code here to group by department and educational_background

<b>Step 2:</b> Calculate average satisfaction scores by educational background:

In [7]:
# Your code here to analyze satisfaction scores

<b>Test Your Work:</b>
1. Verify the merged dataset contains all expected columns
2. Check for any duplicate records
3. Confirm aggregation results make sense

## Success Checklist
- Successfully loaded and cleaned both datasets
- Merged datasets correctly
- Created meaningful aggregations
- Generated insights about education and performance

## Common Issues & Solutions 
- Problem: Merge results in unexpected number of rows 
    - Solution: Check for duplicate IDs in either dataset
- Problem: Case-sensitive matching issues 
    - Solution: Standardize string values before merging

## Summary
Congratulations! You've successfully mastered essential DataFrame operations in pandas, learning how to merge, clean, and analyze real-world HR data. These skills in data manipulation and analysis are fundamental to data science work, and you've now gained practical experience working with the kinds of messy, multi-source datasets you'll encounter in your career.

### Key Points
- Data cleaning is crucial before analysis
- Proper merge operations preserve data integrity
- Grouping and aggregation reveal patterns in data

## Solution Code
Stuck on your code or want to check your solution? Here's a complete reference implementation to guide you. This represents just one effective approach—try solving independently first, then use this to overcome obstacles or compare techniques. The solution is provided to help you move forward and explore alternative approaches to achieve the same results. Happy coding!

### Activity 1: Data Loading and Initial Exploration - Solution Code

In [8]:
# Step 1: Load the datasets
education_df = pd.read_excel('education_data.xlsx') 
employee_df = pd.read_csv('employee_insights.csv')

# Step 2: Examine the data
print("Education Data Info:") 
print(education_df.info())
print("\nEmployee Data Info:")
print(employee_df.info())

Education Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   ID                      100 non-null    object
 1   graduation_year         100 non-null    int64 
 2   educational_background  100 non-null    object
dtypes: int64(1), object(2)
memory usage: 2.5+ KB
None

Employee Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   employee_id          100 non-null    object 
 1   age                  44 non-null     float64
 2   salary               63 non-null     object 
 3   promotion_eligible   84 non-null     object 
 4   last_training_date   71 non-null     object 
 5   department           85 non-null     object 
 6   work_experience      71 non-nul

### Activity 2: Data Cleaning and Preparation - Solution Code

In [9]:
# Step 1: Clean the work_mode column
employee_df['work_mode'] = employee_df['work_mode'].str.strip().str.lower()

# Step 2: Handle missing values

# Fill missing numeric values with median
numeric_columns = employee_df.select_dtypes(include=['number']).columns
employee_df[numeric_columns] = employee_df[numeric_columns].fillna(employee_df[numeric_columns].median())

# Fill missing categorical values with mode
categorical_columns = employee_df.select_dtypes(include=['object']).columns
employee_df[categorical_columns] = employee_df[categorical_columns].fillna(employee_df[categorical_columns].mode().iloc[0])

### Activity 3: Merging Datasets - Solution Code

In [10]:
# Merge datasets on employee_id
merged_df = pd.merge(
employee_df,
education_df,
left_on='employee_id',
right_on='ID',
how='inner'
)

### Activity 4: Analysis and Insights - Solution Code

In [11]:
# Step 1: Analyze educational backgrounds by department
education_by_dept = merged_df.groupby(['department', 'educational_background']).size().unstack(fill_value=0)
display(education_by_dept)

# Step 2: Calculate average satisfaction scores by educational background:
# Calculate average satisfaction score by educational background
avg_satisfaction_by_education = merged_df.groupby('educational_background')['satisfaction_score'].mean()
display(avg_satisfaction_by_education)

educational_background,Architecture,Biology,Business Administration,Chemistry,Computer Science,Economics,Engineering,Law,Linguistics,Mathematics,Medicine,Philosophy,Physics,Political Science,Psychology,Statistics
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Engineering,1,0,0,1,3,1,0,0,0,1,0,2,1,0,0,3
Finance,4,0,3,2,2,2,2,0,4,2,4,1,2,2,2,1
HR,0,1,1,1,1,1,2,1,1,0,3,0,0,0,2,2
Marketing,1,0,1,2,0,1,1,1,1,0,1,1,1,1,2,1
engineering,0,1,1,0,0,2,1,1,0,1,0,1,2,0,0,0
finance,0,0,0,0,2,0,1,0,1,3,1,0,1,0,3,1


educational_background
Architecture               6.500000
Biology                    6.500000
Business Administration    5.666667
Chemistry                  7.333333
Computer Science           6.500000
Economics                  4.857143
Engineering                5.142857
Law                        5.333333
Linguistics                5.857143
Mathematics                6.571429
Medicine                   6.222222
Philosophy                 6.600000
Physics                    6.142857
Political Science          5.333333
Psychology                 4.888889
Statistics                 3.375000
Name: satisfaction_score, dtype: float64