# Ungraded Lab: Guided EDA Exercises

## Overview 
In this hands-on lab, you'll conduct exploratory data analysis (EDA) on the EngageMetrics employee dataset. You'll learn to apply descriptive statistical methods, identify patterns, and detect anomalies in real-world data. This lab builds foundation skills needed for the final capstone project. Remember that the course screencasts are here to support your learning journey. Having them open in another tab for quick reference is a strategy used by professionals and learners alike.

## Learning Outcomes 
By the end of this lab, you will be able to:
- Apply basic descriptive statistics to understand data distributions
- Generate summary statistics for numerical and categorical variables
- Identify patterns and potential anomalies in the dataset
- Document your analysis process and findings effectively

## Dataset Information 
We'll use the <b>employee_insights.csv</b> dataset from EngageMetrics, containing employee information including:
- Demographics (age)
- Work metrics (salary, hours worked, projects completed)
- Performance indicators (satisfaction score, overtime hours)
- Categorical data (department, work mode, promotion eligibility)


## Activities
### Activity 1: Initial Data Exploration 
Let's begin by loading and examining the dataset structure.

<b>Step 1:</b> Import libraries and load data:


In [1]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('employee_insights.csv')

<b>Step 2:</b> Examine the dataset structure:

In [2]:
# View first few rows and basic information
display(df.head())
df.info()

Unnamed: 0,employee_id,age,salary,promotion_eligible,last_training_date,department,work_experience,projects_completed,hours_worked_weekly,work_mode,last_promotion_date,satisfaction_score,overtime_hours
0,E0001,54.0,,,15/08/2023,HR,,14.0,,remote work,2022-05-10,,8.4
1,E0002,,$64761,N,15/08/2023,,1 years,,53.3,HYBRID,05-10-2022,,8.1
2,E0003,54.0,,N,15/08/2023,Marketing,8,6.0,32.6,Hybrid,10/05/2022,10.0,5.2
3,E0004,,,No,,,16,1.0,37.8,Remote,05-10-2022,5.0,
4,E0005,29.0,$61486,Y,15/08/2023,,,1.0,53.3,Hybrid,2022-05-10,,0.3


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   employee_id          100 non-null    object 
 1   age                  44 non-null     float64
 2   salary               63 non-null     object 
 3   promotion_eligible   84 non-null     object 
 4   last_training_date   71 non-null     object 
 5   department           85 non-null     object 
 6   work_experience      71 non-null     object 
 7   projects_completed   48 non-null     float64
 8   hours_worked_weekly  67 non-null     float64
 9   work_mode            84 non-null     object 
 10  last_promotion_date  74 non-null     object 
 11  satisfaction_score   61 non-null     float64
 12  overtime_hours       70 non-null     float64
dtypes: float64(5), object(8)
memory usage: 10.3+ KB


<b>Tip:</b> Always start with a high-level overview of your data to understand its structure.

### Activity 2: Descriptive Statistics 
Now we'll calculate and interpret basic statistics.

<b>Step 1:</b> Generate summary statistics:

In [3]:
# Generate summary statistics for numerical columns
# YOUR CODE HERE 

<b>Step 2:</b> Analyze categorical variables:

In [4]:
# Clean and calculate value counts for work_mode and department
# YOUR CODE HERE 

<b>Tip:</b> Pay attention to missing values and potential outliers in your summary statistics.

### Activity 3: Pattern Identification 
Let's explore relationships between variables.

<b>Step 1:</b> Examine relationships:

In [5]:
# Calculate average satisfaction score by department
# YOUR CODE HERE 

<b>Step 2:</b> Identify potential anomalies:

In [6]:
# Check for outliers in salary and hours_worked_weekly (remember to fill NaN values with appropriate values)
# YOUR CODE HERE 

<b>Test Your Work:</b>
- Summary statistics should show key metrics (mean, median, std) for numerical columns
- Categorical analysis should show counts for each unique value
- Pattern analysis should reveal insights about employee satisfaction
- Outlier detection should identify unusual values in the dataset

## Success Checklist
- Successfully loaded and examined the dataset structure
- Generated and interpreted summary statistics
- Analyzed categorical variable distributions
- Identified patterns and potential outliers
- Documented findings clearly

## Common Issues & Solutions 
- Problem: Missing values affecting calculations 
  - Solution: Use appropriate pandas methods (dropna(), fillna()) based on context
- Problem: Incorrect data types preventing analysis 
  - Solution: Check and convert data types using astype() when needed
  
## Summary
Congratulations on completing this foundational EDA lab! You've now mastered the essential techniques for uncovering insights in data, and these skills will serve as your analytical toolkit for tackling real-world business challenges in your data science journey.
  
### Key Points
- Always start with basic data exploration
- Consider both numerical and categorical variables
- Document observations and potential issues
- Think critically about what the patterns mean

## Solution Code
Stuck on your code or want to check your solution? Here's a complete reference implementation to guide you. This represents just one effective approach—try solving independently first, then use this to overcome obstacles or compare techniques. The solution is provided to help you move forward and explore alternative approaches to achieve the same results. Happy coding!


### Activity 1: Initial Data Exploration - Solution Code

In [7]:
# Import required libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('employee_insights.csv')

# Display the first few rows
print("First few rows of the dataset:")
display(df.head())

# Display basic information about the dataset
print("\nDataset information:")
df.info()

First few rows of the dataset:


Unnamed: 0,employee_id,age,salary,promotion_eligible,last_training_date,department,work_experience,projects_completed,hours_worked_weekly,work_mode,last_promotion_date,satisfaction_score,overtime_hours
0,E0001,54.0,,,15/08/2023,HR,,14.0,,remote work,2022-05-10,,8.4
1,E0002,,$64761,N,15/08/2023,,1 years,,53.3,HYBRID,05-10-2022,,8.1
2,E0003,54.0,,N,15/08/2023,Marketing,8,6.0,32.6,Hybrid,10/05/2022,10.0,5.2
3,E0004,,,No,,,16,1.0,37.8,Remote,05-10-2022,5.0,
4,E0005,29.0,$61486,Y,15/08/2023,,,1.0,53.3,Hybrid,2022-05-10,,0.3



Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   employee_id          100 non-null    object 
 1   age                  44 non-null     float64
 2   salary               63 non-null     object 
 3   promotion_eligible   84 non-null     object 
 4   last_training_date   71 non-null     object 
 5   department           85 non-null     object 
 6   work_experience      71 non-null     object 
 7   projects_completed   48 non-null     float64
 8   hours_worked_weekly  67 non-null     float64
 9   work_mode            84 non-null     object 
 10  last_promotion_date  74 non-null     object 
 11  satisfaction_score   61 non-null     float64
 12  overtime_hours       70 non-null     float64
dtypes: float64(5), object(8)
memory usage: 10.3+ KB


### Activity 2: Descriptive Statistics  - Solution Code

In [8]:
# Generate summary statistics for numerical columns
df['salary'] = (df['salary'].str.replace("$","").astype(float))

numerical_stats = df[['age', 'salary', 'hours_worked_weekly', 'satisfaction_score', 'overtime_hours']].describe()
display(numerical_stats)

# Additional analysis of missing values
missing_values = df.isnull().sum()
print("\nMissing values in each column:")
display(missing_values)

# Calculate value counts for work_mode
work_mode_counts = df['work_mode'].value_counts()
print("Work Mode Distribution:")
display(work_mode_counts)

# Calculate value counts for department
dept_counts = df['department'].value_counts()
print("\nDepartment Distribution:")
display(dept_counts)

# Calculate percentages - Additional analysis
work_mode_pct = df['work_mode'].value_counts(normalize=True) * 100
print("\nWork Mode Percentages:")
display(work_mode_pct)

Unnamed: 0,age,salary,hours_worked_weekly,satisfaction_score,overtime_hours
count,44.0,63.0,67.0,61.0,70.0
mean,41.659091,81553.111111,43.38209,5.57377,5.618571
std,12.036656,20232.682927,8.520669,2.918099,2.843365
min,22.0,51438.0,30.1,1.0,0.0
25%,30.0,64705.5,35.2,3.0,3.3
50%,41.0,77186.0,43.3,6.0,6.15
75%,54.0,98882.5,50.45,8.0,8.025
max,60.0,118824.0,59.4,10.0,9.8



Missing values in each column:


employee_id             0
age                    56
salary                 37
promotion_eligible     16
last_training_date     29
department             15
work_experience        29
projects_completed     52
hours_worked_weekly    33
work_mode              16
last_promotion_date    26
satisfaction_score     39
overtime_hours         30
dtype: int64

Work Mode Distribution:


work_mode
remote work    19
HYBRID         19
On-site        17
Remote         15
Hybrid         14
Name: count, dtype: int64


Department Distribution:


department
Finance        18
HR             16
Marketing      15
Engineering    13
finance        13
engineering    10
Name: count, dtype: int64


Work Mode Percentages:


work_mode
remote work    22.619048
HYBRID         22.619048
On-site        20.238095
Remote         17.857143
Hybrid         16.666667
Name: proportion, dtype: float64

### Activity 3: Pattern Identification - Solution Code

In [9]:
# Calculate average satisfaction score by department
dept_satisfaction = df.groupby('department')['satisfaction_score'].agg(['mean'])
dept_satisfaction = dept_satisfaction.round(2)
print("Satisfaction Score by Department:")
display(dept_satisfaction)

# Identify potential anomalies
def identify_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Check for salary outliers
df['salary'] = df['salary'].fillna(df['salary'].mean())
salary_outliers, salary_lower, salary_upper = identify_outliers(df, 'salary')
print("Salary Outliers:")
print(f"Number of outliers: {len(salary_outliers)}")
print(f"Range for normal values: {salary_lower:.2f} to {salary_upper:.2f}")
display(salary_outliers[['employee_id', 'salary', 'department']])

# Check for hours worked outliers
df['hours_worked_weekly'] = df['hours_worked_weekly'].fillna(df['hours_worked_weekly'].mean())
hours_outliers, hours_lower, hours_upper = identify_outliers(df, 'hours_worked_weekly')
print("\nHours Worked Outliers:")
print(f"Number of outliers: {len(hours_outliers)}")
print(f"Range for normal values: {hours_lower:.2f} to {hours_upper:.2f}")
display(hours_outliers[['employee_id', 'hours_worked_weekly', 'department']])

Satisfaction Score by Department:


Unnamed: 0_level_0,mean
department,Unnamed: 1_level_1
Engineering,6.44
Finance,5.08
HR,4.0
Marketing,5.29
engineering,6.88
finance,5.6


Salary Outliers:
Number of outliers: 5
Range for normal values: 50246.25 to 113338.25


Unnamed: 0,employee_id,salary,department
6,E0007,115377.0,Finance
37,E0038,118824.0,
46,E0047,114025.0,engineering
68,E0069,115365.0,HR
88,E0089,117852.0,Marketing



Hours Worked Outliers:
Number of outliers: 5
Range for normal values: 29.60 to 56.60


Unnamed: 0,employee_id,hours_worked_weekly,department
9,E0010,57.9,Finance
21,E0022,59.4,Marketing
51,E0052,58.5,HR
56,E0057,57.1,HR
79,E0080,57.3,HR
