## Data Inspection and Cleaning

Using the dataset provided (see assignment), perform the following:

* Use .shape, .describe(), info(), and .head() to take an initial look at the dataset.  Identify any inconsistencies or missing data.  Point out at least two items about the data based on the information that you thought important or interesting.


* Handle missing data using .isna(), .dropna(), and .fillna().  Justify your decisions.  For example, if you fill with a value, explain why.


* Identify and remove duplicate rows using .drop_duplicates().  Justify your decisions.


* Use the .str accessor to clean up inconsistent text formatting in the Name and Email columns.


* Reinspect the cleaned dataset to ensure no missing values or duplicates remain and save the cleaned DataFrame as cleaned_data.csv.  Exclude the index from the output.

In [1]:
import pandas as pd

In [66]:
import pandas as pd
df = pd.read_csv('dirty_data_50_rows.csv')
df

Unnamed: 0,Name,Age,Email,Salary,Department
0,Frank,,frank@example.com,43692.0,Finance
1,Frank,48.0,hannah@example.com,70898.0,Operations
2,Alice,,david@example.com,,Operations
3,Eve,,grace@example.com,,Operations
4,charlie,48.0,alice@example.com,,Marketing
5,Judy,,,,Sales
6,Judy,,grace@example.com,50657.0,HR
7,Charlie,,ivan@example.com,40938.0,HR
8,Grace,,frank@example.com,65595.0,Marketing
9,Bob,,grace@example.com,,IT


In [73]:
import pandas as pd

# Load dataset
df = pd.read_csv('dirty_data_50_rows.csv')

print("Dataset Shape:", df.shape)
print("\n Dataset Information:")
print(df.info()) 
print("\n Summary Statistics:")
print(df.describe(include='all')) 
print("\n First 5 Rows:")
print(df.head())  

Dataset Shape: (55, 5)

 Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        55 non-null     object 
 1   Age         22 non-null     float64
 2   Email       51 non-null     object 
 3   Salary      25 non-null     float64
 4   Department  55 non-null     object 
dtypes: float64(2), object(3)
memory usage: 2.3+ KB
None

 Summary Statistics:
         Name        Age               Email        Salary  Department
count      55  22.000000                  51     25.000000          55
unique     14        NaN                  15           NaN           6
top     Frank        NaN  hannah@example.com           NaN  Operations
freq        8        NaN                   7           NaN          16
mean      NaN  44.409091                 NaN  56600.680000         NaN
std       NaN  10.349415                 NaN  13335.284501  

## Identifying Inconsistencies and Missing Data:
- The **Age** column has **33 missing values** (only 22 out of 55 rows have age data).
- The **Email** column has **4 missing values**.
- The **Salary** column has **30 missing values** (only 25 out of 55 rows contain salary information).
- There is **inconsistent formatting in the Name column**, where some names are properly capitalized while others are in lowercase  
  (e.g., Charlie vs charlie, Judy vs judy, Hannah vs hannah, Bob vs bob).
  - There is also **inconsistent formatting in the Email column**, where some usernames are capitalized while others are in lowercase  
  (e.g., Charlie@example.com vs charlie@example.com, Frank@example.com vs frank@example.com, Judy@example.com vs judy@example.com, Ivan@example.com vs ivan@example.com).



## Notable Observations:
1. **Duplicate Email Addresses**  
   - Some emails, like "hannah@example.com," appear **7 times** in just **55 rows**, indicating potential duplicate records.
   - This could impact analyses that assume email is a unique identifier.

2. **Uneven Department Distribution**  
   - The **Operations department** appears **16 times**, while others have far fewer entries.  
   - This imbalance may skew analyses related to salary, age, or workforce composition.

In [75]:
print("\n Missing Values in Each Column:")
print(df.isna().sum())


 Missing Values in Each Column:
Name           0
Age           33
Email          4
Salary        30
Department     0
dtype: int64


## Justification for Identifying Missing Data using .isna()

- I used `.isna().sum()` to check for missing values before making changes
- This helps determine whether to drop or fill missing values and ensures I only modify incomplete data


In [79]:
df = df.dropna(subset=['Age', 'Email', 'Salary'], how='all')
print("\n Missing Values After Dropping Rows:")
print(df.isna().sum())


 Missing Values After Dropping Rows:
Name           0
Age           32
Email          3
Salary        29
Department     0
dtype: int64


## Justification for Handling Missing Data with `.dropna()`

I used `.dropna()` to remove missing values **only where `Age`, `Email`, and `Salary` were all missing**.

- **Rows were removed only if all three columns (`Age`, `Email`, `Salary`) were null**, ensuring that we retain as much data as possible.
- **Rows with at least one of these values were kept**, allowing me to fill missing values using `.fillna()`.

In [97]:
df['Age'] = df['Age'].fillna(df['Age'].median()).astype(int)

df['Email'] = df['Email'].fillna("No Email Provided")

df.loc[df['Department'] == 'HR', 'Salary'] = df.loc[df['Department'] == 'HR', 'Salary'].fillna(50000)
df.loc[df['Department'] == 'Finance', 'Salary'] = df.loc[df['Department'] == 'Finance', 'Salary'].fillna(56418)
df.loc[df['Department'] == 'Marketing', 'Salary'] = df.loc[df['Department'] == 'Marketing', 'Salary'].fillna(55000)
df.loc[df['Department'] == 'Sales', 'Salary'] = df.loc[df['Department'] == 'Sales', 'Salary'].fillna(52000)
df.loc[df['Department'] == 'Operations', 'Salary'] = df.loc[df['Department'] == 'Operations', 'Salary'].fillna(66552)
df.loc[df['Department'] == 'IT', 'Salary'] = df.loc[df['Department'] == 'IT', 'Salary'].fillna(75151)

df['Salary'] = df['Salary'].fillna(df['Salary'].median()).round(0).astype(int)

print("\n Missing Values After Filling:")
print(df.isna().sum())


 Missing Values After Filling:
Name          0
Age           0
Email         0
Salary        0
Department    0
dtype: int64


In [98]:
df

Unnamed: 0,Name,Age,Email,Salary,Department
0,Frank,45,frank@example.com,43692,Finance
1,Frank,48,hannah@example.com,70898,Operations
2,Alice,45,david@example.com,66552,Operations
3,Eve,45,grace@example.com,66552,Operations
4,Charlie,48,alice@example.com,55000,Marketing
6,Judy,45,grace@example.com,50657,HR
7,Charlie,45,ivan@example.com,40938,HR
8,Grace,45,frank@example.com,65595,Marketing
9,Bob,45,grace@example.com,75152,IT
10,Bob,54,frank@example.com,55000,Marketing


## Justification for Handling Missing Data with .fillna()

- I used `.fillna(df['Age'].median())` to replace missing age values.
- This ensures that the missing ages are realistic.
- Since `Email` is a categorical column, I replaced missing values with "No Email Provided".
  - This preserves all employee records, instead of deleting them due to missing emails.
- I fixed values for known department averages (HR, Marketing, and Sales) and calculated the mean salary of the Finance, IT, and Operations departments. Any remaining missing salaries were filled using the overall median salary**.
- This ensured that no missing values remained in the dataset.


In [99]:
print("\n Duplicate Rows Before Removal:", df.duplicated().sum())


 Duplicate Rows Before Removal: 0


In [100]:
df = df.drop_duplicates()

In [101]:
print("Duplicate Rows After Removal:", df.duplicated().sum())

Duplicate Rows After Removal: 0


## Justification for Handling Duplicates with .drop_duplicates()

- I used `.drop_duplicates()` to remove all duplicate rows from the dataset.
- This prevents errors from pcuuring in salary calculations, department counts, and reporting.
- The cleaned dataset will now contain only unique records.


In [103]:
df['Name'] = df['Name'].str.title()

df['Email'] = df['Email'].str.lower()

print("\n The Dataset with Clear Consistent Text Formatting:")
print(df)


 The Dataset with Clear Consistent Text Formatting:
       Name  Age                Email  Salary  Department
0     Frank   45    frank@example.com   43692     Finance
1     Frank   48   hannah@example.com   70898  Operations
2     Alice   45    david@example.com   66552  Operations
3       Eve   45    grace@example.com   66552  Operations
4   Charlie   48    alice@example.com   55000   Marketing
6      Judy   45    grace@example.com   50657          HR
7   Charlie   45     ivan@example.com   40938          HR
8     Grace   45    frank@example.com   65595   Marketing
9       Bob   45    grace@example.com   75152          IT
10      Bob   54    frank@example.com   55000   Marketing
11     Judy   48    david@example.com   67592  Operations
12  Charlie   34    alice@example.com   79268  Operations
13     Ivan   45   hannah@example.com   75152          IT
14    Frank   45      eve@example.com   44590       Sales
15    Frank   45   hannah@example.com   50000          HR
16    Grace   57   

## Justification for Using .str Accessor

- I used `.str.title()` to capitalize the first letter of each word in the `Name` column.
- I used `.str.lower()` to convert all emails to lowercase.
- This removes inconcistencies with lowercase names and standardizes email formatting to prevent potential mismatches.

In [104]:
print("\n Final Missing Values Check:")
print(df.isna().sum())


 Final Missing Values Check:
Name          0
Age           0
Email         0
Salary        0
Department    0
dtype: int64


In [105]:
print("\n Final Duplicate Rows Check:", df.duplicated().sum())


 Final Duplicate Rows Check: 0


In [106]:
print("\n Final Dataset Preview:")
print(df.head())


 Final Dataset Preview:
      Name  Age               Email  Salary  Department
0    Frank   45   frank@example.com   43692     Finance
1    Frank   48  hannah@example.com   70898  Operations
2    Alice   45   david@example.com   66552  Operations
3      Eve   45   grace@example.com   66552  Operations
4  Charlie   48   alice@example.com   55000   Marketing


In [107]:
df

Unnamed: 0,Name,Age,Email,Salary,Department
0,Frank,45,frank@example.com,43692,Finance
1,Frank,48,hannah@example.com,70898,Operations
2,Alice,45,david@example.com,66552,Operations
3,Eve,45,grace@example.com,66552,Operations
4,Charlie,48,alice@example.com,55000,Marketing
6,Judy,45,grace@example.com,50657,HR
7,Charlie,45,ivan@example.com,40938,HR
8,Grace,45,frank@example.com,65595,Marketing
9,Bob,45,grace@example.com,75152,IT
10,Bob,54,frank@example.com,55000,Marketing


In [108]:
df.to_csv("cleaned_data.csv", index=False)