<a href="https://colab.research.google.com/github/Ucheekemezie/Uchechukwu_Profile/blob/master/Clean_and_Fix_a_Messy_Dataset_Using_Python_(Pandas).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Week 2: Clean and Fix a Messy Dataset Using Python (Pandas)

# 1. Introduction
This project demonstrates how to clean and fix a messy dataset using Python’s Pandas library. Real-world datasets often contain inconsistencies such as missing values, duplicate records, and varied data formats. As a data analyst, being able to identify and correct these issues programmatically is an essential skill. The goal is to simulate the everyday tasks analysts perform when preparing data for analysis or reporting.

# 2. Importing Libraries
The only required library for this task was Pandas, which is the industry-standard Python library for data manipulation and analysis.


In [1]:
# Import Required Libraries

import pandas as pd

# 3. Creating and Displaying the Dataset
A small sample dataset was constructed manually to reflect common data quality problems using a Python dictionary and then converted to a Pandas DataFrame.
These include extra spaces in text fields, inconsistent country naming conventions, missing values in numerical columns, and duplicate rows.

In [9]:
# Create a Messy Dataset

data = {
'Name': [' john ', 'Alice', 'Bob', 'Alice', None],
'Age': [None, 25, 30, 25, 22],
'Country': ['us', 'United States', 'USA', None, 'uk']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Country
0,john,,us
1,Alice,25.0,United States
2,Bob,30.0,USA
3,Alice,25.0,
4,,22.0,uk


# 4. Cleaning Steps
The following data cleaning operations were performed:


- **Trimming and Title-Casing Names:** Extra spaces in the "Name" column were removed using `.str.strip()`, and all names were formatted to title case using `.str.title()`.

In [3]:
# Remove Extra Spaces and Capitalize Names

df['Name'] = df['Name'].str.strip().str.title()

- **Handling Missing Ages:** Missing values in the "Age" column were filled with the median of the non-missing age values.

In [4]:
# Fill Missing Ages with the Median

df['Age'] = df['Age'].fillna(df['Age'].median())

- **Standardizing Country Names:** Variants like 'us', 'USA', and 'uk' were replaced with their full-form equivalents using `.replace()`. Missing values were filled with 'Unknown'.

In [5]:
# Fix Country Names

df['Country'] = df['Country'].replace({
'us': 'United States',
'USA': 'United States',
'uk': 'United Kingdom'
})
df['Country'] = df['Country'].fillna('Unknown')

- **Removing Duplicates:** To ensure data uniqueness, any duplicate rows were removed from the DataFrame using the `drop_duplicates()` method:


In [6]:
# Remove Duplicates

df = df.drop_duplicates()

# 5. Final Output and Results
After cleaning, the resulting DataFrame contains accurate, consistent, and complete information. This cleaned dataset is now ready for further analysis or visualization.


In [8]:
# Final Output
print(df)

    Name   Age         Country
0   John  25.0   United States
1  Alice  25.0   United States
2    Bob  30.0   United States
3  Alice  25.0         Unknown
4   None  22.0  United Kingdom


# 6. Conclusion and Learnings
This project emphasized the value of data cleaning in the analytics pipeline. By using core Pandas functions, common data quality issues were addressed programmatically. Techniques like imputing missing values, normalizing categorical data, and removing duplicates are fundamental steps before any form of data analysis or machine learning. Practicing on small-scale tasks like this strengthens understanding and builds confidence for more complex datasets encountered in the industry.
# 7. References
- McKinney, W. (2018). *Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython* (2nd ed.). O’Reilly Media.
- The Pandas Development Team. (2024). *Pandas Documentation*. https://pandas.pydata.org/docs/reference/index.html#api
