<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_5/Section_6_Python_Example__Data_Cleaning_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 6 - Python Example: Data cleaning techniques

Data cleaning is an essential step in preparing raw data for analysis and modeling. It involves rectifying or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. Effective data cleaning can significantly improve the quality of insights derived from data analysis and the performance of predictive models. This section provides a practical guide on implementing various data cleaning techniques using Python, specifically with Pandas and NumPy, two powerful tools for data manipulation.

1. Setting Up the Environment:

Ensure Python is equipped with the necessary libraries for data cleaning. Pandas and NumPy are indispensable for handling data efficiently. If not already installed, they can be added using pip:

In [None]:
pip install pandas numpy

2. Importing Required Libraries:

Begin by importing Pandas and NumPy. These libraries provide comprehensive functions and methods for data manipulation and cleaning:

In [None]:
import pandas as pd
import numpy as np

3. Creating a Sample Dataset:

For demonstration, let’s create a DataFrame that mimics common data issues, including missing values, duplicates, and outliers:

In [None]:
# Create a sample DataFrame
data = pd.DataFrame({
    'Name': ['John Doe', 'Anna Smith', 'Peter Brown', 'Anna Smith', 'John Doe'],
    'Age': [28, np.nan, 30, 22, 28],
    'Salary': [50000, 54000, 50000, None, 50000],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']
})

4. Identifying Missing Values:

Handling missing values is crucial as they can lead to misleading analyses and incorrect model predictions:

In [None]:
# Check for missing values
print(data.isnull().sum())

# Fill numeric missing values with the median
data['Age'].fillna(data['Age'].median(), inplace=True)

# Fill categorical missing values with the mode
data['Salary'].fillna(data['Salary'].mode()[0], inplace=True)

5. Removing Duplicates:

Duplicate data can skew results and lead to inaccurate conclusions:

In [None]:
# Drop duplicates
data.drop_duplicates(inplace=True)

6. Correcting Data Types:

Ensuring correct data types is essential for accurate data analysis:

In [None]:
# Ensure 'Salary' is a float
data['Salary'] = data['Salary'].astype(float)

7. Handling Outliers:

Outliers can disproportionately affect the results of data analysis and predictive modeling:

In [None]:
# Detecting outliers in 'Age'
q1, q3 = np.percentile(data['Age'], [25, 75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

# Filtering out outliers
data = data[(data['Age'] >= lower_bound) & (data['Age'] <= upper_bound)]

8. Normalizing Data:

Normalization is important for many statistical techniques and machine learning models:

In [None]:
# Normalize 'Salary'
data['Salary'] = (data['Salary'] - data['Salary'].min()) / (data['Salary'].max() - data['Salary'].min())

9. Conclusion:

This example demonstrates the fundamental techniques of data cleaning using Python's Pandas and NumPy libraries. Proper data cleaning ensures that datasets are primed for analysis and modeling, significantly boosting the reliability and accuracy of your results. Mastery of data cleaning processes allows data professionals to focus more on extracting insights and less on troubleshooting data-related issues. These skills are essential for any data scientist or analyst dealing with real-world data.