<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/data-analysis-pandas/02_Data_Cleaning_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning and Preprocessing


## Overview



Data cleaning and preprocessing are essential steps in the data analysis process. Raw data often contains inconsistencies, errors, missing values, and other imperfections that can adversely affect the accuracy and reliability of the analysis. Data cleaning and preprocessing involve identifying and correcting these issues to ensure the data is suitable for analysis.

Python, with its rich ecosystem of libraries and tools, provides a powerful environment for data cleaning and preprocessing tasks. In this process, you can leverage various Python libraries such as pandas, NumPy, and scikit-learn to handle and manipulate data efficiently.

Data cleaning encompasses a range of techniques and tasks, including:

1. Handling Missing Values: Missing values can occur due to various reasons such as data collection errors or incomplete data. Python provides methods to detect and handle missing values, such as dropping rows or columns with missing values, imputing missing values with statistical measures, or using advanced techniques like interpolation or machine learning-based imputation.

2. Removing Duplicates: Duplicated data can lead to skewed analysis results. Python offers functions to identify and remove duplicate entries in a dataset, ensuring that each observation is unique.

3. Dealing with Outliers: Outliers are extreme values that differ significantly from other observations. These can distort statistical measures and models. Python allows you to detect and handle outliers using techniques such as Z-score, IQR (Interquartile Range), or domain knowledge-based approaches.

4. Standardizing and Scaling: Data often needs to be standardized or scaled to bring it to a common scale for analysis. Standardization ensures that each variable has a mean of zero and a standard deviation of one, while scaling transforms the data to a specific range. Python libraries like scikit-learn provide functions for these tasks.

5. Handling Categorical Data: Categorical data, such as gender or product categories, needs to be properly encoded for analysis. Python's pandas library offers methods for one-hot encoding, label encoding, or ordinal encoding to represent categorical variables numerically.

6. Data Transformation: Data transformation involves converting variables to better adhere to assumptions of statistical models. This can include log transformations, power transformations, or normalization. Python libraries like NumPy and pandas provide functions to perform these transformations.

7. Feature Engineering: Feature engineering involves creating new features from existing ones to enhance the predictive power of machine learning models. Python provides tools to generate features based on mathematical operations, aggregations, or domain-specific knowledge.

By performing these data cleaning and preprocessing tasks, you can ensure the data is accurate, complete, and well-structured before proceeding with further analysis or modeling. This improves the quality of insights derived from the data and enhances the performance of machine learning algorithms.

In conclusion, data cleaning and preprocessing are crucial steps in the data analysis process, and Python offers a rich set of libraries and tools to perform these tasks efficiently. By mastering these techniques, you can handle real-world datasets effectively and extract valuable insights from your data.

## Handling missing values and duplicates



Handling missing values and duplicates are important tasks when working with data in Pandas. Here's how you can handle missing values and duplicates in Pandas using the Pima Indian Diabetes dataset as an example:

1. Handling Missing Values:

Missing values in the dataset can be identified and handled using Pandas functions. Some common methods for handling missing values are:

- **isnull()**: This function identifies missing values in the dataset, returning a Boolean mask where True represents a missing value.
- **fillna()**: This function allows you to fill missing values with a specified value or a statistical measure, such as mean, median, or mode.
- **dropna()**: This function removes rows or columns containing missing values from the dataset.

Here's an example that demonstrates handling missing values in the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Identify missing values
missing_values = dataset.isnull().sum()
print("Missing Values:")
print(missing_values)

# Fill missing values with the mean
dataset_filled = dataset.fillna(dataset.mean())

# Remove rows with missing values
dataset_dropped = dataset.dropna()

# Print the modified datasets
print("\nDataset with Missing Values Filled:")
print(dataset_filled.head())

print("\nDataset with Rows Dropped:")
print(dataset_dropped.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas. We then use the `isnull()` function to identify missing values in the dataset and count them using the `sum()` function. Next, we fill the missing values with the mean using the `fillna()` function and store the modified dataset in `dataset_filled`. Finally, we drop the rows with missing values using the `dropna()` function and store the modified dataset in `dataset_dropped`. We print both modified datasets to see the results.

2. Handling Duplicates:

Duplicates in a dataset can be identified and handled using Pandas functions. Some common methods for handling duplicates are:

- **duplicated()**: This function identifies duplicate rows in the dataset, returning a Boolean mask where True represents a duplicate row.
- **drop_duplicates()**: This function removes duplicate rows from the dataset.

Here's an example that demonstrates handling duplicates in the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Identify duplicate rows
duplicate_rows = dataset.duplicated()
print("Duplicate Rows:")
print(duplicate_rows)

# Remove duplicate rows
dataset_unique = dataset.drop_duplicates()

# Print the modified dataset
print("\nDataset with Duplicate Rows Removed:")
print(dataset_unique.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas. We use the `duplicated()` function to identify duplicate rows in the dataset and store the Boolean mask in `duplicate_rows`. Next, we use the `drop_duplicates()` function to remove the duplicate rows from the dataset and store the modified dataset in `dataset_unique`. Finally, we print the modified dataset to see the results.

## Removing irrelevant columns



To remove irrelevant columns in pandas, you can use the `drop()` function. The `drop()` function allows you to remove one or more columns from a DataFrame.

Here's an example using the Pima Indian Diabetes dataset to demonstrate how to remove irrelevant columns:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Remove irrelevant columns (e.g., 'SkinThickness' and 'DiabetesPedigreeFunction')
columns_to_drop = ['SkinThickness', 'DiabetesPedigreeFunction']
dataset = dataset.drop(columns=columns_to_drop)

# Print the modified dataset
print(dataset.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas library. We then specify the names of the columns using the `column_names` list. Next, we create a list called `columns_to_drop` that contains the names of the irrelevant columns we want to remove ('SkinThickness' and 'DiabetesPedigreeFunction'). We use the `drop()` function to remove these columns from the dataset and assign the modified dataset back to the `dataset` variable. Finally, we print the modified dataset using the `head()` function to see the output.

The output will be the modified dataset with the irrelevant columns removed.


# Handling data inconsistencies and outliers


## Data inconsistencies




When working with data in Pandas, it is common to encounter inconsistencies or missing values that need to be handled. Pandas provides several methods to deal with data inconsistencies, such as handling missing values, replacing values, or dropping rows/columns with missing or incorrect data. Here's an example using the Pima Indian Diabetes dataset to demonstrate how to handle data inconsistencies:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Replace missing values with NaN
dataset.replace(0, pd.NaT, inplace=True)

# Drop rows with missing values
dataset.dropna(inplace=True)

# Reset the index
dataset.reset_index(drop=True, inplace=True)

# Print the cleaned dataset
print(dataset)


In this example, we load the Pima Indian Diabetes dataset using Pandas library. We then replace all zero values in the dataset with NaN (Not a Number) using the `replace()` method. This is a common approach to handle missing or inconsistent data in the Pima Indian dataset, as zero values for certain features like 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', and 'BMI' are not realistic or valid.

Next, we use the `dropna()` method to remove rows with missing values from the dataset. This is done to ensure that the dataset only contains complete and consistent records.

Finally, we reset the index of the DataFrame using the `reset_index()` method with the `drop=True` parameter, which removes the old index and assigns a new sequential index to the rows.

The resulting cleaned dataset will only contain rows with non-zero and non-null values for the specified features, allowing for more accurate analysis and modeling.


## Outliers

Handling outliers in pandas involves identifying and dealing with data points that significantly deviate from the rest of the dataset. Outliers can adversely affect data analysis and modeling, so it's important to handle them appropriately. Here's an example of handling outliers in the Pima Indian Diabetes dataset using pandas:


In [None]:
import pandas as pd
import numpy as np

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Identify outliers in the 'Glucose' column
glucose_outliers = dataset[dataset['Glucose'] > 200]

# Replace outliers with NaN
dataset.loc[glucose_outliers.index, 'Glucose'] = np.nan

# Calculate the mean glucose level after removing outliers
mean_glucose = dataset['Glucose'].mean()

# Fill NaN values with the mean glucose level
dataset['Glucose'].fillna(mean_glucose, inplace=True)

# Print the modified dataset
print(dataset)


In this example, we load the Pima Indian Diabetes dataset using Pandas. We identify outliers in the 'Glucose' column by selecting rows where the value is greater than 200. Then, we replace those outliers with NaN (missing value). Next, we calculate the mean glucose level after removing the outliers using the `mean()` function. Finally, we fill the NaN values in the 'Glucose' column with the calculated mean glucose level using the `fillna()` method with the `inplace=True` parameter to modify the dataset in place. The resulting dataset will have the outliers in the 'Glucose' column replaced with the mean value.

# Data normalization and scaling


## Overview

Data normalization and scaling are essential techniques in data preprocessing to ensure accurate and reliable analysis. They help transform raw data into a suitable format that can be effectively used by machine learning algorithms. In this introduction, we will explore the concepts of data normalization and scaling and discuss how to implement them in Python programming.

Data normalization refers to the process of rescaling numerical data to a common scale, typically between 0 and 1. It is particularly useful when working with features that have different units or scales. Normalization eliminates the potential bias caused by variables with larger magnitudes dominating the analysis.

On the other hand, data scaling aims to standardize the range of features without necessarily constraining them to a specific range. Scaling techniques, such as Z-score scaling (standardization) and min-max scaling, make the data more interpretable and enhance the performance of various machine learning algorithms that are sensitive to the scale of the input features.

## Data normalization



Data normalization, also known as feature scaling, is a technique used to transform data into a standardized range. It helps in bringing different features to a common scale, preventing one feature from dominating others due to its larger magnitude. Normalizing data can improve the performance of machine learning algorithms and make comparisons between different features more meaningful.

In Pandas, you can perform data normalization using various methods. One common approach is min-max scaling, where the data is transformed to a specified range (e.g., between 0 and 1). The formula for min-max scaling is:

```
normalized_value = (value - min_value) / (max_value - min_value)
```

Here's an example of data normalization using min-max scaling on the Pima Indian Diabetes dataset:



In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Select the features to normalize
features = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]

# Perform min-max scaling on the selected features
for feature in features:
    min_value = dataset[feature].min()
    max_value = dataset[feature].max()
    dataset[feature] = (dataset[feature] - min_value) / (max_value - min_value)

# Print the normalized dataset
print(dataset.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas. We specify the features we want to normalize, in this case, "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", and "Age". We then iterate over each feature, calculate the minimum and maximum values using the `min()` and `max()` functions, and perform the min-max scaling by subtracting the minimum value and dividing by the range (max-min). Finally, we print the normalized dataset using `dataset.head()` to display the first few rows.

After the normalization process, the selected features will be scaled between 0 and 1, allowing for better comparison and analysis.


## Data scaling




Data scaling is a technique used to transform data by adjusting its range, typically to make it more suitable for machine learning algorithms. Scaling data can help in improving the performance and convergence of certain algorithms, especially those that are sensitive to the scale of the input features.

In Pandas, you can perform data scaling using various methods such as standardization (z-score scaling) or normalization (min-max scaling). Here's an example of data scaling using standardization on the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Select the features to scale
features = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]

# Perform standardization on the selected features
scaler = StandardScaler()
dataset[features] = scaler.fit_transform(dataset[features])

# Print the scaled dataset
print(dataset.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas. We specify the features we want to scale, in this case, "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", and "Age". We then create an instance of the `StandardScaler` class from scikit-learn and apply the `fit_transform()` method on the selected features using the scaler. This scales the data using the z-score formula, which subtracts the mean and divides by the standard deviation of each feature. Finally, we print the scaled dataset using `dataset.head()` to display the first few rows.

After the scaling process, the selected features will have a mean of 0 and a standard deviation of 1, making them suitable for algorithms that assume a standard Gaussian distribution or rely on Euclidean distance calculations.


# Reflection points

1. **Handling Missing Values and Duplicates**:
   - Why is it important to handle missing values and duplicates in a dataset?
   - How can missing values be identified and handled using Python libraries?
   - What are the different strategies for dealing with missing values?
   - How can duplicates be detected and removed from a dataset?

   Example answer: Handling missing values and duplicates is crucial to ensure data quality and prevent biased or inaccurate analysis. Missing values can be identified using functions like `isnull()` or `isna()` in libraries such as Pandas. Strategies for handling missing values include imputation techniques (e.g., mean imputation, forward-fill, or backward-fill) or removing rows/columns with missing values. Duplicates can be detected using functions like `duplicated()` and removed using `drop_duplicates()`.

2. **Removing Irrelevant Columns**:
   - Why is it important to remove irrelevant columns from a dataset?
   - How can you identify and remove irrelevant columns using Python?
   - What criteria can be used to determine the relevance of columns?
   - How can removing irrelevant columns improve data analysis and model performance?

   Example answer: Removing irrelevant columns helps reduce noise, improve computational efficiency, and focus on relevant features. Irrelevant columns can be identified by considering domain knowledge, correlation analysis, or feature importance techniques. Columns with high missing values or low variance can also be considered irrelevant. By removing irrelevant columns, we can simplify the dataset, enhance interpretability, and potentially improve model accuracy and performance.

3. **Handling Data Inconsistencies and Outliers**:
   - Why should data inconsistencies and outliers be addressed in a dataset?
   - How can you detect and handle data inconsistencies using Python?
   - What techniques can be used to identify and deal with outliers?
   - What impact can data inconsistencies and outliers have on analysis and modeling?

   Example answer: Data inconsistencies and outliers can introduce biases and distort analysis results. Inconsistencies can be detected by checking data types, unique values, or using regular expressions. Outliers can be identified through statistical methods like z-score, interquartile range (IQR), or visualization techniques like box plots. Handling inconsistencies involves cleaning and standardizing data, while outliers can be handled by removing, transforming, or imputing them. By addressing inconsistencies and outliers, we can ensure data quality, improve accuracy, and prevent skewed analysis or model performance.

4. **Data Normalization and Scaling**:
   - Why is data normalization or scaling important in data preprocessing?
   - What are the common normalization and scaling techniques?
   - How can you perform data normalization and scaling in Python?
   - In what scenarios would you apply normalization or scaling techniques?

   Example answer: Data normalization and scaling are essential to ensure fair comparisons and prevent the dominance of certain features. Common techniques include Min-Max scaling, Z-score normalization, and robust scaling. In Python, libraries like Scikit-learn provide functions such as `MinMaxScaler` and `StandardScaler` to perform normalization and scaling. Normalization is typically applied when the distribution of features is skewed or when dealing with distance-based algorithms. Scaling is useful when features have different units or scales, enabling algorithms to converge faster and preventing certain features from dominating the model.


# A quiz on Data Cleaning and Preprocessing


1. What is the term used to describe the process of filling in missing values in a dataset?
   <br>a) Data normalization
   <br>b) Data imputation
   <br>c) Data scaling
   <br>d) Data standardization

2. Which Python library provides functions to handle missing values in a DataFrame?
   <br>a) NumPy
   <br>b) Pandas
   <br>c) Matplotlib
   <br>d) SciPy

3. How can you remove duplicate rows from a DataFrame in Pandas?
   <br>a) df.drop_duplicates()
   <br>b) df.remove_duplicates()
   <br>c) df.drop_duplicate_rows()
   <br>d) df.remove_duplicate_rows()

4. Which step involves identifying and fixing inconsistencies or errors in the data?
   <br>a) Handling missing values
   <br>b) Removing irrelevant columns
   <br>c) Data normalization
   <br>d) Data cleaning

5. What is the term used to describe extreme values that significantly deviate from the other data points?
   <br>a) Outliers
   <br>b) Inliers
   <br>c) Anomalies
   <br>d) Normals

6. How can you normalize data to have a range between 0 and 1 in Python?
   <br>a) Min-max scaling
   <br>b) Standard scaling
   <br>c) Z-score normalization
   <br>d) Log transformation

7. Which Python library provides the MinMaxScaler class for data scaling?
   <br>a) NumPy
   <br>b) Pandas
   <br>c) Scikit-learn
   <br>d) Seaborn

---

Answers:

1. b) Data imputation
2. b) Pandas
3. a) df.drop_duplicates()
4. d) Data cleaning
5. a) Outliers
6. a) Min-max scaling
7. c) Scikit-learn

---

