# Data Cleaning and Preprocessing

In this lesson, you will learn about the essential techniques for cleaning and preprocessing data in AWS SageMaker. By the end of this lesson, you will be able to identify common data issues, apply cleaning techniques, and preprocess data for machine learning model training.

## Learning Objectives

By the end of this lesson, you will be able to:
- Identify common data issues that can affect model performance.
- Apply cleaning techniques to handle missing values and outliers.
- Preprocess data for training using feature scaling methods.
- Understand the importance of data quality in machine learning.
- Explore different preprocessing techniques available in SageMaker.

## Why This Matters

Data cleaning and preprocessing are crucial steps in the machine learning pipeline. High-quality data leads to better model performance, while poor-quality data can result in inaccurate predictions and unreliable models. Ensuring that your dataset is clean and well-prepared is essential for building effective machine learning solutions.

## Concept: Data Cleaning

Data cleaning involves identifying and correcting errors or inconsistencies in the data to improve its quality. This process is vital as it ensures that the dataset is accurate and reliable, which is crucial for building effective models.

In [None]:
# Example code to handle missing values
import pandas as pd

df = pd.read_csv('dataset.csv')  # Load dataset
# Identify and fill missing values
# Using forward fill method
print('Before filling missing values:')
print(df.isnull().sum())
df.fillna(method='ffill', inplace=True)
print('After filling missing values:')
print(df.isnull().sum())

### Micro-Exercise: Identify Common Data Issues

List common data issues that can affect model performance.

In [None]:
# Common data issues
# 1. Missing values
# 2. Outliers
# 3. Incorrect data types

## Concept: Feature Scaling

Feature scaling is the process of normalizing or standardizing the range of independent variables or features of data. This is important because it ensures that all features contribute equally to the model's performance, preventing bias towards certain features.

In [None]:
# Example code to apply feature scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Assume df is the DataFrame with features to scale
scaled_features = scaler.fit_transform(df)
print('Scaled features:')
print(scaled_features)

### Micro-Exercise: Demonstrate Feature Scaling

Demonstrate how to apply feature scaling to a dataset.

In [None]:
# Example code to apply feature scaling
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# Assume df is the DataFrame with features to scale
scaled_features = scaler.fit_transform(df)
print('Scaled features using MinMaxScaler:')
print(scaled_features)

## Examples

### Example 1: Handling Missing Values
This example demonstrates how to identify and fill missing values in a dataset using the forward fill method.

In [None]:
# Example code to handle missing values
import pandas as pd

df = pd.read_csv('dataset.csv')  # Load dataset
# Identify and fill missing values
print('Before filling missing values:')
print(df.isnull().sum())
df.fillna(method='ffill', inplace=True)
print('After filling missing values:')
print(df.isnull().sum())

### Example 2: Outlier Detection
This example shows how to identify outliers using the IQR method and remove them from the dataset.

In [None]:
# Example code to remove outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
# Remove outliers
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
print('Data after removing outliers:')
print(df)

## Micro-Exercises

### Exercise 1: List Common Data Issues
List common data issues that can affect model performance.

In [None]:
# Common data issues
# 1. Missing values
# 2. Outliers
# 3. Incorrect data types

### Exercise 2: Handle Outliers
Demonstrate how to handle outliers in a dataset.

In [None]:
# Example code to handle outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
# Remove outliers
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

## Main Exercise: Cleaning and Preprocessing a Dataset
In this exercise, you will load a dataset, identify and handle missing values, detect and remove outliers, and apply feature scaling to prepare the dataset for training a machine learning model.

In [None]:
# Load dataset
import pandas as pd

df = pd.read_csv('dataset.csv')

# Handle missing values
print('Before filling missing values:')
print(df.isnull().sum())
df.fillna(method='ffill', inplace=True)
print('After filling missing values:')
print(df.isnull().sum())

# Remove outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
# Remove outliers
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

# Apply feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)
print('Scaled features:')
print(scaled_features)

## Common Mistakes
- Ignoring missing values which can skew model results.
- Not normalizing data leading to poor model performance.

## Recap & Next Steps
In this lesson, we covered the importance of data cleaning and preprocessing in machine learning. You learned how to identify and handle missing values, detect outliers, and apply feature scaling techniques. In the next lesson, we will explore model training and evaluation in AWS SageMaker.