<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/machine-learning-scikit-learn/03_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing


## Overview


Data preprocessing is a critical step in the data analysis and machine learning workflow, playing a pivotal role in transforming raw, unstructured data into a clean, structured format that is suitable for further analysis and modeling. In Python, data preprocessing is carried out using various libraries and techniques to handle missing values, handle outliers, scale features, encode categorical variables, and more. By performing these essential data preparation steps, data scientists and analysts can ensure that the data is of high quality and relevant to the problem at hand, ultimately leading to more accurate and reliable insights.

One of the first tasks in data preprocessing is dealing with missing values, which are gaps or null entries in the dataset. Missing values can arise due to various reasons, such as data collection errors or incomplete records. Python offers several powerful libraries, like Pandas and NumPy, that enable data analysts to identify missing values and choose how to handle them. Depending on the data and the analysis goals, missing values can be imputed, removed, or replaced with meaningful values, ensuring that they do not negatively impact the accuracy of the subsequent analyses.

Another crucial step in data preprocessing is handling outliers. Outliers are extreme values that deviate significantly from the majority of the data points and can distort statistical analyses or machine learning models. Python provides various techniques, such as the Interquartile Range (IQR) method or the Z-score method, to identify and handle outliers effectively. By treating outliers appropriately, analysts can prevent misleading insights and improve the robustness of their data analysis.

Additionally, data preprocessing involves handling categorical variables, which are variables that represent categories rather than numerical values. Many machine learning algorithms require numerical inputs, and as such, categorical variables need to be encoded into a numeric format. Python libraries like Scikit-learn offer tools like one-hot encoding and label encoding to convert categorical variables into a suitable numerical representation, making them usable in machine learning models.

Furthermore, feature scaling is a crucial aspect of data preprocessing to ensure that all features contribute equally to the analysis. Features with different scales can dominate the learning process, leading to biased models. Techniques like min-max scaling and standardization are commonly used in Python to bring all features to a similar scale, thus improving the performance of machine learning algorithms and facilitating convergence.

In conclusion, data preprocessing is a fundamental and indispensable step in the data analysis and machine learning process. Python provides a wealth of libraries and tools that enable data scientists and analysts to clean, preprocess, and transform data effectively. By addressing missing values, handling outliers, encoding categorical variables, and scaling features, analysts can prepare high-quality data that forms the foundation for accurate and meaningful insights, leading to better-informed decisions and successful machine learning models.

## Handling missing values



In the scikit-learn library, missing values in datasets can be handled using various techniques. Here are some common approaches to handling missing values:

1. Dropping Rows or Columns: This approach involves removing rows or columns that contain missing values. However, it should be used with caution as it may lead to loss of important data.

2. Imputation: Imputation is the process of filling in missing values with estimated or predicted values. Some commonly used imputation techniques include:
   - Mean/Median Imputation: Replace missing values with the mean or median of the non-missing values in the same column.
   - Mode Imputation: Replace missing values with the mode (most frequent value) of the non-missing values in the same column.
   - Regression Imputation: Predict missing values based on the relationship between the target variable and other variables using regression models.

Here's an example of using imputation techniques with the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Replace 0 values with NaN in columns where 0 doesn't make sense
columns_to_check = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
dataset[columns_to_check] = dataset[columns_to_check].replace(0, pd.np.nan)

# Impute missing values using mean imputation
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(dataset)

# Convert the imputed data back to a DataFrame
imputed_dataset = pd.DataFrame(imputed_data, columns=column_names)

# Print the imputed dataset
print(imputed_dataset.head())


In this example, we load the Pima Indian Diabetes dataset using Pandas library. We replace the 0 values in columns such as 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', and 'BMI' with NaN (missing value) using the `replace()` method.

Next, we use the `SimpleImputer` class from scikit-learn to perform mean imputation. The `SimpleImputer` replaces missing values with the mean of the non-missing values in each column. We fit the imputer on the dataset using the `fit_transform()` method to impute the missing values.

Finally, we convert the imputed data back to a DataFrame and print the imputed dataset using the `head()` method to see the output.


## Feature scaling and normalization




Feature scaling and normalization are techniques used to transform numeric features in a dataset to a common scale. This process is often necessary in machine learning algorithms to ensure that all features contribute equally to the model's training and avoid biases caused by features with larger magnitudes dominating the learning process. The Scikit-Learn library provides various methods for feature scaling and normalization.

Here are two commonly used techniques for feature scaling and normalization in Scikit-Learn:

1. Standardization:
   Standardization scales the features to have a mean of 0 and a standard deviation of 1. This technique assumes that the feature values follow a Gaussian distribution.


In [None]:
from sklearn.preprocessing import StandardScaler

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate features and target variable
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Perform standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Print the scaled features
print(X_scaled)


In this example, we use the StandardScaler class from Scikit-Learn to perform standardization on the feature matrix X. The fit_transform method calculates the mean and standard deviation of each feature and scales the values accordingly. The resulting X_scaled contains the standardized feature values.

2. Min-Max Scaling:
   Min-Max Scaling, also known as normalization, scales the features to a specific range, usually between 0 and 1.

   Example using the Pima Indian Diabetes dataset:



In [None]:
from sklearn.preprocessing import MinMaxScaler

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate features and target variable
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Perform min-max scaling
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Print the scaled features
print(X_scaled)


In this example, we use the MinMaxScaler class from Scikit-Learn to perform min-max scaling on the feature matrix X. The fit_transform method determines the minimum and maximum values for each feature and scales the values accordingly to the range [0, 1]. The resulting X_scaled contains the normalized feature values.

Both standardization and min-max scaling can be applied to the entire feature matrix or specific columns based on your requirements.


## Handling categorical variables



Handling categorical variables in Scikit-Learn library typically involves encoding categorical data into numerical representations so that machine learning algorithms can effectively process them. Scikit-Learn provides various methods for encoding categorical variables, such as Label Encoding and One-Hot Encoding. Here are examples of how to handle categorical variables using the Pima Indian Diabetes dataset:

1. Label Encoding:
Label Encoding converts categorical variables into integers by assigning a unique numerical value to each category. This method is suitable when the categorical variable has an inherent ordinal relationship.


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Perform label encoding on the 'Outcome' column
label_encoder = LabelEncoder()
dataset['Outcome'] = label_encoder.fit_transform(dataset['Outcome'])

# Print the updated dataset
print(dataset.head())


In this example, we use the LabelEncoder class from Scikit-Learn to encode the 'Outcome' column of the Pima Indian Diabetes dataset. The label_encoder.fit_transform() function fits the encoder on the 'Outcome' column and transforms the categorical values into numerical labels. The transformed values replace the original categorical values in the dataset.

2. One-Hot Encoding:
One-Hot Encoding creates binary columns for each category of a categorical variable. Each column represents a category, and a value of 1 indicates that the observation belongs to that category, while 0 indicates it does not.


In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Perform one-hot encoding on the 'Outcome' column
onehot_encoder = OneHotEncoder(sparse=False)
encoded_features = onehot_encoder.fit_transform(dataset[['Outcome']])

# Create a new DataFrame with the encoded features
encoded_df = pd.DataFrame(encoded_features, columns=onehot_encoder.categories_[0])
dataset = pd.concat([dataset, encoded_df], axis=1)

# Drop the original 'Outcome' column
dataset.drop(['Outcome'], axis=1, inplace=True)

# Print the updated dataset
print(dataset.head())


In this example, we use the OneHotEncoder class from Scikit-Learn to encode the 'Outcome' column of the Pima Indian Diabetes dataset. The onehot_encoder.fit_transform() function fits the encoder on the 'Outcome' column and transforms it into a one-hot encoded representation. We then create a new DataFrame with the encoded features and concatenate it with the original dataset. Finally, we drop the original 'Outcome' column from the dataset.



# Reflection Points

1. **Handling Missing Values**:
   - What are some common techniques for identifying missing values in a dataset?
     - Answer: Common techniques include checking for null values, using summary statistics, or visualizing missing value patterns.
   - What are the potential impacts of missing values on data analysis and modeling?
     - Answer: Missing values can lead to biased results, reduced statistical power, or errors in predictive modeling if not handled appropriately.
   - What are some strategies for handling missing values in a dataset?
     - Answer: Strategies include removing rows or columns with missing values, imputing missing values with statistical measures, or using advanced techniques like regression imputation or multiple imputation.

2. **Feature Scaling and Normalization**:
   - What is the purpose of feature scaling and normalization in machine learning?
     - Answer: Feature scaling ensures that all features have a similar scale, preventing some features from dominating others and affecting model performance.
   - What are some common techniques for scaling and normalizing features?
     - Answer: Techniques include min-max scaling, z-score standardization, and robust scaling using interquartile range.
   - When is it appropriate to apply feature scaling or normalization to a dataset?
     - Answer: Feature scaling is typically applied when features have different scales, such as in distance-based algorithms or models that rely on gradient descent optimization.

3. **Handling Categorical Variables**:
   - How do categorical variables differ from numerical variables in data analysis?
     - Answer: Categorical variables represent discrete groups or categories, while numerical variables represent continuous values.
   - What are the potential challenges of using categorical variables in machine learning models?
     - Answer: Challenges include the need to convert categorical variables into numerical representations, handling high-cardinality categories, and avoiding the introduction of bias during encoding.
   - What are some common techniques for handling categorical variables in machine learning?
     - Answer: Techniques include one-hot encoding, label encoding, target encoding, or using embedding layers in deep learning models.


# Exercise


1. Load the dataset from the given URL.
2. Explore the dataset to understand its structure, features, and target variable.
3. Check for any missing values in the dataset.
4. Split the dataset into features (X) and target variable (y).
5. Split the data into training and testing sets.
6. Perform data preprocessing steps like feature scaling, if necessary.
7. Train a simple classification model (e.g., Logistic Regression) on the preprocessed data.
8. Evaluate the model on the test data and calculate the accuracy.
9. (Optional) Try out other preprocessing techniques such as handling missing values, feature selection, etc., to see how they impact the model's performance.


# Sample Solution

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Step 1: Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=column_names)

# Step 2: Explore the dataset
print(data.head())
print(data.info())

# Step 3: Check for missing values
print(data.isnull().sum())

# Step 4: Split the dataset into features (X) and target variable (y)
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Data preprocessing - Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 7: Train a simple classification model (Logistic Regression) on the preprocessed data
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Step 8: Evaluate the model on the test data and calculate the accuracy
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))


# A quiz on Data Processing


1. Question: The process of filling in or imputing missing values in a dataset is called:
   <br>a) Feature scaling
   <br>b) Normalization
   <br>c) Handling missing values
   <br>d) Categorical encoding

2. Question: Which Python library can be used to load and manipulate datasets, including handling missing values?
   <br>a) NumPy
   <br>b) Pandas
   <br>c) Scikit-learn
   <br>d) Matplotlib

3. Question: Which method is commonly used for filling missing numerical values with the mean of the available data?
   <br>a) Zero imputation
   <br>b) Mode imputation
   <br>c) Mean imputation
   <br>d) Median imputation

4. Question: Feature scaling is important for machine learning algorithms that rely on distance calculations or gradient descent. Which of the following scaling techniques transforms data to a range of [0, 1]?
   <br>a) Min-Max scaling
   <br>b) Standardization
   <br>c) Robust scaling
   <br>d) Log transformation

5. Question: In feature scaling, the z-score scaling technique transforms data to have a mean of 0 and a standard deviation of 1.
   <br>a) True
   <br>b) False

6. Question: Which of the following is NOT a method to handle categorical variables?
   <br>a) One-Hot Encoding
   <br>b) Label Encoding
   <br>c) Ordinal Encoding
   <br>d) Median Encoding

7. Question: In One-Hot Encoding, how many new columns are created for a categorical feature with 'n' unique categories?
   <br>a) n
   <br>b) n+1
   <br>c) 2n
   <br>d) n/2

8. Question: Which Python library provides the 'get_dummies()' function to perform One-Hot Encoding?
   <br>a) Pandas
   <br>b) NumPy
   <br>c) Scikit-learn
   <br>d) Statsmodels

---
Answer Key:
1. c) Handling missing values
2. b) Pandas
3. c) Mean imputation
4. a) Min-Max scaling
5. b) False
6. d) Median Encoding
7. b) n+1
8. a) Pandas
---