# Data Processing

## Data Cleaning Techniques and Handling Missing Values

In the real world, raw data is rarely perfect. It often contains errors, inconsistencies, and missing values. Data cleaning is the process of detecting and correcting these issues to improve the quality of the data.

- *Duplicates*: These are repeated entries in the dataset. Duplicate rows can skew your analysis or model, especially if they represent the same event or entity.
- *Incorrect Data Types*: Data that is incorrectly typed (e.g., numbers stored as strings) can lead to errors in analysis and model training.
- *Outliers*: Extreme values that are far removed from the rest of the data. They can distort the results of statistical analyses or machine learning models.

Missing data can occur for various reasons, such as data entry errors, equipment malfunctions, or privacy concerns. It's essential to handle missing values carefully, as improper handling can lead to biased results.

Types of Missing Data:

- *MCAR* (Missing Completely at Random): The probability of data being missing is independent of the data itself.
- *MAR* (Missing at Random): The probability of missing data is related to other observed data, but not to the missing data itself.
- *MNAR* (Missing Not at Random): The missingness is related to the unobserved data itself.

**Removing missing data:**

Dropping rows or columns with missing values is the simplest approach, but it can lead to loss of valuable information.

In [None]:
import pandas as pd

df = pd.read_csv('../resources/data.csv')
print(df)
df_cleaned = df.dropna()  # Drops all rows with any missing values
# Drop rows with any missing values in numerical columns
df_feature_cleaned = df.dropna(subset=['feature1', 'feature2'])
print(df_cleaned)
print(df_feature_cleaned)

**Mean Imputation**: Replace missing values with the mean of the column.

In [None]:
# Impute missing values with the mean of the respective column
df['feature1'].fillna(df['feature1'].mean(), inplace=True)
df['feature2'].fillna(df['feature2'].mean(), inplace=True)
df

**Median Imputation**: Useful when the data is skewed.

In [None]:
df['feature1'].fillna(df['feature1'].median(), inplace=True)

**Mode Imputation**: Often used for categorical data.

In [None]:
df['feature1'].fillna(df['feature1'].mode()[0], inplace=True)
df

Advanced Imputation:

**K-Nearest Neighbors (KNN) Imputation**: Estimates missing values based on the values of the nearest neighbors.

**Multivariate Imputation**: Considers the relationships between features to impute missing values.

**Creating Indicator Variables**: Another approach is to create a binary indicator variable that flags the presence of missing data.


## Data Scaling and Normalization

Data scaling and normalization are crucial preprocessing steps that can significantly affect the performance of your models.

Data scaling ensures that all features contribute equally to the model's learning process. When features have vastly different ranges, those with larger ranges can dominate the distance calculations or the gradient steps, leading to suboptimal models.

*Distance-based Algorithms*: In algorithms like K-Nearest Neighbors (KNN) or Support Vector Machines (SVM), the distance between data points is crucial. If one feature has a much larger range than others, it will disproportionately influence the distance metric, potentially skewing the results.

*Gradient Descent*: Algorithms like linear regression and neural networks use gradient descent to minimize a cost function. If the features are not scaled, the gradient descent algorithm may take longer to converge or may converge to a suboptimal solution due to the uneven steps taken in the parameter space.



Min-Max Scaling

Rescales the data to a fixed range, usually [0, 1]. This technique is particularly useful when you know that your data follows a distribution with clear upper and lower bounds.

$$
X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
$$

- Use Min-Max Scaling when you want all your features to have the same scale (e.g., in algorithms like KNN, SVM).
- Particularly useful when the data is distributed across a known and fixed range.

In [2]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

data = {'feature': [10, 50, 100, 150, 200]}
df = pd.DataFrame(data)

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)

print(scaled_data)

[[0.        ]
 [0.21052632]
 [0.47368421]
 [0.73684211]
 [1.        ]]


In [None]:
df = pd.read_csv('../resources/data.csv')

# Select numerical columns for scaling
numerical_cols = df[['feature1', 'feature2']]

# Keep categorical and target columns separate
categorical_cols = df[['categorical_column', 'target']]

from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
scaler = MinMaxScaler()

# Fit and transform the numerical columns
numerical_scaled = scaler.fit_transform(numerical_cols)

# Convert the scaled data back to a DataFrame
numerical_scaled_df = pd.DataFrame(numerical_scaled, columns=numerical_cols.columns)

# Combine scaled numerical data with the categorical columns
df_scaled = pd.concat([numerical_scaled_df, categorical_cols.reset_index(drop=True)], axis=1)

# Display the final DataFrame
print(df_scaled)

**Z-Score Normalization (or standardization)**

Rescales the data to have a mean of 0 and a standard deviation of 1. This technique is useful when the data follows a Gaussian (normal) distribution but not necessarily bounded.

$$
X' = \frac{X - \mu}{\sigma}
$$

- Use Z-Score Normalization when your data follows a normal distribution.
- It's the default choice for many machine learning models, especially those assuming normally distributed data.


In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

data = {'feature': [10, 50, 100, 150, 200]}
df = pd.DataFrame(data)

scaler = StandardScaler()
standardized_data = scaler.fit_transform(df)

print(standardized_data)

**Robust Scaling**

Uses the median and interquartile range (IQR) for scaling, which makes it robust to outliers. Unlike Min-Max Scaling and Z-Score Normalization, Robust Scaling is less influenced by extreme values in the data.


$$X'=\frac{X - \text{median}(X)}{\text{IQR}(X)}$$

Where $\text{IQR}(X)$ is the range between the 25th percentile ($Q1$) and the 75th percentile ($Q3$).

- Use Robust Scaling when your data contains outliers or follows a non-Gaussian distribution.
- Ideal for situations where the data is skewed or contains extreme values that should not heavily influence the scaling.

In [None]:
from sklearn.preprocessing import RobustScaler
import pandas as pd

data = {'feature': [10, 50, 100, 150, 200]}
df = pd.DataFrame(data)

scaler = RobustScaler()
robust_scaled_data = scaler.fit_transform(df)

print(robust_scaled_data)

*Min-Max Scaling*: Use when you know the boundaries of your data or when the algorithm requires data in a specific range.

*Z-Score Normalization*: Suitable for normally distributed data or when you want to standardize features to have equal importance.

*Robust Scaling*: Best when dealing with data that has outliers.

## Introduction to Feature Engineering

Feature engineering is a crucial step in the data preprocessing pipeline. It involves transforming raw data into features that better represent the underlying problem to the predictive models. Good feature engineering can significantly improve the performance of machine learning models by providing them with the most relevant and informative input data.

*Feature Engineering* refers to the process of selecting, modifying, or creating new features from raw data to improve the performance of machine learning models. The goal is to create features that capture the underlying patterns in the data more effectively than the original raw data. Proper feature engineering can lead to simpler models, improved accuracy, and reduced training times.

**Creating new features** 

Involves deriving new variables from existing ones that better represent the underlying structure of the data. This could involve mathematical transformations, aggregations, or domain-specific operations.

In [None]:
import pandas as pd

data = {
    'height': [1.60, 1.75, 1.82, 1.90],
    'weight': [55, 80, 72, 90]
}
df = pd.DataFrame(data)

# Creating a new feature: BMI
df['BMI'] = df['weight'] / (df['height'] ** 2)

print(df)

**Encoding Categorical Variables**

Machine learning algorithms generally require numerical input. Therefore, categorical variables must be converted into numerical format. There are several techniques to encode categorical variables:

*One-hot* encoding converts each category into a new binary column. This method is suitable when there are a limited number of categories.

In [None]:
data = {'color': ['red', 'blue', 'green']}
df = pd.DataFrame(data)

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['color'])

print(df_encoded)

*Label encoding* assigns a unique integer to each category. It is simple but assumes an ordinal relationship between the categories, which may not be appropriate for all cases.

In [None]:
from sklearn.preprocessing import LabelEncoder

data = {'color': ['red', 'blue', 'green']}
df = pd.DataFrame(data)

# Label Encoding
encoder = LabelEncoder()
df['color_encoded'] = encoder.fit_transform(df['color'])

print(df)

*Target encoding* involves replacing each category with the mean of the target variable for that category. This technique is useful when the categorical variable has a large number of levels.

In [None]:
data = {
    'color': ['red', 'blue', 'green', 'red', 'blue', 'green'],
    'target': [1, 0, 0, 1, 1, 0]
}
df = pd.DataFrame(data)

# Target Encoding
df['color_encoded'] = df.groupby('color')['target'].transform('mean')

print(df)

**Interaction Features**

Interaction features are created by combining two or more variables to capture the interactions between them. These features can reveal relationships that are not apparent when considering the variables independently.

*Example*:

Suppose you have age and income as features. An interaction term could be created by multiplying these two features to capture how income changes with age.

In [None]:
data = {
    'age': [25, 35, 45, 55],
    'income': [30000, 50000, 70000, 90000]
}
df = pd.DataFrame(data)

# Creating an interaction feature
df['age_income_interaction'] = df['age'] * df['income']

print(df)

**Polynomial features**

Created by generating new features that are polynomial combinations of the existing features. This can be particularly useful when modeling non-linear relationships.

*Example*:

Suppose you have a single feature x. You can create polynomial features like x^2, x^3, etc., which may help in capturing the non-linear relationships in the data.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

data = {'x': [2, 3, 4]}
df = pd.DataFrame(data)

# Creating polynomial features (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df)

# Convert the polynomial features back to a DataFrame
df_poly = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['x']))

print(df_poly)

**Feature Selection**

Identify features that are highly correlated with the target variable.

In [None]:
# Select only numerical columns for correlation
numerical_df = df.select_dtypes(include=[float, int])

# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()

# Print the correlations with the 'target' variable, sorted in descending order
print(correlation_matrix['target'].sort_values(ascending=False))

## Few Notes on scikit-learn

One of the most popular and powerful open-source libraries in Python for machine learning and data analysis. It provides a comprehensive suite of tools for building, evaluating, and deploying machine learning models, making it an essential tool for data scientists, machine learning engineers, and statisticians.

*Pipelines* in scikit-learn allow you to streamline the process of creating machine learning workflows by combining multiple steps (e.g., preprocessing, model fitting) into a single, cohesive process.

In [25]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X_train = [[1, 10], [100, 50], [5, 1 ], [10, 100], [50, 5]]
y_train = [0, 0, 1, 1, 0 ]

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

pipeline.fit(X_train, y_train)

**Model Evaluation Metrics**:

Scikit-learn includes a rich set of metrics for evaluating model performance, from standard accuracy to more complex metrics like ROC AUC, mean squared error (MSE), and adjusted R².

In [27]:
from sklearn.metrics import accuracy_score, confusion_matrix

X_test = [[1, 10], [10, 5], [5, 1 ], [10, 100], [50, 5]]
y_test = [0, 1, 0, 0, 0 ]

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(y_pred, '\n')
print(accuracy, '\n')
print(cm, '\n')


[0 0 0 1 0] 

0.6 

[[3 1]
 [1 0]] 

