# DATA PROCESSING

- Data processing is a crucial step in any machine learning project. It involves transforming raw data into a format that can be used for analysis, modeling, and making predictions. This step can significantly impact the quality of the machine learning model. The following outlines the common steps involved in data processing:

___
## Data Collection

- Description: Gathering raw data from different sources such as databases, APIs, flat files (CSV, Excel), web scraping, or sensors.
- Tools: APIs (e.g., requests, beautifulsoup), Databases (e.g., SQL, MongoDB), Files (e.g., pandas.read_csv()).

___
## Data Cleaning

- Data cleaning is the process of handling missing, inconsistent, or outlier data that can affect model accuracy. This step ensures that the data is accurate and ready for analysis.

#### 1.Handling Missing Data:

- <b>Imputation</b>:Fill missing values with the mean, median, or mode (for numerical data) or the most frequent value (for categorical data).
-  <b> Deletion</b>: Remove rows or columns with missing values (use with caution, as it might lose important information).

#### 2.Removing Duplicates:
- pandas.drop_duplicates()


#### 3.Handling Outliers:
- Outliers can skew results and make the model less accurate.
- Use statistical methods (e.g., Z-scores, IQR) to identify outliers.
- Tools: pandas, scipy.stats (Z-scores), numpy.percentile (IQR).

___
##  Exploratory Data Analysis (EDA)


- EDA is used to understand the structure of the dataset and the relationships between different features.

#### Data Visualization: 

- Plotting graphs to understand the distribution and relationships.
   - matplotlib, seaborn, plotly.
   - Histograms to understand the distribution of a feature.
   - Scatter plots to understand the relationship between two features.
   - Boxplots to visualize the spread and detect outliers.
#### Statistical Summaries: 
- Getting the central tendency (mean, median, mode), spread (variance, standard deviation), and other summary statistics.
   - pandas.describe(), scipy.stats
#### Correlation Analysis:
   - Check for correlations between features. Strong correlations may suggest multicollinearity or redundancy.
     - pandas.corr(), seaborn.heatmap().

___
##  Feature Engineering
 Feature engineering involves transforming raw data into a set of features that better represent the underlying problem to the model.
   


#### Creating New Features:
- Based on domain knowledge or mathematical transformations, new features can be derived
#### Categorical Encoding:
- Machine learning models typically require numerical input. Categorical variables must be converted into numerical representations. 
    - One-Hot Encoding:  pandas.get_dummies(), sklearn.preprocessing.OneHotEncoder().
    - [Label Encoding](#section1) :Assign an integer value to each category (useful for ordinal features).
    - [Target Encoding](#section2) : Replace categories with the mean of the target variable for each category.
#### Feature Scaling:
-  Standardize or normalize the features to bring them onto the same scale, especially important for distance-based models like KNN or gradient descent-based models like logistic regression.
- [Normalization](#section3):  Rescaling features to a range (typically [0, 1]).sklearn.preprocessing.MinMaxScaler().
<br><br>
- [Standardization](#section4): Scaling features to have zero mean and unit variance.sklearn.preprocessing.StandardScaler().
#### Feature Selection: 
- Remove irrelevant or redundant features.
- Correlation Thresholding: Remove features that are highly correlated with each other.
- Univariate Feature Selection: Select features based on their statistical significance with the target variable.
- Recursive Feature Elimination (RFE): A method to iteratively remove features and build the model to select the most important ones.
-sklearn.feature_selection.

___
## Data Transformation

- Data transformation refers to modifying or converting data in a way that makes it more suitable for the machine learning model.

- <b>Log Transformation:</b> Apply logarithmic transformation to features that have a skewed distribution.
- <b>Polynomial Features:</b> Create interaction terms or higher-degree polynomial features.
- <b>Dimensionality Reduction:</b> Reduce the number of features to a lower-dimensional space (especially useful when there are many features).
   -  PCA (Principal Component Analysis): Projects data onto fewer dimensions while preserving as much variance as possible.
   -  t-SNE (t-Distributed Stochastic Neighbor Embedding): For visualizing high-dimensional data in 2D or 3D.

___
## Data Spitting

#### Training Set: 
- Used to train the machine learning model.
#### Test Set:
- Used to evaluate the model's performance on unseen data.

___
## Model Training
- Once the data is clean, transformed, and split, you can train a machine learning model using the training dataset.
- Examples of models include regression models, classification models, or clustering algorithms, depending on the problem.


___
##  Model Evaluation
- fter training the model, evaluate its performance using various metrics based on the type of problem (regression, classification, etc.).
- For Regression: Common metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared 
- For Classification: Common metrics include Accuracy, Precision, Recall, F1-Score, AUC-ROC.

__
##  Model Tuning

- Hyperparameter tuning can be done to improve the model performance by selecting the best hyperparameters (e.g., learning rate, number of trees in a random forest, etc.).
  - Grid Search: Exhaustively tries all possible hyperparameter combinations.
  - Random Search: Randomly tries different hyperparameter combinations.
  - Bayesian Optimization: A probabilistic model to select hyperparameters that are most likely to improve the model.
___
##  Deployment
 - After training and evaluation, the model is deployed into a production environment, where it can be used to make real-time predictions on new data.
 - Tools: Flask, Django (for APIs), Docker (for containerization), cloud platforms (AWS, GCP, Azure).
 ___
 ___

<h2 id="section1"><u>Label encoding</u></h2>
- Red is encoded as 0 <br>
- Green is encoded as 1<br>
- Blue is encoded as 2


In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to the 'Color' column
df['Color_encoded'] = label_encoder.fit_transform(df['Color'])

print(df)

   Color  Color_encoded
0    Red              2
1  Green              1
2   Blue              0
3  Green              1
4   Blue              0
5    Red              2


<h2 id="section2"><u>Target encoding</u></h2>

- Target Encoding (also known as Mean Encoding) is a technique where categorical values are replaced by the mean of the target variable (dependent variable) for each category. It can be useful when dealing with categorical features with a high cardinality (many categories), and is often used in situations where One-Hot Encoding would create a sparse matrix with too many columns.

- 1 .Calculate the mean of the target variable for each category in the feature.
- 2 .
Replace each category in the feature with the corresponding mean of the target variable.

- Suppose you have a dataset where you're predicting whether a person will buy a product (target variable Purchase, 1 for Yes and 0 for No) based on the City they live in (categorical feature).

In [4]:
import pandas as pd

# Sample data
data = {
    'City': ['New York', 'Boston', 'Chicago', 'Boston', 'New York', 'Chicago'],
    'Purchase': [1, 0, 1, 1, 0, 0]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Calculate the target mean for each category in 'City'
target_encoding = df.groupby('City')['Purchase'].mean()

# Map the target mean to the 'City' column
df['City_encoded'] = df['City'].map(target_encoding)

print(df)


       City  Purchase  City_encoded
0  New York         1           0.5
1    Boston         0           0.5
2   Chicago         1           0.5
3    Boston         1           0.5
4  New York         0           0.5
5   Chicago         0           0.5


<h2 id="section3"><u>Normalization</u></h2>

- Normalization, often called Min-Max Scaling, rescales the features to a specific range, usually [0, 1] or [-1, 1]. The key idea is to subtract the minimum value of the feature and then divide by the range (the difference between the maximum and minimum values). This method is useful when you want to ensure that all features have the same scale and are bounded within a specific range.

- When you want the values of your data to fall within a specific range (like [0, 1]).

- Especially useful for machine learning algorithms that rely on the distance between data points, such as:

  - K-Nearest Neighbors (KNN)
  - Neural Networks
  - Support Vector Machines (SVM) with RBF kernel
- Normalization can also help when you're dealing with non-Gaussian distributions or when features have different units or very different ranges (e.g., height in cm and weight in kg).

In [5]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Apply normalization
normalized_data = scaler.fit_transform(data)

print(normalized_data)


[[0.         0.        ]
 [0.33333333 0.33333333]
 [0.66666667 0.66666667]
 [1.         1.        ]]


### Normalization (Min-Max Scaling)

Normalization rescales the data into a fixed range, typically [0, 1]. The formula for normalization is:

$$
X_{\text{norm}} = \frac{X - \min(X)}{\max(X) - \min(X)}
$$

Where:
- $X$ is the original feature value.
- $\min(X)$ is the minimum value of the feature.
- $\max(X)$ is the maximum value of the feature.
- $X_{\text{norm}}$ is the normalized value.


<h2 id="section4"><u>Standardization</u></h2>

- Standardization, also known as Z-score normalization, rescales the features so that they have a mean of 0 and a standard deviation of 1. It is the most common method for scaling features and is often preferred when working with many machine learning algorithms, especially those that rely on Gaussian assumptions.

### Standardization (Z-score Normalization)

Standardization rescales the data so that it has a mean of 0 and a standard deviation of 1. The formula for standardization is:

$$
X_{\text{std}} = \frac{X - \mu}{\sigma}
$$

Where:
- $X$ is the original feature value.
- $\mu$ (mu) is the mean of the feature.
- $\sigma$ (sigma) is the standard deviation of the feature.
- $X_{\text{std}}$ is the standardized value.


## Z score
- In machine learning, the Z-score is a statistical measure that describes how many standard deviations a data point is from the mean of a dataset. It's a way to standardize or normalize data, making it easier to compare values that come from different distributions or have different scales.


### Why Z-scores are used in Machine Learning
- Feature Scaling: In many machine learning algorithms (like k-Nearest Neighbors, Support Vector Machines, or neural networks), the model performance can be sensitive to the scale of the data. Features with different units or vastly different scales can cause some features to dominate others, leading to suboptimal performance. Standardizing features using Z-scores (a process called z-score normalization) ensures that each feature contributes equally to the model.

- Outlier Detection: Z-scores help identify outliers. A Z-score significantly higher or lower than 0 indicates that the data point is far away from the mean. Common thresholds are Z-scores of ±2 or ±3, which correspond to data points that are more than 2 or 3 standard deviations away from the mean, respectively.

- Assumption for Algorithms: Many machine learning models assume or perform better when the data is approximately normally distributed (Gaussian distribution). Z-score normalization can help make data more Gaussian-like by adjusting for extreme values and centering the data around 0.