# Practice on how to build an end to end model given any sort of data

Since my interview will fully rely on how I can build a model from start to finish give any set of data, this notebook will act as a guide on how to make that end to end modelin possible. 

## Steps for end to end model:

assuming you have a specific dataset:

**Understand the Problem Statement**:

Before diving into building a machine learning model, it's crucial to have a clear understanding of the problem you're trying to solve. Define your problem statement, objectives, and success criteria.

**Collect and Explore the Data**:

Obtain the dataset relevant to your problem statement. Understand the structure of the data, its features, and the target variable.
Perform exploratory data analysis (EDA) to gain insights into the data. This includes visualizations, summary statistics, and identifying patterns or correlations in the data.

**Data Preprocessing**:

Clean the data by handling missing values, outliers, and inconsistencies.
Encode categorical variables if necessary (e.g., one-hot encoding, label encoding).
Scale or normalize numerical features to ensure they're on similar scales.

**Feature Engineering**:

Create new features that might be useful for the model.
Select relevant features based on domain knowledge or feature importance techniques.

**Split the Data**:

Split the dataset into training, validation, and test sets. A common split is 70% for training, 15% for validation, and 15% for testing.
Ensure the splits maintain the distribution of the target variable, especially in cases of imbalanced datasets.

**Select a Model**:

Choose a suitable machine learning algorithm based on the nature of your problem (classification, regression, clustering, etc.), the size of your dataset, and computational resources.
Consider starting with simple models like linear regression/classification and gradually move to more complex ones like decision trees, random forests, support vector machines, or neural networks.

**Train the Model**:

Train the selected model on the training dataset using appropriate training techniques (e.g., gradient descent, backpropagation).
Tune hyperparameters using techniques like grid search, random search, or Bayesian optimization to improve model performance.

**Evaluate the Model**:

Assess the model's performance on the validation set using appropriate evaluation metrics (accuracy, precision, recall, F1-score, RMSE, etc.).
Adjust the model or experiment with different algorithms if performance is not satisfactory.

**Fine-tune the Model**:

Refine the model by adjusting parameters or features based on insights gained from the evaluation phase.
Avoid overfitting by regularization techniques (e.g., L1/L2 regularization, dropout).

**Validate the Model**:

Once satisfied with the model's performance on the validation set, evaluate it on the test set to get an unbiased estimate of its performance.

**Interpret the Results**:

Interpret the model's predictions and understand how it's making decisions.
Analyze important features and their impact on predictions.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from pathlib import Path

# preprocessing imports
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler, Normalizer

# feature selection imports
from sklearn.feature_selection import mutual_info_classif
from sklearn.ensemble import RandomForestClassifier

# feature engineering imports

# model imports 
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.decomposition import PCA

# validation imports
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_score, accuracy_score, f1_score, recall_score

# utils
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

pd.options.display.max_columns=300

## Preprocessing Data

1. Start with finding all categorical and non categorical data.

```python
        non_num_columns = train_data.select_dtypes(exclude=np.number).columns
        num_columns = train_data.select_dtypes(np.number).columns
```

2. Find all nulls in data for categorical and non categorical and draw it


```python
        null_df = pd.DataFrame(train_data.isnull().sum(), columns=["nulls"]).reset_index()
        null_df.rename(columns={"index":"features"}, inplace=True)

        null_df["percentage_null"] = null_df.loc[np.where(null_df.nulls > 0)].nulls/train_data.shape[0]*100
        plt.figure(figsize=(10, 25))
        sb.catplot(null_df.loc[np.where(null_df.nulls > 0)], x="percentage_null", y="features", kind="bar",height=10)
```

3. Impute all data with nulls. 

    1. if `numerical` use `median` can add a bit of noise within same variance range.
    2. if `categorical`, introduce `"unknown"` as a new category.

> For Numerical
```python
        # Step 1: Identify columns with null values
        columns_with_nulls = X.columns[X.isnull().any()].tolist()

        # Step 2: Calculate column-wise averages for the columns with nulls
        column_averages = X.mean()

        # Step 3: Replace null values with column averages
        for col in columns_with_nulls:
                X[col] = X[col].fillna(column_averages[col])
```

> For Categorical

```python
        # Calculate column-wise averages for the columns with nulls
        column_averages = "unknown"

        #  Replace null values with column averages
        for col in non_num_columns_new:
        X[col] = X[col].fillna(column_averages)
```

4. **CLEAR DUPLICATES**

```python
        df.drop_duplicates(subset=[`<column>`])
```

## Do minor EDA to look at how the data looks like


1. Start off with checking spread target. if categorical `catplot with count`. If numerical `kde density`

```python

    plt.figure(figsize=(5, 5))
    
    """Working with categorical target"""
    sb.catplot(train_data, x="target", kind="count")

    """Numerical Target"""
    sb.kdeplot(rejoined_xy, x= "SalePrice")

```

2. If a mix of categorical target and numerical features use:

```python
    fig, axs = plt.subplots(ncols=2,nrows=2, figsize=(25, 25))
    
    # Numerical
    features_lines = ["BsmtFinSF1", "GarageArea","GrLivArea", "1stFlrSF"]
    
    # overallqual category
    col_i = 0
    for i in range(2):
        for j in range(2): 
            feature = features_lines[col_i]
            sb.lineplot(rejoined_xy, y= feature,x="OverallQual", ax=axs[i,j])
            col_i+=1
```

3. To check overall pattern for num vs num regplot or scatter plot

```python
    sb.regplot(data=mpg, x="displacement", y="mpg", logx=True)
    sb.regplot(x_estimator=np.mean, order=2)
```

4. Check for data outliers

```python
    sb.boxplot(rejoined_xy, x= "SalePrice")
```

## Feature Engineering

Steps:
1. Remove outliers

```python
    def remove_outlier(df, column_name):
        # Calculate the first quartile (Q1) and third quartile (Q3)
        column_data = df[column_name]
        Q1 = df.quantile(0.25)[column_name]
        Q3 = df.quantile(0.75)[column_name]

        # Calculate the IQR (Interquartile Range)
        IQR = Q3 - Q1

        # Define the lower and upper bounds for outlier detection
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        filtered_data = (column_data>=lower_bound) & (column_data <= upper_bound) 

        # # Filter out rows with outliers
        df_no_outliers = df[filtered_data]

        return df_no_outliers 
    
```

2. OneHotEncode Categorical data

```python
    from sklearn.preprocessing import OneHotEncoder

    encodeder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    encoded_data = encodeder.fit_transform(X[cat_columns])

    encoded_df = pd.DataFrame(encoded_data, columns=encodeder.get_feature_names(cat_columns))

    X = pd.concat([X,encoded_df], axis=1) # X with cat dropped
```

3. Consider binning your data:

```python
    from sklearn.preprocessing import KBinsDiscretizer


    binner = KBinsDiscretizer(n_bins=7, encode="ordinal", strategy="uniform")
    binned_grlivarea = binner.fit_transform(pd.DataFrame(X.GrLivArea))
    X["GrLivBinned"] = binned_grlivarea
    X["GrLivBinned"]

```
4. After feature importance, consider scaling your data.


```python
    scaler = StandardScaler()
    scaled_features_x = scaler.fit_transform(resampled_train_x)
```

##  Feature Selection

Three main methods for feature selection:

```python
    ## Form new train data X with the old data that adds encoding
    to_corr_df = pd.concat([X,y], axis=1)
    corr_df = to_corr_df.corr(method="pearson")

    plt.figure(figsize=(10, 25))
    sb.heatmap(pd.DataFrame(corr_df["SalePrice"]), cbar=True, cmap="Blues")
```

```python
    info_gain_df = pd.DataFrame(mutual_info_regression(X,y), columns=["info_gain"], index=X.columns).reset_index().rename(columns={'index': 'features'})

    plt.figure(figsize=(10, 25))
    sb.catplot(info_gain_df, x="info_gain", y="features", kind="bar",height=10)
```

```python
        rf.fit(X_train, y_train)
    
    # Predict on the test data
    y_pred = rf.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = rf.feature_importances_
    features_to_keep = pd.concat([features_to_keep, fold_importance_df], axis=0)
```

## Train Test Split with cross validation


> Train test split: 

```python

    from sklearn.model_selection import train_test_split

    x_train, x_valid, y_train, y_valid = train_test_split(x_pca, resampled_train_y, test_size=0.10, shuffle= True, random_state=42)
    
```


> cross validate

```python
    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error') # or r2

    # ROC And ROC_AUC or f1 score for classification

```

## Set model pipeline


Show the interviewer that you can set up all the previous stuff as a preprocessor thats extended

```python
    from sklearn.base import BaseEstimator, TransformerMixin

    class CustomTransformer(BaseEstimator, TransformerMixin):
        def __init__(self, new_feature_value):
            self.new_feature_value = new_feature_value
            
        def fit(self, X, y=None):
            # No fitting necessary, so we return self
            return self
        
        def transform(self, X):
            # Add a new feature with the specified value to each sample
            new_feature = [self.new_feature_value] * len(X)
            return np.c_[X, new_feature]
```



1. Regression

Start with normal regression

if have already selected features --> ridge regression else lasso

```python
    models = {
            ("Linear Regression", LinearRegression()),
            ("Ridge Regression", Ridge()),
            ("Lasso Regression", Lasso()),
            ("Decision Tree Regression", DecisionTreeRegressor()),

    }

    params ={
            "Linear Regression", LinearRegression()
            "Ridge Regression", Ridge(),
            "Lasso Regression", Lasso(),
            "Decision Tree Regression", DecisionTreeRegressor(),
    }

    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    print(f"{name}: Mean Squared Error = {mse:.2f}")

```
2. Classification

```python
# Create and train various classification models
models = [
    ("Logistic Regression", LogisticRegression()),
    ("Decision Tree Classifier", DecisionTreeClassifier()),
    ("Random Forest Classifier", RandomForestClassifier()),
    ("Gradient Boosting Classifier", GradientBoostingClassifier()),
    ("Support Vector Classifier", SVC()),
    ("K-Nearest Neighbors Classifier", KNeighborsClassifier())
]

for name, model in models:
    # Fit the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name}: Accuracy = {accuracy:.2f}")
```

3. Pipeline