# **DATA SCIENCE CYCLE PHASES:**
## The typical Data Science Cycle is divided into the following phases:
* ### PHASE 1: Problem Definition
* ### PHASE 2: Data Collection
* ### PHASE 3: Data Preparation
* ### PHASE 4: Data Exploration and Visualization
* ### PHASE 5: Feature Engineering and Preprocessing
* ### PHASE 6: Modeling
* ### PHASE 7: Model Evaluation
* ### PHASE 8: Deployment
* ### PHASE 9: Monitoring and Maintenance
* ### PHASE 10: Results Communication

 #########################################################################################################################

## PHASE 1: Problem Definition

**Description:** In this phase, the data scientist works closely with the stakeholders to understand the business problem and define the research question.

**Inputs:**

* Stakeholder requirements: This input consists of the business objectives, scope, constraints, and other requirements provided by the stakeholders.
* Domain knowledge: This input includes the knowledge of the industry, market trends, and other domain-specific information relevant to the problem.

**Outputs:**

* Research question: The output of this phase is a clear and specific research question that defines the problem statement, the data that will be collected, and the outcome that will be achieved.

 #########################################################################################################################
## PHASE 2: Data Collection

**Description:** In this phase, the data scientist collects the data required to answer the research question.

**Inputs:**

* Research question: This input defines the data that needs to be collected to answer the research question.
* Data sources: This input includes the sources from which the data will be collected, such as databases, APIs, files, or surveys.

**Outputs:**

* Raw data: The output of this phase is the raw data collected from the data sources.

 #########################################################################################################################
## PHASE 3: Data Preparation

**Description:** In this phase, the data scientist prepares the raw data for further analysis and modeling.

**Inputs:**

* Raw data: This input is the data collected in the previous phase.
* Data cleaning requirements: This input consists of the specific requirements for data cleaning, such as missing value imputation, outlier detection and removal, and data normalization.

**Outputs:**

* Cleaned data: The output of this phase is the cleaned data, ready for exploration and modeling.

 #########################################################################################################################
## PHASE 4: Data Exploration and Visualization

**Description:** In this phase, the data scientist explores the cleaned data to gain a deeper understanding of the relationships between the variables and the characteristics of the data.

**Inputs:**

* Cleaned data: This input is the data prepared in the previous phase.

**Outputs:**

* Insights and visualizations: The outputs of this phase are the insights gained from exploring the data and the visualizations that illustrate the patterns and relationships in the data.

 #########################################################################################################################
## PHASE 5: Feature Engineering and Preprocessing

**Description:** In this phase, the data scientist creates new features from the existing data or preprocesses the data to make it suitable for modeling.

**Inputs:**

* Cleaned data: This input is the data prepared in the previous phase.

**Outputs:**

* Engineered or preprocessed data: The output of this phase is the data that has been transformed or engineered to make it suitable for modeling.

 #########################################################################################################################
## PHASE 6: Modeling

**Description:** In this phase, the data scientist selects a suitable model, trains it on the data, and tests its performance.

**Inputs:**

* Engineered or preprocessed data: This input is the data prepared in the previous phase.
* Model selection criteria: This input consists of the criteria used to select a suitable model, such as accuracy, interpretability, and scalability.

**Outputs:**

* Trained model: The output of this phase is the trained model that can be used to make predictions or generate insights.

 #########################################################################################################################
## PHASE 7: Model Evaluation

**Description:** In this phase, the data scientist evaluates the performance of the model and fine-tunes its parameters to improve its accuracy and robustness.

**Inputs:**

* Trained model: This input is the model trained in the previous phase.
    Evaluation criteria: This input consists of the criteria used to evaluate the performance of the model. Classification models use metrics such as: accuracy, precision, recall, F1 score and ROC-AUC. Regression models use metrics such as: mean squared error, mean absolute error, negative mean absolute error, negative mean absolute error, mean absolute percentile error, r2, explained variance score, etc.

**Outputs:**

* Improved model: The output of this phase is the model that has been fine-tuned to achieve better performance.

 #########################################################################################################################
## PHASE 8: Deployment

**Description:** In this phase, the data scientist deploys the trained model into the production environment to make predictions or generate insights.

**Inputs:**

* Improved model: This input is the model that has been fine-tuned to achieve better performance.
* Production environment: This input consists of the hardware and software infrastructure used to deploy and run the model in production.
* Deployment requirements: This input consists of the specific requirements for deploying the model, such as scalability, reliability, and security.

**Outputs:**

* Deployed model: The output of this phase is the model that has been deployed in the production environment and is ready to be used for prediction or generating insights.

 #########################################################################################################################
## PHASE 9: Monitoring and Maintenance

**Description:** In this phase, the data scientist monitors the performance of the deployed model and performs regular maintenance to ensure that it continues to perform well.

**Inputs:**

* Deployed model: This input is the model that has been deployed in the production environment.
* Performance metrics: This input consists of the metrics used to monitor the performance of the model. Classification models use metrics such as: accuracy, precision, recall, F1 score and ROC-AUC. Regression models use metrics such as: mean squared error, mean absolute error, negative mean absolute error, negative mean absolute error, mean absolute percentile error, r2, explained variance score, etc.
* Maintenance requirements: This input consists of the specific requirements for maintaining the model, such as updating the model parameters or retraining the model with new data.

**Outputs:**

* Maintained model: The output of this phase is the model that has been regularly maintained to ensure that it continues to perform well in the production environment.

 #########################################################################################################################
## PHASE 10: Results Communication

**Description:** This final phase focuses on presenting the insights gained from the data science project, which are communicated to stakeholders in a clear and understandable manner, and the results are shared in a way that is easily understood by non-technical audiences, such as executives or clients. 

It involves creating visualizations, reports, and presentations that effectively convey the results of the project and any insights that were gained from the data. The goal is to ensure that the stakeholders can understand the impact and implications of the project and make informed decisions based on the results.

This phase also involves identifying areas for future improvement and considering the implications of the results for the organization as a whole.

**Inputs:**

* Maintained model: This input is the model that has been regularly maintained to ensure that it continues to perform well in the production environment.
* Insights and results: This input consists of the insights and results generated by the model, such as predictions, recommendations, or visualizations.
* Storytelling requirements: This input consists of the specific requirements for communicating the insights and results, such as audience, format, and tone.

**Outputs:**

* Data insights and story: The output of this phase is the communication of the insights and results generated by the model to the stakeholders, in a way that is clear, concise, and engaging. The data scientist tells the story of how the research question was answered, and how the insights and results can be used to inform decision-making or drive business value.


In [None]:
models = {
    'Linear Regression': {
        'model': LinearRegression(),
        'params': {
            'normalize': [True, False]
        }
    },
    'Lasso Regression': {
        'model': Lasso(),
        'params': {
            'alpha': [0.01, 0.1, 1, 10]
        }
    },
    'Ridge Regression': {
        'model': Ridge(),
        'params': {
            'alpha': [0.01, 0.1, 1, 10]
        }
    },
    'Elastic Net Regression': {
        'model': ElasticNet(),
        'params': {
            'alpha': [0.01, 0.1, 1, 10],
            'l1_ratio': [0.25, 0.5, 0.75]
        }
    },
    'Support Vector Regression': {
        'model': SVR(),
        'params': {
            'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'C': [0.1, 1, 10],
            'epsilon': [0.01, 0.1, 1]
        }
    },
    'Gradient Boosting Regression': {
        'model': GradientBoostingRegressor(),
        'params': {
            'n_estimators': [50, 100, 150],
            'learning_rate': [0.01, 0.1, 1],
            'max_depth': [3, 5, 7],
            'subsample': [0.5, 0.75, 1]
        }
    },
    'AdaBoost Regression': {
        'model': AdaBoostRegressor(),
        'params': {
            'n_estimators': [50, 100, 150],
            'learning_rate': [0.01, 0.1, 1],
            'loss': ['linear', 'square', 'exponential']
        }
    },
    'XGBoost Regression': {
        'model': XGBRegressor(),
        'params': {
            'max_depth': [3, 5, 7],
            'learning_rate': [0.1, 0.01, 0.001],
            'n_estimators': [50, 100, 150]
        }
    },
    'Random Forest Regression': {
        'model': RandomForestRegressor(),
        'params': {
            'n_estimators': [50, 100, 150],
            'max_depth': [3, 5, 7],
            'max_features': ['sqrt', 'log2']
        }
    },
    'Extra Trees Regression': {
        'model': ExtraTreesRegressor(),
        'params': {
            'n_estimators': [50, 100, 150],
            'max_depth': [3, 5, 7],
            'max_features': ['sqrt', 'log2']
        }
    }
}

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error


# assume X_train, y_train, X_valid, y_valid, X_test, y_test are pandas dataframes
preprocessing_pipeline = Pipeline([
    # add your preprocessing steps here
])

X_train_transformed = preprocessing_pipeline.fit_transform(X_train, y_train)
X_valid_transformed = preprocessing_pipeline.transform(X_valid)
X_test_transformed = preprocessing_pipeline.transform(X_test)

## Define models dictionary

In [2]:
models = {}

# Linear Regression
models['Linear Regression'] = {
    'model': LinearRegression(),
    'params': {
        'normalize': [True, False]
    }
}

# Lasso Regression
models['Lasso Regression'] = {
    'model': Lasso(),
    'params': {
        'alpha': [0.1, 0.5, 1, 2, 5, 10],
        'normalize': [True, False],
        'max_iter': [1000, 5000, 10000]
    }
}

# Ridge Regression
models['Ridge Regression'] = {
    'model': Ridge(),
    'params': {
        'alpha': [0.1, 0.5, 1, 2, 5, 10],
        'normalize': [True, False],
        'max_iter': [1000, 5000, 10000]
    }
}

# Support Vector Regression
models['SVR'] = {
    'model': SVR(),
    'params': {
        'C': [0.1, 0.5, 1, 2, 5, 10],
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'gamma': ['scale', 'auto']
    }
}

# Random Forest Regression
models['Random Forest Regression'] = {
    'model': RandomForestRegressor(),
    'params': {
        'n_estimators': [50, 100, 150, 200],
        'max_depth': [3, 5, 7, 9],
        'min_samples_split': [2, 3, 4],
        'min_samples_leaf': [1, 2, 3],
        'bootstrap': [True, False]
    }
}

# XGBoost Regression
models['XGBoost Regression'] = {
    'model': XGBRegressor(),
    'params': {
        'learning_rate': [0.05, 0.1, 0.15],
        'max_depth': [3, 5, 7, 9],
        'n_estimators': [50, 100, 150, 200],
        'objective': ['reg:squarederror']
    }
}

NameError: name 'LinearRegression' is not defined

## Let's first try a models dictionary with 1 model and only 4 possible combinations of hyperparameters

In [None]:
models = {}
# XGBoost Regression
models['XGBoost Regression'] = {
    'model': XGBRegressor(),
    'params': {
        'learning_rate': [0.05],
        'max_depth': [3, 5, 7, 9],
        'n_estimators': [200],
        'objective': ['reg:squarederror']
    }
}

## Loop each model to train, tune, and retrain final model with best parameters and joined training+validation datasets

In [None]:
for model_name, model in models.items():
    gs = GridSearchCV(model['model'], model['params'], cv=5,
                      scoring='neg_mean_squared_error', return_train_score=True,
                      n_jobs=-1, verbose=1, refit=True)
    gs.fit(X_train_transformed, y_train,
           eval_set=[(X_valid_transformed, y_valid)],
           early_stopping_rounds=10)

    # Get the best estimator and store it
    best_model = gs.best_estimator_
    model['best_model'] = best_model

    # Join training and validation datasets for retraining best model with best parameters
    X_train_valid = pd.concat([X_train, X_valid], axis=0)
    y_train_valid = pd.concat([y_train, y_valid], axis=0)
    
    # Retrain best model with joined training+validation dataset using best hyperparameters
    best_model.fit(X_train_valid, y_train_valid)
    
    # Store the final best model retrained with joined training+validation dataset and best hyperparameters
    model['final_best_model'] = best_model
    
    # Evaluate the model on the test set and store the predictions and metric score
    y_test_pred = best_model.predict(X_test)
    test_mse = mean_squared_error(y_test, y_test_pred)
    model['test_predictions'] = y_test_pred
    model['test_mse'] = test_mse

## 

In [None]:
# Print the test mean squared error for each model (not ordered)
for model_name, model in models.items():
    print(f'{model_name} test mean squared error: {model["test_mse"]:.2f}')

# Print the top models with the lowest test mean squared error (in ascending order)
sorted_models = sorted(models.items(), key=lambda x: x[1]['test_mse'])
print('\nTop Models:')
for i in range(len(models)):
    model_name, model = sorted_models[i]
    print(f'{i+1}: {model_name} test mean squared error: {model["test_mse"]:.2f}')