## Ways to improve model performance
1. Data. Get better data, more data
2. Engineer certain features.
3. Clean up data. Remove rows to reduce multicollinearity 
4. Parameter tuning.

## Parameter Tuning
- Ways to 
1. GridSearchCV


In [None]:
#using GridSearchCV to find the best parameters for the DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

#load the data
data = pd.read_csv('data.csv')

#split the data into features and target
X = data.drop('target', axis=1)
y = data['target']

#split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#create the DecisionTreeClassifier
clf = DecisionTreeClassifier()

#set the parameters to search
param_grid ={
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],

}

#use GridSearchCV to find the best parameters
#clf is the model,param_grid is the parameters, cv is the number of folds, scoring is the metric to use
grid_search = GridSearchCV(clf, param_grid, cv= 5 , scoring='accuracy')
#fit the model
grid_search.fit(X_train, y_train)
#find the best parameters
print(grid_search.best_params_)

In [None]:
#GridSearchCV  for LogisticRegression
from sklearn.linear_model import LogisticRegression

#load the data
data = pd.read_csv('data.csv')

#split the data into features and target
X = data.drop('target', axis=1)
y = data['target']

#split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#create the LogisticRegression
clf = LogisticRegression( max_iter=1000)#max_iter is the maximum number of iterations taken for the solvers to converge to avoid convergence warning

#set the parameters to search
param_grid ={
    'penalty': ['l1', 'l2'],# l1 is Lasso, l2 is Ridge. Lasso penalizes the coefficients to zero when column not needed, Ridge does not penalize the coefficients to zero but reduces the size of the coefficients of the columns that are not needed
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], #C is the inverse of the regularization strength, smaller values specify stronger regularization
    'solver':['liblinear', 'lbfgs']#solver is the algorithm to use in the optimization problem

}

#use GridSearchCV to find the best parameters
grid_search = GridSearchCV(clf,param_grid,cv=5, scoring='accuracy')
#fit the model
grid_search.fit(X_train, y_train)
#find the best parameters
print(grid_search.best_params_)



## Pipelines
- Ways of implementing more than one method.
eg implementing the StandardScaler and LogisticRegression model 
-  we want to clean our data, transform it, potentially use feature selection, and then run a machine learning algorithm. Using pipelines, you can do all these steps in one go!
- Very useful if you want to deploy your model.It's easier to change the pipeline than the codes


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ('scaler',StandardScaler()),
    ('clf',DecisionTreeClassifier(random_state=42))
])

param_grid= {
    'clf__criterion': ['gini', 'entropy'], #criteria to split the data either gini or entropy.
    'clf__max_depth': [1, 2, 5, 10],# maximum depth of the tree. 
    'clf__min_samples_split': [2, 5, 10], #minimum number of samples required to split an internal node
    'clf__min_samples_leaf': [1, 2, 5, 10], #minimum number of samples required to be at a leaf node

}

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)


#or
pipe = Pipeline([
    ('scaler',StandardScaler()),
    ('clf',LogisticRegression(random_state=42))
])

grid = ([
    'penalty': ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], #inverse of the regularization strength the smaller the value the stronger the regularization
    'solver': ['liblinear', 'lbfgs']

])

grid_search = GridSearchCV(pipe, param_grid=grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
#Get the test score
print(grid_search.score(X_test, y_test))

## Ensemble methods
- Machine learning Algorithms
- **Ensemble:** to an algorithm that makes use of more than one model to make a prediction. 
- Models used in ensembles.
1. Random Forest :  an ensemble method for decision trees that takes advantage of bagging and the subspace sampling method to create a "forest" of decision trees that provides consistently better predictions than any single decision tree.
2. Gradient Boosted Trees.
-Ensemble methods are powerful techniques in machine learning that combine multiple models to produce a more robust and accurate prediction than any single model could achieve on its own. The core idea is that by aggregating the predictions of several models, the ensemble can reduce variance, bias, or improve the overall predictive performance.

**Key Concepts of Ensemble Methods:**

   1.  Diversity of Models: Ensemble methods rely on the principle that different models may capture different patterns in the data. By combining these models, the ensemble can mitigate the weaknesses of individual models.

    2. Aggregation: The outputs of the individual models are combined using methods like voting, averaging, or stacking, to produce the final prediction.

**Types of Ensemble Methods:**

 1. **Bagging(Bootstrap Aggregating):**
        How It Works: Bagging involves training multiple models on different random subsets (with replacement) of the original training data. Each model is trained independently, and the final prediction is made by averaging the predictions (for regression) or by majority voting (for classification).

- Example: Random Forest is a popular bagging method where multiple decision trees are trained, and their predictions are aggregated.

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


2. **Boosting** - Boosting trains models sequentially, where each new model tries to correct the errors made by the previous ones. The models are combined to form a strong learner, with each model focusing on the mistakes made by its predecessors.
- Gradient Boosting Machines (GBM), AdaBoost, XGBoost, and LightGBM are popular boosting algorithms.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

3. **Stacking** - Stacking involves training multiple models (often different types of models) and then using another model (meta-learner) to combine the predictions of these base models. The meta-learner is trained on the predictions of the base models as input features.
- You might use a logistic regression model as a meta-learner to combine the outputs of a decision tree, a random forest, and a support vector machine.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

estimators = [
    ('dt', DecisionTreeClassifier()),
    ('rf', RandomForestClassifier(n_estimators=100)),
    ('svc', SVC(probability=True))
]

model = StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression()
)
model.fit(X_train, y_train)


4. **VotingClassifier** - A voting classifier is an ensemble that combines several different models and predicts the class label based on the majority vote (for classification) or the average prediction (for regression). The models can be of different types.

- Example: Combining a decision tree, a logistic regression, and a k-nearest neighbors (KNN) classifier into a voting classifier.

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = DecisionTreeClassifier()

voting_clf = VotingClassifier(estimators=[
    ('lr', model1),
    ('knn', model2),
    ('dt', model3)
], voting='hard')  # 'hard' for majority vote, 'soft' for weighted probabilities
voting_clf.fit(X_train, y_train)


## difference between Boosting and Bagging
- Bagging(Random Forest) Boosting(Gradient Boosting)
1. **Independent vs Iterative :** In bagging random forest trains each tree independently and at the same time.
- Boosting trains each tree iteratively.
-  In a random forest model, how well or poorly a given tree does has no effect on any of the other trees since they are all trained at the same time. Boosting, on the other hand, trains trees one at a time, identifies the weak points for those trees, and then purposefully creates the next round of trees in such a way as to specialize in those weak points.
2. **Weak vs Strong:** In a random forest, each tree is a strong learner -- they would do just fine as a decision tree on their own. In boosting algorithms, trees are artificially limited to a very shallow depth (usually only 1 split), to ensure that each model is only slightly better than random chance.
3. **Aggregate Predictions:** In random forest, each tree votes for the final result while boosting algorithm employs a system of weights to determine how important the input for each tree is.


## Adaboost Algorithm and Gradient Boosted Trees
- **Adaptive Boosting Algorithm:** combines the predictions of several weak learners to create a strong learner. The idea behind AdaBoost is to improve the accuracy of any machine learning algorithm by focusing on the instances that are hardest to classify.


In [None]:
# adaboostClassifier and GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split,GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


#adaBoostClassifier
ada = AdaBoostClassifier(random_state=42,)

#set the parameters to search
param_grid ={
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1],
      
}

#use GridSearchCV to find the best parameters
grid_search = GridSearchCV(ada, param_grid,cv=5, scoring='accuracy')
#fit the model
grid_search.fit(X, y)
#find the best parameters
print(grid_search.best_params_)

#to get the classification report
ada_test_pred = grid_search.predict(X_test)
print(classification_report(y_test, ada_test_pred))

#to get the confusion matrix
print(confusion_matrix(y_test, ada_test_pred))


#GradientBoostingClassifier
gbc = GradientBoostingClassifier(random_state=42)


#set the parameters to search
param_grid={
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
    'subsample': [0.8, 0.9, 1]
}

#use GridSearchCV to find the best parameters
grid_search = GridSearchCV(gbc, param_grid, cv=5, scoring='accuracy')

#fit the model 
grid_search.fit(X,y)

#find the best parameters
print(grid_search.best_params_)
#to get the classification report
gbc_test_pred = grid_search.predict(X_test)
print(classification_report(y_test, gbc_test_pred))

#to get the confusion matrix
print(confusion_matrix(y_test, gbc_test_pred))


## The top best GradientBoostingClassifier is XBoost(eXtreme Gradient Boosting.)
- When using the XBoost the y variable needs to be label encoded from 0 going foward.
- So instantiate the LabelEncoder
- fit_transform the y_train 
- transform the y_test 

- Use the transformed y_train and y_test to get the predicted values and calculate the accuracy. Even when using the GridSearchCV for parameter tuning.

In [None]:
#XGBoost
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder

#instantiate the model
clf = XGBClassifier()

#label encode the target
encoder = LabelEncoder()
fitted_y_train = encoder.fit_transform(y_train)
fitted_y_test = encoder.transform(y_test)

#set the parameters to search
param_grid = {
    'learning_rate': [0.1, 0.2],
    'max_depth': [6],
    'min_child_weight': [1, 2],
    'subsample': [0.5, 0.7],
    'n_estimators': [100],
}

#use GridSearchCV to find the best parameters
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
#fit 
grid_search.fit(X_train, fitted_y_train)
#find the best parameters
print(grid_search.best_params_)

#to get the classification report
xgb_test_pred = grid_search.predict(X_test)
print(classification_report(fitted_y_test, xgb_test_pred))


## When to use Ensemble methods
1. complex datasets with many features or non-linear relationships.
2. Methods like bagging (e.g., Random Forest) can help reduce overfitting by averaging out the variance from different models.
3.  Boosting techniques reduce bias by sequentially focusing on hard-to-predict samples, while bagging methods reduce variance by averaging across multiple models.

## Advantages
1. Improved Accuracy: Ensembles typically perform better than single models.
2. Robustness: Less likely to be affected by outliers or noise in the data.
3. Versatility: Can be applied to various types of models and problems.

## Disadvantages
1. Complexity: Ensembles are more complex to understand and interpret than individual models.
2. Computationally Intensive: Training multiple models can require more computational resources and time.1

## GridSearchCV
- Used to apply/ fit parameters for it to choose the best parameters to train with.


In [None]:
#implementing the GridSearch
from sklearn.model_selection import GridSearchCV

#initialize the model
clf = RandomForestClassifier(random_state=42,)

#set the parameters to search
param_grid = {
    'n_estimators':[100,200,300,400],
    'criterion':['gini', 'entropy'],
    'max_depth': [1,2,5,10],
    'min_samples_split':[None, 2, 5, 10],
    'min_samples_leaf':[1,2,5,10],
}

#use GridSearchCV to find the best parameters
grid_search = GridSearchCV(clf,param_grid, cv=5, scoring= 'accuracy')

#fit the model 
grid_search.fit(X_train, y_train)

#To find the best parameters. These are all the parameters needed to get the best score
grid_search.best_params_
# to find the best score. This score is the mean cross-validated score of the best_estimator
grid_search.best_score_
# to find the best estimator. The best estimator is the number of trees that give the best score.
grid_search.best_estimator_

"""to find the mean cross validation score of the best estimator
to understand the model's expected performance based on cross- validation 
"""
grid_search.cv_results_['mean_train_score'][grid_search.best_index_]#mean_train_score will give an array of the mean training score of the best index
#thus to get a  value we use the index of the best index
"""
to find the test score of the best estimator
to find out how the model performs on unseen data
"""
grid_search.score(X_test, y_test)  or  grid_search.best_estimator_.score(X_test,y_test)
#or 
#to find the mean cross validation score of the best estimator
grid_search.cv_results_['mean_test_score'][grid_search.best_index_]#mean_test_score will give an array of the mean test score of the best index

"""
The results from the test score is the accuracy score for classification and R2 score for regression.

"""



## Choosing Between the Models:

 1. **Problem Complexity:**
        Random Forest is a strong baseline for many problems. If you need a fast, reliable classifier with minimal tuning, Random Forest is a good choice.
        Gradient Boosting (especially with libraries like XGBoost, LightGBM, or CatBoost) can achieve higher performance on more complex tasks but at the cost of more tuning and slower training.
        AdaBoost can be useful when you have simple models or weak learners and want to boost their performance without excessive complexity.

    2. **Speed and Size of Dataset:**
        Random Forest can be faster to train and test on larger datasets.
        Gradient Boosting is typically slower but can achieve better results on complex datasets.
        AdaBoost is lightweight and can be faster than Gradient Boosting but may not perform as well on complex tasks.

    3. **Model Tuning:**
        Random Forest requires minimal tuning (number of trees, max depth).
        Gradient Boosting can require extensive hyperparameter tuning (learning rate, number of estimators, max depth, etc.) for optimal results.
        AdaBoost has fewer hyperparameters to tune but can still benefit from parameter adjustments.

-It is good practice to instantiate all the three ensemble models and use cross_val_score to see which one has a higher score. Pick the model with the highest mean cross_val_score.