# Life cycle of a Data Science 

1) Feature Engineering .

2) Feature Selection.

3) Model Creation.

4) Hyperparamter Tuning ( Random Forest Classifier is being used here in this session ) 

Why is Hyper Parameter tuning required ?

ans) The parameters should be selected based on the dataset we have instead of using the by default parameters provided in the regressor or classifier(Model) ,so that we can get the best model for our dataset

# All Techniques Of Hyper Parameter Optimization.

1) GridSearchCV.

2) RandomizedSearchCV.

3) Bayesian Optimization -Automate Hyperparameter Tuning (Hyperopt).

4) Sequential Model Based Optimization(Tuning a scikit-learn estimator with   skopt).

5) Optuna- Automate Hyperparameter Tuning.

6) Genetic Algorithms (TPOT Classifier).

References.
https://github.com/fmfn/BayesianOptimization

https://github.com/hyperopt/hyperopt

https://www.jeremyjordan.me/hyperparameter-tuning/

https://optuna.org/

https://towardsdatascience.com/hyperparameters-optimization-526348bb8e2d(By Pier Paolo I

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
df=pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# here we have few zero values in glucose,insulin and glucose,insulin,SkinThickness cant be zero hence replacing it with median
import numpy as np
df["Glucose"]=np.where(df["Glucose"]==0,df["Glucose"].median(),df["Glucose"])
df["Insulin"]=np.where(df["Insulin"]==0,df["Insulin"].median(),df["Insulin"])
df["SkinThickness"]=np.where(df["SkinThickness"]==0,df["SkinThickness"].median(),df["SkinThickness"])
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35.0,30.5,33.6,0.627,50,1
1,1,85.0,66,29.0,30.5,26.6,0.351,31,0
2,8,183.0,64,23.0,30.5,23.3,0.672,32,1
3,1,89.0,66,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40,35.0,168.0,43.1,2.288,33,1


In [None]:
# Independent and Dependent features

X=df.drop("Outcome",axis=1)
Y=df["Outcome"]

In [None]:
# Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.20,random_state=33)

In [None]:
from sklearn.ensemble import RandomForestClassifier
# here we have used 10 decision trees--n_estimators=10
rf_classifier=RandomForestClassifier(n_estimators=10).fit(X_train,y_train)
prediction=rf_classifier.predict(X_test)

In [None]:
Y.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [None]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
print(confusion_matrix(y_test,prediction))
print(accuracy_score(y_test,prediction))
print(classification_report(y_test,prediction))

# here we are getting accuray -0.6948051948051948 

[[85 14]
 [33 22]]
0.6948051948051948
              precision    recall  f1-score   support

           0       0.72      0.86      0.78        99
           1       0.61      0.40      0.48        55

    accuracy                           0.69       154
   macro avg       0.67      0.63      0.63       154
weighted avg       0.68      0.69      0.68       154



The main parameters used by a Random Forest Classifier are:

1) criterion = the function used to evaluate the quality of a split.

2) max_depth = maximum number of levels allowed in each tree.

3) max_features = maximum number of features considered when splitting a node.

4) min_samples_leaf = minimum number of samples which can be stored in a tree leaf.

5) min_samples_split = minimum number of samples necessary in a node to cause node splitting.

6) n_estimators = number of trees in the ensamble.

In [None]:
# Lets do manual Hyperparamtere tuning , which means we are putting any values for the parameter of our wish just to check
model=RandomForestClassifier(n_estimators=300,criterion='entropy',
                             max_features='sqrt',min_samples_leaf=10,random_state=100).fit(X_train,y_train)
predictions=model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(accuracy_score(y_test,predictions))
print(classification_report(y_test,predictions))

[[87 12]
 [28 27]]
0.7402597402597403
              precision    recall  f1-score   support

           0       0.76      0.88      0.81        99
           1       0.69      0.49      0.57        55

    accuracy                           0.74       154
   macro avg       0.72      0.68      0.69       154
weighted avg       0.73      0.74      0.73       154



which Hyperparameter tuning we shoud use first? ------------- Randomized CV , because it narrow down our Results after that we should apply Grid Search which will check thorouly .

*Technique 1 : Randomized Search CV*

In [None]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [None]:
rf=RandomForestClassifier()
rf_randomcv=RandomizedSearchCV(estimator=rf,param_distributions=random_grid,n_iter=100,cv=3,verbose=2,
                               random_state=100,n_jobs=-1)
### fit the randomized model
rf_randomcv.fit(X_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  4.3min
