### Introduction to cross validation 
- what is cross validation?
    - Technique used to assess how eell a machine learning model generalizes to an inderpendent dataset
- Type of cross validation 
    - K-fold cross validation 
        - splits the dataset into k folds of approximately equal size
        - the model is trained on k-1 folds and validated on the remaining fold 
        - this process is repeated k times, and the average performance is computed 
    - Stratified K-fold
        - Ensures that each fold maintains the same class distribution as the original dataset 
        - useful for imbalanced datasets
    - Leave one out cross validation(LOOCV)
        - uses a simgle data point for validation and the rest for training 
        -Repeats this process for all data points 
        - computationally expensive but provides the most robust evaluation 
### HyperParameter Tuning
- What is hyperparameter tuning?
    - Hyperparameters are parameters that are not learned by the model but are set before training, tuning these hyperparameter is crucial for optimizing model performance
- Techniques for Hyperparameter Tuning
    - Grid search 
        - Exhaustively searches over a predefined hyperparameter space
        - example: Testing all combinations of values for max_depth and learning_rate
    - Random search 
        - Randomly samples combinations of hyperparameters from the predefined space
        - more efficient than grid search when the parameter space is large
    - Importance of hyperparameter tuning
        - Prevents overfitting and underfitting by selecting the best configuration 
        - enchances model performance by optimizing critical settings 
    - Importance of tuning hyperparameters for model performance
        - without tuning the model minght not reach its optimal performance leading to:
            - underfitting
            - overfitting

In [None]:
#Exercise 
# - task1: perform feature engineering
# - task2: Train evaluate models
# - task3: Apply grid search for hyperparamter tuning


In [1]:
import pandas as pd   
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [8]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

df = pd.read_csv(url)

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
# select relevent feature 
df = df[['Pclass','Sex', 'Age','Fare','Embarked','Survived']]

# now handling missing values
df.fillna({
    'Age':df['Age'].median()
}, inplace=True)

df.fillna({
    'Embarked':df['Embarked'].mode()[0]
}, inplace=True)


In [10]:
# defining feature and target
x = df.drop(columns=['Survived'])
y = df['Survived']

In [11]:
# Apply feature scaling and encoding 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['Age', 'Fare']),
        ('cat', OneHotEncoder(), ['Pclass', 'Sex', 'Embarked'])
    ]
)
x_preprocessed = preprocessor.fit_transform(x)

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


In [14]:
# train and evaluate Logistic regression model
log_reg = LogisticRegression()
log_reg_scores = cross_val_score(log_reg, x_preprocessed, y, cv=5, scoring='accuracy'   )
print(f"LogisticRegression accuracy: {log_reg_scores.mean():.3f}")

LogisticRegression accuracy: 0.789


In [15]:
# random forest model
rf = RandomForestClassifier()
rf_scores = cross_val_score(rf, x_preprocessed, y, cv=5, scoring='accuracy')
print(f"RandomForest accuracy: {rf_scores.mean():.3f}")


RandomForest accuracy: 0.804


In [16]:
from sklearn.model_selection import GridSearchCV

In [17]:
# define hyperparameter grid for Random Forest
param_grid = {
    'n_estimators': [50,100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5,10]
}
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_preprocessed, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")


Best parameters: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 200}
Best cross-validation score: 0.836
