## Part 2 - Hands-on - scikit learn

The objective of this notebook is get hands on with Scikit-learn and demonstrate their usage in preprocessing and modeling tasks using the Titanic dataset.

#### Reminder from 1_Pandas

The Titanic dataset contains information about passengers aboard the Titanic, including features such as age, sex, ticket class, and whether they survived the disaster.

In [9]:
# Importing necessary libraries
import pandas as pd  
from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.impute import SimpleImputer  
from sklearn.preprocessing import OneHotEncoder  
from sklearn.compose import ColumnTransformer  
from sklearn.pipeline import Pipeline  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.metrics import accuracy_score, classification_report  

  from pandas.core.computation.check import NUMEXPR_INSTALLED


1. **Data Loading and Exploration:**
   - Load the Titanic dataset into a pandas DataFrame.
   - Display basic information about the dataset (e.g., data types, missing values, summary statistics).
   - Explore the distribution of the target variable (`Survived`).


In [None]:
# Task 1: Data Loading and Exploration
# Load the Titanic dataset
titanic_df = # read_csv

In [None]:
# Display basic information


In [None]:
# Explore the distribution of the target variable (Survived)


2. **Data Preprocessing:**
   - Handle missing values in the dataset (e.g., drop unnecessary columns, impute missing values).
   - Encode categorical variables using one-hot encoding.
   - Split the dataset into features (X) and target variable (y).

In [None]:
# Drop unnecessary columns


In [None]:
# Display the first rows
titanic_df.head()

In [10]:
# Define columns that you want to impute and encode  
num_features = # list of numerical column names
cat_features = # list of categorical column names

In [11]:
# Create the transformers num_transformer for num_features
num_transformer = # SimpleImputer

# Create the transformers cat_transformer for cat_features
cat_transformer = Pipeline(steps=[  
    ('imputer', ), # SimpleImputer
    ('onehot', ) # OneHotEncoder
])

In [None]:
# Combine transformers into a preprocessor with ColumnTransformer  
preprocessor = ColumnTransformer(  
    transformers=[  
        ('num', num_transformer, num_features),  
        ('cat', cat_transformer, cat_features)  
    ])

In [None]:
# Split the dataset into features (X) and target variable (y)


In [None]:
# Split the preprocessed dataset into training and validation sets


3. **Building a Pipeline:**
   - Create a Scikit-learn pipeline that includes preprocessing steps (imputation, encoding) and a machine learning model.
   - Choose a machine learning model (e.g., Logistic Regression, Decision Tree, Random Forest Classifier) and include it in the pipeline.


In [None]:
# Create a Scikit-learn pipeline


In [None]:
# Fit the pipeline on the training data


4. **Training and Evaluation:**
   - Split the preprocessed dataset into training and validation sets.
   - Fit the pipeline on the training data.
   - Evaluate the pipeline on the validation data using accuracy score and classification report.

In [None]:
# Evaluate the pipeline on the validation data

print(f"Accuracy Score: ")
print("Classification Report:")

5. **Parameter Tuning (Optional):**
   - Experiment with different parameters of the pipeline components (e.g., model hyperparameters, imputation strategy).
   - Use techniques like GridSearchCV to find the best combination of parameters.

In [None]:
# Define parameter grid for GridSearchCV
param_grid = {  
    'model__n_estimators': [50, 100, 200],  
    'model__max_depth': [None, 5, 10, 20],  
    'preprocessor__num__strategy': ['mean', 'median']
    # Add other parameters if desired
}

In [None]:
# Perform GridSearchCV


In [None]:
# Print the best parameters and best score


In [None]:
# Print all of the results and sort by rank score