# Assignment 2

In this assigment, we will work with the *Adult* data set. Please download the data from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/2/adult). Extract the data files into the subdirectory: `../05_src/data/adult/` (relative to `./05_src/`).

# Load the data

Assuming that the files `adult.data` and `adult.test` are in `../05_src/data/adult/`, then you can use the code below to load them.

In [1]:
# Write your code below.
%load_ext dotenv
%dotenv

In [2]:
import pandas as pd
columns = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
    'native-country', 'income'
]
adult_dt = (pd.read_csv('../../05_src/data/adult/adult.data', header = None, names = columns)
              .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))


In [3]:
adult_dt

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


# Get X and Y

Create the features data frame and target data:

+ Create a dataframe `X` that holds the features (all columns that are not `income`).
+ Create a dataframe `Y` that holds the target data (`income`).
+ From `X` and `Y`, obtain the training and testing data sets:

    - Use a train-test split of 70-30%. 
    - Set the random state of the splitting function to 42.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Split the data into features and target
X = adult_dt.drop('income', axis=1)
Y = adult_dt['income']

# Perform train-test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [5]:
X.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

## Random States

Please comment: 

+ What is the [random state](https://scikit-learn.org/stable/glossary.html#term-random_state) of the [splitting function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)? 

    
+ Why is it [useful](https://en.wikipedia.org/wiki/Reproducibility)?


*(Comment here.)*
- The random_state parameter in the train_test_split function is used to seed the random number generator. This ensures that the split of the data into training and testing sets is reproducible. By setting the random_state to a specific value (e.g., 42), you can guarantee that the same split will be produced every time the code is run.

- **Why is random_state useful?**
    - **Reproducibility:** Setting a random_state allows others to replicate your results exactly. This is crucial in scientific research and data analysis, where reproducibility is a key principle.

    - **Consistency:** During model development, having a consistent train-test split allows you to compare different models or parameter settings on the same data partition. This ensures that differences in model performance are due to the models themselves, not to variations in the data splits.

    - **Debugging:** When debugging code or verifying results, having a fixed random_state makes it easier to trace and reproduce specific issues or behaviors.

# Preprocessing

Create a [Column Transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) that treats the features as follows:

- Numerical variables

    * Apply [KNN-based imputation for completing missing values](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html):
        
        + Consider the 7 nearest neighbours.
        + Weight each neighbour by the inverse of its distance, causing closer neigbours to have more influence than more distant ones.
    * [Scale features using statistics that are robust to outliers](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler).

- Categorical variables: 
    
    * Apply a [simple imputation strategy](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer):

        + Use the most frequent value to complete missing values, also called the *mode*.

    * Apply [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html):
        
        + Handle unknown labels if they exist.
        + Drop one column for binary variables.
    
    
The column transformer should look like this:

![](./images/assignment_2__column_transformer.png)

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Preprocessing for numerical data
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
numeric_transformer = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=7, weights='distance')),
    ('scaler', RobustScaler())
])

# Preprocessing for categorical data
categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop='if_binary'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num_transforms', numeric_transformer, numeric_features),
        ('cat_transforms', categorical_transformer, categorical_features)
    ])

# Create the pipeline with preprocessor and logistic regression model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)])



In [7]:
pipeline

## Model Pipeline

Create a [model pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html): 

+ Add a step labelled `preprocessing` and assign the Column Transformer from the previous section.
+ Add a step labelled `classifier` and assign a [`RandomForestClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to it.

The pipeline looks like this:

![](./images/assignment_2__pipeline.png)

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Create the pipeline with preprocessor and RandomForestClassifier
pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier())
])

In [9]:
pipeline

# Cross-Validation

Evaluate the model pipeline using [`cross_validate()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html):

+ Measure the following [preformance metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values): negative log loss, ROC AUC, accuracy, and balanced accuracy.
+ Report the training and validation results. 
+ Use five folds.


In [10]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, log_loss, roc_auc_score, accuracy_score, balanced_accuracy_score
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

scoring = {
    'neg_log_loss': 'neg_log_loss',
    'roc_auc': 'roc_auc',
    'accuracy': 'accuracy',
    'balanced_accuracy': make_scorer(balanced_accuracy_score)
}

# Define parameter grid for GridSearchCV
param_grid = {
    'classifier__n_estimators': [10, 50, 100],
    'classifier__max_depth': [None, 10, 20, 30]
}

# Create the GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')

# Fit the GridSearchCV object
grid_search.fit(X_train, Y_train)

# Perform cross-validation on the best estimator
cv_results = cross_validate(
    grid_search.best_estimator_,
    X_train,
    Y_train,
    cv=5,
    scoring=scoring,
    return_train_score=True
)



Display the fold-level results as a pandas data frame and sorted by negative log loss of the test (validation) set.

In [11]:
# Create a DataFrame to show results
cv_df = pd.DataFrame(cv_results)
cv_df_sorted = cv_df.sort_values(by='test_neg_log_loss')

cv_df_sorted.head()

Unnamed: 0,fit_time,score_time,test_neg_log_loss,train_neg_log_loss,test_roc_auc,train_roc_auc,test_accuracy,train_accuracy,test_balanced_accuracy,train_balanced_accuracy
2,8.773143,0.231532,-0.30897,-0.192357,0.914482,0.984437,0.859588,0.929911,0.765483,0.87379
0,8.100509,0.268514,-0.306424,-0.194047,0.915259,0.983403,0.85896,0.927878,0.772305,0.872271
1,9.639491,0.282101,-0.306274,-0.190452,0.91532,0.984512,0.857206,0.931717,0.764079,0.87542
4,7.893525,0.290724,-0.305749,-0.192267,0.915196,0.984007,0.863756,0.931502,0.77685,0.881144
3,7.887537,0.216075,-0.29856,-0.193226,0.920589,0.983915,0.867925,0.929582,0.783601,0.875879


Calculate the mean of each metric. 

In [12]:
# Calculate mean of each metric
train_metrics = cv_df.filter(like='train').mean().to_dict()
val_metrics = cv_df.filter(like='test').mean().to_dict()

print("Training Metrics Mean: ", train_metrics)
print("Validation Metrics Mean: ", val_metrics)


Training Metrics Mean:  {'train_neg_log_loss': -0.19246969849631487, 'train_roc_auc': 0.9840545984710081, 'train_accuracy': 0.9301180168393046, 'train_balanced_accuracy': 0.8757007949913845}
Validation Metrics Mean:  {'test_neg_log_loss': -0.3051952278774005, 'test_roc_auc': 0.9161691917632073, 'test_accuracy': 0.8614867851765757, 'test_balanced_accuracy': 0.7724635206915111}


Calculate the same performance metrics (negative log loss, ROC AUC, accuracy, and balanced accuracy) using the testing data `X_test` and `Y_test`. Display results as a dictionary.

*Tip*: both, `roc_auc()` and `neg_log_loss()` will require prediction scores from `pipe.predict_proba()`. However, for `roc_auc()` you should only pass the last column `Y_pred_proba[:, 1]`. Use `Y_pred_proba` with `neg_log_loss()`.

In [13]:
# Fit the best estimator on the training data
best_estimator = grid_search.best_estimator_
best_estimator.fit(X_train, Y_train)

# Make predictions
Y_pred_proba = best_estimator.predict_proba(X_test)
Y_pred = best_estimator.predict(X_test)

# Calculate performance metrics
test_metrics = {
    'neg_log_loss': log_loss(Y_test, Y_pred_proba),
    'roc_auc': roc_auc_score(Y_test, Y_pred_proba[:, 1]),
    'accuracy': accuracy_score(Y_test, Y_pred),
    'balanced_accuracy': balanced_accuracy_score(Y_test, Y_pred)
}

print("Test Set Performance Metrics: ", test_metrics)


Test Set Performance Metrics:  {'neg_log_loss': 0.309725793742636, 'roc_auc': 0.9132349846703384, 'accuracy': 0.8625243115979118, 'balanced_accuracy': 0.7688166162054435}


# Target Recoding

In the first code chunk of this document, we loaded the data and immediately recoded the target variable `income`. Why is this [convenient](https://scikit-learn.org/stable/modules/model_evaluation.html#binary-case)?

The specific line was:

```
adult_dt = (pd.read_csv('../05_src/data/adult/adult.data', header = None, names = columns)
              .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))
```

(Answer here.)
Recoding the target variable income as a binary variable immediately upon loading the data is convenient for several reasons, particularly in the context of using scikit-learn for machine learning tasks. Here's why:

**Binary Classification:**

- Many machine learning models and evaluation metrics in scikit-learn are specifically designed for binary classification tasks. Recoding income to 0 and 1 ensures that it is in the correct format for these models and metrics.

**Consistency and Convenience:**

- By recoding income as a binary variable right away, we ensure consistency in how this target variable is handled throughout the analysis. This reduces the risk of errors and inconsistencies later in the pipeline.

**Simplification of Model Evaluation:**

- Binary labels (0 and 1) make it easier to calculate and interpret various evaluation metrics such as accuracy, precision, recall, F1-score, ROC AUC, and log loss. These metrics are designed to work with binary labels, making the evaluation process more straightforward.

**Compatibility with Scikit-learn:**

- Scikit-learn expects the target variable in a binary classification task to be in a binary format (e.g., 0 and 1). By recoding the target variable immediately, we ensure compatibility with scikit-learn's functions and methods, avoiding potential issues or errors during model training and evaluation.

**Code Readability and Maintainability:**

- Recoding the target variable in the data loading step makes the code more readable and maintainable. It centralizes the transformation logic, making it clear from the beginning how the target variable is defined and used.

Here's a breakdown of the specific line of code:

            adult_dt = (pd.read_csv('../05_src/data/adult/adult.data', header=None, names=columns)
              .assign(income=lambda x: (x.income.str.strip() == '>50K')*1))

- `pd.read_csv('../05_src/data/adult/adult.data', header=None, names=columns): 
  - This reads the data from the specified CSV file and assigns column names to the DataFrame.

- `.assign(income=lambda x: (x.income.str.strip() == '>50K')*1): 
  - This uses the assign method to add a new column income to the DataFrame. The lambda function strips any whitespace from the income values, checks if they are equal to '>50K', and converts the boolean result (True/False) to integers (1/0) by multiplying by 1.

- This line of code ensures that the income column is immediately available as a binary variable, ready for use in binary classification models and evaluation metrics.

## Criteria

|Criteria|Complete|Incomplete|
|---------------------|----|----|
|Evaluation of model pipeline |Model pipeline was evaluated correctly.|Model pipeline was not evaluated correctly.|
|Explanation of answer|Answer was concise and explained the learner's reasoning in depth.|Answer was not concise and did not explained the learner's reasoning in depth.|

## Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-2`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_2.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.

# Reference

Becker,Barry and Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.