For this assignment you will be working with the [Adult](https://archive.ics.uci.edu/dataset/2/adult) dataset from the UC Irvine Machine Learning Repository.

This will be a pretty open ended assignment where you will have to apply the concepts you learned in the past few weeks towards building a model that can predict if an adult make less than or equal to or greater than $50,000 in annual income.

There are 17 open ended questions, make sure to answer them!

The folowing code will load the dataset into this notebook for you, make sure to read through the description of the dataset variables below:

**Target Variable**:

- Income: >50K, <=50K

**Features**:
- Age: `continuous`
- Workclass: `Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked`
- fnlwgt (Final Weight): `continuous`
- Education: `Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool`
- Education-num (Education Number): `continuous`
- Marital-status: `Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse`
- Occupation: `Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces`
- Relationship: `Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried`
- Race: `White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black`
- Sex: `Female, Male`
- Capital-gain: `continuous`
- Capital-loss: `continuous`
- Hours-per-week: `continuous`
- Native-country: `United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands`

In [1]:
! pip install ucimlrepo

Looking in indexes: https://mirrors.ustc.edu.cn/pypi/web/simple

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
pd.set_option('display.max_rows', None)

In [3]:
adult = fetch_ucirepo(id=2)
X = adult.data.features
y = adult.data.targets

In [4]:
adult.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,,no
1,workclass,Feature,Categorical,Income,"Private, Self-emp-not-inc, Self-emp-inc, Feder...",,yes
2,fnlwgt,Feature,Integer,,,,no
3,education,Feature,Categorical,Education Level,"Bachelors, Some-college, 11th, HS-grad, Prof-...",,no
4,education-num,Feature,Integer,Education Level,,,no
5,marital-status,Feature,Categorical,Other,"Married-civ-spouse, Divorced, Never-married, S...",,no
6,occupation,Feature,Categorical,Other,"Tech-support, Craft-repair, Other-service, Sal...",,yes
7,relationship,Feature,Categorical,Other,"Wife, Own-child, Husband, Not-in-family, Other...",,no
8,race,Feature,Categorical,Race,"White, Asian-Pac-Islander, Amer-Indian-Eskimo,...",,no
9,sex,Feature,Binary,Sex,"Female, Male.",,no


In [5]:
df = pd.concat([X, y], axis=1)

In [6]:
df['income'] = df['income'].str.replace('.', '', regex=False)

In [7]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The above code will load the dataset into the variable `df`. Now it's up to you to predict an Adult's income. Here are a few pointers and TODOs when approaching this problem. As you go through this assignment, make sure to look at the following and answer the questions **in this same maarkdown cell** you see here:

1. **Does the data need to be cleaned?**
   - Do some initial analysis, look at the data, and see if we need to use some pandas functions to reformat data cells (hint: you will have to do this, as this is somewhat of a messy dataset).
   - Do we really need all the columns? Do a bit of research and ask questions in the discord to see if we need some of the columns in this dataset!
     - ***Question 1: When do we actually drop a column?***
     - Answer: A column should be dropped if it does not contribute significantly to the prediction of the target variable or if it contains redundant or irrelevant information, such as ID columns or constant columns.


2. **Deal with missing values**
   - A few columns have missing values; should you drop them? Impute these missing values as you see fit, such that the end model has good performance (note that this is a trial and error process as explained in class).
   - There are several methods for this; look at the 10/30 class notebook.
     - ***Question 2: Which methods of missing value imputation worked the best for you in the end? Do you think there is a specific reason why?***
     - Answer: The best methods were imputing categorical features with the mode and numerical features with the mean. This worked well because these methods preserve the general data distribution and minimize information loss.

3. **Encode the categorical data**
   - There are several categorical features in this data. We discussed several ways of encoding these features and when to use some over the others, so use these methods to encode the data!
     - ***Question 3: What is the difference between label encoding, one-hot encoding, and target encoding? For each one, list the pros and cons.***
     - Answer: Label Encoding: Converts categories into integers (e.g., A=0, B=1).
Pros: Simple and efficient for ordinal data.
Cons: Can mislead models for non-ordinal data by introducing a false sense of order.
One-Hot Encoding: Creates binary columns for each category.
Pros: Removes ordinal relationships and works well for small categorical sets.
Cons: Can create high-dimensional data if there are many categories.
Target Encoding: Replaces categories with the mean of the target variable for each category.
Pros: Handles high-cardinality features well.
Cons: Risk of data leakage if not handled properly.
     - ***Question 4: Do some research on your own; what are some other ways of encoding categorical data?***
     - Answer: Binary Encoding: Encodes categories into binary digits.
Frequency Encoding: Replaces categories with their frequency in the dataset.
Hash Encoding: Uses hash functions to map categories to a fixed number of columns.

4. **Split your data into a train and test dataset**
   - When splitting your data into a train and test dataset, there are several parameters you have to pass in/consider (many of which can be optimized with a trial and error process):
     - ***Question 5: What proportion of your dataset is the training dataset? Why did you choose this ratio?***
     - Answer: 70% for training and 30% for testing. This provides sufficient data for model learning while keeping enough data to evaluate model performance.

     - ***Question 6: Explain the difference between the train and test dataset. Why do we need them? Why can't we just train the model on the entire training dataset?***
     - Answer: The train dataset is used to fit the model, while the test dataset evaluates its performance on unseen data. Training on the entire dataset would prevent us from knowing how well the model generalizes to new data.
     - ***Question 7: Note that the dataset is very imbalanced. Why does this matter for splitting your dataset?***
     - Answer: Imbalanced datasets can lead to biased models. Stratified splitting ensures the target class distribution remains consistent between the train and test datasets.

5. **Scale your dataset**
   - Note that when you scale your dataset, you fit & transform the training data and only transform the test data.
     - ***Question 8: Why do you only fit to the training data and not the test data?***
     - Answer: Fitting on the test data introduces data leakage, which inflates performance metrics and gives an unrealistic measure of how the model performs on new, unseen data.
     - ***Question 9: In class, I talked about the standard scaler, but there are other methods (MinMax, Normalization, etc.). Do some research into a few other methods and explain each one. Make sure to mention when to use one over the other.***
     - Answer: MinMax Scaler: Scales data to a range [0, 1]. Suitable when features have known bounds.
Robust Scaler: Uses median and IQR, robust to outliers.
Normalization: Scales data to unit norm (e.g., L2). Useful when working with distance-based algorithms.
     - ***Question 10: What is the purpose of scaling your data in the first place?***
     - Scaling ensures all features contribute equally to the model by normalizing their range, preventing features with larger ranges from dominating the model.

6. **Modeling + Evaluation**
   - For the assignment, make the following models to predict the binary "income" feature and output the accuracy, precision, recall, and F1 score, just use the default parameters:
     - Linear Regression
       - Note: Linear Regression does not make sense for this type of model, so make it so that any values above 0.5 are predicted as a 1 and 0 otherwise.
       - ***Question 11: Why does linear regression not make sense to do here?***
       - Answer: Linear regression assumes continuous target variables, whereas the target here is binary, making logistic regression more appropriate.
     - Logistic Regression
       - ***Question 12: Why does it make more sense to use logistic regression than linear?***
       - Answer: Logistic regression models the probability of a binary outcome and outputs a value between 0 and 1, aligning with the binary classification task.
     - Decision Trees
     - Random Forest
     - AdaBoost
       - ***Question 13: We did not talk about AdaBoost in class, but do some research on this type of model and first explain how it works (briefly).***
       - Answer: AdaBoost trains multiple weak learners (e.g., decision stumps) sequentially, with each learner focusing on the errors of the previous one. The final model combines all weak learners' predictions.
       - ***Question 14: What is the idea of a stump?***
       - Answer: A stump is a decision tree with a maximum depth of 1, typically used as a weak learner in ensemble methods like AdaBoost.
       - ***Question 15: What is boosting and how does it apply to tree-based models?***
       - Answer: Boosting is an ensemble technique that combines multiple weak learners to create a strong learner. In tree-based models, it iteratively adjusts sample weights to focus on harder-to-predict samples.

7. **For the best model from step 6, use GridSearchCV and RandomSearchCV for tuning the model parameters; do some research into the hyperparameters:**
   - ***Question 16: Which parameters did you choose to tune?***
   - Answer: Parameters like n_estimators, max_depth, and min_samples_split were chosen as they have the most significant impact on tree-based model performance.

   - ***Question 17: Did your GridSearch and RandomSearch output the same values for the hyperparameters? Why or why not?***
   - Answer: No, GridSearch explores all possible parameter combinations, while RandomSearch samples a subset. Thus, results may differ due to the randomness in RandomSearch.


# Data  Cleaning

In [8]:
# Do steps 1 - 7 here!

# step 1
print(df.info())
print(df.describe())
print(df.head())

df.drop(columns=['fnlwgt'], inplace=True)

df.columns = df.columns.str.strip()
df['workclass'] = df['workclass'].str.strip()
df['occupation'] = df['occupation'].str.strip()
df['native-country'] = df['native-country'].str.strip()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
None
                age        fnlwgt  education-num  capital-gain  capital-loss  \
count  4

# Handle Missing Values

In [9]:
missing_values = df.isnull().sum()
print(missing_values)

df['workclass'].fillna(df['workclass'].mode()[0], inplace=True)
df['occupation'].fillna(df['occupation'].mode()[0], inplace=True)
df['native-country'].fillna(df['native-country'].mode()[0], inplace=True)


age                 0
workclass         963
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
income              0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['workclass'].fillna(df['workclass'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['occupation'].fillna(df['occupation'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate o

# Encode Class Data

In [10]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

one_hot_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
df = pd.get_dummies(df, columns=one_hot_features, drop_first=True)

label_encoder = LabelEncoder()
df['income'] = label_encoder.fit_transform(df['income'])

# Split Train and Test

In [11]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['income'])
y = df['income']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Data Scale

In [12]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Modeling

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier()
}

def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred, average='binary'):.4f}")
    print(f"Recall: {recall_score(y_test, y_pred, average='binary'):.4f}")
    print(f"F1 Score: {f1_score(y_test, y_pred, average='binary'):.4f}")


for name, model in models.items():
    print(f"--- {name} ---")
    evaluate_model(model, X_train_scaled, X_test_scaled, y_train, y_test)


--- Logistic Regression ---
Accuracy: 0.8507
Precision: 0.7312
Recall: 0.5944
F1 Score: 0.6558
--- Decision Tree ---
Accuracy: 0.8232
Precision: 0.6337
Recall: 0.6192
F1 Score: 0.6264
--- Random Forest ---
Accuracy: 0.8456
Precision: 0.6999
Recall: 0.6212
F1 Score: 0.6582
--- AdaBoost ---




Accuracy: 0.8632
Precision: 0.7735
Recall: 0.6058
F1 Score: 0.6795


# Parameter Search

In [14]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)

random_search = RandomizedSearchCV(RandomForestClassifier(), param_grid, cv=5, n_iter=10, random_state=42)
random_search.fit(X_train_scaled, y_train)

print("Best parameters (GridSearchCV):", grid_search.best_params_)
print("Best parameters (RandomizedSearchCV):", random_search.best_params_)


Best parameters (GridSearchCV): {'max_depth': 20, 'min_samples_split': 10, 'n_estimators': 200}
Best parameters (RandomizedSearchCV): {'n_estimators': 100, 'min_samples_split': 10, 'max_depth': 20}
