# Lab instructions
- https://www.kaggle.com/datasets/lodetomasi1995/income-classification?sort=published  
In this lab we'll be using a dataset from kaggle yet again...it's just so fun and rich! We're using the following income dataset where we want to use the other features to predict whether someone is making over $50,000 per year or not.

# Primary Goals:
Predict income.

# Assignment Specs:

- You need to use Naive Bayes and neural networks in your work to answer the question above, but you should explore at least two other models in order to answer the above questions as best you can. You may use multiple neural network models if you like, but I'd encourage you to consider past model types we've discussed.
- This dataset has variables of multiple types. So, this should give you an opportunity to explore how neural networks can (or can't) handle data of different types. You may need to one-hot encode the character variables...
- Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.

# Import Data

In [11]:
import pandas as pd 
df = pd.read_csv("/Users/dan/calpoly/BusinessAnalytics/GSB545ADML/Week4/income_evaluation.csv")
df = df.dropna()
df.head() 

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [20]:
# Check distribution
df.value_counts(" income")

 income
<=50K    24720
>50K      7841
Name: count, dtype: int64

# Clean Data
Cleaning Process:
- Balance the dataset so have the same number of each income level.
- Drop *education* because it is already coded categorically by education-num
- Drop *fnlwgt* since is a metric for grouping similar types of people in a population. In our model only care about classifying individuals based on the other parameters.
- Set up modeling with train/test split.
- Scale and dummify numerical and categorical variables respectively.

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Balance:
# Separate the two classes
low_income = df[df[' income'] == ' <=50K']
high_income = df[df[' income'] == ' >50K']
# Oversample the minority class
high_income_oversampled = high_income.sample(n=len(low_income), replace=True, random_state=42)
# Combine them back together
df_balanced = pd.concat([low_income, high_income_oversampled])
# Shuffle the resulting DataFrame
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# Drop:
income_df = df_balanced.drop(columns= [" education", " fnlwgt"])
income_df[" income"] = income_df[' income'].map({' <=50K': 1, ' >50K': 0})

# Set up for modeling:
# Price as predictor
X = income_df.drop(columns=[" income"])
y = income_df[" income"]
# Satisfy categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [24]:
# Recheck distribution
income_df.value_counts(" income")

 income
0    24720
1    24720
Name: count, dtype: int64

# Create Neural Network Model
Using *MLPClassifier* we are using a neural network by using one hidden layer only. We also are using the *Sigmoid* function to create our predictions. Then we predict on teh test data to see our results.

In [23]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report

# Create the model
nn_classifier = MLPClassifier(
    hidden_layer_sizes=(64,),      # One hidden layer with 64 neurons
    activation='logistic',         # Activation function: 'relu', 'tanh', 'logistic' (sigmoid)
    solver='adam',                 # Optimizer: 'adam' is usually best
    alpha=0.001,                   # L2 regularization
    learning_rate='adaptive',      # Adaptive learning rate
    max_iter=500,                  # Max training iterations
    early_stopping=True,           # Stop early if no improvement
    n_iter_no_change=50,           # Number of iterations with no improvement to stop
    random_state=42
)

# Fit the model
nn_classifier.fit(X_train_scaled, y_train)

# Predict
y_pred_class = nn_classifier.predict(X_test_scaled)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred_class))
print("\nClassification Report:\n", classification_report(y_test, y_pred_class))


Accuracy: 0.8537621359223301

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.89      0.86      4928
           1       0.88      0.82      0.85      4960

    accuracy                           0.85      9888
   macro avg       0.86      0.85      0.85      9888
weighted avg       0.86      0.85      0.85      9888



# Results
Our neural network classifier achieved an overall accuracy of 85.4% on the test data after balancing the classes through. The model now shows more balanced performance across both classes: for Class 0 (<=50K), it achieved a precision of 83%, recall of 89%, and an F1-score of 86%. For Class 1 (>50K), it achieved a precision of 88%, recall of 82%, and an F1-score of 85%. The macro and weighted averages for precision, recall, and F1-score are all around 85%, indicating that the model treats both classes relatively evenly.