# Predicting Income

In this lab we'll be using a dataset from kaggle yet again...it's just so fun and rich! We're using the following income dataset where we want to use the other features to predict whether someone is making over $50,000 per year or not.

Primary Goals:

Predict income.
Assignment Specs:

You need to use Naive Bayes and neural networks in your work to answer the question above, but you should explore at least two other models in order to answer the above questions as best you can. You may use multiple neural network models if you like, but I'd encourage you to consider past model types we've discussed.
This dataset has variables of multiple types. So, this should give you an opportunity to explore how neural networks can (or can't) handle data of different types. You may need to one-hot encode the character variables...
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.
Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!

In [83]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import StackingClassifier
import numpy as np

In [20]:
# load in data
income = pd.read_csv("income_evaluation.csv")
# drop nas from income column
income.columns = income.columns.str.strip()
income = income.dropna(subset = ["income"])
# encode 1 as over 50K and 0 as less than 50K
income['income_binary'] = income['income'].apply(lambda x: 0 if x.strip() == '<=50K' else 1)
income = income.drop(columns = ["income"])
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [None]:
# determine x and y
X = income.drop(columns = ["income_binary"])
y = income[["income_binary"]]

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:


def evaluate_model(X, y, model_type="Naive Bayes", test_size=0.2, random_state=42):
    # Split data into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Identify categorical columns
    cat_cols = X.select_dtypes(include='object').columns.tolist()

    # Column transformer for preprocessing
    ct = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('standardize', StandardScaler(), make_column_selector(dtype_include=np.number))
    ])
    
    # List of classifiers
    classifiers = {
        "Neural Network (10)": MLPClassifier(hidden_layer_sizes=(10,), activation='relu', max_iter=500, random_state=random_state),
        "Neural Network (50)": MLPClassifier(hidden_layer_sizes=(50,), activation='relu', max_iter=500, random_state=random_state),
        "Logistic Regression + Bagging": BaggingClassifier(estimator=LogisticRegression(max_iter=1000), n_estimators=100),
        "KNN + Bagging": BaggingClassifier(estimator=KNeighborsClassifier(n_neighbors=5), n_estimators=100),
        "Decision Tree + Bagging": BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100),
        "Random Forest": RandomForestClassifier(n_estimators=100),
        "Stacking (LR, DT, KNN)": StackingClassifier(estimators=[('lr', LogisticRegression(max_iter=1000)), 
                                                              ('dt', DecisionTreeClassifier()), 
                                                              ('knn', KNeighborsClassifier())]),
    }

    # Select classifier based on model_type
    clf = classifiers[model_type]
    pipeline = Pipeline([("preprocess", ct), ("model", clf)])
    
    # Fit the model
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    # Calculate accuracy and F1 score
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Print the results for the selected model
    print(f"Results for {model_type}:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")



# Neural Network 10

In [67]:
evaluate_model(X, y, model_type="Neural Network (10)")

  y = column_or_1d(y, warn=True)


Results for Neural Network (10):
Accuracy: 0.8627
F1 Score: 0.8588


# Neural Network 50

In [68]:
evaluate_model(X, y, model_type="Neural Network (50)")

  y = column_or_1d(y, warn=True)


Results for Neural Network (50):
Accuracy: 0.8402
F1 Score: 0.8409


# Naive Bayes

In [89]:
X = income.drop(columns=["income_binary"])  # Features (excluding target)
y = income['income_binary']  # Target variable

cat_cols = X.select_dtypes(include='object').columns.tolist()
ct = ColumnTransformer([
    ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), cat_cols)
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_transformed = ct.fit_transform(X_train)
X_test_transformed = ct.transform(X_test)

nb_model = CategoricalNB()
nb_model.fit(X_train_transformed, y_train)

y_pred = nb_model.predict(X_test_transformed)
print(f"Results for Naive Bayes:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.2f}")

Results for Naive Bayes:
Accuracy: 0.80
F1 Score: 0.63


# Bagging

## Logistic Regression

In [69]:
evaluate_model(X, y, model_type="Logistic Regression + Bagging")

  y = column_or_1d(y, warn=True)


Results for Logistic Regression + Bagging:
Accuracy: 0.8581
F1 Score: 0.8531


## KNN

In [74]:
evaluate_model(X, y, model_type="KNN + Bagging")

  y = column_or_1d(y, warn=True)


KeyboardInterrupt: 

## Decision Tree

In [71]:
evaluate_model(X, y, model_type="Decision Tree + Bagging")

  y = column_or_1d(y, warn=True)


Results for Decision Tree + Bagging:
Accuracy: 0.8614
F1 Score: 0.8578


# Random Forest

In [72]:
evaluate_model(X, y, model_type="Random Forest")

  return fit_method(estimator, *args, **kwargs)


Results for Random Forest:
Accuracy: 0.8592
F1 Score: 0.8553


# Stacking

In [73]:
evaluate_model(X, y, model_type="Stacking (LR, DT, KNN)")

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


Results for Stacking (LR, DT, KNN):
Accuracy: 0.8614
F1 Score: 0.8566
