# Section 3 Classifier

In this section, we employed three classifier models:
(1)Logistic Regression 
(2) Random Forest 
(3) XGBoost
Logistic Regression serves as a baseline model, Random Forest provides strong performance with built-in feature selection, and XGBoost offers a powerful boosting approach that in comparision with random forest.

In [34]:
import pandas as pd

df = pd.read_csv("../data/preprocessed/cleaned_raw_encoded.csv")
df.head()

Unnamed: 0,age,class of worker,detailed industry recode,detailed occupation recode,education,wage per hour,enroll in edu inst last wk,marital stat,major industry code,major occupation code,...,country of birth father,country of birth mother,country of birth self,citizenship,own business or self employed,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year,year,label
0,73,3.0,0,0,12.0,0,2.0,6.0,14.0,6.0,...,39.0,39.0,39.0,4.0,0,1.0,2,0,95,0
1,58,6.0,4,34,16.0,0,2.0,0.0,4.0,8.0,...,39.0,39.0,39.0,4.0,0,1.0,2,52,94,0
2,18,3.0,0,0,0.0,0,1.0,4.0,14.0,6.0,...,41.0,41.0,41.0,0.0,0,1.0,2,0,95,0
3,9,3.0,0,0,10.0,0,2.0,4.0,14.0,6.0,...,39.0,39.0,39.0,4.0,0,1.0,0,0,94,0
4,10,3.0,0,0,10.0,0,2.0,4.0,14.0,6.0,...,39.0,39.0,39.0,4.0,0,1.0,0,0,94,0


In [35]:
## drop the Label and weight for feature data X
X = df.drop(columns=["label"])
y = df["label"]     # 0 = <50K, 1 = >50K
w = df["weight"]


In [36]:
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
# Scale ONLY for logistic regression
scaler = StandardScaler()
encoder = OrdinalEncoder()

# Identify types
categorical_cols = df.select_dtypes(include="object").columns.tolist()


numeric_cols = [
    'age', 'wage per hour', 'capital gains', 'capital losses',
    'dividends from stocks', 'weight', 'num persons worked for employer',
    'own business or self employed', 'veterans benefits',
    'weeks worked in year', 'year'
]
X[categorical_cols] = encoder.fit_transform(X[categorical_cols])

# Scale numeric columns
scaler = StandardScaler()
X[numeric_cols] = scaler.fit_transform(X[numeric_cols])


## conduct the train test split and scale the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, w,
    test_size=0.2,
    random_state=42,
    stratify=y
)






### Logistic regression

train and predict using the logistic regression

In [37]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score


log_clf = LogisticRegression(max_iter=500)
log_clf.fit(X_train, y_train, sample_weight=w_train)
y_pred = log_clf.predict(X_test)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred, sample_weight=w_test)
acc




STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9461917997765783

### Random Forest
train and predict using the random forrest

In [41]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix
)
import seaborn as sns
import matplotlib.pyplot as plt

# Build the model
rf = RandomForestClassifier(
    n_estimators=500,
    random_state=42,
    n_jobs=-1
)

# Train (IMPORTANT: include sample_weight)
rf.fit(X_train, y_train, sample_weight=w_train)

# Predictions
y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:, 1]





### Run the Feature ranking by the random forest

In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Save feature names BEFORE scaling
feature_names = X_train.columns

# After RandomForest training
importances = rf.feature_importances_

# Create DataFrame with names + importance
importance_df = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
}).sort_values(by="importance", ascending=False)

importance_df.head(40)



Unnamed: 0,feature,importance
24,weight,0.097009
3,detailed occupation recode,0.091985
0,age,0.089934
18,dividends from stocks,0.088754
16,capital gains,0.08345
2,detailed industry recode,0.053279
4,education,0.050533
30,num persons worked for employer,0.040114
39,weeks worked in year,0.038072
8,major industry code,0.036633


### XGBoost
train and predict using the random forrest

In [None]:
from xgboost import XGBClassifier
xgb_clf = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss"
)
xgb_clf.fit(X_train, y_train, sample_weight=w_train)

y_pred = xgb_clf.predict(X_test)
y_proba = xgb_clf.predict_proba(X_test)[:, 1] 


For the Model Evaluation code, please see the src/evaluation.py