# 📚 Introduction

In this project, we use the AutoGluon library to build a classification model on the Adult Income dfset.
The objective is to predict whether an individual's income exceeds $50K/year based on attributes such as age, education, occupation, and more.

AutoGluon simplifies machine learning by automatically preprocessing the df, selecting the best models, and ensembling them for superior predictive performance.


## 🧾 Data Definitions

| Column Name        | Description                                          | Data Type |
|--------------------|-------------------------------------------------------|-----------|
| age                | Age of the individual                                | int64     |
| workclass          | Type of employment (e.g., Private, Self-emp, Govt)    | object    |
| fnlwgt             | Final weight — census sampling weight                | int64     |
| education          | Level of education attained                          | object    |
| educational-num    | Numerical representation of education                | int64     |
| marital-status     | Marital status (e.g., Married, Never-married)         | object    |
| occupation         | Type of occupation (e.g., Tech-support, Craft-repair) | object    |
| relationship       | Relationship status (e.g., Husband, Wife, Own-child)  | object    |
| race               | Race of the individual                               | object    |
| gender             | Gender (Male/Female)                                 | object    |
| capital-gain       | Income from investment sources                       | int64     |
| capital-loss       | Losses from investment sources                       | int64     |
| hours-per-week     | Average hours worked per week                        | int64     |
| native-country     | Country of origin                                    | object    |
| income             | Target Variable (<=50K or >50K)                      | object    |


## Importing necessary Libraries

In [None]:
import pandas as pd
from autogluon.tabular import TabularPredictor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

## 📂 Data Loading and Exploration

In [None]:
#Loading dataset
df = pd.read_csv('adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


- Also, to check out the link of y-data profile of the dataset
- [Exploratory Data Analysis Report - Adult Dataset](https://aayushsingh2708.github.io/EDA_Reports/ydata/adult.html)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [6]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,48842.0,38.643585,13.71051,17.0,28.0,37.0,48.0,90.0
fnlwgt,48842.0,189664.134597,105604.025423,12285.0,117550.5,178144.5,237642.0,1490400.0
educational-num,48842.0,10.078089,2.570973,1.0,9.0,10.0,12.0,16.0
capital-gain,48842.0,1079.067626,7452.019058,0.0,0.0,0.0,0.0,99999.0
capital-loss,48842.0,87.502314,403.004552,0.0,0.0,0.0,0.0,4356.0
hours-per-week,48842.0,40.422382,12.391444,1.0,40.0,40.0,45.0,99.0


## 🤖 Model Building using AutoGluon

In [7]:
#Setting Target Column and Splitting Data
target = 'income'
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [8]:
#Initialize and Train AutoGluon Predictor
predictor = TabularPredictor(label=target).fit(train_df)

No path specified. Models will be saved in: "AutogluonModels\ag-20250411_185921"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.11.4
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26100
CPU Count:          8
Memory Avail:       1.07 GB / 7.65 GB (13.9%)
Disk Space Avail:   271.01 GB / 475.83 GB (57.0%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets=

In [17]:
#Displaying the Model's Leaderboard
predictor.leaderboard(test_df, silent=True)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.880848,0.8616,accuracy,0.092102,0.009907,37.454617,0.092102,0.009907,37.454617,1,True,7
1,LightGBM,0.878596,0.8636,accuracy,0.025871,0.008022,0.841387,0.025871,0.008022,0.841387,1,True,4
2,LightGBMLarge,0.878084,0.8604,accuracy,0.046343,0.015066,1.128707,0.046343,0.015066,1.128707,1,True,12
3,XGBoost,0.877981,0.8644,accuracy,0.095685,0.014506,1.257365,0.095685,0.014506,1.257365,1,True,11
4,WeightedEnsemble_L2,0.877981,0.8644,accuracy,0.107695,0.014506,1.383228,0.01201,0.0,0.125863,2,True,13
5,LightGBMXT,0.875422,0.8584,accuracy,0.073954,0.025672,1.856234,0.073954,0.025672,1.856234,1,True,3
6,RandomForestEntr,0.864879,0.8444,accuracy,0.192376,0.075082,1.791065,0.192376,0.075082,1.791065,1,True,6
7,RandomForestGini,0.863753,0.8452,accuracy,0.214176,0.044295,1.657547,0.214176,0.044295,1.657547,1,True,5
8,NeuralNetFastAI,0.862934,0.8452,accuracy,0.794719,0.044738,37.803961,0.794719,0.044738,37.803961,1,True,10
9,ExtraTreesGini,0.854233,0.8404,accuracy,0.352024,0.050764,1.027913,0.352024,0.050764,1.027913,1,True,8


## 📈 Model Evaluation

In [None]:
#Evaluating performance
performance = predictor.evaluate(test_df)

In [None]:
#Displaying Performance Metrics
performance_df = pd.DataFrame(list(performance.items()), columns=['Metric', 'Value'])
performance_df

Unnamed: 0,Metric,Value
0,accuracy,0.877981
1,balanced_accuracy,0.798666
2,mcc,0.642419
3,roc_auc,0.930562
4,f1,0.713874
5,precision,0.792644
6,recall,0.649345


In [None]:
#Feature Importance
feature_importance = predictor.feature_importance(test_df)
print(feature_importance)

Computing feature importance via permutation shuffling for 14 features using 5000 rows with 5 shuffle sets...
	3.16s	= Expected runtime (0.63s per shuffle set)
	2.52s	= Actual runtime (Completed 5 of 5 shuffle sets)


                 importance    stddev   p_value  n  p99_high   p99_low
marital-status      0.05500  0.004844  0.000007  5  0.064973  0.045027
capital-gain        0.05240  0.002898  0.000001  5  0.058368  0.046432
educational-num     0.03164  0.005098  0.000078  5  0.042137  0.021143
age                 0.01588  0.001973  0.000028  5  0.019942  0.011818
occupation          0.01328  0.002715  0.000198  5  0.018871  0.007689
capital-loss        0.01180  0.001517  0.000032  5  0.014923  0.008677
hours-per-week      0.00956  0.002985  0.001006  5  0.015705  0.003415
relationship        0.00312  0.000996  0.001093  5  0.005171  0.001069
workclass           0.00120  0.001349  0.058794  5  0.003978 -0.001578
gender              0.00076  0.001203  0.115366  5  0.003238 -0.001718
fnlwgt              0.00068  0.000672  0.043261  5  0.002064 -0.000704
native-country      0.00060  0.000894  0.104000  5  0.002442 -0.001242
race                0.00012  0.000438  0.286696  5  0.001022 -0.000782
educat

## 🔍 Prediction Example

In [15]:
#Predictions Sample Output
predictions = predictor.predict(test_df.drop(columns=[target]))
print(predictions.head())

7762     <=50K
23881    <=50K
30507     >50K
28911    <=50K
19484     >50K
Name: income, dtype: object


# 📈 Conclusion

- ✅ We successfully built a **classification model** using **AutoGluon** on the **Adult Income dataset**.
- ✅ The model achieved an overall **accuracy of 87.79%** on the test set.
- ✅ Other key performance metrics:
  - **Balanced Accuracy**: 79.87%
  - **ROC AUC**: 93.05%
  - **F1 Score**: 71.39%
  - **Precision**: 79.26%
  - **Recall**: 64.93%
  - **MCC (Matthews Correlation Coefficient)**: 0.642

- **Balanced Accuracy**:  
  Measures the average of sensitivity (true positive rate) and specificity (true negative rate).  
  It is especially useful when classes are imbalanced.  
  ➔ A balanced accuracy of **79.87%** means the model is fairly good at handling both classes.

- **ROC AUC (Receiver Operating Characteristic - Area Under Curve)**:  
  Measures how well the model distinguishes between classes.  
  A score of **1.0** is perfect; **0.5** is random guessing.  
  ➔ A ROC AUC of **93.05%** means the model distinguishes income classes **very well**.

- **F1 Score**:  
  Harmonic mean of **Precision** and **Recall**.  
  It balances the trade-off between false positives and false negatives.  
  ➔ A F1 Score of **71.39%** shows a **good balance** between correctly predicting high-income individuals and avoiding wrong classifications.

- **Precision**:  
  Out of all people predicted as high-income (>50K), how many were actually high-income?  
  ➔ A Precision of **79.26%** means the model is **fairly accurate** when it predicts someone earns more than 50K.

- **Recall (Sensitivity)**:  
  Out of all actual high-income individuals, how many did the model correctly identify?  
  ➔ A Recall of **64.93%** means the model **captured most** high-income individuals but **missed some**.


## 📊 Feature Importance Insights
The most important features influencing the model predictions were:
- `marital-status`
- `capital-gain`
- `education-num`
- `age`
- `occupation`

These features had the highest impact based on permutation feature importance analysis.

---