# Problem Statement
---

The task of predicting an individual's income level is an important one with far-reaching implications. In various fields such as marketing, taxation, and public policy, being able to predict income can help tailor interventions, advertisements, and policies to the right audience. More specifically, predicting whether an individual's annual income exceeds $50,000, or remains below or equal to that threshold, can be useful for resource allocation, financial services, or employment-related decision-making.

In this scenario, we are given a dataset containing demographic and employment-related features of individuals. The goal is to use **Random Forests**, a powerful machine learning algorithm, to predict whether an individual's income exceeds $50,000 annually. This is a **binary classification** problem where the target variable is income, which is represented as either <=50K or >50K.

# Objective

1.  **Classification Task**:
    
    *   Predict whether an individual earns more than 50K (**>50K**) or less than or equal to 50K (**<=50K**) based on demographic and employment-related features.

2.  **Feature Analysis**:
    
    *   Analyze the significance and contribution of various features in predicting income, identifying which ones are the most influential in determining the target outcome.

3.  **Model Evaluation**:
    
    *   Assess the performance of the Random Forest model using key evaluation metrics such as:
        
        *   **Accuracy** – Measures the overall correctness of the model.
        
        *   **Precision, Recall, and F1-Score** – Evaluate the model's performance in terms of both false positives and false negatives, providing a more nuanced understanding of its effectiveness.
        
        *   **Confusion Matrix** – Visualize the true positives, false positives, true negatives, and false negatives to better understand the model's predictions.

4.  **Handling Missing Values**:
    
    *   Address missing data (represented by **?**) in features such as **Workclass** and **Occupation** through appropriate imputation techniques, ensuring the model's robustness and reliability.


# Understanding the dataset

---

This dataset contains key information about individuals, with the following columns:

1. **age**: The individual's age.

2. **workclass**: The type of employer or employment status (e.g., Private, Self-employed, Government).

3. **fnlwgt**: The final weight, representing the population size that the individual’s record is meant to reflect.

4. **education**: The highest level of education attained by the individual.

5. **education.num**: A numeric code corresponding to the individual’s education level.

6. **marital.status**: The individual's marital status.

7. **occupation**: The individual's job type or occupation.

8. **relationship**: The family relationship status of the individual (e.g., spouse, child, etc.).

9. **race**: The individual's race.

10. **sex**: The gender of the individual.

11. **capital.gain**: The income earned from capital gains (investment earnings).

12. **capital.loss**: The income lost from capital losses (investment losses).

13. **hours.per.week**: The average number of hours the individual works per week.

14. **native.country**: The individual's country of origin.

15. **income**: The target variable, indicating whether the individual’s income is greater than 50K (">50K") or less than or equal to 50K ("<=50K").


# Step-by-Step Guide

---

Here’s the step-by-step approach explaining how to write the Python code to build and evaluate a Random Forest model for income prediction


### Step 1: Import Required Libraries
First, we need to import the necessary libraries for data processing, machine learning, and model evaluation. These libraries help with loading, preprocessing, training, and evaluating the model.

- **pandas**: For data manipulation, such as loading CSV files and handling missing values.
- **numpy**: For numerical operations, particularly for handling missing data.
- **scikit-learn**: This is used for splitting the dataset, training the Random Forest model, and evaluating the model's performance.

### Step 2: Load and Preprocess the Data
Before training a model, we need to load the dataset and perform preprocessing to handle missing values and categorical data. This involves:

- Loading the dataset using `pandas.read_csv()`.
- Grouping rare categories in the `Country` feature into a category labeled "Other".
- Splitting the dataset into features (X) and target (y). In this case, the target is income column `Target`, and the features are all other columns.
- Preprocessing the categorical and numerical data by:
  - Using `SimpleImputer` to replace missing values in both categorical and numerical features.
  - Applying one-hot encoding to categorical features to convert them into a numerical format suitable for the model.

### Step 3: Train the Random Forest Model
After preprocessing the data, the next step is to train the machine learning model. This involves:

- Creating a `RandomForestClassifier` model from `scikit-learn`. A Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs.
- Using the `.fit()` method to train the model with the training data.

### Step 4: Evaluate the Model
Once the model is trained, we need to evaluate its performance on unseen test data. This involves:

- Predicting the income variable (`Target`) using the trained model on the test data.
- Generating a classification report to evaluate metrics like precision, recall, and F1-score.
- Computing the accuracy score, which measures the proportion of correct predictions.
- Displaying the confusion matrix, which shows the number of true positives, false positives, true negatives, and false negatives.


# solution here

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.impute import SimpleImputer


In [2]:
df = pd.read_csv('archive/adult_train.csv')

df.head(5)


Unnamed: 0,Age,Workclass,fnlwgt,Education,Education_Num,Martial_Status,Occupation,Relationship,Race,Sex,Capital_Gain,Capital_Loss,Hours_per_week,Country,Target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
df.isnull().sum()

Age                  0
Workclass         1836
fnlwgt               0
Education            0
Education_Num        0
Martial_Status       0
Occupation        1843
Relationship         0
Race                 0
Sex                  0
Capital_Gain         0
Capital_Loss         0
Hours_per_week       0
Country            583
Target               0
dtype: int64

In [6]:
 # Group rare categories in 'native.country' into 'Other'
threshold = 50  
country_counts = df['Country'].value_counts() 
rare_countries = country_counts[country_counts < threshold].index  # Identify countries with less than 50 occurrences
df['Country'] = df['Country'].replace(rare_countries, 'Other')  # Replace rare countries with 'Other


In [8]:
X = df.drop(columns=['Target'])
y = df['Target'].map({' <=50K':0, ' >50K':1})


In [11]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

numerical_columns = X.select_dtypes(exclude="object").columns  # Select numerical columns (numbers)

# Numerical preprocessing: fill missing values with the mean
numerical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))
])

categorical_columns = X.select_dtypes(include="object").columns  # Select categorical columns (strings)
categorical_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),  # Replace missing categorical values with the most frequent category
        ("onehot", OneHotEncoder(handle_unknown="ignore"))  # Convert categorical values into one-hot encoded variables
    ])
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_columns),
        ("cat", categorical_transformer, categorical_columns)
    ]
)
X_processed = preprocessor.fit_transform(X)


In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

In [13]:
RFC = RandomForestClassifier(random_state=42, n_estimators=37, max_depth=32)
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
print("\nclassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:")
print(f"{accuracy_score(y_test, y_pred):.2f}")
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(pd.DataFrame(cm, index=["<=50K", ">50K"], columns=["<=50K", ">50K"]))


classification Report:
              precision    recall  f1-score   support

           0       0.89      0.94      0.91      4942
           1       0.77      0.63      0.69      1571

    accuracy                           0.87      6513
   macro avg       0.83      0.79      0.80      6513
weighted avg       0.86      0.87      0.86      6513


Accuracy Score:
0.87

Confusion Matrix:
       <=50K  >50K
<=50K   4640   302
>50K     577   994


In [126]:
test_df = pd.read_csv("archive/adult_test.csv")
test_df

Unnamed: 0,Age,Workclass,fnlwgt,Education,Education_Num,Martial_Status,Occupation,Relationship,Race,Sex,Capital_Gain,Capital_Loss,Hours_per_week,Country,Target
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
16277,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [130]:
# Load the test dataset
test_df = pd.read_csv("archive/adult_test.csv")

# Fix target column: strip spaces and remove trailing period
test_df["Target"] = test_df["Target"].str.strip().str.replace(".", "", regex=False)
print(test_df["Target"])

0        <=50K
1        <=50K
2         >50K
3         >50K
4        <=50K
         ...  
16276    <=50K
16277    <=50K
16278    <=50K
16279    <=50K
16280     >50K
Name: Target, Length: 16281, dtype: object


In [133]:
# Encode target
y_test = test_df["Target"].map({"<=50K": 0, ">50K": 1})
y_test

0        0
1        0
2        1
3        1
4        0
        ..
16276    0
16277    0
16278    0
16279    0
16280    1
Name: Target, Length: 16281, dtype: int64

In [134]:
# Features
X_test = test_df.drop(columns=["Target"])

# Preprocess test features using the SAME preprocessor fitted on train data
X_test_processed = preprocessor.transform(X_test)

In [135]:
# Predict with your trained RandomForest model
y_predict = RFC.predict(X_test_processed)

In [137]:
# Check accuracy
acc = accuracy_score(y_test, y_predict)
print("Test Accuracy:", acc)

Test Accuracy: 0.8539401756648854
