# Introduction

## Project Description
The primary objective of this project is to apply key **Machine Learning** concepts, such as data cleaning, feature engineering, model selection, hyperparameter tuning, and performance evaluation. By exploring various  models and evaluation metrics, we expect to gain a deeper understanding of how to build and optimize predictive models.

The aim is to perform a classification task using the well-known **Adult Income Dataset**. The target variable is whether a person earns more or less than 50K USD per year.

## Dataset Description
The [Adult Income Dataset](https://archive.ics.uci.edu/dataset/2/adult) was originally extracted from the 1994 US Census Database and is commonly used for binary classification tasks in ML. It contains demographic and economic features that may influence a person’s income level.

It consists of 6 numerical features and 9 categorical features, including:
* Demographic Information: `age`, `sex`, `race`, `marital-status`, `native-country`, `relationship`.
* Education & Occupation: `education`, `education-num`, `workclass`, `occupation`.
* Financial Attributes: `capital-gain`, `capital-loss`, `hours-per-week`, `fnlwgt`.
* The Target Variable: `class` (<=50K or >50K)

## Project Overview
1. **Getting Started**: import libs and access the data.
2. **Exploratory Data Analysis**: Explore the dataset, visualize feature distributions, identify correlations.
3. **Data Preprocessing**: Handle missing values, handle text, scale data, split between train and test set.
4. **Model Training**: Train different classification models and perform hyperparameter tuning.
5. **Model Evaluation**: Evaluate model performance using various metrics.

# Getting Started

## Import necessary packages
You can download all of the necessary packages using: `pip install pandas numpy seaborn matplotlib scikit-learn`

In [None]:
# File manipulation tools
import os
import pandas as pd
import tarfile
import urllib.request

# Data visualization tools
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing tools
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Predictors
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Performance metrics
from sklearn.model_selection import GridSearchCV

## Data Access
The dataset is hosted on our GitHub repository. There is also a description from the original authors in `adult.names`.

In [None]:
DOWNLOAD_URL = "https://github.com/fatimaezzahra-creator/Projet-ML/raw/refs/heads/main/datasets/adult.tgz"
DATASET_PATH = "datasets"

def fetch_data(data_url, data_path):
    if not os.path.isdir(data_path):
        os.makedirs(data_path)
    tgz_path = os.path.join(data_path, "adult.tgz")
    urllib.request.urlretrieve(data_url, tgz_path)
    tgz_file = tarfile.open(tgz_path)
    tgz_file.extractall(path=data_path)
    tgz_file.close()    

def load_data():
    csv_path = os.path.join(DATASET_PATH, "adult.data")
    return pd.read_csv(csv_path, skipinitialspace=True)

fetch_data(DOWNLOAD_URL, DATASET_PATH)
data = load_data()

# Dataset Exploratory Analysis

## Analysis of Form and Content

In [None]:
data.info()

In [None]:
# Visualize Missing Data
sns.heatmap(data.isna(), cbar=False)

We can see that the graph is all dark, which means there is no missing values in the data .

In [None]:
# Distribution of the Target class
target_name = "class"
data[target_name].value_counts(normalize=True)

In [None]:
# Distribution of Numerical Features
for col in data.select_dtypes("int64"):
    sns.displot(data[col])

In [None]:
# Distribution of Categorical Attributes
for col in data.select_dtypes("object"):
    sns.displot(data=data, x=col)
    plt.title(f" '{col}'", fontsize=16)
    plt.xlabel(col, fontsize=12)
    plt.ylabel("Nombre d'observations", fontsize=12)
    plt.xticks(rotation=90)
    plt.show()

## Identifying correlations

In [None]:
# Create a copy of the data
df = data.copy()

# Creating subsets based on the target variable
class_0 = df[df[target_name] == "<=50K"]
class_1 = df[df[target_name] == ">50K"]
combined_df = (pd.concat([class_0, class_1]))

def visualize_correlation(dataframe, feature_name, target_name, rotate=False):
    sns.histplot(
        data=dataframe,
        x=feature_name,
        hue=target_name,
        stat="density",
        common_norm=False,
        palette="muted")
    if rotate:
        plt.xticks(rotation=90)
        plt.show()

def visualize_correspondance(dataframe, feature1_name, feature2_name):
    # FIX ME: add comment explaining what does the line below do
    mapping = dataframe.groupby(feature1_name)[feature2_name].unique()
    # Show mapping to verify correspondance
    for feature1, feature2 in mapping.items():
        print(f"{feature1_name}: {feature1}, {feature2_name}: {feature2}")

In [None]:
# Correlation between target and the 'age' feature
visualize_correlation(combined_df, "age", target_name)

In [None]:
# Correlation between target and the 'sex' feature
visualize_correlation(combined_df, "sex", target_name)

In [None]:
# Correlation between target and the 'workclass' feature
visualize_correlation(combined_df, "workclass", target_name, rotate=True)

In [None]:
# Correlation between target and the 'fnlwgt' feature
visualize_correlation(combined_df, "fnlwgt", target_name)

In [None]:
# Correlation between target and the 'occupation' feature
visualize_correlation(combined_df, "occupation", target_name, rotate=True)

In [None]:
# Correlation between target and the 'hours-per-week' feature
visualize_correlation(combined_df, "hours-per-week", target_name)

In [None]:
# Correlation between target and the 'education-num' feature
visualize_correlation(combined_df, "education-num", target_name)

In [None]:
# Correlation between target and the 'education' feature
visualize_correlation(combined_df, "education", target_name, rotate=True)

The features `education` (categorical) and `education-num` (numerical) may convey similar information, as they both represent the education level of an individual.

In [None]:
# Correspondance between 'education' and 'education-num'
visualize_correspondance(data, "education", "education-num")

The output of this mapping shows that each education category corresponds to a single unique value of education-num in an ordinal way. That is, 1 is equivalent to the lowest level of education (Preschool), while 16 is the highest (Doctorate). This confirms that the two features are effectively encoding the same information.

In [None]:
# Correlation between target and the 'marital-status' feature
visualize_correlation(combined_df, "marital-status", target_name, rotate=True)

In [None]:
# Correlation between target and the 'relationship' feature
visualize_correlation(combined_df, "relationship", target_name)

The features `relationship` and `marital-status` might also convey similar information because a person's relationship type often depends on their marital status.



In [None]:
# Correspondance between 'marital-status' and 'relationship'
visualize_correspondance(data, "marital-status", "relationship")

The output reveals that for each value of marital-status, there are multiple possible values for relationship.
This variability indicates that a person's relationship cannot be uniquely determined based on their marital-status.

In [None]:
# Correlation between target and the 'race' feature
visualize_correlation(combined_df, "race", target_name)

In [None]:
# Correlation between target and the 'native-country' feature
visualize_correlation(combined_df, "native-country", target_name, rotate=True)

The native-country  and race columns in the dataset contains many unique values, some of which have very low frequencies. Keeping all these rare categories can negatively impact the machine learning model due to:

Overfitting: The model may place undue importance on rare categories, learning patterns that don't generalize well to new data.
Increased Complexity: High cardinality increases the dimensionality during encoding (e.g., in one-hot encoding), which can slow down training and complicate the model unnecessarily.


In [None]:
# Correlation between target and the 'capital-gain' feature
visualize_correlation(combined_df, "capital-gain", target_name)

In [None]:
# Correlation between target and the 'capital-loss' feature
visualize_correlation(combined_df, "capital-loss", target_name)

The combination of capital-gain and capital-loss into a single derived feature could have a stronger correlation with the target variable (class) than either capital-gain or capital-loss individually, potentially improving the predictive power of the model.

In [None]:
# Capital Features Combination
dff = df.copy()
dff["capital-net"] = (dff["capital-gain"] - dff["capital-loss"])
dff["ratio"] = (dff["capital-gain"]/(dff["capital-loss"]+0.00000001))
dff["capital-weighted"] = (dff["capital-gain"]*0.223329 + dff["capital-loss"]*0.150526)

# Encodage LabelEncoder pour chaque colonne catégorique
label_encoders = {}
for column in dff.select_dtypes(include=['object']).columns:
    encoder = LabelEncoder()
    dff[column] = encoder.fit_transform(dff[column])
    label_encoders[column] = encoder  # Stocker l'encodeur pour chaque colonne (optionnel, utile pour l'inverse_transform)

# Calcul de la matrice de corrélation
corr_matrix = dff.corr()
corr_matrix[target_name].sort_values(ascending=False)

**capital-net: the net difference between gains and losses**

Correlation: 0.214, lower than capital-gain. Relevance: While intuitive (netting gains and losses), this feature does not add much value compared to capital-gain alone. Consider dropping it unless it improves model performance.

**ratio: relative proportion of gains to losses**

Correlation: 0.223, identical to capital-gain. Relevance: This feature does not improve upon capital-gain’s correlation. Its usefulness might depend on the model's capacity to interpret non-linear relationships, but it seems redundant for linear models.

**capital-weighted: weighted sum of the two based on their importance**

Correlation: 0.229, slightly higher than capital-gain (0.223). This feature combines the effects of both gains and losses, weighted by their individual correlations with class. It shows a slight improvement, suggesting it may capture some additional nuanced information. This feature is pertinent to keep for modeling.

**Synthesis**

1. Retain `education-num` (numerical feature) and remove education to avoid redundancy and simplify the dataset.
2. Retain both features `relationship` and `marital-status` as they capture different aspects of an individual's social situation.
3. Focus on the most impactful features:
`education-num`, `age`, `hours-per-week`, `capital-weighted`, and categorical variables such as `relationship` and `marital-status`.
4. The feature `fnlwgt` adds minimal value to the predictive power of the model.
5. To reduce noise, we can group all rare categories (those with a frequency below a certain threshold, e.g., 500 occurrences) into a single category called `"Other"`.

# Data Pre-Processing


## Data Cleaning

**Removing Low-Impact Features** 

In [None]:
# Combine capital features in capital_weighted
df["capital-weighted"] = df["capital-gain"] * 0.223329 + df["capital-loss"] * 0.150526

# Delete the redundant or irrelevant columns
df.drop(["education", "fnlwgt", "capital-gain", "capital-loss"], axis=1, inplace=True)


**Grouping Rare Categories into 'Other' to Simplify Data**

In [None]:
# Count the frequency of each category 
native_country_counts = df['native-country'].value_counts()
race_counts = df['race'].value_counts()

# Identify categories to keep 
to_keep = native_country_counts[native_country_counts >= 500].index
to_keep_race = race_counts[race_counts >= 500].index

# Replace rare categories with "Other"
df['native-country'] = df['native-country'].apply(lambda x: x if x in to_keep else 'Other')
df['race'] = df['race'].apply(lambda x: x if x in to_keep_race else 'Other')

# Verify the updated frequencies
print(df['native-country'].value_counts())
print(df['race'].value_counts())

### Encoding and Scaling

In [None]:
# Transformation of Text and Categorical Data
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
categorical_features.remove(target_name)

preprocessor = ColumnTransformer([
    ("categorical", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ("numerical", StandardScaler(), numerical_features)
])

## Train/Test Split

We saw earlier that the distribution of the target class is NOT balanced, so to create our train and test sets we can use a StratifiedShuffleSplit that will not only shuffle the instances but also preserve the proportions in the original dataset.

In [None]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_indexes, test_indexes in split.split(df, df[target_name]):
    train_set = df.iloc[train_indexes]
    test_set = df.iloc[test_indexes]

print("Proportions in the original dataset:", df[target_name].value_counts(normalize=True))
print("Proportions in the train set:", train_set[target_name].value_counts(normalize=True))
print("Proportions in the test set:", test_set[target_name].value_counts(normalize=True))

Finally, we can separate the target from the features.

In [None]:
train_data = train_set.drop(target_name, axis=1)
target = train_set[target_name].copy()

# Model Selection
In this section, we will train 5 different models, evaluate them and compare to find the best one. They are:
* A LogisticRegression;
* An SGDClassifier;
* A RandomForestClassifier;
* A GradientBoostingClassifier;
* A KNeighboursClassifier;

First let's build each pipeline, with our previously defined preprocessor.

In [None]:
LR_clf = LogisticRegression(solver='liblinear', random_state=42)
SGD_clf = SGDClassifier(loss='hinge', random_state=42)
RF_clf = RandomForestClassifier(random_state=42)
GB_clf = GradientBoostingClassifier(random_state=42)
KNN_clf = KNeighborsClassifier()

LR_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LR_clf)
])
SGD_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SGD_clf)
])
RF_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RF_clf)
])
GB_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GB_clf)
])
KNN_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', KNN_clf)
])

models = {
    'Logistic': LR_pipeline,
    'SGD': SGD_pipeline,
    'RandomForet': RF_pipeline,
    'GradientBoosting': GB_pipeline,
    'KNeighbours': KNN_pipeline 
}

Now we can define the parameter grid for each model.

In [None]:
LR_param_grid = {
    'classifier__penalty': ['l1','l2'],
    'classifier__C': [0.01, 0.1, 1],
    'classifier__max_iter' : [100, 500, 1000]
}

SGD_param_grid = {
    'classifier__learning_rate': ['constant', 'invscaling'],
    'classifier__eta0': [0.01, 0.1, 1],
    'classifier__penalty': ['l2', 'l1'],
    'classifier__alpha': [0.001, 0.01, 0.1],
    'classifier__max_iter': [100, 500, 1000]
}

RF_param_grid = {
    'classifier__n_estimators': [10, 50, 100],
    'classifier__max_depth': [10, 50, 100]
}

GB_param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'classifier__max_depth': [3, 4, 5, 6],
    'classifier__subsample': [0.7, 0.8, 0.9, 1.0]
}

KNN_param_grid = {
    'classifier__n_neighbors': [3, 5, 7, 9, 11, 15],
    'classifier__weights': ['uniform', 'distance'],
    'classifier__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'classifier__leaf_size': [10, 20, 30, 40, 50],
    'classifier__p': [1, 2] 
}

param_grid = {
    'Logistic': LR_param_grid,
    'SGD': SGD_param_grid,
    'RandomForet': RF_param_grid,
    'GradientBoosting': GB_param_grid,
    'KNeighbours': KNN_param_grid 
}

Finally, we perform a GridSearchCV. Since the parameter grid are already pretty extensive, we will use only 3 folds.

In [None]:
for name, model in models.items():
    print(f"For the {name} model:")
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grid[name],
        cv=3,                   # Validação cruzada 5-fold
        scoring='accuracy',     # Métrica de avaliação
        n_jobs=-1,              # Uso de todos os núcleos disponíveis
    )
    grid_search.fit(train_data, target)
    print("    Best Hyperparams:", grid_search.best_params_)
    print("    Best Accuracy:", grid_search.best_score_)