# Project: Adult Dataset

- **Project Name:** Adult Classification Project
- **Project Type:** Binary-class Classification
- **Author:** Dr. Saad Laouadi

### Project Overview:
This project leverages the famous **Adult Dataset**, also known as the **Census Income Dataset**, for a **binary-class classification** problem. The objective is to predict whether a person earns more than $50,000 a year based on various demographic features.

The primary focus of this notebook is **data preprocessing**, which includes handling missing values, encoding categorical variables, and feature scaling to prepare the data for machine learning algorithms.

### Dataset Details:
- **Source**: The Adult Dataset is derived from the 1994 U.S. Census database.
- **Classes**: Binary classification task - the target is to predict income (<=50K or >50K).
- **Number of Samples**: 48,842
- **Number of Features**: 14 features (including age, education, occupation, race, etc.)

### Key Features:
- **Preprocessing Tasks**:
  - Handle missing or incomplete data
  - Encode categorical variables
  - Feature scaling (e.g., Standardization, Normalization)

### Objectives:
1. **Preprocess the dataset**:
   - Handle missing values
   - Convert categorical data into numeric form using encoding techniques
   - Scale/normalize features for optimal performance in future machine learning models
2. **Prepare the dataset** for modeling and evaluation in the next notebook.

---

**Copyright © Dr. Saad Laouadi**  
**All Rights Reserved** 🛡️

In [None]:
# Import necessary modules
import os
import re
import requests
import json 
from io import StringIO
import warnings

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns

# Configuration Variables
PRINT_INFO = True

with open('config.json', 'r') as file:
    config = json.load(file)

train_data_url = config['TRAIN_DATA']
test_data_url = config['TEST_DATA']
info_data_url = config['INFO_DATA']

processed_train_data = config['PROCESSED_TRAIN_DATA']
model_save_path = config['MODEL_SAVE_PATH']
metrics_save_path = config['METRICS_SAVE_PATH']
random_seed = config['RANDOM_SEED']

if PRINT_INFO:
    print("Train Data URL:", train_data_url)
    print("Test Data URL:", test_data_url)
    print("Info Data URL:", info_data_url)
    
    print("Processed Train Data Path:", processed_train_data)
    print("Model Save Path:", model_save_path)
    print("Metrics Save Path:", metrics_save_path)
    print("Random Seed:", random_seed)


%load_ext autoreload
%autoreload 2

from utils import *

%load_ext watermark
%watermark -iv -v  

### Check the Data information

In [None]:
# Fetch the content of data info from the info_data_url 


# Print the content of the text file

In [None]:
# Download the data if you need to 
# !wget https://raw.githubusercontent.com/qcversity/ml-datasets/refs/heads/main/data/adult.info.txt

### Extracting Feature Names



In [None]:
# Check the file with a text editor locally or online to get the following information

START_LINE = 94
END_LINE = None 
PATTERN = r"([a-zA-Z0-9\-]+):" 

# This function is user defined from the utils module
feature_names = extract_feature_names(info_data_url,
                                      start_line=START_LINE,
                                      end_line=END_LINE,
                                      pattern=PATTERN
                                     )

print("*"*72)
print(f"The extracted feature names:\n{feature_names}")
print(f"The number of features: {len(feature_names)}")
print("*"*72)

# Add the target to the list of feature names
target = ['income']
col_names = feature_names + target
print(f"The column names are:\n{col_names}")
print("*"*72)