
## Lab Exercise Week #1: Mastering Data Preprocessing in Machine Learning

### Objective:
The goal of this lab exercise is to delve into advanced data preprocessing techniques used in machine learning. Preprocessing is an important step in Machine Learning, otherwise you end-up with "garbage-in garbage-out" paradigm, meaning you feed the machine learning algorithm with bad data, you can expect bad performance. You will work with real-world datasets, handling missing data, encoding categorical variables, scaling features, and more.

### Dataset:
For this exercise, we will use the "<a href="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data">Adult Income</a>" dataset from the UCI Machine Learning Repository. This dataset contains various features related to individuals and aims to predict whether a person earns more than $50,000 per year.

### Tasks:

#### 1). Data Loading and Initial Exploration:
<ul>
    <li>Load the "Adult Income" dataset. Loading can be done by accessing it locally, after you have download it, or direct access via URL provided above.</li>
    <li>Explore the dataset to understand its structure and features. Check the feature structure from the <a href="https://archive.ics.uci.edu/dataset/2/adult">UCI Repository</a>.</li>
    <li>Identify the target variable and its distribution.</li>
</ul>




In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Task 1: Data Loading and Initial Exploration
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship',
                'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

data = pd.read_csv(url, names=column_names, sep=',\s', na_values=["?"])

# Explore the dataset
print("Dataset Info:\n", data.info())
print("\nTarget Variable Distribution:\n", data['income'].value_counts())

  data = pd.read_csv(url, names=column_names, sep=',\s', na_values=["?"])
  data = pd.read_csv(url, names=column_names, sep=',\s', na_values=["?"])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
Dataset Info:
 None

Target Variable Distribution:
 income
<=50K    24720
>50K      7841
Name

In [2]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


#### 2). Handling Missing Data:
<ul>
    <li>Identify missing values in the dataset.</li>
    <li>Implement a strategy to handle missing data (e.g., imputation or removal). Impute missing values with the median for numerical columns and the most frequent value for categorical columns.</li>
    <li>Justify the chosen strategy.</li>
</ul>

In [19]:
# Task 2: Handling Missing Data
# Impute missing values with the median for numerical columns and the most frequent value for categorical columns
numerical_features = data.select_dtypes(include=['int64', 'float64']).columns
categorical_features = data.select_dtypes(include=['object']).columns
# exclude the income column as it will be the output column -it will be the target variable
categorical_features = categorical_features[~categorical_features.isin(['income'])] # exclude the income column as it will be the output column
numerical_transformer = SimpleImputer(strategy='median')
categorical_transformer = SimpleImputer(strategy='most_frequent')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [20]:
numerical_features

Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'capital-diff'],
      dtype='object')

In [21]:
categorical_features

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex', 'native-country'],
      dtype='object')

#### 3). Encoding Categorical Variables:
<ul>
    <li>Identify categorical variables in the dataset.</li>
    <li>Apply appropriate encoding techniques (e.g., one-hot encoding or label encoding). Use one-hot encoding for categorical variables.</li>
    <li>Discuss the impact of encoding choices on model performance.</li>
</ul>

In [22]:
# Task 3: Encoding Categorical Variables
# Use one-hot encoding for categorical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        # find cat no in the data ignore it...apply onehotencoding...
        # convert categorical to binary numercial features
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

#### 4). Feature Scaling:
<ul>
    <li>Identify features that require scaling.</li>
    <li>Apply feature scaling using techniques such as Min-Max scaling or Standardization.</li>
    <li>Discuss the importance of feature scaling in different machine learning algorithms.</li>
</ul>

In [24]:
# Task 4: Feature Scaling
# Apply Standardization to numerical features
preprocessor = ColumnTransformer(
    transformers=[
        # make sure all numbers are between 0 and 1
        # i.e. age - between 10 and 18 and ayment = between 10K and 15K .. => payment dominates the model as bigger numbers
        # so standardise so each feature contributes equally
        # preserves shape + keeps negative values....
        # Need to scale for >>> LinReg / LogReg / SVM / k-NN / PCA / NN
        # do not need to scale for >> DT / RF / GradBoost (XGBoost, LightGBN)
        #### NEVER FIT SCALAR on TEST DATA?? >>> fit on trainnig set..... transform train and tes using same scalar...otherwise data leakage
        # DStandardScalar  + MinMaxScalar + RobustScalar
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

#### 5). Feature Engineering:
<ul>
    <li>Create new meaningful features based on existing ones.</li>
    <li>Discuss how feature engineering can enhance model performance.</li>
</ul>

In [8]:
# Task 5: Feature Engineering
# Create a new feature "capital-diff" as the difference between capital-gain and capital-loss
data['capital-diff'] = data['capital-gain'] - data['capital-loss']

#### 6). Outlier Detection and Handling:
<ul>
    <li>Identify potential outliers in numerical features.</li>
    <li>Implement a strategy to handle outliers (e.g., removal or transformation).</li>
    <li>Discuss the impact of outliers on model training.</li>
</ul>

In [9]:
# Task 6: Outlier Detection and Handling
# Identify and handle outliers using a suitable method (e.g., Z-score or IQR)

#### 7). Normalization and Transformation:
<ul>
    <li>Apply normalization to achieve a normal distribution in numerical features.</li>
    <li>Discuss the benefits of normalization in specific machine learning algorithms.</li>
</ul>

In [10]:
# Task 7: Normalization and Transformation (You can use standard scalar, minmax scalar or robust scalar)
# Apply normalization to achieve a normal distribution in numerical features


#### 8). Data Splitting:
<ul>
    <li>Split the dataset into training and testing sets.</li>
    <li>Justify the chosen ratio and the importance of a proper train-test split.</li>
</ul>

In [None]:
# Task 8: Data Splitting
X = data.drop('income',axis=1)
y = data['income']
print("input columns", X.columns)
print("output columns", y.name)

# split data as 80% for training and 20% for testing
# randome ensure reproduces the same each time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("shapes of training and testing")

print('X_train shape:', X_train.shape)
# output for yoru model, will be included in the training
print('y_train shape:', y_train.shape)
# test data for features
print('X_test shape:', X_test.shape)
# is the output in the mdoel used to check the accuracy fo the classifier
print('y_test shape:', y_test.shape)

input columns Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'capital-diff'],
      dtype='object')
output columns income
shapes of training and testing
X_train shape: (26048, 15)
y_train shape: (26048,)
X_test shape: (6513, 15)
y_test shape: (6513,)


In [25]:
X_train.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'capital-diff'],
      dtype='object')

In [26]:
y

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
32556    <=50K
32557     >50K
32558    <=50K
32559    <=50K
32560     >50K
Name: income, Length: 32561, dtype: object

#### 9). Handling Imbalanced Data:
<ul>
    <li>Identify if the target variable has an imbalanced distribution.</li>
    <li>Implement techniques to handle imbalanced data (e.g., oversampling or undersampling).</li>
    <li>Discuss the challenges posed by imbalanced datasets.</li>
</ul>

In [None]:
#Identify if the target variable has an imbalanced distribution.
# check the ytrain distribution - cna see what values you have fo ech class...has two classes
# cna see theres an imbalance
print("Class distribution before oversampling:")
print( y_train.value_counts())

Class distribution before oversampling:
income
<=50K    19778
>50K      6270
Name: count, dtype: int64


In [None]:

# add mor samples for class that has lower number of samples
oversampler = RandomOverSampler(random_state=42)
# Apply the oversampling technique to the training set
# once applied datag will bebalanced for the training
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)

In [None]:
print("Class distribution after oversampling:")
# can see no imbalance now...
print( y_train_resampled.value_counts())

Class distribution after oversampling:
income
>50K     19778
<=50K    19778
Name: count, dtype: int64


#### 10). Pipeline Implementation:
<ul>
    <li>Construct a preprocessing pipeline that includes all the steps above.</li>
    <li>Discuss the advantages of using a pipeline in the context of reproducibility and efficiency.</li>
</ul>

In [29]:
# Task 10: Pipeline Implementation
# Construct a preprocessing pipeline with a classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    # choose RF classifier
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit the pipeline on training data
# train the pipeline -- means all the traing data will ebpassed thru pre-processing steps first then thru classifier...
pipeline.fit(X_train_resampled, y_train_resampled)

# Make predictions on test data
# thsi tests yuor model...
# y_pred is the var for the predicted values...
y_pred = pipeline.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8549055734684478
Classification Report:
               precision    recall  f1-score   support

       <=50K       0.91      0.90      0.90      4942
        >50K       0.70      0.71      0.70      1571

    accuracy                           0.85      6513
   macro avg       0.80      0.81      0.80      6513
weighted avg       0.86      0.85      0.86      6513

