
## Lab Exercise Week #1: Mastering Data Preprocessing in Machine Learning

### Objective:
The goal of this lab exercise is to delve into advanced data preprocessing techniques used in machine learning. Preprocessing is an important step in Machine Learning, otherwise you end-up with "garbage-in garbage-out" paradigm, meaning you feed the machine learning algorithm with bad data, you can expect bad performance. You will work with real-world datasets, handling missing data, encoding categorical variables, scaling features, and more.

### Dataset:
For this exercise, we will use the "<a href="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data">Adult Income</a>" dataset from the UCI Machine Learning Repository. This dataset contains various features related to individuals and aims to predict whether a person earns more than $50,000 per year.

### Tasks:

#### 1). Data Loading and Initial Exploration:
<ul>
    <li>Load the "Adult Income" dataset. Loading can be done by accessing it locally, after you have download it, or direct access via URL provided above.</li>
    <li>Explore the dataset to understand its structure and features. Check the feature structure from the <a href="https://archive.ics.uci.edu/dataset/2/adult">UCI Repository</a>.</li>
    <li>Identify the target variable and its distribution.</li>
</ul>




In [4]:
# Import necessary libraries 
import pandas as pd
import numpy as np
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


In [10]:
#import the data
url="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours-per-week', 'native-country', 'income']

# sep=r',\s', >>> r is an indicator for regex
# f'...' = a raw string
# separator is a comma , comma followed by whitespace.. i.s. a,  or a,b.
# engine='python' is needed because the default C engine doesnâ€™t support regex separators
df1 = pd.read_csv(url, names=col_names, sep=r',\s', na_values=['?'], engine='python')

In [27]:
df1.info()

print(f" \n target variable distributino \n {df1['income'].value_counts()} \n ")

df1.head(10)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
 
 target variable distributino 
 income
<=50K    24720
>50K      7841
Name: count, dtype: in

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital-status,occupation,relationship,race,sex,capital_gain,capital_loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


#### 2). Handling Missing Data:
<ul>
    <li>Identify missing values in the dataset.</li>
    <li>Implement a strategy to handle missing data (e.g., imputation or removal). Impute missing values with the median for numerical columns and the most frequent value for categorical columns.</li>
    <li>Justify the chosen strategy.</li>
</ul>

In [None]:
# Task 2: Handling Missing Data
# Impute missing values with the median for numerical columns and the most frequent value for categorical columns


#### 3). Encoding Categorical Variables:
<ul>
    <li>Identify categorical variables in the dataset.</li>
    <li>Apply appropriate encoding techniques (e.g., one-hot encoding or label encoding). Use one-hot encoding for categorical variables.</li>
    <li>Discuss the impact of encoding choices on model performance.</li>
</ul>

In [None]:
# Task 3: Encoding Categorical Variables
# Use one-hot encoding for categorical variables


#### 4). Feature Scaling:
<ul>
    <li>Identify features that require scaling.</li>
    <li>Apply feature scaling using techniques such as Min-Max scaling or Standardization.</li>
    <li>Discuss the importance of feature scaling in different machine learning algorithms.</li>
</ul>

In [None]:
# Task 4: Feature Scaling
# Apply Standardization to numerical features


#### 5). Feature Engineering:
<ul>
    <li>Create new meaningful features based on existing ones.</li>
    <li>Discuss how feature engineering can enhance model performance.</li>
</ul>

In [None]:
# Task 5: Feature Engineering
# Create a new feature "capital-diff" as the difference between capital-gain and capital-loss


#### 6). Outlier Detection and Handling:
<ul>
    <li>Identify potential outliers in numerical features.</li>
    <li>Implement a strategy to handle outliers (e.g., removal or transformation).</li>
    <li>Discuss the impact of outliers on model training.</li>
</ul>

In [None]:
# Task 6: Outlier Detection and Handling
# Identify and handle outliers using a suitable method (e.g., Z-score or IQR)

#### 7). Normalization and Transformation:
<ul>
    <li>Apply normalization to achieve a normal distribution in numerical features.</li>
    <li>Discuss the benefits of normalization in specific machine learning algorithms.</li>
</ul>

In [None]:
# Task 7: Normalization and Transformation
# Apply normalization to achieve a normal distribution in numerical features


#### 8). Data Splitting:
<ul>
    <li>Split the dataset into training and testing sets.</li>
    <li>Justify the chosen ratio and the importance of a proper train-test split.</li>
</ul>

In [None]:
# Task 8: Data Splitting


In [None]:
# Check the shape of the train and test splitted data

#### 9). Handling Imbalanced Data:
<ul>
    <li>Identify if the target variable has an imbalanced distribution.</li>
    <li>Implement techniques to handle imbalanced data (e.g., oversampling or undersampling).</li>
    <li>Discuss the challenges posed by imbalanced datasets.</li>
</ul>

#### 10). Pipeline Implementation:
<ul>
    <li>Construct a preprocessing pipeline that includes all the steps above.</li>
    <li>Discuss the advantages of using a pipeline in the context of reproducibility and efficiency.</li>
</ul>

In [None]:
# Task 10: Pipeline Implementation
# Construct a preprocessing pipeline with a classifier


# Fit the pipeline on training data


# Make predictions on test data


# Evaluate the model (Use accuracy and classification report)
