
## Lab Exercise Week #1: Mastering Data Preprocessing in Machine Learning

### Objective:
The goal of this lab exercise is to delve into advanced data preprocessing techniques used in machine learning. Preprocessing is an important step in Machine Learning, otherwise you end-up with "garbage-in garbage-out" paradigm, meaning you feed the machine learning algorithm with bad data, you can expect bad performance. You will work with real-world datasets, handling missing data, encoding categorical variables, scaling features, and more.

### Dataset:
For this exercise, we will use the "<a href="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data">Adult Income</a>" dataset from the UCI Machine Learning Repository. This dataset contains various features related to individuals and aims to predict whether a person earns more than $50,000 per year.

### Tasks:

#### 1). Data Loading and Initial Exploration:
<ul>
    <li>Load the "Adult Income" dataset. Loading can be done by accessing it locally, after you have download it, or direct access via URL provided above.</li>
    <li>Explore the dataset to understand its structure and features. Check the feature structure from the <a href="https://archive.ics.uci.edu/dataset/2/adult">UCI Repository</a>.</li>
    <li>Identify the target variable and its distribution.</li>
</ul>




In [None]:
# Import necessary libraries


#### 2). Handling Missing Data:
<ul>
    <li>Identify missing values in the dataset.</li>
    <li>Implement a strategy to handle missing data (e.g., imputation or removal). Impute missing values with the median for numerical columns and the most frequent value for categorical columns.</li>
    <li>Justify the chosen strategy.</li>
</ul>

In [None]:
# Task 2: Handling Missing Data
# Impute missing values with the median for numerical columns and the most frequent value for categorical columns


#### 3). Encoding Categorical Variables:
<ul>
    <li>Identify categorical variables in the dataset.</li>
    <li>Apply appropriate encoding techniques (e.g., one-hot encoding or label encoding). Use one-hot encoding for categorical variables.</li>
    <li>Discuss the impact of encoding choices on model performance.</li>
</ul>

In [None]:
# Task 3: Encoding Categorical Variables
# Use one-hot encoding for categorical variables


#### 4). Feature Scaling:
<ul>
    <li>Identify features that require scaling.</li>
    <li>Apply feature scaling using techniques such as Min-Max scaling or Standardization.</li>
    <li>Discuss the importance of feature scaling in different machine learning algorithms.</li>
</ul>

In [None]:
# Task 4: Feature Scaling
# Apply Standardization to numerical features


#### 5). Feature Engineering:
<ul>
    <li>Create new meaningful features based on existing ones.</li>
    <li>Discuss how feature engineering can enhance model performance.</li>
</ul>

In [None]:
# Task 5: Feature Engineering
# Create a new feature "capital-diff" as the difference between capital-gain and capital-loss


#### 6). Outlier Detection and Handling:
<ul>
    <li>Identify potential outliers in numerical features.</li>
    <li>Implement a strategy to handle outliers (e.g., removal or transformation).</li>
    <li>Discuss the impact of outliers on model training.</li>
</ul>

In [None]:
# Task 6: Outlier Detection and Handling
# Identify and handle outliers using a suitable method (e.g., Z-score or IQR)

#### 7). Normalization and Transformation:
<ul>
    <li>Apply normalization to achieve a normal distribution in numerical features.</li>
    <li>Discuss the benefits of normalization in specific machine learning algorithms.</li>
</ul>

In [None]:
# Task 7: Normalization and Transformation
# Apply normalization to achieve a normal distribution in numerical features


#### 8). Data Splitting:
<ul>
    <li>Split the dataset into training and testing sets.</li>
    <li>Justify the chosen ratio and the importance of a proper train-test split.</li>
</ul>

In [None]:
# Task 8: Data Splitting


In [None]:
# Check the shape of the train and test splitted data

#### 9). Handling Imbalanced Data:
<ul>
    <li>Identify if the target variable has an imbalanced distribution.</li>
    <li>Implement techniques to handle imbalanced data (e.g., oversampling or undersampling).</li>
    <li>Discuss the challenges posed by imbalanced datasets.</li>
</ul>

#### 10). Pipeline Implementation:
<ul>
    <li>Construct a preprocessing pipeline that includes all the steps above.</li>
    <li>Discuss the advantages of using a pipeline in the context of reproducibility and efficiency.</li>
</ul>

In [None]:
# Task 10: Pipeline Implementation
# Construct a preprocessing pipeline with a classifier


# Fit the pipeline on training data


# Make predictions on test data


# Evaluate the model (Use accuracy and classification report)
