<a href="https://colab.research.google.com/github/gabitza-tech/ETTI-SummerSchool2025/blob/main/Students_MachineLearning_Intro_FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 PART 1 - Introduction: Binary Classification with the Adult Income Dataset

In this exercise, we will explore the **Adult Income** dataset, a widely-used dataset for **classification tasks**.  

The main goal of this exercise is to:  
- Gain a solid understanding of the dataset  
- Preprocess and prepare it for a **binary classification** task  
- Apply Machine Learning techniques to predict income levels  

---

### 🛠 Tools & Libraries
We will primarily use **Scikit-Learn**, a powerful and versatile library for **Data Science** and **Machine Learning**. Some of its highlights:  

- Ready-to-use datasets for **prototyping and experimentation**  
- Built-in **data preprocessing tools**  
- Wide selection of **Machine Learning algorithms**  
- Easy evaluation with common metrics such as:  
  - ✅ Accuracy  
  - ✅ Precision  
  - ✅ Recall  
  - ✅ F1-score  

Scikit-Learn makes it straightforward to experiment with different models, and while it works well out-of-the-box, understanding the **hyperparameters** is important for improving performance.

---

### 💾 Saving Results
All results, including **screenshots and brief explanations**, will be saved in Google Docs for documentation purposes.

---

### 📥 Step 1: Load the Dataset
Let's start by loading the **Adult Income** dataset and exploring its structure.


In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd

# Load Adult dataset
adult = fetch_openml("adult", version=2, as_frame=True)
x, y = adult.data, adult.target


# 📝 Exercise 1: Exploring the Dataset

In this exercise, we will take a closer look at the dataset by examining the features (**X**) and the labels (**y**).  
> **Hint:** The dataset is in **pandas DataFrame** format, so you can leverage all the familiar pandas methods.

### Tasks

1. **Preview the data**: Display the first 5 samples in the dataset.
2. **Dataset size**: How many samples are there in total?
3. **Feature count**: How many features does each sample have?
4. **Number of classes**: How many unique classes are present in the target variable?
5. **Class distribution**: How many samples belong to each class?
6. **Missing values**: Identify the total number of missing values.  
   > **Hint:** Use the `isna()` method in pandas.
7. **Missing value percentages**: Compute what percentage of each feature contains missing data.
8. **Feature types**: Determine the type (categorical or numerical) of the features that contain missing values.

---

Take your time to explore the dataset thoroughly—this step is crucial for **data cleaning** and **preprocessing**, which can significantly impact model performance.


In [None]:
# Missing values in the dataset are marked with '?' => Replace "?" with NaN
print(type(x))
print(type(y))
x = x.replace("?", pd.NA)

# .... Code here
# Print first 5 samples

# Get the number of samples in the dataset (rows) and the number of features (columns)

# How many classes?

# Samples per calss

# How many missing values do we have? => isna().sum()

# Type of features


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
   age  workclass  fnlwgt     education  education-num      marital-status  \
0   25    Private  226802          11th              7       Never-married   
1   38    Private   89814       HS-grad              9  Married-civ-spouse   
2   28  Local-gov  336951    Assoc-acdm             12  Married-civ-spouse   
3   44    Private  160323  Some-college             10  Married-civ-spouse   
4   18        NaN  103497  Some-college             10       Never-married   

          occupation relationship   race     sex  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male             0             0   
1    Farming-fishing      Husband  White    Male             0             0   
2    Protective-serv      Husband  White    Male             0             0   
3  Machine-op-inspct      Husband  Black    Male          7688             0   
4                NaN    Own-child  White  Female         

# 📝 Exercise 2: Handling Missing Data

Now that we have identified the missing values in the dataset, and since they represent **≤5% of the samples**, we can remove them.  

### Tasks

1. Drop the samples with missing values from both **features (X)** and **labels (y)**.  
   > **Hint:** Use `dropna()`.
2. Check the **new size** of the dataset.
3. Calculate **how much data was lost** after removing the missing entries.


In [None]:
# CODE HERE
# Drop samples that have

# Check new dataset size


(48842, 14)
(45222, 14)
7.411653904426519 %


# 🔄 Transforming Categorical Data into Numbers

Many Machine Learning algorithms require **numerical inputs**, so we need to convert categorical features into numbers.  

For this, we will use **`LabelEncoder`** from `scikit-learn`.  

> This time, we will handle this step for you. Next time, you will be on your own! 🙂


In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# Numerically encode features based on the number of unique values
le = LabelEncoder()

# Separate categorical and numeric columns
cat_cols = x_drop.select_dtypes(include=["object", "category"]).columns
num_cols = x_drop.select_dtypes(include=["int64", "float64"]).columns

# 1. Label Encoder - Individual numerical values for each categorical value in a feature
x_encoded = x_drop.copy()
for col in cat_cols:
  x_encoded[col] = le.fit_transform(x_drop[col])

# x_encoded has the same structure as x, but with categorical columns having numerical values now
x_encoded.head()

# For labels, we can simply transform them in numerical values in the case of binary classification
y_encoded = le.fit_transform(y_drop)
print(set(y_encoded))


{np.int64(0), np.int64(1)}


# 📝 Exercise 3: Train-Test Split

Next, we will split our dataset into a **training set** and a **test set**.  

> In real-world applications, we would usually also create a **validation set**, but for this introductory exercise, we will keep it simple with just two splits.  

### Tasks

1. Split the data into **train (70%)** and **test (30%)** sets using `train_test_split()`.  
   > **Note:** Use the `stratify` parameter to ensure class proportions are preserved.
2. Check the **number of samples** in the train and test sets.
3. Verify that the **class distribution** is balanced in both sets.  
   - Example: If the training set has 65% `'>=50K'` and 35% `'<=50K'`, the test set should have a **similar distribution**.


In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

# Complete the function - search it
x_train, x_test, y_train, y_test = train_test_split(x_encoded, y_encoded, test_size=0.3, stratify=y_encoded, random_state=42)

# CODE HERE - STUDENTS
# No samples
...

# See if train-test balanced
...

31655
13567
y_train counts:
0: 23809
1: 7846

y_test counts:
0: 10205
1: 3362


# ⚡ Training & Inference with a Simple Logistic Regression

Now that our data is **cleaned**, **encoded**, and **split** into training and test sets, we can move on to **making predictions**.  

In this section, we will train a **K Nearest Neighbour** classifier and a **Logistic Regression** model, two of the simplest and most widely used algorithms for **binary classification**, and then evaluate its performance on the test set.

# 📝 Exercise 4: Comparing 2 Machine Learning Classification Methods
### Tasks

1. Which algorithm has the better performance?
2. Which algorithm is faster to train? Do some profiling on training time taken for each.
3. How does changing the `n_neighbors` parameter for KNN affect results and how does `max_interations` affect the results?
4. Why isn't it ok to simply change the training hyper-parameters of the methods and then evaluate them on the test set?

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)

# --- Make predictions ---
y_pred = knn.predict(x_test)

print("\nKNN Classification Report:\n", classification_report(y_test, y_pred))

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(x_train, y_train)

# Predict and evaluate
y_pred = clf.predict(x_test)
print("\nLogistic Regression Classification Report:\n", classification_report(y_test, y_pred))


# Alternatively, just for checking accuracy
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)



KNN Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.91      0.85     10205
           1       0.54      0.32      0.40      3362

    accuracy                           0.76     13567
   macro avg       0.67      0.62      0.63     13567
weighted avg       0.74      0.76      0.74     13567


Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.93      0.89     10205
           1       0.72      0.55      0.62      3362

    accuracy                           0.83     13567
   macro avg       0.79      0.74      0.76     13567
weighted avg       0.83      0.83      0.83     13567

Accuracy: 0.8343775337215302


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 📏 Scaling Numerical Features

When working with numerical data, it’s important to **scale features** so that they have a similar distribution.  
For example, scaling values to a **range between 0 and 1** can improve the performance of many Machine Learning algorithms.

Scikit-Learn provides built-in scalers, so we don’t have to implement them manually.

---

### Common Scaling Techniques

**1. StandardScaler**  
Scales each feature to have **mean** $\mu = 0$ and **standard deviation** $\sigma = 1$:  

$$
z = \frac{x - \mu}{\sigma}
$$

**2. MinMaxScaler**  
Scales each feature to a specified range, by default \([0, 1]\):  

$$
x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
$$

> You can also choose other ranges, e.g., \([-1, 1]\).

---

**⚠️ Important:** Always **FIT** your scaler on the **TRAINING set** only.  
Scaling the train and test set together can lead to **data leakage**, which will make your results **skewed, biased, and unfair**.

E.g.: 'centering the values of the dataset with the mean of the entire dataset' -> can lead to influencing training with the distribution of the test set, which might be totally different.


# 📝 Exercise 5: Scaling and Its Impact

In this exercise, we will explore how **different scaling methods** affect model performance.

### Tasks

1. Scale your data using:  
   - `StandardScaler()`  
   - `MinMaxScaler()`  
   > **Hint:** Use `fit_transform()` on the training data and `transform()` on the test data.
2. Verify that your scaled data **looks different** from the original data.  
   > It’s always good to double-check that scaling was applied correctly.
3. Train **two separate Logistic Regression models**:  
   - One on the StandardScaler data  
   - One on the MinMaxScaler data  
4. Compare the results. Which scaling method gives **better performance** on the test set?


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# CODE HERE
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(x_train)
X_test_std = scaler_std.transform(x_test)

scaler_minmax = MinMaxScaler()
X_train_minmax = scaler_minmax.fit_transform(x_train)
X_test_minmax = scaler_minmax.transform(x_test)

print(X_train_std[:2])
print('---')
print(X_train_minmax[:2])



[[ 3.36395269e-01 -5.16254212e-01  1.51020289e+00 -1.48057352e-01
  -2.18400256e-01  7.47278750e-01 -1.77512502e-01 -2.70817140e-01
   5.99256763e-01 -1.94161628e-01 -3.04870748e-01 -2.13176071e-01
  -2.24878853e-02 -1.65709108e-01 -1.96080747e-01 -1.13128035e-01
  -7.17216861e-02 -9.89610441e-02 -1.38288322e-01 -1.22234059e-01
  -1.85862157e-01 -2.15433723e-01 -4.49508816e-01 -1.10814024e-01
  -6.90250603e-01  4.08949437e+00 -4.13377150e-02 -1.34685923e-01
  -5.29075373e-01  2.46931846e+00 -2.63718836e-02 -9.31601965e-01
  -1.10375014e-01 -6.88256102e-01 -1.81543571e-01 -1.70968035e-01
  -3.71997879e-01 -1.48722229e-02 -3.91177193e-01  2.55463366e+00
  -1.84493434e-01 -2.16235319e-01 -2.66763371e-01 -3.45058394e-01
  -6.87696003e-02 -3.92946824e-01 -1.47384838e-01 -3.71392345e-01
  -1.77133437e-01 -2.31487121e-01 -8.38493911e-01  1.68681937e+00
  -1.74171120e-01 -4.13964000e-01 -3.44542244e-01 -2.17751670e-01
  -9.91236684e-02 -1.72138892e-01 -3.21355261e-01 -8.72214910e-02
   4.02906

In [None]:

# --- Train Logistic Regression StandardScaler---


# --- Make predictions ---
y_pred_std = ...
print("\nStandard Scaling Classification Report:\n", classification_report(...))

# --- Train Logistic Regression MinMaxScaler---


# --- Make predictions ---
y_pred_minmax = ...
print("\nMin-Max Scaling Classification Report:\n", classification_report(..))


Standard Scaling Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.93      0.90     10205
           1       0.73      0.59      0.65      3362

    accuracy                           0.84     13567
   macro avg       0.80      0.76      0.78     13567
weighted avg       0.84      0.84      0.84     13567


Min-Max Scaling Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.93      0.90     10205
           1       0.73      0.58      0.65      3362

    accuracy                           0.84     13567
   macro avg       0.80      0.76      0.77     13567
weighted avg       0.84      0.84      0.84     13567



# 🚀 PART 2 - Transforming Categorical Data with One-Hot Encoding

Many Machine Learning algorithms require **numerical input**, so categorical features must be converted into numbers.  

This time, we will use **`OneHotEncoder`** from `scikit-learn`, which creates a **binary column for each category** in a feature.  

> Good news: You already have the One-Hot Encoded data ready as `X_encoded` and `y_encoded`. (lucky you!)  

However, you will need to **repeat the previous steps** with this new encoding:  
- Split the data into train and test sets  
- Scale the numerical features if necessary  
- Train the model  
- Compare the results with the previous **LabelEncoder** approach  

This will help you understand **how encoding choices impact model performance**.


In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# One-Hot Encode Features
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

# Separate categorical and numeric columns
cat_cols = x_drop.select_dtypes(include=["object", "category"]).columns
num_cols = x_drop.select_dtypes(include=["int64", "float64"]).columns

# One-Hot Encode categorical features - Each feature, will be One-Hot Encoded
x_encoded_array = ohe.fit_transform(x_drop[cat_cols])

# Convert encoded array back to DataFrame
x_encoded_df = pd.DataFrame(
    x_encoded_array,
    columns=ohe.get_feature_names_out(cat_cols),
    index=x_drop.index
)

# Combine numeric columns and encoded categorical columns
x_encoded = pd.concat([x_drop[num_cols], x_encoded_df], axis=1)

# x_encoded now has more columns than the original x, but with categorical columns one-hot encoded
x_encoded.head()

# For labels, we can simply transform them in numerical values in the case of binary classification
y_encoded = le.fit_transform(y_drop)
print(set(y_encoded))


# 📝 Exercise 6: One-Hot Encoding and Its Impact

In this exercise, we will evaluate how **One-Hot Encoding** affects model performance compared to Label Encoding.

### Tasks

1. Determine **how many features** the One-Hot Encoded dataset now contains.
2. Split the data into **train and test sets**, keeping the **same proportions** as in the first case for a fair comparison.
3. Train **two Logistic Regression classifiers**:  
   - One with scaled features  
   - One without scaling
4. Compare the results and analyze:  
   - What are the **performance gains or losses** with One-Hot Encoding compared to Label Encoding?
5. Is training faster or slower than previously? Do some profiling of time taken for training.

In [None]:
# Code here

# 📝 Exercise 7 (Optional - Homework): Handling Missing Values

For the curious minds: so far, we simply **dropped samples** with missing values.  
But what if we **keep them** and try to fill in the missing information?  

### Tasks

1. Treat the missing values (`'n/a'`) as a **separate category** for categorical features.  
   > Hint: This is very easy to do in pandas.
2. Fill the missing values:  
   - **Categorical features:** use the **most frequent value** in the column  
   - **Numerical features:** use the **mean or median** of the column
3. Train your model again and **compare the results**.  
   - Does filling missing values improve performance, or not?


In [None]:
# Code here

# 📝 Exercise 8 (Optional - Homework): Exploring Interesting Relationships

At the end of the day, the goal of data analysis and predictions is to **generate insights** that can have a real-world impact.  
This exercise encourages you to explore **interesting patterns and relationships** in the Adult Income dataset.

### Suggested Questions to Explore

1. Where do people earning **>=50K/year** come from?  
2. How many hours do they work in each occupation or category?  
3. What is the **average age** in each category?  
4. How is the **gender balance** for each income category?  
5. How many hours do people in each category work?  
6. Any **other observations** that you find interesting or surprising.  

> Feel free to use **groupby**, **pivot tables**, or **visualizations** to uncover meaningful patterns.


In [None]:
# Code here