# Introduction
A very important aspect of supervised and semi-supervised machine learning is the quality of the labels produced by human labelers. Unfortunately, humans are not perfect and in some cases may even maliciously label things incorrectly. In this assignment, you will evaluate the impact of incorrect labels on a number of different classifiers.

We have provided a number of code snippets you can use during this assignment. Feel free to modify them or replace them.


## Dataset
The dataset you will be using is the [Adult Income dataset](https://archive.ics.uci.edu/ml/datasets/Adult). This dataset was created by Ronny Kohavi and Barry Becker and was used to predict whether a person's income is more/less than 50k USD based on census data.

### Data preprocessing
Start by loading and preprocessing the data. Remove NaN values, convert strings to categorical variables and encode the target variable (the string <=50K, >50K in column index 14).

In [None]:
import pandas as pd
import numpy as np

In [None]:
# This can be used to load the dataset
data = pd.read_csv("adult.csv", na_values='?')
data.head()


##### Check the percentage of missing values in the columns. Rule of thumb: If the percentage of missing values is above 60%, remove the feature.

In [None]:
for column in data.columns:
    nan_count = data[column].isna().sum()/len(data)*100
    print("Percentage of NaN in column " + column + " is " + str(nan_count) + "\n")

Remove all rows that contain nan values, since the columns with missing values can't be imputed (no numerical values)

In [None]:
data_before = len(data)
data = data.dropna()
data_after = len(data)
print("Removed " + str(data_before-data_after) + " rows from the " + str(data_before) + " rows")

data = data.drop(columns=["education", "fnlwgt"])
print(data)


Turn string columns into categorical data

In [None]:
string_columns = ['workclass','marital-status','occupation','relationship','race','sex','native-country']
for col in string_columns:
    data[col] = pd.Categorical(data[col])

In [None]:
print(data['salary'].unique())
data['salary'] = data['salary'].str.strip().str.replace(r"\.$", "", regex=True)
data['salary'] = pd.Categorical(data['salary'],categories=["<=50K", ">50K"],ordered=False)

### Data classification
Choose at least 4 different classifiers and evaluate their performance in predicting the target variable.

#### Preprocessing
Think about how you are going to encode the categorical variables, normalization, whether you want to use all of the features, feature dimensionality reduction, etc. Justify your choices

A good method to apply preprocessing steps is using a Pipeline. Read more about this [here](https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/) and [here](https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf).

<!-- #### Data visualization
Calculate the correlation between different features, including the target variable. Visualize the correlations in a heatmap. A good example of how to do this can be found [here](https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec).

Select a features you think will be an important predictor of the target variable and one which is not important. Explain your answers. -->

#### Evaluation
Use a validation technique from the previous lecture to evaluate the performance of the model. Explain and justify which metrics you used to compare the different models.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn import tree
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold, cross_val_predict
from sklearn.metrics import classification_report, accuracy_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns


# determine categorical and numerical features
numerical_ix = ['age','education-num','capital-gain','capital-loss','hours-per-week']
categorical_ix = ['workclass','marital-status','occupation','relationship','race','sex','native-country']

# Define your preprocessing steps here
steps = [('cat', OneHotEncoder(handle_unknown='ignore'), categorical_ix), ('num', MinMaxScaler() , numerical_ix)]


# Apply your model to feature array X and labels y
def apply_model(model, ct, X, y, feature_reduction = False):
    pipeline_pca = Pipeline(steps=[('t', ct), ('pca', PCA(n_components=60)), ('m', model)])
    pipeline_nopca = Pipeline(steps=[('t', ct) , ('m', model)])

    return evaluate_model(X, y, pipeline_nopca, pipeline_pca, feature_reduction)

# Apply your validation techniques and calculate metrics
def evaluate_model(X, y, pipeline_nopca, pipeline_pca, feature_reduction=False):
    cv = StratifiedKFold(n_splits=5, shuffle=True)

    scores_nopca = cross_val_score(pipeline_nopca, X, y, cv=cv, scoring="accuracy")
    scores_pca = cross_val_score(pipeline_pca, X, y, cv=cv, scoring="accuracy")

    #print("Mean accuracy without PCA:", scores_nopca.mean())
    #print("Mean accuracy with PCA   :", scores_pca.mean())

    y_pred = cross_val_predict(pipeline_pca, X, y, cv=cv)

    print("\nClassification Report:")
    print(classification_report(y, y_pred))

    return scores_nopca.mean(), scores_pca.mean()

### DEPRECATED METHOD AS PCA IS USED INSTEAD OF FEATURE IMPORTANCE

# def show_feature_importance(pipeline, X, y, top_n = 12):
#     model = pipeline.named_steps["m"]
#     feature_names = pipeline.named_steps["t"].get_feature_names_out()

#     importance = None

#     if hasattr(model, "feature_importances_"):
#         importance = model.feature_importances_ * 100
#     elif hasattr(model, "coef_"):
#         importance = abs(model.coef_[0])
#     else:
#         print("Using permutation importance (slower)...")
#         r = permutation_importance(pipeline, X, y, n_repeats=10, random_state=42)
#         importance = r.importances_mean

#     df = pd.DataFrame({"feature": feature_names, "importance": importance})

#     df["base_feature"] = (
#         df["feature"]
#         .str.replace(r"^cat__|^num__", "", regex=True)   # remove prefixes
#         .str.split("_").str[0]                          # keep original feature name
#     )

#     agg_df = df.groupby("base_feature")["importance"].sum().sort_values(ascending=False)

#     print("\nTop Features (aggregated):")
#     print(agg_df.head(top_n))

#     red = agg_df.head(top_n).index.to_list()

#     # --- Plot aggregated importance ---
#     plt.figure(figsize=(10, 6))
#     sns.barplot(x=agg_df.head(top_n), y=agg_df.head(top_n).index, palette="viridis")
#     plt.title(f"Aggregated Feature Importance ({type(model).__name__})")
#     plt.xlabel("Importance")
#     plt.ylabel("Feature")
#     plt.show()

#     return red

# DEPRECATED METHOD AS PCA IS USED INSTEAD OF FEATURE SELECTION

# def compare_and_plot(models, ct, X, y):
#     cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#     results = {}

#     for name, model in models.items():
#         # Full features
#         pipeline_full = Pipeline(steps=[('t', ct), ('m', model)])
#         scores_full = cross_val_score(pipeline_full, X, y, cv=cv, scoring="accuracy")

#         # Reduced features
#         reduced_features = apply_model(model, ct, X, y, feature_reduction=True)
#         X_reduced = X[reduced_features]

#         cat_selected = [c for c in reduced_features if c in categorical_ix]
#         num_selected = [c for c in reduced_features if c in numerical_ix]

#         ct_reduced = ColumnTransformer([
#             ('cat', OneHotEncoder(handle_unknown='ignore'), cat_selected),
#             ('num', MinMaxScaler(), num_selected)
#         ])
#         pipeline_reduced = Pipeline(steps=[('t', ct_reduced), ('m', model)])
#         scores_reduced = cross_val_score(pipeline_reduced, X_reduced, y, cv=cv, scoring="accuracy")

#         # Store mean difference
#         results[name] = scores_reduced.mean() - scores_full.mean()

#     # Plot differences
#     plt.figure(figsize=(8, 5))
#     plt.barh(list(results.keys()), list(results.values()), color="skyblue")
#     plt.axvline(0, color="red", linestyle="--")
#     plt.xlabel("Accuracy Difference (Reduced - Full)")
#     plt.title("Effect of Feature Reduction on Model Accuracy")
#     plt.show()

#     return results

ct = ColumnTransformer(steps)

models = {
    "LogReg": LogisticRegression(max_iter=10000),
    "SGD": SGDClassifier(loss="hinge", penalty="l2", max_iter=10000),
    "DecisionTree": tree.DecisionTreeClassifier(),
    "LinearSVC": LinearSVC()
}

results_nopca = []
results_pca = []

for model in models.values():
    y = data['salary']
    X = data.drop('salary', axis=1)

    npca, ypca = apply_model(model, ct, X, y)
    results_nopca.append(npca)
    results_pca.append(ypca)

x = np.arange(len(models))
width = 0.35

plt.figure(figsize=(8,5))
plt.bar(x - width/2, results_nopca, width, label="No PCA", color="steelblue")
plt.bar(x + width/2, results_pca, width, label="With PCA", color="orange")
plt.xticks(x, list(models.keys()))
plt.ylabel("Mean CV Accuracy")
plt.title("Model Performance: With vs Without PCA")
plt.legend()
plt.show()




### Label perturbation
To evaluate the impact of faulty labels in a dataset, we will introduce some errors in the labels of our data.


#### Preparation
Start by creating a method which alters a dataset by selecting a percentage of rows randomly and swaps labels from a 0->1 and 1->0.


In [None]:
"""Given a label vector, create a new copy where a random fraction of the labels have been flipped."""
def pertubate(y: np.ndarray, fraction: float) -> np.ndarray:
    copy = y.copy()
    n = len(y)

    rng = np.random.default_rng()
    flip_idx = rng.choice(n, size=int(fraction*n), replace=False)

    copy.iloc[flip_idx] = 1 - copy.iloc[flip_idx]

    return copy

#### Analysis
Create a number of new datasets with perturbed labels, for fractions ranging from `0` to `0.5` in increments of `0.1`.

Perform the same experiment you did before, which compared the performances of different models except with the new datasets. Repeat your experiment at least 5x for each model and perturbation level and calculate the mean and variance of the scores. Visualize the change in score for different perturbation levels for all of the models in a single plot.

State your observations. Is there a change in the performance of the models? Are there some classifiers which are impacted more/less than other classifiers and why is this the case?

In [None]:
og_data = pd.read_csv("adult.csv", na_values='?')
data = og_data.copy()

for column in data.columns:
    nan_count = data[column].isna().sum()/len(data)*100

data_before = len(data)
data = data.dropna()
data_after = len(data)

data = data.drop(columns=['education', 'fnlwgt'])
string_columns = ['workclass','marital-status','occupation','relationship','race','sex','native-country']
for col in string_columns:
    data[col] = pd.Categorical(data[col])

data['salary'] = data['salary'].str.strip().str.replace(r"\.$", "", regex=True)
data['salary'] = data['salary'].replace({"<=50K":0, ">50K":1})

og_data = data

In [None]:
salary = og_data['salary']

data_00 = og_data.copy()

data_01 = og_data.copy()
data_01['salary'] = pertubate(salary, 0.1)

data_02 = og_data.copy()
data_02['salary'] = pertubate(salary, 0.2)

data_03 = og_data.copy()
data_03['salary'] = pertubate(salary, 0.3)

data_04 = og_data.copy()
data_04['salary'] = pertubate(salary, 0.4)

data_05 = og_data.copy()
data_05['salary'] = pertubate(salary, 0.5)

# 4 different models
lr = LogisticRegression()
sgd = SGDClassifier(loss="hinge", penalty="l2", max_iter=10000)
dt = tree.DecisionTreeClassifier()
svc = LinearSVC()

data = [("data_00", data_00), ("data_01", data_01), ("data_02", data_02), ("data_03" ,data_03), ("data_04", data_04), ("data_05", data_05)]
models = [lr, sgd, dt, svc]

results = {
    f"data_{i:02d}": {
        model: {"mean": None, "variance": None}
        for model in models
    }
    for i, _ in enumerate(data)
}

numerical_ix = ['age' ,'education-num','capital-gain','capital-loss','hours-per-week']
categorical_ix = ['workclass','marital-status','occupation','relationship','race','sex','native-country']

steps = [('cat', OneHotEncoder(handle_unknown='ignore'), categorical_ix), ('num', MinMaxScaler() , numerical_ix)]

ct = ColumnTransformer(steps)

for m in models:
    for name, df in data:
        scores = []

        for r in range(0,5):
            y = df['salary']
            X = df.drop('salary', axis=1)

            _, score = apply_model(m, ct, X, y)
            scores.append(score)

        mean = np.mean(scores)
        variance = np.var(scores)

        results[name][m]["mean"] = mean
        results[name][m]["variance"] = variance



In [None]:
data_names = ["data_00", "data_01", "data_02", "data_03", "data_04", "data_05"]
x = range(len(data_names))  # 0..5

plt.figure(figsize=(8,5))

for model in models:
    means = [results[dname][model]["mean"] for dname in data_names]
    variances = [results[dname][model]["variance"] for dname in data_names]
    std_devs = np.sqrt(variances)

    plt.plot(x, means, marker='o', label=model)
    plt.fill_between(x,
                     np.array(means) - std_devs,
                     np.array(means) + std_devs,
                     alpha=0.2)

plt.xticks(x, data_names)
plt.xlabel("Dataset")
plt.ylabel("Mean Score")
plt.title("Model Performance Across Datasets")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
records = []
for dataset, models_dict in results.items():
    for model_name, stats in models_dict.items():
        records.append({
            "Dataset": dataset,
            "Model": model_name,
            "Mean": stats["mean"],
            "Variance": stats["variance"]
        })

df = pd.DataFrame(records)

print(df)

Observations + explanations: max. 400 words

#### Discussion

1)  Discuss how you could reduce the impact of wrongly labeled data or correct wrong labels. <br />
    max. 400 words



    Authors: Youri Arkesteijn, Tim van der Horst and Kevin Chong.


## Machine Learning Workflow

From part 1, you will have gone through the entire machine learning workflow which are they following steps:

1) Data Loading
2) Data Pre-processing
3) Machine Learning Model Training
4) Machine Learning Model Testing

You can see these tasks are very sequential, and need to be done in a serial fashion.

As a small perturbation in the actions performed in each of the steps may have a detrimental knock-on effect in the task that comes afterwards.

In the final part of Part 1, you will have experienced the effects of performing perturbations to the machine learning model training aspect and the reaction of the machine learning model testing section.

## Part 2 Data Discovery

You will be given a set of datasets and you are tasked to perform data discovery on the data sets.

<b>The datasets are provided in the group lockers on brightspace. Let me know if you are having trouble accessing the datasets</b>

The process is to have the goal of finding datasets that are related to each other, finding relationships between the datasets.

The relationships that we are primarily working with are Join and Union relationships.

So please implement two methods for allowing us to find those pesky Join and Union relationships.

Try to do this with the datasets as is and no processing.



In [None]:
def discovery_algorithm():
    """Function should be able to perform data discovery to find related datasets
    Possible Input: List of datasets
    Output: List of pairs of related datasets
    """

    pass

You would have noticed that the data has some issues in them.
So perhaps those issues have been troublesome to deal with.

Please try to do some cleaning on the data.

After performing cleaning see if the results of the data discovery has changed?

Please try to explain this in your report, and try to match up the error with the observation.

In [None]:
## Cleaning data, scrubbing, washing, mopping

def cleaningData(data):
    """Function should be able to clean the data
    Possible Input: List of datasets
    Output: List of cleaned datasets
    """

    pass

## Discussions

1)  Different aspects of the data can effect the data discovery process. Write a short report on your findings. Such as which data quality issues had the largest effect on data discovery. Which data quality problem was repairable and how you choose to do the repair.

<!-- For the set of considerations that you have outlined for the choice of data discovery methods, choose one and identify under this new constraint, how would you identify and resolve this problem? -->

Max 400 words