# Introduction
A very important aspect of supervised and semi-supervised machine learning is the quality of the labels produced by human labelers. Unfortunately, humans are not perfect and in some cases may even maliciously label things incorrectly. In this assignment, you will evaluate the impact of incorrect labels on a number of different classifiers.

We have provided a number of code snippets you can use during this assignment. Feel free to modify them or replace them.


## Dataset
The dataset you will be using is the [Adult Income dataset](https://archive.ics.uci.edu/ml/datasets/Adult). This dataset was created by Ronny Kohavi and Barry Becker and was used to predict whether a person's income is more/less than 50k USD based on census data.

### Data preprocessing
Start by loading and preprocessing the data. Remove NaN values, convert strings to categorical variables and encode the target variable (the string <=50K, >50K in column index 14).

In [1]:
import json
from collections import defaultdict

import pandas as pd
import numpy as np
from pandas.core.interchange.dataframe_protocol import Column, DataFrame
from pandas.core.util.hashing import hash_pandas_object
from sklearn.compose import ColumnTransformer
from sklearn.metrics import jaccard_score
from sklearn.preprocessing import OneHotEncoder

In [2]:
# This can be used to load the dataset
data = pd.read_csv("adult.csv", header=0, na_values='?')
data = data.dropna()

data = data.convert_dtypes()

numericals = data.select_dtypes(include=[np.number]).columns
categoricals = data.select_dtypes(exclude=[np.number]).columns

data[categoricals] = data[categoricals].astype('category')

encoder = ColumnTransformer(transformers=[('cat', OneHotEncoder(), categoricals)], remainder='passthrough')

#dt = encoder.fit_transform(data)

#for c in data.columns:
#    if data[c].dtype == 'object':
#        data[c] = pd.Categorical(data[c])

#print(dt)
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
print(data.dtypes)

age                  Int64
workclass         category
fnlwgt               Int64
education         category
education-num        Int64
marital-status    category
occupation        category
relationship      category
race              category
sex               category
capital-gain         Int64
capital-loss         Int64
hours-per-week       Int64
native-country    category
salary            category
dtype: object


### Data classification
Choose at least 4 different classifiers and evaluate their performance in predicting the target variable. 

#### Preprocessing
Think about how you are going to encode the categorical variables, normalization, whether you want to use all of the features, feature dimensionality reduction, etc. Justify your choices 

A good method to apply preprocessing steps is using a Pipeline. Read more about this [here](https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/) and [here](https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf). 

<!-- #### Data visualization
Calculate the correlation between different features, including the target variable. Visualize the correlations in a heatmap. A good example of how to do this can be found [here](https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec). 

Select a features you think will be an important predictor of the target variable and one which is not important. Explain your answers. -->

#### Evaluation
Use a validation technique from the previous lecture to evaluate the performance of the model. Explain and justify which metrics you used to compare the different models. 

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define your preprocessing steps here
steps = []

# Combine steps into a ColumnTransformer
ct = ColumnTransformer(steps)

# show the correlation between different features including target variable
def visualize(data, ct):
    pass

# Apply your model to feature array X and labels y
def apply_model(model, X, y):    
    # Wrap the model and steps into a Pipeline
    pipeline = Pipeline(steps=[('t', ct), ('m', model)])
    
    # Evaluate the model and store results
    return evaluate_model(X, y, pipeline)

# Apply your validation techniques and calculate metrics
def evaluate_model(X, y, pipeline):
    pass

### Label perturbation
To evaluate the impact of faulty labels in a dataset, we will introduce some errors in the labels of our data.


#### Preparation
Start by creating a method which alters a dataset by selecting a percentage of rows randomly and swaps labels from a 0->1 and 1->0. 


In [5]:
"""Given a label vector, create a new copy where a random fraction of the labels have been flipped."""
def pertubate(y: np.ndarray, fraction: float) -> np.ndarray:
    copy = data.copy()
    # Flip fraction*len(data) of the labels in copy
    return copy

#### Analysis
Create a number of new datasets with perturbed labels, for fractions ranging from `0` to `0.5` in increments of `0.1`.

Perform the same experiment you did before, which compared the performances of different models except with the new datasets. Repeat your experiment at least 5x for each model and perturbation level and calculate the mean and variance of the scores. Visualize the change in score for different perturbation levels for all of the models in a single plot. 

State your observations. Is there a change in the performance of the models? Are there some classifiers which are impacted more/less than other classifiers and why is this the case?

In [6]:
# Code

Observations + explanations: max. 400 words

#### Discussion

1)  Discuss how you could reduce the impact of wrongly labeled data or correct wrong labels. <br />
    max. 400 words



    Authors: Youri Arkesteijn, Tim van der Horst and Kevin Chong.


## Machine Learning Workflow

From part 1, you will have gone through the entire machine learning workflow which are they following steps:

1) Data Loading
2) Data Pre-processing
3) Machine Learning Model Training
4) Machine Learning Model Testing

You can see these tasks are very sequential, and need to be done in a serial fashion. 

As a small perturbation in the actions performed in each of the steps may have a detrimental knock-on effect in the task that comes afterwards.

In the final part of Part 1, you will have experienced the effects of performing perturbations to the machine learning model training aspect and the reaction of the machine learning model testing section.

## Part 2 Data Discovery

You will be given a set of datasets and you are tasked to perform data discovery on the data sets.

<b>The datasets are provided in the group lockers on brightspace. Let me know if you are having trouble accessing the datasets</b>

The process is to have the goal of finding datasets that are related to each other, finding relationships between the datasets.

The relationships that we are primarily working with are Join and Union relationships.

So please implement two methods for allowing us to find those pesky Join and Union relationships.

Try to do this with the datasets as is and no processing.



In [27]:
#import kshingle as ks
import os
import pandas as pd
from sklearn.metrics import jaccard_score
import numpy as np
from datasketch import MinHash, MinHashLSH
from collections import defaultdict
from typing import Dict, List, Set, Tuple
import re
import json
import pickle

Discovery algorithm:
1. I scan each database with read_csv.
2. I flatten each database, convert it to one string
3. I shingle with k=8
4. I calculate the Jaccard similarity between all pairs of shingles
5. I return the list of relatedness between databases

Comments:
1. File sizes are manageable enough to directly calculate Jaccard similarity without relying on MinHash
2. Tables 10 and 11 are the same
3. For k=8, most similar tables are table_13.csv <> table_7.csv | plain_8=0.1207 | plain_containment_8 = 0.2582 | minhash_8=0.1328
4.
table_6.csv <> table_5.csv | plain_8=0.0958 | plain_containment_8 = 0.2043 | minhash_8=0.0703
table_6.csv <> table_2.csv | plain_8=0.0858 | plain_containment_8 = 0.2544 | minhash_8=0.0859
Significant Jaccard containment scores

In [8]:
def kshingle_manual(s, k=1):
   for i in range(len(s) - k + 1):
       yield s[i:i + k]


In [16]:
def discovery_algorithm_parse_tables(base_dir):
    datasets = {}
    ds = {}
    if not os.path.isdir(base_dir):
        return datasets  # return empty if folder not found
    for fname in os.listdir(base_dir):
        if fname.lower().endswith(".csv"):
            fpath = os.path.join(base_dir, fname)
            try:
                df = pd.read_csv(fpath, sep=None, engine='python', on_bad_lines='skip')
            except Exception:
                continue
            df = df.dropna().astype(object)
            datasets[fname] = df.to_numpy().flatten()
            string = " ".join(str(x) for x in datasets[fname]).replace("\n", " ")
            # merge consecutive whitespaces (book advice)
            string = re.sub(' +', ' ', string)
            ds[fname] = string
    return ds


In [15]:
def get_shingles(text, k=8):
    return set(kshingle_manual(text, k))

def discovery_algorithm():
    """Function should be able to perform data discovery to find related datasets
    Possible Input: List of datasets
    Output: List of pairs of related datasets
    """

    base_dir = "./lake33"
    ds = discovery_algorithm_parse_tables(base_dir)


    def jaccard_set(set1, set2):
        intersection = len(set1.intersection(set2))
        union = len(set1.union(set2))
        if union == 0:
            return 0.0
        return intersection / union

    def jaccard_containment(set1, set2):
        intersection = len(set1.intersection(set2))
        def jacc_c1(set1, set2):
            if len(set1) == 0:
                return 0.0
            return intersection / len(set1)
        def jacc_c2(set1, set2):
            if len(set2) == 0:
                return 0.0
            return intersection / len(set2)
        return max(jacc_c1(set1, set2), jacc_c2(set1, set2))


    # Cache shingles and MinHash per file per k to avoid recomputing
    #    ks_range = range(8, 9)  # log20 suggests 7.92 (chapter 3 rule of thumb)
    shingles_k = {fname: {} for fname in ds}
    minhash_k = {fname: {} for fname in ds}
    minhashlsh_k = {fname: {} for fname in ds}

    for fname, text in ds.items():
        for k in range(8,9):
            kshingle = set(kshingle_manual(text, 8)) #log20
            shingles_k[fname][k] = kshingle
            mh = MinHash(num_perm=128)
            for sh in kshingle:
                try:
                    mh.update(sh.encode("utf8"))
                except Exception:
                    mh.update(str(sh).encode("utf8"))
            minhash_k[fname][k] = mh

    # MinHash LSH
    for fname, text in ds.items():
        for k in range(8,9):
            minhashlsh_k[fname][k] = MinHashLSH(threshold=0.5, num_perm=128)
            minhashlsh_k[fname][k].insert(fname, minhash_k[fname][k])

    # Print plain Jaccard (set-based) and MinHash estimated Jaccard side by side for each k
    filenames = list(ds.keys())
    for i in range(len(filenames)):
        for j in range(i + 1, len(filenames)):
            f1 = filenames[i]
            f2 = filenames[j]
            parts = [f"{f1} <> {f2}"]
            for k in range(8, 9):
                s1 = shingles_k[f1][k]
                s2 = shingles_k[f2][k]
                plain_j = jaccard_set(s1, s2)
                try:
                    est_j = minhash_k[f1][k].jaccard(minhash_k[f2][k])
                except Exception:
                    est_j = None
                est_lsh_j = minhashlsh_k[f1][k].jaccard(minhashlsh_k[f2][k], query_id=f1)
                parts.append(f"{plain_j:.4f}") # Plain
                parts.append(f"{jaccard_containment(s1, s2):.4f}")
                parts.append(f"{est_j if est_j is None else f'{est_j:.4f}'}")
                parts.append(f"{jaccard_containment(s1, s2):.4f}")
            print(" | ".join(parts))


discovery_algorithm()
print()


table_0.csv <> table_1.csv | 0.0364 | 0.3631 | 0.0469
table_0.csv <> table_10.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_11.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_13.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_14.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_15.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_16.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_17.csv | 0.0002 | 0.0255 | 0.0000
table_0.csv <> table_2.csv | 0.0363 | 0.3631 | 0.0469
table_0.csv <> table_3.csv | 0.0005 | 0.0637 | 0.0000
table_0.csv <> table_4.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_5.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_6.csv | 0.0000 | 0.0127 | 0.0000
table_0.csv <> table_7.csv | 0.0000 | 0.0000 | 0.0000
table_0.csv <> table_8.csv | 0.0000 | 0.0000 | 0.0000
table_1.csv <> table_10.csv | 0.0000 | 0.0000 | 0.0000
table_1.csv <> table_11.csv | 0.0000 | 0.0000 | 0.0000
table_1.csv <> table_13.csv | 0.0000 | 0.0000 | 0.0000
table_1.csv <> tab

Book way

In [29]:
#import kshingle as ks
import os
import pandas as pd
from sklearn.metrics import jaccard_score
import numpy as np
from datasketch import MinHash, MinHashLSH
from collections import defaultdict
from typing import Dict, List, Set, Tuple
import re
import json
import pickle

In [20]:

def read_files(dir):
    datasets = {}
    data_strs = {}
    for root, dirs, files in os.walk(dir):
        for file in files:
            if file.endswith(".csv"):
                fpath = os.path.join(root, file)
                try:
                    df = pd.read_csv(fpath, sep=None, engine='python', on_bad_lines='skip')
                except Exception:
                    continue
                df = df.dropna().astype(object)
                datasets[file] = (df.to_numpy().flatten())
                string = " ".join(str(x) for x in datasets[file]).replace("\n", " ")
                # merge consecutive whitespaces (book advice)
                string = re.sub(' +', ' ', string)
                data_strs[file] = string
    return data_strs



In [36]:

def step_1_kshingles(datasets: Dict, k):
    shingle_sets = defaultdict(set)
    for fname, string in datasets.items():
        for shingle in kshingle_manual(string, k):
            shingle_sets[fname].add(shingle)
    return shingle_sets

datasets = read_files("./lake33")
cache_file = "shingle_sets.pkl"

def load_shinglesets(cache_file):
# check if shingle_sets.json exists, if so load it
    shingle_sets = {}
    if os.path.exists(cache_file):
        with open(cache_file, "rb") as f:
            shingle_sets = pickle.load(f)
    else:
        shingle_sets = step_1_kshingles(datasets, 8)
        with open(cache_file, "wb") as f:
            try:
                pickle.dump(shingle_sets, f)
            except Exception as e:
                print(e)
                if os.path.exists(cache_file):
                    os.remove(cache_file)
    return shingle_sets


def opt_hash(shingle_set: set):
    minhash = MinHash(num_perm=128)
    for shingle in shingle_set:
        try:
            minhash.update(shingle.encode("utf8"))
        except Exception:
            minhash.update(str(shingle).encode("utf8"))
    return minhash

curr_shinglesets = load_shinglesets(cache_file)
opt_hashes = {k: opt_hash(v) for k, v in curr_shinglesets.items()}
print(opt_hashes)




{'table_0.csv': <datasketch.minhash.MinHash object at 0x00000279A645ABA0>, 'table_1.csv': <datasketch.minhash.MinHash object at 0x00000279A645AB30>, 'table_10.csv': <datasketch.minhash.MinHash object at 0x00000279A645AAC0>, 'table_11.csv': <datasketch.minhash.MinHash object at 0x00000279A645AA50>, 'table_13.csv': <datasketch.minhash.MinHash object at 0x00000279A645A9E0>, 'table_14.csv': <datasketch.minhash.MinHash object at 0x00000279A645A970>, 'table_15.csv': <datasketch.minhash.MinHash object at 0x00000279A645A900>, 'table_16.csv': <datasketch.minhash.MinHash object at 0x00000279A645A890>, 'table_17.csv': <datasketch.minhash.MinHash object at 0x00000279A645A820>, 'table_2.csv': <datasketch.minhash.MinHash object at 0x00000279A645A7B0>, 'table_3.csv': <datasketch.minhash.MinHash object at 0x00000279A645A740>, 'table_5.csv': <datasketch.minhash.MinHash object at 0x00000279A645A6D0>, 'table_6.csv': <datasketch.minhash.MinHash object at 0x00000279A645A660>, 'table_7.csv': <datasketch.min

In [44]:
# Python
from datasketch import MinHash, MinHashLSH

def build_minhash_from_shingles(shingle_set, num_perm=128):
    mh = MinHash(num_perm=num_perm)
    for sh in shingle_set:
        mh.update(sh.encode("utf-8"))
    return mh

# Build all shingles once

# Build all MinHashes once with the same num_perm
num_perm = 8
mh_by_file = {fname: build_minhash_from_shingles(s, num_perm) for fname, s in curr_shinglesets.items()}

# Estimate Jaccard between any two files
def est_jaccard(f1, f2):
    return mh_by_file[f1].jaccard(mh_by_file[f2])

# Optional: LSH for fast candidate search (not for exact jaccard)
lsh = MinHashLSH(threshold=0.5, num_perm=num_perm)
for fname, mh in mh_by_file.items():
    lsh.insert(fname, mh)

# Query similar files to 'file_a'
for fname in mh_by_file.keys():
    candidates = [c for c in lsh.query(mh_by_file[fname]) if c != fname]
    print(f"Candidates for {fname}: {candidates}")

# Compute true set Jaccard if needed (slower but exact)
def true_jaccard(f1, f2):
    s1, s2 = curr_shinglesets[f1], curr_shinglesets[f2]
    return len(s1 & s2) / len(s1 | s2) if (s1 or s2) else 0.0

Candidates for table_0.csv: []
Candidates for table_1.csv: ['table_2.csv']
Candidates for table_10.csv: ['table_11.csv']
Candidates for table_11.csv: ['table_10.csv']
Candidates for table_13.csv: []
Candidates for table_14.csv: []
Candidates for table_15.csv: []
Candidates for table_16.csv: []
Candidates for table_17.csv: []
Candidates for table_2.csv: ['table_1.csv']
Candidates for table_3.csv: []
Candidates for table_5.csv: []
Candidates for table_6.csv: []
Candidates for table_7.csv: []


```
Candidates for table_0.csv: []
Candidates for table_1.csv: ['table_2.csv']
Candidates for table_10.csv: ['table_11.csv']
Candidates for table_11.csv: ['table_10.csv']
Candidates for table_13.csv: []
Candidates for table_14.csv: []
Candidates for table_15.csv: []
Candidates for table_16.csv: []
Candidates for table_17.csv: []
Candidates for table_2.csv: ['table_1.csv']
Candidates for table_3.csv: []
Candidates for table_5.csv: []
Candidates for table_6.csv: []
Candidates for table_7.csv: []
```

These values show up even when num_perm=8

You would have noticed that the data has some issues in them.
So perhaps those issues have been troublesome to deal with.

Please try to do some cleaning on the data.

After performing cleaning see if the results of the data discovery has changed?

Please try to explain this in your report, and try to match up the error with the observation.

In [None]:
## Cleaning data, scrubbing, washing, mopping

def cleaningData(data):
    """Function should be able to clean the data
    Possible Input: List of datasets
    Output: List of cleaned datasets
    """

    pass

## Discussions

1)  Different aspects of the data can effect the data discovery process. Write a short report on your findings. Such as which data quality issues had the largest effect on data discovery. Which data quality problem was repairable and how you choose to do the repair.

<!-- For the set of considerations that you have outlined for the choice of data discovery methods, choose one and identify under this new constraint, how would you identify and resolve this problem? -->

Max 400 words