# Active Learning
You should build an end-to-end machine learning pipeline using an active learning component. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Assume that the training set is not labeled. You have a labeling budgets $ B $ to label exactly $ B $ sampled training data points. Compare the random sampling, uncertainty sampling, and clustering-based sampling strategies.
    - The random sampling strategy samples and labels $ B $ random data points from the training set. 
    - The uncertainty sampling strategy is an active learning strategy to sample and label $ B $ training data points. You might use [modAL](https://modal-python.readthedocs.io/en/latest/index.html).
    - The clustering-based sampling strategy clusters data points into $ B $ clusters using a clustering model, such as [k-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). It then samples and labels one data point per cluster.
- Train 3 separate classification models on these 3 sets of sampled and labeled training data points. 
- Calculate and visualize the test performance curve of these models when the labeling budget $ B $ increases for any of these sampling strategies. 
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

Import liberaries

In [13]:
import random 
import pandas as pd 
import numpy as np
import sklearn.metrics
import sklearn.cluster
import sklearn.ensemble
import sklearn.neighbors
import sklearn.model_selection
import plotly.graph_objects as go 


In [5]:
# Load data into a pandas dataframe
df = pd.read_csv('mnist.csv')
df=df.set_index('id')
df.head()

Unnamed: 0_level_0,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31953,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34452,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60897,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36953,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1981,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
x=df.drop(['class'],axis=1)
y=df['class']
x_train,x_test,y_train,y_test = sklearn.model_selection.train_test_split(x,y)

Comparing sampleing stratigies

preliminaries

In [7]:
labeling_budget=30

1-Random Sampling

In [18]:
number_of_labels_random_sampling=[]
accuracies_random_sampling=[]
sampled_indexes=[]
for i in range(labeling_budget):
  #sampling training data points
  unlabeled_x_train=x_train.drop(sampled_indexes)
  sampled_indexes+=random.sample(list(unlabeled_x_train.index),1)

  #updating the labeled training set
  labeled_x_train=x_train.loc[sampled_indexes,:]
  labeled_y_train=y_train.loc[sampled_indexes]

  #train and test the model 
  model=sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
  model.fit(labeled_x_train,labeled_y_train)
  y_predicted=model.predict(x_test)
  accuracy=sklearn.metrics.accuracy_score(y_test,y_predicted)

  #save the results 
  number_of_labels_random_sampling.append(len(sampled_indexes))
  accuracies_random_sampling.append(accuracy)
pd.DataFrame({'number of labels': number_of_labels_random_sampling, 'Accuracy': accuracies_random_sampling})



Unnamed: 0,number of labels,Accuracy
0,1,0.104
1,2,0.164
2,3,0.217
3,4,0.201
4,5,0.211
5,6,0.254
6,7,0.29
7,8,0.307
8,9,0.37
9,10,0.372


2-Uncertinaty sampling

In [17]:
number_of_labels_uncertainty_sampling = []
accuracies_uncertainty_sampling = []
sampled_indexes = []

for i in range(labeling_budget):
    # Sampling training data points
    unlabeled_x_train = x_train.drop(sampled_indexes)
    if len(sampled_indexes) < 3:
        sampled_indexes += random.sample(list(unlabeled_x_train.index), 1)
    else:
        probabilities = model.predict_proba(unlabeled_x_train)
        highest_probability = probabilities.max(axis=1)
        unlabeled_x_train['uncertainty'] = 1 - highest_probability
        most_uncertain_index = unlabeled_x_train.sort_values(by='uncertainty', ascending=False).index[0]
        sampled_indexes.extend([most_uncertain_index])

    # Updating the labeled training set
    labeled_x_train = x_train.loc[sampled_indexes, :]
    labeled_y_train = y_train.loc[sampled_indexes]

    # Train and test the model
    model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
    model.fit(labeled_x_train, labeled_y_train)
    y_predicted = model.predict(x_test)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)

    # Save the results
    number_of_labels_uncertainty_sampling.append(len(sampled_indexes))
    accuracies_uncertainty_sampling.append(accuracy)

result_df = pd.DataFrame({'Number of Labels': number_of_labels_uncertainty_sampling, 'Accuracy': accuracies_uncertainty_sampling})
print(result_df)



DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented fr

    Number of Labels  Accuracy
0                  1     0.093
1                  2     0.093
2                  3     0.192
3                  4     0.219
4                  5     0.220
5                  6     0.238
6                  7     0.280
7                  8     0.277
8                  9     0.352
9                 10     0.370
10                11     0.376
11                12     0.436
12                13     0.445
13                14     0.460
14                15     0.459
15                16     0.469
16                17     0.468
17                18     0.487
18                19     0.495
19                20     0.494
20                21     0.490
21                22     0.491
22                23     0.494
23                24     0.494
24                25     0.511
25                26     0.527
26                27     0.484
27                28     0.483
28                29     0.483
29                30     0.486



DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`



3-clustering based sampling

In [23]:
number_of_labels_clustering_based_sampling = []
accuracies_clustering_based_sampling = []
sampled_indexes = []

for i in range(labeling_budget):
    if i < 2:
        continue

    sampled_indexes = []

    unlabeled_x_train = x_train.drop(sampled_indexes)

    clustering_model = sklearn.cluster.KMeans(n_clusters=i)
    clustering_model.fit(x_train)

    unlabeled_x_train["Cluster ID"] = clustering_model.labels_

    temp_df = unlabeled_x_train.groupby("Cluster ID").apply(lambda x: x.sample(n=1))
    sampled_indexes = list(temp_df.index.get_level_values("id"))

    labeled_x_train = x_train.loc[sampled_indexes, :]
    labeled_y_train = y_train.loc[sampled_indexes]

    model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
    model.fit(labeled_x_train, labeled_y_train)

    y_predicted = model.predict(x_test)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)

    number_of_labels_clustering_based_sampling.append(len(sampled_indexes))
    accuracies_clustering_based_sampling.append(accuracy)

pd.DataFrame({"Number of Labels": number_of_labels_clustering_based_sampling, "Accuracy": accuracies_clustering_based_sampling})




DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`




DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`




DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`




DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragm

Unnamed: 0,Number of Labels,Accuracy
0,2,0.157
1,3,0.175
2,4,0.218
3,5,0.301
4,6,0.322
5,7,0.308
6,8,0.379
7,9,0.364


Visualization

In [24]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=number_of_labels_random_sampling, y=accuracies_random_sampling,
                         mode='lines', name='Random sampling'))
fig.add_trace(go.Scatter(x=number_of_labels_uncertainty_sampling, y=accuracies_uncertainty_sampling,
                        mode='lines', name='Uncertainty sampling'))
fig.add_trace(go.Scatter(x=number_of_labels_clustering_based_sampling, y=accuracies_clustering_based_sampling,
                        mode='lines', name='Clustering-based sampling'))
fig.show()