<a href="https://colab.research.google.com/github/adasegroup/ML2021_seminars/blob/master/seminar8/Multiclass_Imbalanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced classification: Imbalanced and Multi-class cases

In this seminar we will learn how to perform classification in the case of multiple balanced or imbalanced classes. 

The dataset, which we will use for this tutorial, is the smaller version of [Stanford Dogs Dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/). The initial dataset consists of images of 120 breeds of dogs. In our case we are going to use just 4 classes out of those 120.

![dogs_pic](https://dog.ceo/img/dog-api-fb.jpg)

#### The plan of the seminar:
* a small introduction to Stanford Dogs Dataset
* Producing the features of the images using the pretrained neural network (we will consider it as a black box)
* Multi-class classification methods: One-vs-One and One-vs-Rest
* Imbalanced dataset - why is it a problem?
* Imbalanced classification methods: Over and Under-Sampling, SMOTE 

Let us start with some library imports.

##### NOTES:
* Class description
* dataframe creation in class or in seminar

In [None]:
!pip install -U imbalanced-learn

In [None]:
!wget https://github.com/adasegroup/ML2021_seminars/raw/main/seminar8/data/dog_breeds.zip

In [None]:
!unzip -oqd "./" "dog_breeds.zip"

In [None]:
!ls .

In [None]:
!rm -rf ./__MACOSX ./sample_data .config ./dog_breeds.zip

In [None]:
import torch
import pandas as pd
import matplotlib.pyplot as plt
import urllib
%matplotlib inline
from PIL import Image
from torchvision import transforms
import os
import sklearn
import os.path
from tqdm.autonotebook import tqdm

In [None]:
paths_doggies = [i for i in os.listdir('./') if '.DS_' not in i] 

In [None]:
#if you load your data from the local directory
#################################
#path_doggies ="dog_breeds/small"
#paths_doggies = [path_doggies +'/'+ i for i in os.listdir(path_doggies) if '.DS_' not in i] 
#################################

<br>

Now let us have a look at the data

In [None]:
def img_show(img, ax, title = None):
    """
    Plots the image on the particular axis

    Parameters
    ----------
    img: Image,image to plot.
    axis: matplotlib axis to plot on.
    title: string, the title of the image
    
    """
    ax.imshow(img)
    ax.axis('off')
    if title:
        ax.set_title(title)

In [None]:
#images for plotting 
img_names = {}
for num, i in enumerate(paths_doggies[:4]):
    img_names.update({i.split('-')[-1]:paths_doggies[num]+'/'+os.listdir(i)[0]})


In [None]:
#plot the images from img_names
fig, ax = plt.subplots(1,4, figsize=(20,10))
k = 0
for i, key in enumerate(img_names.keys()):
    img_show(Image.open(img_names[key]), ax[i], title = key)
plt.show()

In order to make working with the data much easier, we are going to create a class, that will store the ```image_to_features``` model, the ```data_list```, containing all the vectors of features of the image samples and the ```data_path```.

In [None]:
class DogBreedDataset:
    def __init__(self, data_path, feature_generator, num_samples=None):
        """
        A wrapper class for Stanford Dog Breeds dataset.

        Parameters
        ----------
        data_path: string, the path to the dataset.
        feature_generator: torch.nn.Module, the model, that receives the torch.tensor of the preprocessed image 
                           as the input and produces the tensor of features as the output.
        num_samples: integer, the number of samples in each class to load, default: None.
        """
        self.data_path = data_path
        self.model = feature_generator
        self.num_samples = num_samples
        self.data_list = []

    def preprocess_image(self, image):
        """
        Opens and preprocesses an Image according to the requirements mentioned at https://pytorch.org/hub/pytorch_vision_vgg/

        Parameters
        ----------
        path: the path to the image.
        img_name: the name of the image file.

        Returns
        -------
        input_tensor: the tensor of the preprocessed image.
        input_batch: input_tensor with an extra dim, representing a batch
        """

        preprocess = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225]),
        ])
        input_tensor = preprocess(image)
        input_batch = input_tensor.unsqueeze(0)
        return input_batch

    def load_dataset(self):
        """
        Loads and preprocesses the images from the dataset

        Parameters
        ----------
        path: the path to the image.
        img_name: the name of the image file.

        Returns
        -------
        data_list: the list of vectors of features of dogs' images
        """
        data_list = []
        for path in tqdm(self.data_path):
            counter = 0
            for filename in tqdm(os.listdir(path)):
                counter += 1
                # input
                with open(os.path.join(path, filename), 'rb') as file:
                    batch = self.preprocess_image(Image.open(file))

                with torch.no_grad():
                    features = self.model(batch).flatten().cpu().numpy()

                # label
                _, label = path.split('-', 1)
                data_list.append((features, label))

                if counter >= self.num_samples:
                    break

        return data_list

The model that we are going to use to get our features from this raw images is the Neural Network called **VGG-11** (you are going to learn about these types of NN models later in this course).

Lucky for us, [```PyTorch```](https://pytorch.org) library stores some of the most popular [pretrained Neural Networks](https://pytorch.org/hub/), so we don't have to design and train the VGG-11 NN from sctratch.

In [None]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
#download the VGG11 model from pytorch hub
model = torch.hub.load('pytorch/vision:v0.4.0', 'vgg11', pretrained=True)

![](https://neurohive.io/wp-content/uploads/2018/11/vgg16-1-e1542731207177.png)

However, we do not need the whole network for producing the images' features - we will take only the part of it, just before the first __fully connected__ layer.

In [None]:
#take only the "head" that outputs the images' features
image_to_feats = model.features

In [None]:
image_to_feats.eval()

Let's have a look at our data:

Let us download, preprocess and store the features of the images in a pandas dataframe

In [None]:
dataset_class = DogBreedDataset(paths_doggies, image_to_feats, num_samples = 150)

In [None]:
datalist = dataset_class.load_dataset()

Let's create a pandas dataframe with all the features and labels. 

In [None]:
features, label = datalist[0]

In [None]:
columns = [f"feat_{i+1}" for i in range(len(features))]
df_doggies = pd.DataFrame(
    [feat for feat, lab in datalist],
    columns=columns)

df_doggies["y"] = [lab for feat, lab in datalist]

In [None]:
df_doggies.shape

In [None]:
df_doggies.head()

Turn the labels to Categorical type and create the dictionary, in case we would like to recover the original labels

In [None]:
df_doggies.y = pd.Categorical(df_doggies.y)

In [None]:
label_map = dict(enumerate(df_doggies.y.cat.categories) )

In [None]:
label_map

In [None]:
df_doggies.y = df_doggies.y.cat.codes

## Plotting the data using dimensinonality reduction techniques

DataPlotter is another blackbox that we are going to use for representing our features in a more convenient way for plotting (later in the course you will learn about PCA and TSNE).


Let's plot our data!

In [None]:
from sklearn.manifold import TSNE
import seaborn as sns
from sklearn.decomposition import PCA

class DataPlotter:
    def __init__(self, data, dim_red = 'pca', X=None, y=None):
        """
        A wrapper class for dimensionality reduction and plotting.

        Parameters
        ----------
        data_path: dataframe, the dataset.
        dim_red: string, the dimensionality reduction technique to use, either 'tsne' or 'pca'.
        """ 
        self.data = data
        self.dim_red = dim_red
        self.X = X
        self.y = y
        if X is None:
            self.X = self.data.loc[:, self.data.columns!='y']
        if y is None:
            self.y = self.data.y.astype(int)
       
    def shuffle_data(self):
        """
        Randomly shuffling the data.
        """
        self.X = self.X.sample(frac=1).reset_index(drop=True)
        self.y = self.y.sample(frac=1).reset_index(drop=True)

    def reduce_dimension(self):
        """
        Reduce the current dimension of the feature data to 2 dimensions using either pca or tsne.
        """
        if self.dim_red =='tsne':
            self.X_embedded = TSNE(n_components=2, perplexity=30.0).fit_transform(self.X)
        elif self.dim_red == 'pca':
            self.X_embedded = PCA(n_components=2).fit_transform(self.X)

    def plot_data(self):
        plt.figure(figsize=(20,10))
        sns.scatterplot(self.X_embedded[:,0], self.X_embedded[:,1], hue = self.y, palette="rainbow", s=100,  
                        legend = "full")

In [None]:
data_pltr = DataPlotter(df_doggies, dim_red = 'pca')

In [None]:
data_pltr.reduce_dimension()

In [None]:
data_pltr.plot_data()

## Multi-class classification

Finally, let's try some multi-class classification methods.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.base import clone

Train-test split

In [None]:
y = df_doggies.y.astype(int)
X = df_doggies.loc[:, df_doggies.columns!='y']

split = train_test_split(X, y, test_size=0.5,
                         random_state=42, stratify=y)
train_X, test_X, train_y, test_y = split

Most of the binary classification methods that you have already discussed in the previous seminars, unfortuntelly, only allow to distinguish one class from the other. However, in our case, we want to classify several dog breeds, so how can we do that?

One way to this problem is using **One-vs-All** approach:
![](https://miro.medium.com/max/1574/1*7sz-bpA4r_xSqAG3IJx-7w.jpeg)

In [None]:
from sklearn.svm import LinearSVC

model_SVC = LinearSVC(random_state=0)

In [None]:
from sklearn.multiclass import OneVsRestClassifier

ovr_classifier = OneVsRestClassifier(clone(model_SVC), n_jobs=-1)
ovr_classifier.fit(train_X, train_y)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
predict_y = ovr_classifier.predict(test_X)

cmatrix = confusion_matrix(test_y, predict_y)
pd.DataFrame(cmatrix)

rows -- fact
columns -- predict

In [None]:
print("Accuracy %.3f%%" % (100 * ovr_classifier.score(test_X, test_y)))


### One-vs-One approach to multi-class classification

![](https://ars.els-cdn.com/content/image/1-s2.0-S0950705116301459-gr1.jpg)

In the same manner we have trained and evaluated OneVsRest algorithm, train the OneVsOneClassifier.
Have a look at the accuracy and confusion matrix. Which method has performed best?

In [None]:
from sklearn.multiclass import OneVsOneClassifier

In [None]:
###YOUR CODE###

## Imbalanced data

Data imbalance is a very common problem for many machine learning problems. Consider volcano erruption, or plane crush prediction - there is an abundance of negative examples, when the event does not happen and very little recorded cases of the events, the occurence of which we want to predict.

This is where various methods of class balancing is going to help.

In [None]:
X_sub, y_sub = df_doggies.loc[:, df_doggies.columns!='y'], df_doggies.y.astype(int)

In [None]:
data_pltr = DataPlotter(df_doggies, dim_red = 'pca')
data_pltr.reduce_dimension()
data_pltr.plot_data()

In [None]:
from collections import Counter
print('Distribution before imbalancing: {}'.format(Counter(y_sub)))

In [None]:
from imblearn.datasets import make_imbalance
X_res, y_res = make_imbalance(
    X_sub, y_sub, sampling_strategy={0: 150, 1: 150, 2: 30, 3: 150},
    random_state=1)


In [None]:
print('Distribution after imbalancing: {}'.format(Counter(y_res)))

In [None]:
data_pltr = DataPlotter(df_doggies, dim_red = 'pca', X = X_res, y = y_res)
data_pltr.reduce_dimension()
data_pltr.plot_data()

In [None]:
split = train_test_split(X_res, y_res, test_size=0.3,
                         random_state=42, stratify=y_res)
train_X, test_X, train_y, test_y = split

In [None]:
from sklearn.linear_model import RidgeClassifier
model_SVC = LinearSVC(random_state=50)
#model_SVC = RidgeClassifier(random_state=0)
ovr_classifier = OneVsRestClassifier(clone(model_SVC), n_jobs=-1)
ovr_classifier.fit(train_X, train_y)

In [None]:
predictions = ovr_classifier.predict(test_X)
#predictions = model_SVC.predict(test_X[test_y==0])

In [None]:
from imblearn.metrics import classification_report_imbalanced

In [None]:
print("Accuracy %.3f%%" % (100 * ovr_classifier.score(test_X, test_y)))

In [None]:
print(classification_report_imbalanced(test_y, predictions))

## Techniques to try, when dealing with the imabalanced dataset:
* Under/Over Sampling
* Synthetic minority over-sampling technique and its variants (ADASYN, BorderlineSMOTE, etc)

In [None]:
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [None]:
balancer = RandomUnderSampler()

In [None]:
balanced_train_x, balanced_train_y = balancer.fit_resample(train_X, train_y)

In [None]:
print('Distribution before balancing: {}'.format(Counter(train_y)))

In [None]:
print('Distribution after balancing: {}'.format(Counter(balanced_train_y)))

In [None]:
model = LinearSVC(random_state=50)
ovr_classifier = OneVsRestClassifier(clone(model), n_jobs=-1)
ovr_classifier.fit(balanced_train_x, balanced_train_y)

In [None]:
print("Accuracy %.3f%%" % (100 * ovr_classifier.score(test_X, test_y)))

In [None]:
predictions = ovr_classifier.predict(test_X)
#predictions = model.predict(test_X[test_y==0])

In [None]:
print(classification_report_imbalanced(test_y, predictions))

In [None]:
pd.DataFrame(confusion_matrix(test_y, predictions))

Have a look at how ```RandomOverSampler()``` will deal with the same task. Is it better or worse?

In [None]:
balancer = RandomOverSampler()

In [None]:
### YOUR CODE ###

## SMOTE

<img src="https://ars.els-cdn.com/content/image/1-s2.0-S0950705119302898-gr1.jpg" alt="smote" width="600"/>

In [None]:
rebalancer = SMOTE(sampling_strategy='not majority', k_neighbors=5, random_state = 1)

In [None]:
under_balancer = RandomUnderSampler(sampling_strategy={0:20, 1: 30, 3:50})

In [None]:
balanced_train_x, balanced_train_y = under_balancer.fit_resample(train_X, train_y)

In [None]:
print('Distribution before balancing: {}'.format(Counter(balanced_train_y)))

In [None]:
model = LinearSVC(random_state=0)

ovr_classifier = OneVsRestClassifier(clone(model))
X_SMOTE, y_SMOTE = rebalancer.fit_resample(balanced_train_x, balanced_train_y)
print('Distribution after balancing: {}'.format(Counter(y_SMOTE)))
ovr_classifier = ovr_classifier.fit(X_SMOTE, y_SMOTE)

In [None]:
predict_y_balanced = ovr_classifier.predict(test_X)
pd.DataFrame(confusion_matrix(test_y, predict_y_balanced))

In [None]:
print("Accuracy %.3f%%" % (100 * ovr_classifier.score(test_X, test_y)))

In [None]:
print(classification_report_imbalanced(test_y, predict_y_balanced))

In [None]:
data_pltr = DataPlotter(df_doggies, dim_red = 'pca', X = X_SMOTE, y = y_SMOTE)
data_pltr.reduce_dimension()
data_pltr.plot_data()

There are different variations of SMOTE method, such as ADASYN, BalancedSMOTE etc. Many of them are avaliable in [```imblearn```](https://imbalanced-learn.readthedocs.io/en/stable/api.html) library.

**Try out those methods yourself, using the mentioned functions, plot and analyze the results.**

In [None]:
### YOUR CODE ####