<a href="https://colab.research.google.com/github/theora/LawAndAI/blob/main/intro_to_ML_1_classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ALAAI HW 3 (Introduction to ML)

## Introduction

This set of exercises is intended to introduce students to fundamental steps that are necessary in carrying out a basic supervised ML experiment. The goal of the exercise is to make students aware of all the steps that go into such an experiment, beginning with acquisition and preprocessing of data and finishing with the evaluation of the experimental results.

It is important to have access to the internet in order to download the required resources and refer to online documentation.

Begin by reading Section 1 (Background and Motivation) and Section 3 (Data Set) of the [Improving Sentence Retrieval from Case Law for Statutory Interpretation](http://savelka.net/docs/2019ICAIL.pdf) paper. Despite the paper is about ranking the sentences we will simply treat the problem as multi-label classification task. 

## Get the Data

First you are asked to execute the cell below. Note the exclamation marks at the beginning of the lines. In Colab this means that the code is not understood as Python but rather as [bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) code which is executed as such.

The code clones [the repository](https://github.com/jsavelka/statutory_interpretation) from GitHub (git clone). In case you would like to learn more about git and GitHub you can go over [this article](https://product.hubspot.com/blog/git-and-github-tutorial-for-beginners).

WARNING: Before you proceed create your own copy of this notebook. Under File click "Save a Copy in Drive". Continue your work on the copy.

In [None]:
!git clone https://github.com/jsavelka/statutory_interpretation.git

Executing the below cell will list the files in the repository you just cloned.

In [None]:
!ls statutory_interpretation

Unzip the archive with the documents related to the “common business purpose” term (`common_business_purpose.zip`).

In [None]:
!unzip statutory_interpretation/common_business_purpose.zip

## Load and Explore the Data

Using the json module load the data from the `common_business_purpose-sentence.json` file. Follow the instruction provided in the TODO. Executing the cell before completing the task will result in an error.

In [None]:
import json
from pathlib import Path
from pprint import pprint

PWD = Path()
IEV_FILE = PWD/'...'  # TODO: Replace the three dots with the file name.

Load the json file.

In [None]:
with open(IEV_FILE, 'r') as json_f:
    cbp_data = json.load(json_f)

Explore the object you have just loaded into memory. Start with understanding what kind of object it is.

In [None]:
print(type(cbp_data))

Knowing the object is a dictionary determine what keys it has.

In [None]:
print(cbp_data.keys())

Explore one of the objects under any of the keys. Follow the instruction provided in the TODO. Executing the cell before completing the task will result in an error.

In [None]:
pprint(cbp_data['...'])  # TODO: Replace the three dots with any of the keys.

Finally, determine how many documents there are.

In [None]:
print(len(cbp_data))

For each sentence extract the label and the text. Make sure to keep correct pairing between labels and sentences. For example, you can create one list for labels and one list for the texts but make sure that both lists are ordered accordingly.

In [None]:
labels = []
texts = []
for sent_data in cbp_data.values():
    labels.append(sent_data['label'])
    texts.append(sent_data['text'])
print(f'Len labels: {len(labels)}; Len features: {len(texts)}')

## Exercises

This Colab will walk you through several exercises. Follow the instructions and write your answers to the individual questions into a separate file (MS Word docx, plain text, or any other common format). The questions are meant to be answered in the order in which they are asked.

NOTE: Only the underlined items are supposed to be reported in your homework submission.

### Question 1

Generate a bar chart describing the distribution of the labels. <ins>Show the bar chart in your report and comment on the balance of the classes</ins>.

In [None]:
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np

sentence_types, counts = zip(*Counter(labels).items())
indexes = np.arange(len(sentence_types))
plt.title('Label Distribution')
plt.bar(indexes, counts)
plt.xticks(indexes, sentence_types)
plt.xticks(rotation=45)
plt.show()

### Question 2

Split your data set into the training, validation, and test set. Use 50/25/25 split. <ins>Explain the importance of dividing the data set into a training and test set.</ins>.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(texts, labels,
                                                    test_size=0.5, shuffle=True)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp,
                                                test_size=0.5, shuffle=True)

print(f'Len train: {len(X_train)} ({len(y_train)})')
print(f'Len valid: {len(X_val)} ({len(y_val)})')
print(f'Len test: {len(X_test)} ({len(y_test)})')

### Question 3

Vectorize the data set using the `CountVectorizer` from the `sklearn` library of modules. <ins>Explain the purpose of transforming the text of the documents into vectors.</ins>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_val_counts = count_vect.transform(X_val)
X_test_counts = count_vect.transform(X_test)

### Question 4

Train a `Nearest Neighbors Classifier` using `5` closest data points as neighbors. Follow the instruction provided in the TODO. Executing the cell before completing the task will result in an error.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)  # TODO: Replace the three dots with the number of neighbors.
neigh.fit(X_train_counts, y_train)

Evaluate the fitted models on the validation set. <ins>Report per class Precision, Recall, and F1-measure as well as a confusion matrix for your classifiers.</ins>

In [None]:
from sklearn.metrics import plot_confusion_matrix, classification_report
import matplotlib.pyplot as plt

pred_nn = neigh.predict(X_val_counts)
print(classification_report(y_val, pred_nn))
disp = plot_confusion_matrix(neigh, X_val_counts, y_val,
                             display_labels=[a.split()[0] for a
                                             in neigh.classes_],
                             values_format='.0f',
                             cmap=plt.cm.Blues)

### Question 5

Train a `Decision Tree Classifier` using `4` as maximum tree depth. Follow the instruction provided in the TODO. Executing the cell before completing the task will result in an error.

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=42, max_depth=4)  # TODO: Replace the three dots with the maximum tree depth.
tree.fit(X_train_counts, y_train)

Evaluate the fitted models on the validation set. <ins>Report per class Precision, Recall, and F1-measure as well as a confusion matrix for your classifiers.</ins>

In [None]:
pred_tree = tree.predict(X_val_counts)
print(classification_report(y_val, pred_tree))
disp = plot_confusion_matrix(tree, X_val_counts, y_val,
                             display_labels=[a.split()[0] for a
                                             in neigh.classes_],
                             values_format='.0f',
                             cmap=plt.cm.Blues)

### Question 6

Train a `Logistic Regression Classifier`.

In [None]:
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(random_state=42)
lr_clf.fit(X_train_counts, y_train)

Evaluate the fitted models on the validation set. <ins>Report per class Precision, Recall, and F1-measure as well as a confusion matrix for your classifiers.</ins>

In [None]:
from sklearn.metrics import plot_confusion_matrix, classification_report
import matplotlib.pyplot as plt

pred_lr = lr_clf.predict(X_val_counts)
print(classification_report(y_val, pred_lr))
disp = plot_confusion_matrix(lr_clf, X_val_counts, y_val,
                             display_labels=[a.split()[0] for a
                                             in lr_clf.classes_],
                             values_format='.0f',
                             cmap=plt.cm.Blues)

### Question 7

Compare the three classifiers on the test set.

In [None]:
# nearest neigbors
pred_nn = neigh.predict(X_test_counts)
# decision tree
pred_tree = tree.predict(X_test_counts)
# logistic regression
pred_lr = lr_clf.predict(X_test_counts)

<ins>Elaborate on which of them performs the best. Explain why are you identifying the one as the best performing.</ins>

In [None]:
# compare using classification reports
print('NEAREST NEIGHBORS')
print(classification_report(y_test, pred_nn))
print('DECISION TREE')
print(classification_report(y_test, pred_tree))
print('LOGISTIC REGRESSION')
print(classification_report(y_test, pred_lr))

In [None]:
# compare using confusion matrices
disp = plot_confusion_matrix(neigh, X_test_counts, y_test,
                             display_labels=[a.split()[0] for a
                                             in neigh.classes_],
                             values_format='.0f',
                             cmap=plt.cm.Blues)

disp = plot_confusion_matrix(tree, X_test_counts, y_test,
                             display_labels=[a.split()[0] for a
                                             in tree.classes_],
                             values_format='.0f',
                             cmap=plt.cm.Blues)

disp = plot_confusion_matrix(lr_clf, X_test_counts, y_test,
                             display_labels=[a.split()[0] for a
                                             in lr_clf.classes_],
                             values_format='.0f',
                             cmap=plt.cm.Blues)