# AUTMI Seminar 2020/2021 Spring

## Introduction to Machine Learning

## February 25, 2021

## Data science

> The nontrivial extraction of implicit, previously unknown, and
> potentially useful information from data.

-   non-trivial

-   relationship between data points

-   (large) dataset

-   make predictions on unknown examples



## Examples and counterexamples Convenience store statistics

-   Number of customers

    -   trivial information

-   Last month’s income

    -   trivial information

-   Items most frequently bought together

    -   *finding frequent itemsets*

-   How many cashiers need to be open at Friday 16 pm?

    -   customer queuing model

    -   *time series modeling*



## Representations

To be able to run machine learning algorithms the computer needs numerical representations. For natural text input this means we need a mapping that converts strings to a numerical represenatation. **one-hot encoding** is the easiest approach where we map each word to an integer id.

## Vector Representation


- sample  
    - one data point - <span style="color: darkred">vector</span>

- feature  
    - a property or attribute of a sample - <span style="color: darkred">oneelement of a vector</span>

    - length of the mail

    -   sender

    -   Does it contain the word *Rolex*?

    -   Does it contain the expression *Trust fund*?

- dataset  
    - collections of all samples - <span style="color: darkred">matrix</span>

- label  
    - correct *answers* for all samples in a dataset - <spanstyle="color: darkred">vector</span>


## Neural networks and Deep learning
- Neural network is inspired by the information processing methods of biological nervous systems
- It is composed of neurons, each layer connected to the next
- Deep learning is a neural network consisting of multiple layers
    - the idea is not new
    - it is returned because of the rise of the GPUs
    - good frameworks (Pytorch, Tensorflow)

## Deep Learning

- it has a black-box nature
- interpreting them is hard
- we don't exactly know the reasoning behind a decision
- can we trust Deep learning?
- the latest language model of **OpenAI**, **GPT-3**, has 175B trainable parameters [link](https://news.developer.nvidia.com/openai-presents-gpt-3-a-175-billion-parameters-language-model/)

<img src="img/dl/network.png">

this is a __feed forward neural network__ with two hidden layers. Each neuron contains an activation function:

$$\mathbf{h_1} = \sigma (\mathbf{W_1 x})$$
$$\mathbf{h_2} = \sigma (\mathbf{W_2 h_1})$$
$$\mathbf{y} = \sigma (\mathbf{W_3 h_2})$$

$\sigma$: activation function, typically non-linear such as the sigmoid
function $$\sigma(x) = \frac{1}{1 + e^{-x}}$$

During the training, the weights are learned to predict a value of a new input.

__What is inside a neuron?__

![perceptron](https://c.mql5.com/2/35/artificialneuron__1.gif)


*image from https://www.mql5.com/en/blogs/post/724245*

In [None]:
!pip install transformers==4.0.0

In [None]:
from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("The grade for my homework will be", max_length=50, do_sample=False))

In [None]:
# Allocate a pipeline for question-answering
question_answerer = pipeline('question-answering')
question_answerer({
    'question': 'Who went to the store ?',
    'context': 'Adam went to the store yesterday.'})

In [None]:
from transformers import Conversation

chit_chat = pipeline('conversational')
conversation_1 = Conversation("Hi! How are you?")
conversation_2 = Conversation("Do you like movies?")
conversation_3 = Conversation("What is my favourite movie?")
chit_chat([conversation_1, conversation_2, conversation_3])

## ML vs DL

<img src="img/dl/ai_ml_dl.png" />

## ML vs DL

- Deep learning
    - Automatic feature engineering
    - Scalable with big data
    - Can solve non-separable problems as well (traditional methods struggle with non-linearity)
    - Currently most state-of-the-art methods are based on DL
- Traditional Machine Learning
    - Feature extraction is done manually
    - Can learn relatively well from small data (DL can’t)
    - Scalability is worse with big data
    - It can be enough for small tasks

## Learning types 
-   **Supervised learning**

    -   is a problem where for every input variable(x) there is an ouput
        variable(y) in the training data

    -   the preparation of the output variables is usually done with
        Human resources - Labeled data

-   **Unsupervised learning**

    -   is a problem where only the input variable(x) is present in the
        training data

    -   still can be very useful since labeling data is very resource
        hungry and expensive

## Learning problems

-   __Classification__ (supervised learning)

    -   assign a label for each sample

    -   labels are predefined and usually not very numerous

    -   e.g. sentiment analysis

-   __Regression__ (supervised learning)

    -   predict a continuous variable

    -   e.g. predict real estate prices, stock market based on history,
        location, amenities

-   __Clustering__ (unsupervised learning)

    -   group samples into clusters according to a similarity measure
     
    -   e.g. group similar facebook comments
    
    -   goal: high intra-group similarity (samples in the same cluster
        should be similar to each other), low inter-group similarity
        (samples in different clusters shouldn’t be similar)

### Evaluation - Binary classification

<img src="img/dl/true_false.png" />

**Accuracy**: fraction of correctly guessed labels among all the samples

#### __Precision, recall and F-score:__

**Precision**: fraction of positive samples among those labeled positive
$$\text{Precision}=\frac{tp}{tp+fp}$$

**Recall**: fraction of recovered positive samples of all positive
samples $$\text{Recall}=\frac{tp}{tp+fn}$$

**F-score**: harmonic mean of precision and recall
$$\text{F-score} = 2 * \frac{\text{prec}  \text{rec}}{\text{prec} + \text{rec}}$$

### Evaluation - regression Root-mean-square error

$$\operatorname{RMSE}=\sqrt{\frac{\sum_{t=1}^n (\hat y_t - y_t)^2}{n}},$$

where $\hat y_t$ are the predicted values, $y_t$ is the true value and
$n$ is the number of samples.

## Train, Validation and Test set

- __training set__
    - part of the dataset used for training -


- __validation dataset__
    - part of the dataset used for cross-validation, early stopping and
    hyperparameter tuning


- __test set__
    - part of the dataset used for testing trained models. Your method
    should only be tested once on the test set.

<img src="img/dl/train_test_val.png" />

## Terminology

-   How do ML/DL algorithms learn?

-   **Loss function**: helps to calculate the prediction loss of our
    network, which tells us how bad/good is our model.

-   We want to **optimize** the loss/cost function.

-   How?

    -   **Gradient descent** helps us find the global minima of the loss
        function

    -   **Backpropagation** algorithm is used to propagate the error
        back to the weights of the model and updates them

## Important concepts in Machine Learning

- __Cost Function__: used to measure how badly our models are performing on a data
- __Parameters__: variables that are updated during the training
- __Sample__: single row in our data
- __Batch size__: the number of samples our model works throught before updating the weights
- __Epoch__: one epoch means that each sample in the training dataset was iterated through the model
- __Iteration__ – one update on the weights. It happens once for each batch.
- __Hyperparameters__ – variables that don't change during training (number of epochs, batch size, learning rate, etc..)

- __Gradient descent__: used to find the global optimum in the cost function

<img src="img/dl/gradient.gif?raw=true" />

## Over- and Underfitting?
![under_over](https://miro.medium.com/max/2400/1*JZbxrdzabrT33Yl-LrmShw.png)
*image from https://miro.medium.com/max/2400/1*JZbxrdzabrT33Yl-LrmShw.png*

## Recurrent neural networks

In NLP, recurrent neural networks (RNN) are commonly used to analyse sequences. It takes in a sequence of words, one at a time, and produces hidden states ($h$) after each steps. RNN-s are used recurrently by feeding in the current word and the hidden state from the previous word.

Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$ (fully connected layer) to reduce the dimension into the dimension of the labels.

![rnn](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/79bb86abc9e89951a5f8c4a25ca5de6a491a4f5d/assets/sentiment1.png)

_(image from bentrevett)_

![rnn2](https://miro.medium.com/max/1400/1*WMnFSJHzOloFlJHU6fVN-g.gif)

![rnn3](https://miro.medium.com/max/770/1*o-Cq5U8-tfa1_ve2Pf3nfg.gif)

### LSTM

One of the biggest problem of recurrent neural networks is the vanishing gradient problem. It happens when the gradient shrinks during bakcpropagarion. If it becomes very small, the network stops learning. This mostly happen when long sentences are present. LSTM networks address this problem by having an inner memory cell to remember important information or forget others. LSTM has a similar flow as a RNN, it processes data and passes information as it propagates forward. The difference is in the operations within the cells.

![lstm](https://miro.medium.com/max/770/1*0f8r3Vd-i4ueYND1CUrhMA.png)



__LSTM__ consists of:

- __Forget gate__
    - Decides what information should be kept or thrown away
    - Information from the previous hidden state and from the current input

![forget](https://miro.medium.com/max/770/1*GjehOa513_BgpDDP6Vkw2Q.gif)


- __Input gate__
    - Decides what information is relevant to add from the current step

![input](https://miro.medium.com/max/770/1*TTmYy7Sy8uUXxUXfzmoKbA.gif)


- __Cell state__

![cell](https://miro.medium.com/max/770/1*S0rXIeO_VoUVOyrYHckUWg.gif)

- __Output gate__
    - Determines what the next hidden state should be

![lstm2](https://miro.medium.com/max/770/1*VOXRGhOShoWWks6ouoDN3Q.gif)

_(images from [link](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21))_

## Building a classficiation pipeline

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.


In [None]:
!pip install torchtext==0.4
!pip install torch
!pip install pandas
!pip install gensim
!pip install scikit-learn

First we are going to download the dataset using [torchtext](https://pytorch.org/text/stable/index.html):

In [None]:
NGRAMS = 2
from torchtext import data
from torchtext.datasets import text_classification
import os
if not os.path.isdir('./data'):
    os.mkdir('./data')
text_classification.DATASETS['AG_NEWS'](
    root='./data', ngrams=NGRAMS, vocab=None)

In [None]:
#Import the needed libraries
from tqdm import tqdm
from sklearn.model_selection import train_test_split as split
import numpy as np
import pandas as pd

Now we use [pandas](https://pandas.pydata.org/) to read in the dataset into a DataFrame. We are also going to just take a fraction of the dataset to be more efficient.

In [None]:
train_data = pd.read_csv("./data/ag_news_csv/train.csv",quotechar='"', names=['label', 'title', 'description'])
test_data = pd.read_csv("./data/ag_news_csv/test.csv",quotechar='"', names=['label', 'title', 'description'])

train_data = train_data.groupby('label').apply(lambda x: x.sample(frac=0.2, random_state=1234)).sample(frac=1.0)
test_data = test_data.groupby('label').apply(lambda x: x.sample(frac=0.2, random_state=1234)).sample(frac=1.0)

In [None]:
train_data

We need a way of converting raw data to features!
![features](https://developers.google.com/machine-learning/crash-course/images/RawDataToFeatureVector.svg)

The easiest way of converting raw data to features is called the [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model.

In [None]:
from collections import defaultdict

word_to_ix = defaultdict(int)
for sent in train_data.title:
    for word in sent.split():
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

We are going to use Python's machine learning library, called [scikit-learn](https://scikit-learn.org/stable/) to build a classical ML pipeline

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=10000)

X = vectorizer.fit(train_data.title)

In [None]:
X.vocabulary_

In [None]:
c = X.transform(["Hello my name is adam"]).toarray()
print(c)

We first build a __featurizer__ that takes raw texts as input and runs builds features on the whole dataset

In [None]:
import gensim
from tqdm import tqdm
from sklearn.model_selection import train_test_split as split
import numpy as np

def vectorize_to_bow(tr_data, tst_data):
    
    tr_vectors = X.transform(tr_data)
    
    tst_vectors = X.transform(tst_data)
    return tr_vectors, tst_vectors

def get_features_and_labels(data, labels):
    tr_data,tst_data,tr_labels,tst_labels = split(data,labels, test_size=0.3, random_state=1234)
    
    tst_vecs = []
    tr_vecs = []
    tr_vecs, tst_vecs = vectorize_to_bow(tr_data, tst_data)    
    return tr_vecs, tr_labels, tst_vecs, tst_labels

In [None]:
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(train_data.title, train_data.label)

In [None]:
#Import a bunch of stuff from sklearn
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

rf  =  RandomForestClassifier(n_estimators=100, verbose=True, n_jobs=-1)
svc = SVC()
lr  = LogisticRegression(n_jobs=-1)

In [None]:
#And then we run the training of the algorithms with the provided vectors and the labels
rf.fit(tr_vecs, tr_labels)
svc.fit(tr_vecs, tr_labels)
lr.fit(tr_vecs, tr_labels)

In [104]:
import eli5
eli5.show_weights(lr, feature_names=X.get_feature_names())

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
+2.608,arafat,,
+2.514,iran,,
+2.456,iraq,,
+2.255,hostage,,
+2.238,afp,,
+2.158,blair,,
+2.136,darfur,,
+2.122,iraqi,,
+2.103,troops,,
+2.079,fallujah,,

Weight?,Feature
+2.608,arafat
+2.514,iran
+2.456,iraq
+2.255,hostage
+2.238,afp
+2.158,blair
+2.136,darfur
+2.122,iraqi
+2.103,troops
+2.079,fallujah

Weight?,Feature
+2.073,nba
+2.048,cup
+1.942,coach
+1.922,nhl
+1.902,nascar
+1.872,sox
+1.718,bryant
+1.707,glazer
+1.670,wenger
+1.654,nfl

Weight?,Feature
+2.270,dollar
+2.168,forbes
+2.072,enron
+2.009,tax
+1.862,economy
+1.858,oil
+1.807,growth
+1.780,yukos
+1.774,opec
+1.759,jobless

Weight?,Feature
+2.805,nasa
+2.540,space
+2.430,spam
+2.203,linux
+2.162,microsoft
+2.089,yahoo
+2.003,nokia
+1.992,spyware
+1.902,science
+1.895,spammers


After training we can run the trained models on the test dataset to get the results!

In [None]:
from sklearn.metrics import accuracy_score
print(type(tst_vecs))
rf_pred = rf.predict(tst_vecs)
svc_pred = svc.predict(tst_vecs)
lr_pred = lr.predict(tst_vecs)
print("Random Forest Test accuracy : {}".format(accuracy_score(tst_labels, rf_pred)))
print("SVC Test accuracy : {}".format(accuracy_score(tst_labels, svc_pred)))
print("Logistic Regression Test accuracy : {}".format(accuracy_score(tst_labels, lr_pred)))

Bag of words are the simplest method for featurizing your data. If we want a more sophisticated method, we could use [TF-IDf](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).


![tfidf](https://miro.medium.com/max/700/1*qQgnyPLDIkUmeZKN2_ZWbQ.png)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000, use_idf=True)
vectors = vectorizer.fit(train_data.title)

tfidf_vectorizer_vectors = vectors.transform(train_data.title)

In [None]:
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[5] 
 
# place tf-idf values in a pandas data frame 
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=vectors.get_feature_names(), columns=["tfidf"]) 
df = df.sort_values(by=["tfidf"],ascending=False)
df

Sklearn allows us to build [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) with defining each step of the pipeline, like:
- Vectorizers
- Classifiers
- Voting strategies
- Optionally merge feature extraction from multiple sources

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from time import time

eclf = VotingClassifier(estimators=[('lr', lr), ('svc', svc), ('rf', rf)], voting='hard')

tr_data, tst_data, tr_labels, tst_labels = split(train_data.title, train_data.label, test_size=0.3, random_state=1234)

for clf, label in zip([lr, svc, rf, eclf], ['Logistic Regression', 'Linear SVC', 'Random Forest', 'Ensemble']):
    checker_pipeline = Pipeline([
            ('vectorizer', TfidfVectorizer(max_features=10000,ngram_range=(1, 3))),
            ('classifier', clf)
        ])
    print("Validation result for {}".format(label))
    checker_pipeline.fit(tr_data, tr_labels)

    tst_pred = checker_pipeline.predict(tst_data)
    print("{} Test accuracy : {}".format(label, accuracy_score(tst_labels, tst_pred)))

## Deep learning model using [Pytorch](https://pytorch.org/)

In [None]:
import torch

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=10000)

word_to_ix = vectorizer.fit(train_data.title)

VOCAB_SIZE = len(word_to_ix.vocabulary_)
NUM_LABELS = 4 


tr_data, val_data = split(train_data, test_size=0.3, random_state=1234)

batch_size = 1
use_gpu = torch.cuda.is_available()
learning_rate = 0.01

In [None]:
tr_data_vecs = torch.FloatTensor(word_to_ix.transform(tr_data.title).toarray())
tr_labels = tr_data.label.tolist()

val_data_vecs = torch.FloatTensor(word_to_ix.transform(val_data.title).toarray())
val_labels = val_data.label.tolist()

tr_data_loader = [(sample, label-1) for sample, label in zip(tr_data_vecs, tr_labels)]
val_data_loader = [(sample, label-1) for sample, label in zip(val_data_vecs, val_labels)]

In [None]:
BATCH_SIZE = 64

from torch.utils.data import DataLoader, TensorDataset


train_iterator = DataLoader(tr_data_loader,
                            batch_size=BATCH_SIZE,
                            shuffle=True,
                            )

valid_iterator = DataLoader(val_data_loader,
                          batch_size=BATCH_SIZE,
                          shuffle=False,
                          )

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
from torch import nn

class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        self.linear = nn.Linear(vocab_size, num_labels)

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)

In [None]:
INPUT_DIM = VOCAB_SIZE
OUTPUT_DIM = 4

model = BoWClassifier(OUTPUT_DIM, INPUT_DIM)

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [None]:
criterion = nn.NLLLoss()

In [None]:
model = model.to(device)
criterion = criterion.to(device)

In [None]:
from sklearn.metrics import classification_report
def class_accuracy(preds, y):
    """
    Returns accuracy per batch
    """
    rounded_preds = preds.argmax(1)
    correct = (rounded_preds == y).float() #convert into float for division 
    #target_names = ['class 0', 'class 1', 'class 2', 'class 3']
    #print(classification_report(rounded_preds.cpu().numpy(), y.cpu().numpy(), target_names=target_names))
    acc = correct.sum() / len(correct)
    return acc

In [None]:
from sklearn.metrics import accuracy_score
import torch.nn.functional as F
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for texts, labels in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(texts)

        loss = criterion(predictions, labels)
        
        acc = class_accuracy(predictions, labels)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for texts, labels in iterator:
            predictions = model(texts)
            loss = criterion(predictions, labels)
            
            acc = class_accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 50

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')