<a href="https://colab.research.google.com/github/aidarvaleev1998/AML-DS-2021/blob/main/FinalExam_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <center>Final Exam Lab
```
- Advanced Machine Learning, Innopolis University 
- Professor: Muhammad Fahim 
- Teaching Assistant: Gcinizwe Dlamini
```
<hr>

```
Tasks:
  1. Data Preprocessing (5 points)
  2. Conditional Generative adversarial network definition (5 points)
  3. Conditional Generative adversarial network training (10 points)
  4. Text explainer implemetation using Lime or Shap (5 bonus points)
```

<hr>

## The Dataset

For this task the 20 newsgroups text dataset is used. [LINK](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)

In [1]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

import nltk, string, re
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Available device : {device}")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Available device : cuda


## Task 1: Preprocessing of Dataset (5 points)



1.  Loading and cleaning of Text data:
    * Choose 4 categories from the dataset  
    * Implement a method `clean_text` which will take text then make text lowercase, remove punctuation, whitespaces and stopwords
    * Plot the distribution of classes/categories

In [2]:
categories = ['alt.atheism', 'comp.graphics', 'rec.autos', 'sci.space'] #TODO: Choose 4 categories from the dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
newsgroups_valid = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))


In [3]:
def clean_text(text):
    """ Function to perform common NLP pre-processing tasks. """
    # make lowercase
    text = text.lower()
    # remove punctuation
    text = word_tokenize(text)
    # remove numbers
    text = [re.sub(r"\d*", "", w) for w in text]
    # remove whitespaces
    text = [re.sub(r" *", "", w) for w in text]
    # remove stopwords
    stop_words = set(stopwords.words("english"))
    text = [w for w in text if w not in stop_words]
    # remove short words
    text = [w for w in text if len(w) > 3]
    return " ".join(text)

In [4]:
train_sentences = []
validation_sentences = []

train_labels = []
validation_labels = []


# Clean training sentences
for id in range(len(newsgroups_train.data)):
    text = clean_text(newsgroups_train.data[id])
    label = newsgroups_train.target[id]
    if text:
        train_sentences.append(text)
        train_labels.append(label)

# Clean validation sentences
for id in range(len(newsgroups_valid.data)):
    text = clean_text(newsgroups_valid.data[id])
    label = newsgroups_valid.target[id]
    if text:
        validation_sentences.append(text)
        validation_labels.append(label)

## Create vocabulary

In [5]:
# Create tokenizer
en_tokenizer = get_tokenizer('spacy', language='en')

# Create vocabulary
def build_vocab(sentences, tokenizer):
    counter = Counter()
    for sentence in sentences:
        counter.update(tokenizer(sentence))
    return Vocab(counter, specials=['<unk>', '<pad>'])

vocabulary = build_vocab(train_sentences, en_tokenizer)

In [6]:
VOCAB_SIZE = len(vocabulary) + 2

## Add padding 

In [7]:
max_len = 128

# Add Padding 
def create_dataset(sentences, labels, en_tokenizer, vocab, max_len=128):
    res = []
    for sentence in sentences:
        sentence_tokens = [vocab[token] for token in en_tokenizer(sentence)]
        if len(sentence_tokens) <= max_len:
            sentence_tokens = sentence_tokens + [vocab['<pad>']]*(max_len-len(sentence_tokens))
        else:
            sentence_tokens = sentence_tokens[:max_len]
        sentence_tensor = torch.tensor(sentence_tokens,dtype=torch.long)
        res.append(sentence_tensor)
        
    return TensorDataset(torch.stack(res),torch.from_numpy(np.array(labels)))

BATCH_SIZE = 128
PAD_IDX = vocabulary['<pad>']

train_dataset = create_dataset(train_sentences,train_labels, en_tokenizer, vocabulary)
validation_dataset = create_dataset(validation_sentences, validation_labels, en_tokenizer, vocabulary)

In [None]:
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
validation_loader = DataLoader(validation_dattrain_dsaset, batch_size=BATCH_SIZE, shuffle=True)

## Task 2: Conditional Generative adversarial network definition (5 points)

1.  Models Definition:
    * Define the Generator & Discriminator network (Achitecture of your choice) 

In [14]:
# TODO: Implement the Generator & Discriminator class
class Generator(nn.Module):
    # initializersremove whitespaces
    def __init__(self, emb_dim, hidden_dim): # noise_dim = emb_dim
        super(Generator, self).__init__()
        self.embeddings = nn.Embedding(VOCAB_SIZE, emb_dim, padding_idx=PAD_IDX)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, batch_first=True)
        self.lstm2out = nn.Linear(hidden_dim, VOCAB_SIZE)
        self.softmax = nn.LogSoftmax(dim=-1)
        self.cond_emb = nn.Embedding(4, emb_dim)

    # forward method. Condition is should be incorporated to the model input 
    def forward(self, x, c, hidden, need_hidden=False):
        emb = self.emb(x)  # batch_size * len * emb_dim
        c = self.cond_emb(c).reshape((c.shape[0], 1, -1))
        emb = torch.cat([c, emb], 1)
        if len(x.size()) == 1:
            emb = emb.unsqueeze(1)  # batch_size * 1 * emb_dim

        out, hidden = self.lstm(out, hidden)  # out: batch_size * seq_len * hidden_dim
        out = out.contiguous().view(-1, self.hidden_dim)  # out: (batch_size * len) * hidden_dim
        out = self.lstm2out(out)  # (batch_size * seq_len) * vocab_size
        pred = self.softmax(out)
        
        if need_hidden:
            return pred, hidden
        else:
            return pred

    def sample(self, c, num_samples, batch_size):
        num_batch = num_samples // batch_size + 1 if num_samples != batch_size else 1
        samples = torch.zeros(num_batch * batch_size, max_len).long()

        # Generate sentences with multinomial sampling strategy
        for b in range(num_batch):
            hidden = (
                torch.zeros(1, batch_size, self.hidden_dim).cuda(),
                torch.zeros(1, batch_size, self.hidden_dim).cuda()
            )
            inp = torch.LongTensor([] * batch_size)
            if self.gpu:
                inp = inp.cuda()

            for i in range(max_len):
                out, hidden = self.forward(inp, c, hidden, need_hidden=True)  # out: batch_size * vocab_size
                next_token = torch.multinomial(torch.exp(out), 1)  # batch_size * 1 (sampling from each row)
                samples[b * batch_size:(b + 1) * batch_size, i] = next_token.view(-1)
                inp = next_token.view(-1)
        samples = samples[:num_samples]

        return samples

class Discriminator(nn.Module):
    # initializers
    def __init__(self, emb_dim, hidden_dim, dropout):
        super(Discriminator, self).__init__()
        self.hidden_dim = hidden_dim
        
        self.embeddings = nn.Embedding(VOCAB_SIZE, emb_dim, padding_idx=PAD_IDX)
        self.cond_emb = nn.Embedding(4, emb_dim)
        self.gru = nn.GRU(emb_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout)
        self.dense = nn.Linear(2 * 2 * hidden_dim, 1)
        self.dropout = nn.Dropout(dropout)

    # forward method. Note: Condition is should be incorporated to the model input
    def forward(self, x, c):
        emb = self.embeddings(x)  # batch_size * seq_len * emb_dim
        emb = emb.permute(1, 0, 2)  # seq_len * batch_size * emb_dim

        c = self.cond_emb(c).reshape((1, c.shape[0], -1))
        out = torch.cat([emb, c], 0)
        _, hidden = self.gru(out)  # 4 * batch_size * hidden_dim
        hidden = hidden.permute(1, 0, 2).contiguous()  # batch_size * 4 * hidden_dim
        out = self.dense(hidden.view(-1, 4 * self.hidden_dim))
        out = torch.tanh(out)
        return out

# define discriminator and generator
# TODO: specify the input and output size

D = Discriminator(emb_dim=32, hidden_dim=32, dropout=0.3).to(device).float()
G = Generator(emb_dim=32, hidden_dim=32).to(device).float()

print(G)
print()
print(D)

RuntimeError: ignored

## Task 3: Conditional Generative adversarial network training (10 points)

* Implement the Conditional Generative adversarial network training procedure 
* Define the optimizers for Generator and Discriminator network
* Define the loss functions
* Add Tensorboard to log the Generator and Discriminator loss (for both Training and Validation). For discriminator the loss on fake samples and real samples should be logged separately 

**NOTE:** It is not important that the loss decreases during the training loop for this task. It is important that the training procedure is correctly implemented

In [12]:
from torch.utils.tensorboard import SummaryWriter
import torch.optim as optim


%load_ext tensorboard

writer = SummaryWriter()

# params
learning_rate = 0.0001
n_epochs = 10
# TODO: Create optimizers for the discriminator and generator
d_optimizer = optim.Adam(D.parameters(), learning_rate)
g_optimizer = optim.SGD(G.parameters(), learning_rate)

# fixed noise for validation 
fixed_noise = torch.normal(0,1, (len(validation_dataset), 32), dtype=torch.float, device=device)

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


RuntimeError: ignored

In [13]:
## TODO: Implement the training procedure and log train & validation loss using tensorboard
loss_function = nn.BCELoss()
for epoch in range(n_epochs):
    G.train()
    D.train()
    for x,y in train_loader:
        x = x.to(device)
        y = y.to(device)

        # TRAIN THE DISCRIMINATOR
        # Step 1: Zero gradients (zero_grad)
        # Step 2: Train with real
        # Step 3: Compute the discriminator losses on real 

        d_optimizer.zero_grad()
        D_real = D(x, y)
        d_real_loss = loss_function(D_real, torch.ones(D_real.shape[0], 1).to(device))

        # Step 4: Train with fake
        # Step 5: Generate fake and move x to GPU, if available
        # Step 6: Compute the discriminator losses on fake 
        # Step 7: add up loss and perform backprop
        
        fake = G.sample(y, y.shape[0], y.shape[0])
        
        # Compute the discriminator losses on fake 
        d_optimizer.zero_grad()           
        D_fake = D(fake)
        d_fake_loss = loss_function(D_fake, torch.zeros(D_real.shape[0], 1).to(device))
        
        # add up loss and perform backprop
        d_loss = d_real_loss + d_fake_loss
        d_loss.backward()
        d_optimizer.step()
        
        
        #TRAIN THE GENERATOR (Train with fake and flipped labels)
        g_optimizer.zero_grad()
        
        # Step 1: Zero gradients  Generator
        # Step 2: Generate fake from random noise (z)
        # Step 3: Compute the discriminator losses on fake using flipped labels!
        # Step 4: Perform backprop and take optimizer step
        fake = G.sample(y, y.shape[0], y.shape[0])
        D_g = D(fake, y)
        g_loss = loss_function(D_g, torch.ones(D_g.shape[0], 1).to(device))
        g_loss.backward()
        g_optimizer.step()

    # validation
    # with torch.no_grad():
    #     D.eval()
    #     G.eval()
    #     for x, y in validation_loader:
    #         x = x.to(device)
    #         y = y.to(device)

    writer.add_scalar("Train Loss D Real", d_real_loss, epoch)
    writer.add_scalar("Train Loss D Fake", d_fake_loss, epoch)
    writer.add_scalar("Train Loss G", g_loss, epoch)

RuntimeError: ignored

## Launch Tensorboard

In [None]:
%tensorboard --logdir ./runs

## Task 4: (Optional): Text explainer implemetation using Lime or Shap (5 bonus points)

Using the [20 newsgroups](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) text dataset is used.
Create a simple(i.e Decision tree, Random Forest) multi-class classifier and explain the classifiers predictions with the help of LIME or SHARP. 

**Note:** Use TF-IDF for feature extraction

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

clf = None

## <center>Solution should be pushed to github and link to github submitted to Moodle</center>