# Part One: Getting and preparing the data

## How to create a spider using Scrapy

[Scrapy](https://scrapy.org/) is a powerful tool to scrape web pages.

Define a simple spider like the one in [```spider.py```](/edit/spider.py) and then simply run:

```
$ scrapy runspider spider.py -o dataset-raw.json
```

This will save all objects created by the spider into a list in the ```dataset-raw.json``` file.

## Taking a look at the data

An sample article looks like this json object here:
```javascript
{
    "title": "<h1 class=\"short \" itemprop=\"name headline\">L\u2019\u00e9pouvantail de l\u2019\u00abinstabilit\u00e9\u00bb</h1>",
    "body": "<div class=\"article-main-txt\" id=\"StoryDetailBody5b15f54930eeaf4a6daf8d8b\">\n        \n \n \n  <p> <strong>Traversons-nous une \u00abgrande p\u00e9riode d\u2019instabilit\u00e9\u00bb, comme l\u2019a martel\u00e9 Philippe Couillard samedi, au Conseil g\u00e9n\u00e9ral lib\u00e9ral \u00e0 Montr\u00e9al?</strong></p> \n  <p> D\u2019accord, les perspectives \u00e9conomiques sont l\u00e9g\u00e8rement assombries actuellement.</p> \n  <p> Notamment par les gestes de protectionnisme du pr\u00e9sident am\u00e9ricain Donald Trump, sur l\u2019acier et l\u2019aluminium, m\u00eame envers ses alli\u00e9s du Canada ...",
    "keywords": ["papier commercial", "exag\u00e9rer la situation", "d\u2019une progression", "alli\u00e9s du Canada", "campagne lib\u00e9rale", "contexte \u00e9conomique", "r\u00e9confortante face", "pr\u00e9sident am\u00e9ricain", "long cycle", "budget du ministre", "ciel \u00e9conomique", "march\u00e9 immobilier am\u00e9ricain", "marge du Conseil", "adversaire caquiste", "r\u00e9gions du monde", "crise financi\u00e8re", "brandir des \u00e9pouvantails", "vrai changement", "pr\u00e9c\u00e9dant le scrutin", "gestes de protectionnisme", "premier ministre", "promettre des id\u00e9es", "nom du parti", "int\u00e9r\u00eat \u00e9lectoral", "transformation du Qu\u00e9bec", "vieille formule", "grande instabilit\u00e9", "mains sur le volant", "Donald Trump", "Alexandre Taillefer", "Philippe Couillard", "Carlos Leitao", "M\u00e9lanie Joly", "Lehman Brothers", "Montr\u00e9al", "Qu\u00e9bec", "Canada"],
    "author": "Antoine Robitaille"
}
```
So we need to clean things up a little bit.

## Cleaning up the data

Take a look at the [generate_dataset.py](/edit/generate_dataset.py) file.

After running this script (which may take some time depending on the dataset size) we have these objects:

```javascript
{
  "author": "antoine-robitaille",
  "title": "l\u2019\u00e9pouvantail_de_l\u2019\u00abinstabilit\u00e9\u00bb",
  "keywords": [
   ["papier", "commercial"],
   ...
   ],
  "text": [
   [
    "traversons",
    "-",
    "nous",
    "une",
    "\u00ab",
    "grande",
    "p\u00e9riode",
    "d\u2019",
    "instabilit\u00e9",
    "\u00bb",
    ",",
    "comme",
    "l\u2019",
    "a",
    "martel\u00e9",
    "philippe",
    "couillard",
    "samedi",
    ",",
    "au",
    "conseil",
    ...
   ]
}
```

## Splitting our dataset into a train, valid and test set

Our dataset is heavily imbalanced by the number of articles per author. We want to reflect this as much as possible in our train, valid and test set.

Running the script [```generate_train_valid_test.py```](/edit/generate_train_valid_test.py) will do just that. (With the help of [domain/article.py](/edit/domain/article.py) and [repository/articles.py](/edit/repository/articles.py)).

Let's take a look at some exemples;

In [None]:
from repository.articles import load_splitted_articles

train, valid, test = load_splitted_articles()

In [None]:
train[0].text

# Part 2: Training a simple classifier

## Getting our data ready for a neural network

In [None]:
from domain.article import stopwords

class ArticleSimpleFormatter:
    def format_examples(self, articles):
        examples = list()
        for article in articles:
            body = []
            for sentence in article.get_sentences():
                if len(sentence) > 0:
                    body += [w for w in sentence if w not in stopwords]
            if len(body) > 0:
                examples.append((body[:1000], article.author))
        return examples

article_formatter = ArticleSimpleFormatter()

In [None]:
formatted_train = article_formatter.format_examples(train)
formatted_valid = article_formatter.format_examples(valid)
formatted_test = article_formatter.format_examples(test)

In [None]:
# How many authors do we have?
len(set(a[1] for a in formatted_train))

In [None]:
# Wat is our vocabulary size?
len(set([w for a in formatted_train for w in a[0]]))

Based in [this](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) tutorial, we are going to train a text classifier using scikit-learn

In [None]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report

X_train = [x[0] for x in formatted_train]
X_valid = [x[0] for x in formatted_valid]
X_test = [x[0] for x in formatted_test]

authors = set(x[1] for x in formatted_train)
author_to_idx = {author: i for i, author in enumerate(sorted(authors))}

Y_train = np.array([author_to_idx[x[1]] for x in formatted_train])
Y_valid = np.array([author_to_idx[x[1]] for x in formatted_valid])
Y_test = np.array([author_to_idx[x[1]] for x in formatted_test])

text_clf = Pipeline([('vect', CountVectorizer(tokenizer=lambda x: x, lowercase=False)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42,
                                           max_iter=5, tol=None)),
])

text_clf.fit(X_train, Y_train)
predicted = text_clf.predict(X_train)
print(np.mean(predicted == Y_train))

predicted = text_clf.predict(X_test)
print(np.mean(predicted == Y_test))

In [None]:
print(classification_report(Y_test, predicted, target_names=sorted(authors)))

# Part 3: Getting word vectors from our corpus

Let's check what are word vectors and why is it useful. *(see slides)*

Now that we are more at ease with word vectors, let's generate them!

We first need to format our dataset the way [FastText](https://fasttext.cc/) want it to be by running the [```generate_fasttext_data.py```](/edit/generate_fasttext_data.py)

Let's take a look at these word vectors.

In [None]:
from gensim.models import KeyedVectors

vec_model_path = './data/ml-quebec-2018.vec'
vec_model = KeyedVectors.load_word2vec_format(vec_model_path)

In [None]:
vec_model['bonjour']

A vector doesn't tell us much about a word itself unless we compare it with other.
See [this visualiation](https://projector.tensorflow.org/?config=https://gist.githubusercontent.com/ngarneau/fe6db31a00d99c338eeac5dba7cb32b6/raw/133ee3336c07757f12d85eaf116d82baf48cb56a/config.json) for more insights.

In [None]:
class ArticleFormatter:
    def __init__(self, vec_model):
        self.vec_model = vec_model

    def filter_unfrequent_words(self, sentence):
        new_sentence = list()
        for word in sentence:
            if word in self.vec_model:
                new_sentence.append(word)
            else:
                new_sentence.append('<UNK>')
        return new_sentence

    def format_examples(self, articles):
        examples = list()
        for article in articles:
            body = []
            for sentence in article.get_sentences():
                if len(sentence) > 0:
                    body += self.filter_unfrequent_words([w for w in sentence if w not in stopwords])
            if len(body) > 0:
                examples.append((body[:1000], article.author))
        return examples

article_formatter = ArticleFormatter(vec_model)
formatted_train = article_formatter.format_examples(train)
formatted_valid = article_formatter.format_examples(valid)
formatted_test = article_formatter.format_examples(test)

In [None]:
vocab = set()
authors = set()

for example in formatted_train:
    for word in example[0]:
        vocab.add(word)
    authors.add(example[1])
    
word_to_idx = {
    '<PAD>': 0,
    '<UNK>': 1,
}

for word in sorted(vocab):
    word_to_idx[word] = len(word_to_idx)
    
author_to_idx = {author: i for i, author in enumerate(sorted(authors))}

dataset = {
    'word_to_idx': word_to_idx,
    'author_to_idx': author_to_idx,
    'train': formatted_train,
    'valid': formatted_valid,
    'test': formatted_test,
}

In [None]:
class ArticleVectorizer:
    def __init__(self, word_to_idx, author_to_idx):
        self.word_to_idx = word_to_idx
        self.author_to_idx = author_to_idx

    def vectorize_sequence(self, sequence, idx, remove_if_unk=False):
        if '<UNK>' in idx:
            unknown_index = idx['<UNK>']
            words = [idx.get(tok, unknown_index) for tok in sequence]
            if remove_if_unk:
                return [w for w in words if w != unknown_index]
            else:
                return words

        else:
            return [idx[tok] for tok in sequence]

    def __call__(self, example):
        sentences, author = example
        vectorized_body = self.vectorize_sequence(sentences, self.word_to_idx)
        vectorized_author = self.author_to_idx[author]
        return (
            vectorized_body,
            vectorized_author,
        )

article_vectorizer = ArticleVectorizer(dataset['word_to_idx'], dataset['author_to_idx'])

In [None]:
train_data = [article_vectorizer(article) for article in dataset['train']]
valid_data = [article_vectorizer(article) for article in dataset['valid']]
test_data = [article_vectorizer(article) for article in dataset['test']]

In [None]:
import random
from torch.utils.data import DataLoader, Dataset

class SentenceDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        sentences, author = self.dataset[index]
        return sentences, author

In [None]:
train_dataset = SentenceDataset(train_data)
valid_dataset = SentenceDataset(valid_data)
test_dataset = SentenceDataset(test_data)

In [None]:
import torch

def pad_sequences(vectorized_seqs, seq_lengths):
    seq_tensor = torch.zeros((len(vectorized_seqs), seq_lengths.max())).long()
    for idx, (seq, seqlen) in enumerate(zip(vectorized_seqs, seq_lengths)):
        seq_tensor[idx, :seqlen] = torch.LongTensor(seq[:seqlen])
    return seq_tensor

def collate_truncate(max_length):
    def collate_examples(samples):
        sentences, authors = list(zip(*samples))
        sentences_lengths = torch.LongTensor([min(len(s), max_length) for s in sentences])
        padded_sentences = pad_sequences(sentences, sentences_lengths)
        authors = torch.LongTensor(authors)
        return padded_sentences, authors
    return collate_examples

In [None]:
batch_size = 32

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=collate_truncate(1000),
    shuffle=True
)

valid_loader = DataLoader(
    valid_dataset,
    batch_size=batch_size,
    collate_fn=collate_truncate(1000),
    shuffle=False
)

test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=collate_truncate(1000),
    shuffle=False
)

In [None]:
b = next(iter(train_loader))
b[0].shape

# Part 4: Creating a neural network to classify textual data

In [None]:
from torch import nn

class MyEmbeddings(nn.Embedding):
    def __init__(self, word_to_idx, embedding_dim):
        super(MyEmbeddings, self).__init__(len(word_to_idx), embedding_dim, padding_idx=0)
        self.embedding_dim = embedding_dim
        self.vocab_size = len(word_to_idx)
        self.word_to_idx = word_to_idx

    def set_item_embedding(self, idx, embedding):
        self.weight.data[idx] = torch.FloatTensor(embedding)

    def load_words_embeddings(self, vec_model):
        for word in vec_model.index2word:
            if word in self.word_to_idx:
                idx = self.word_to_idx[word]
                embedding = vec_model[word]
                self.set_item_embedding(idx, embedding)
                
embeddings_layer = MyEmbeddings(dataset['word_to_idx'], vec_model.vector_size)
embeddings_layer.load_words_embeddings(vec_model)

In [None]:
embeddings_layer

In [None]:
from torch.nn import functional as F

class TextClassifier(nn.Module):
    def __init__(self, authors_to_idx, embeddings, conv_in_channels=100, conv_out_channels=256):
        super(TextClassifier, self).__init__()
        self.embeddings = embeddings
        self.conv = nn.Conv1d(conv_in_channels, conv_out_channels, kernel_size=5, padding=2)
        self.fully_connected = nn.Linear(conv_out_channels, len(author_to_idx))
        self.loss_function = nn.CrossEntropyLoss()
        self.metrics = ['acc']

    def forward(self, x): # 1
        # import pdb; pdb.set_trace()
        embeddings = self.embeddings(x) # 2
        embeddings = embeddings.transpose(1,2) # 3
        convoluted = self.conv(embeddings) # 4
        convoluted = F.tanh(convoluted) # 5
        pooled = F.max_pool1d(convoluted, convoluted.shape[-1]) # 6
        pooled = pooled.squeeze(dim=2) # 7
        logits = self.fully_connected(pooled) # 8
        return logits


loaders = [train_loader, valid_loader, test_loader]

In [None]:
from torch import optim
from pytoune.framework import Experiment

# Fix bug where memory is allocated on GPU0 when ask to take GPU1.
if torch.cuda.is_available() and not args.cpu:
    torch.cuda.set_device(args.device)
device = torch.device('cuda:%d' % args.device if torch.cuda.is_available() and not args.cpu else 'cpu')

# Create our embedding layer
embeddings_layer = MyEmbeddings(dataset['word_to_idx'], vec_model.vector_size)
# Load pre-trained word embeddings
embeddings_layer.load_words_embeddings(vec_model)

net = TextClassifier(author_to_idx, embeddings_layer)

embeddings_param_set = set(net.embeddings.parameters())
other_params_list = [p for p in net.parameters() if p not in embeddings_param_set]
optimizer = optim.SGD([{'params': other_params_list, 'lr': 1e-2, 'momentum':0.9, 'weight_decay': 1e-3},
                       {'params': net.embeddings.parameters(), 'lr': 1e-3}])

expt = Experiment('./expt_random_init', net, optimizer=optimizer, monitor_metric='val_acc', monitor_mode='max', device=device)

In [None]:
import numpy as np
from pytoune.framework import ReduceLROnPlateau

def init_and_train(expt,
                   loaders,
                   callbacks=[],
                   reduce_lr_on_plateau=False,
                   epochs=2000,
                   steps_per_epoch=None,
                   logging=True,
                   seed=42):
    train_loader, valid_loader, test_loader = loaders

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

    lr_schedulers = []
    if reduce_lr_on_plateau:
        reduce_lr = ReduceLROnPlateau(monitor='loss', mode='min', patience=20, factor=0.5, threshold_mode='abs', threshold=1e-3, verbose=True)
        lr_schedulers.append(reduce_lr)

    expt.train(train_loader, valid_loader,
               epochs=epochs,
               steps_per_epoch=steps_per_epoch,
               validation_steps=steps_per_epoch,
               callbacks=callbacks,
               lr_schedulers=lr_schedulers)

In [None]:
init_and_train(expt, loaders, reduce_lr_on_plateau=True, epochs=30)

In [None]:
expt.test(loaders[2])

In [None]:
all_y = list()
all_pred = list()

for x, y in loaders[2]:
    loss, metric, pred = expt.model.evaluate(x, y, return_pred=True)
    all_y.append(y)
    all_pred.append(pred.argmax(axis=1))

In [None]:
print(classification_report(np.concatenate(all_y), np.concatenate(all_pred), target_names=sorted(authors)))

# Part 5: Todos

- What about leaving stopwords with a neural net?
- The network heavily overfits the data, what can we do?