# Reuters 20 Newsgroups classification

In this exercise we will build a classifier for 20 Newsgroups classification. In this task, your aim is to build a classifier which will classifiy a given content to one of 20 classes. Let us start from loading the data. We will use `sklearn` module to download it and preprocess it:

In [1]:
import sklearn.datasets as sk_datasets

train_data_raw = sk_datasets.fetch_20newsgroups(
    subset='train',
)
test_data_raw = sk_datasets.fetch_20newsgroups(
    subset='test',
)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Raw data is a dictionary with a lot of data which is actually needed for this exercise. We will concentrate only on texts and targets:

In [4]:
train_texts_raw, train_y = train_data_raw['data'], train_data_raw['target']
test_texts_raw, test_y = test_data_raw['data'], test_data_raw['target']

Let us have a look at an example text from training set:

In [5]:
print(train_texts_raw[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







As you may see, there is a lot of additional stuff in this text. Now, let us concentrate on cleaning up the data. In practical applications, cleaning up and understanding potential flaws of your texts is often crucial. In the next cell, try to implement a cleaning up function. We expect from you that it will do the following three things:
- Remove the header of the data,
- Delete all `\n` characters,
- [Optional] Delete the footer.

Feel invited to look for more potential clean ups. Share them on `Slack`. Let us compare how different ways of preprocessing data might affect your models:

In [0]:
def clean_up_text(text):
    text = ' '.join(text.split('\n\n')[1:]) 
    return text
  
  
def clean_up_texts(texts):
    return [clean_up_text(text) for text in texts]

Now - let us apply the clean up function to the data:

In [0]:
train_texts = clean_up_texts(train_texts_raw)
test_texts = clean_up_texts(test_texts_raw)

Now we are in a really interesting moment when a particular representation gap is striking. The collections which were created in a cell above are only collections of lists of characters. Although texts stored there make sense for us, it is only because our brains were trained to read. For a computer, the representation above is equivalent to collection of `int`s. This is where the concept of `embedding` comes in place. We need to represent texts in a form which is able to represent the semantic structure. 

We will use the [`GloVe`](https://nlp.stanford.edu/projects/glove/) embeddings. Let us download and unzip them:

In [6]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2019-05-24 08:51:12--  http://nlp.stanford.edu/data/glove.6B.zip
Translacja nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Łączenie się z nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... połączono.
Żądanie HTTP wysłano, oczekiwanie na odpowiedź... 302 Found
Lokalizacja: https://nlp.stanford.edu/data/glove.6B.zip [podążanie]
--2019-05-24 08:51:13--  https://nlp.stanford.edu/data/glove.6B.zip
Łączenie się z nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... połączono.
Żądanie HTTP wysłano, oczekiwanie na odpowiedź... 301 Moved Permanently
Lokalizacja: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [podążanie]
--2019-05-24 08:51:16--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Translacja downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Łączenie się z downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... połączono.
Żądanie HTTP wysłano, oczekiwanie na odpowiedź... 200 OK
Długość: 862182613 (822M) [applicati

In [0]:
import os

print(os.listdir())

!unzip glove.6B.zip.1

Now, we can prepare an `Embedding` dict:

In [0]:
import collections

import numpy as np


PATH_FOR_50_D_EMBEDDING = 'glove.6B.50d.txt'
PATH_FOR_100_D_EMBEDDING = 'glove.6B.100d.txt'
PATH_FOR_200_D_EMBEDDING = 'glove.6B.200d.txt'
PATH_FOR_300_D_EMBEDDING = 'glove.6B.300d.txt'


def get_word_and_vector_from_line(line):
    values = line.split()
    word = values[0]
    vector = np.asarray(values[1:], dtype='float32')
    return word, vector


def get_embedding_dict_from_file(filepath=PATH_FOR_100_D_EMBEDDING):
    embedding_dict = collections.OrderedDict()
    with open(filepath) as glove_file:
        for line in glove_file:
            word, vector = get_word_and_vector_from_line(line)
            embedding_dict[word] = vector
    print('Found %s word vectors.' % len(embedding_dict))
    return embedding_dict
  
  
embedding_dict = get_embedding_dict_from_file()

For now, we will use only `100D` embedding. Later, one can build embedding dict for other embeddings (including `50D`, `200D`,`300D`). We are using `OrderedDict`, so accidental operations on dictionary will not change the ordering of words.

Ok, the next important step is to prepare the tokenizer tool. The tokenizer should:
- remove all unneeded characters (e.g. punctuation, white spaces, etc.),
- lower all strings,
- split texts into series of words / tokens,
- [Optional] perform some sort of text normalization (e.g. stemming or lemmatization),
- [Optional] remove stop words from the text (one can use NLTK stop words)

Try to think closely about what additional normalization might change in your model performance. If you try to use any normalization on your data, share your results on Slack. Is stemming a good method of string normalization in this case?

In [0]:
import re

import nltk
from nltk.corpus import stopwords


nltk.download('stopwords')


class Tokenizer:
  
    def tokenize_text(self, text):
        raise NotImplementedError
        
    def tokenize_texts(self, texts):
        return [self.tokenize_text(text) for text in texts]
      
      
class StudentTokenizer(Tokenizer):
    pass 
  
  

tokenizer = StudentTokenizer()

Now, we can tokenize texts:

In [0]:
tokenized_train_texts = tokenizer.tokenize_texts(train_texts)
tokenized_test_texts = tokenizer.tokenize_texts(test_texts)

Now, let us have a look at how many words are in our embedding dictionary:

In [0]:
def count_tokens_from_dictionary(tokens, dictionary):
    return len(tokens), len([token for token in tokens if token in dictionary])
  
def flatten_list_of_lists(list_of_lists):
    return [el for list_ in list_of_lists for el in list_]
  
all_train_tokens, dictionary_train_tokens = count_tokens_from_dictionary(
    tokens=flatten_list_of_lists(tokenized_train_texts),
    dictionary=embedding_dict)
all_test_tokens, dictionary_test_tokens = count_tokens_from_dictionary(
    tokens=flatten_list_of_lists(tokenized_test_texts),
    dictionary=embedding_dict)

print('Out of %d train tokens, %d are in dictionary what is %d percent '
      'of all tokens.' % (
    all_train_tokens,
    dictionary_train_tokens,
    int(dictionary_train_tokens / all_train_tokens * 100),
))
print('Out of %d test tokens, %d are in dictionary what is %d percent '
      'of all tokens.' % (
    all_test_tokens,
    dictionary_test_tokens,
    int(dictionary_test_tokens / all_test_tokens * 100),
))

Now, we are almost ready to start modeling our data using Neural Networks. Before that, we need to prepare a function which will transform a sequence of words into sequence of indices as well as the embedding matrix. Your task is to write a more comprehensive implementation of `Embedder` which:

- will include out of vocabulary vector and index (additional index and vector which will model out a vocabulary word). Which vector one should choose as a default value for OOW?
- will pad sequences to a `max_len` length. If `max_len == None`, no padding should be performed.

Noone said that this should be implemented from scratch. E.g., `keras` has a nice tool which might be useful.

In [0]:
class Embedder:
  
    def get_indices_for_tokenized_texts(self, tokenized_texts):
        return [self.get_tokens_indices(tokens) for tokens in tokenized_texts]
 
    def get_tokens_indices(self, tokens):
        raise NotImplementedError    
      
    @property
    def embedding_dim(self):
        raise NotImplementedError
        
    @property
    def embedding_matrix():
        raise NotImplementedError
      
      
class StudentEmbedder(Embedder):
    # TODO
    pass
      
    
        
embedder = Embedder(embedding_dictionary=embedding_dict)

Now, we are ready for preparation of datasets:

In [0]:
import torch.utils.data as torch_data


train_set = torch_data.TensorDataset(
    torch.LongTensor(embedder.get_indices_for_tokenized_texts(
        tokenizer.tokenize_texts(train_texts),
    )),
    torch.LongTensor(train_y),
)
test_set = torch_data.TensorDataset(
    torch.LongTensor(embedder.get_indices_for_tokenized_texts(
        tokenizer.tokenize_texts(test_texts),
    )),
    torch.LongTensor(test_y),
)

... and implement  and train a model which will be a multinomial logistic regression over mean of token embeddings. 

Hints:
- use `EmbeddingBag` layer,
- is padding useful in this approach?

Remember to make your model architecture as easy to change as possible.

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class MeanEmbedding(nn.Module):
    pass

A final model should be 1D Convolution over a sequence of vector embeddings:

Hints:
- use `Embedding` layer (beware however on its output dimensions).

Extra:
- how about making possible for your model to deal with varying length approach?


In [0]:
class FullEmbedding(nn.Module):
    pass