## In-class Assignment 3
## [COSC 7336 Advanced Natural Language](https://fagonzalezo.github.io/dl-tau-2017-2/)

## Authorship Attribution with CNNs

The goal is to build a model to identify the author of a text. We will use the CCAT Reuters 50-50 dataset (https://archive.ics.uci.edu/ml/datasets/Reuter_50_50). The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts. First you have to download the file `c50.zip` from the UCI repository. If your are working on an Azure you can use `wget` or the following Python code that downloads the file from `URL` and saves it locally under `file_name`::

```python
import urllib.request
urllib.request.urlretrieve('URL', "filename")
```

After you have the file, the following code will unzip it. 

In [None]:
# Unzip the c50.zip file
import zipfile
zip_ref = zipfile.ZipFile("./C50.zip", 'r')
zip_ref.extractall(".")
zip_ref.close()

The following code processes the dataset files and loads the text:

In [1]:
import os
import keras
from keras.preprocessing.text import Tokenizer

def get_texts_from_catdir(cat_dir):
    texts = []
    TRAIN_DIR = cat_dir#"./c50/train"
    category_index = {}
    categories = []
    for category_name in sorted(os.listdir(TRAIN_DIR)):
        category_id = len(category_index)
        category_index[category_name] = category_id
        #print(category_name)
        category_path = os.path.join(TRAIN_DIR, category_name)
        for f_name in sorted(os.listdir(category_path)):
            f_path = os.path.join(category_path, f_name)
            #print(f_name)
            #print(f_path)
            f = open(f_path, "r")
            texts += [f.read()]                
            f.close()
            categories += [category_id]
    print("%d files loaded from %s" % (len(texts), cat_dir))
    return texts, categories, category_index

# Load the RAW text and Category labels
tr_txt, tr_y, tr_y_ind = get_texts_from_catdir("./C50/C50train")
te_txt, te_y, te_y_ind = get_texts_from_catdir("./C50/C50test")
#print(tr_y)
#print(tr_y_ind)

Using TensorFlow backend.


2500 files loaded from ./C50/C50train
2500 files loaded from ./C50/C50test


Finally, we extract the words from the text and represent is sample as a sequence of indices:

In [2]:
# Build Tokenizer and Vocabulary
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(tr_txt)
# Dictionary of the WHOLE extracted vocabulary
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
# Frequencies of words
word_counts=tokenizer.word_counts
print("Frequency of \"%s\" is %s" % ("the", word_counts["the"]))

print("WORD-SEQUENCE OF A TEXT:")
sample_text = keras.preprocessing.text.text_to_word_sequence(tr_txt[0])
print(sample_text[:5])

print("TEXT CONVERTED TO A SEQUENCE OF IDs: ")
X_tr_seq = tokenizer.texts_to_sequences(tr_txt)
X_te_seq = tokenizer.texts_to_sequences(te_txt)
print(X_tr_seq[0][:5])

Found 31090 unique tokens.
Frequency of "the" is 71425
WORD-SEQUENCE OF A TEXT:
['the', 'internet', 'may', 'be', 'overflowing']
TEXT CONVERTED TO A SEQUENCE OF IDs: 
[1, 169, 130, 14, 13]


## Assignment

Now, it's time for you to work. 

### 1. Small dataset experiment

Adapt the code in https://github.com/fagonzalezo/dl-tau-2017-2/blob/master/Handout-CNN-sentence-classification.ipynb to work with the the authorship dataset that we just processed. We will train the CNN-rand model (no word2vec vectors).

We need to shuffle the train dataset, since it is sorted by author. Also, for the first experiment we will take only 500 samples.

In [3]:
from sklearn.utils import shuffle
x_train, y_train = shuffle(X_tr_seq, tr_y, random_state=0)


x_test = X_te_seq
y_test = te_y

x_train = x_train[:500]
y_train = y_train[:500]

num_classes = len(set(y_train))
print("Num Classes: " + str(num_classes))


Num Classes: 50


### 2. Small dataset with word2vec

Download the word2vec pretrained model from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing. If you are using an Azure VM you may need to use the following utility to download files from Google Drive:

In [4]:
import requests

def download_file_from_google_drive(id, destination):
    '''
    id: file id (take it from sharable link)
    destination: destination file in your disk
    '''
    def get_confirm_token(response):
        for key, value in response.cookies.items():
            if key.startswith('download_warning'):
                return value

        return None

    def save_response_content(response, destination):
        CHUNK_SIZE = 32768

        with open(destination, "wb") as f:
            for chunk in response.iter_content(CHUNK_SIZE):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)

    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

### 3. Full dataset experiment

Repeat the experimen using now the whole dataset.