## Code for Downloading train and validation dataset.
#### train_url: url of "train.json"
#### valid_url: url of "valid.json"
    - These are list of dictionaries.
    - Each dictionary has key of 'user_id', 'sentence'.
    - There are **total eight** twitter users.

#### valid4test_url: url of "valid.json", but **without** label data. I save this as "test.json" for simulating grading process.
    - This is also a list of dictionaries.
    - But in here, each dictionary does not have a key of 'user_id', it only has 'sentence' information.
    - For grading, I have my own 'test.json' data, which I would not provide to students. 
    - My 'test.json' has the same format as this file (only 'sentence' as a key). 
    - For simulating the grading phase, in this Colab notebook, I would save this file as a 'test.json'.
    - In your code to be submitted, what you need to do is
        (1) load './test.json', which is a list of dictionaries.
        (2) predict "user_id" for each "sentence" in each dictionary.
        (2-1) Which means you need to submit (a) code for test (b) trained model.
        (3) add your predicted "user_id" information to each dictionary,
        (4) save the final list of dictionary as a "result.json". 

In [1]:
import gdown

train_url = 'https://drive.google.com/uc?id=1QV7r1Gr6Qh8lB-cV5Zui5_2ElQoQgYbb'
valid_url = 'https://drive.google.com/uc?id=1MmDF2k4s7VrlWRqyOtw-KG5pHF9P7u9v'
valid4test_url = 'https://drive.google.com/uc?id=1T5UFbIWq8IA5ox0upGcpxtTRyJwakxwI'

# These are the dataset you need to use for your training and validation.
gdown.download(train_url, './train.json')
gdown.download(valid_url, './valid.json')

# Save the validation dataset in the name of 'test.json', just to simulate the grading phase. 
# You need to submit code that can 
gdown.download(valid4test_url, './test.json')

Downloading...
From: https://drive.google.com/uc?id=1QV7r1Gr6Qh8lB-cV5Zui5_2ElQoQgYbb
To: /home/elkhan/AI Toolkits/assignment-5/train.json
100%|██████████| 818k/818k [00:00<00:00, 3.40MB/s]
Downloading...
From: https://drive.google.com/uc?id=1MmDF2k4s7VrlWRqyOtw-KG5pHF9P7u9v
To: /home/elkhan/AI Toolkits/assignment-5/valid.json
100%|██████████| 127k/127k [00:00<00:00, 834kB/s]
Downloading...
From: https://drive.google.com/uc?id=1T5UFbIWq8IA5ox0upGcpxtTRyJwakxwI
To: /home/elkhan/AI Toolkits/assignment-5/test.json
100%|██████████| 112k/112k [00:00<00:00, 931kB/s]


'./test.json'

## What you need to submit
1. Model `.pth` which finished training.
2. A code (`test.py` or `test.ipynb`) which can...
    - (a) Load `./test.json` file.
        - Your code should work when I paste `test.json` file to the same directory where your `test.py` or `test.ipynb` is.
    - (b) Load your pretrained model.
    - (c) Do prediction: get `user_id` for each `sentence`
    - (d) Save your list of dictionaries as `result.json`.
    ```
    # This is just an example to help your understanding.
    d0 = {'user_id':2, 'sentence':'Hi I am Bill'}
    d1 = {'user_id':5, 'sentence':'Hi I am Elon'}
    ...
    res = [d0, d1, d2, ....]
    json.dump(res, open(result.json, 'w'))
    ``` 

In [2]:
import json

# load json
input_path = './test.json'
input_data = json.load(open(input_path, 'r'))
print('Number of test data: ', len(input_data))
print('Example Data: \n', input_data[0])

Number of test data:  800
Example Data: 
 {'sentence': 'i got arrested beaten left bloody and unconscious but i havent given up and you can not give up an inspiring read from civil rights legend'}


In [23]:
import pandas as pd

dataset = pd.read_json("train.json")
num_classes = len(dataset['user_id'].unique())

In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

def tokenize_pad_sequences(text, max_words, max_len):
    '''
    This function tokenize the input text into sequnences of intergers and then
    pad each sequence to the same length
    '''

    # Text tokenization
    tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')
    tokenizer.fit_on_texts(text)
    # Transforms text to a sequence of integers
    X = tokenizer.texts_to_sequences(text)
    # Pad sequences to the same length
    X = pad_sequences(X, padding='post', maxlen=max_len)
    # return sequences
    return X, tokenizer

In [15]:
max_len = 50
meta = pd.read_json("meta.json")
X, tokenizer = tokenize_pad_sequences(dataset['sentence'], max_words=len(meta['tokens']), max_len=50)

In [24]:
import torch.nn as nn

class RNNmodel(nn.Module):
    def __init__(self, lstm_dim, num_classes, max_len):
        super(RNNmodel, self).__init__()
        self.lstm_dim = lstm_dim
        self.num_classes = num_classes
        self.max_len = max_len
        self.char_embedding = nn.Embedding(num_embeddings=num_classes, 
                                           embedding_dim=lstm_dim)
        self.lstm = nn.LSTM(input_size=lstm_dim, 
                            hidden_size=lstm_dim,
                            num_layers=1, 
                            batch_first=True,
                            )
        
        self.out_linear = nn.Linear(lstm_dim, num_classes)

    def forward(self, sort_input, sort_output, sort_length):
        ## originally, recommended to use torch.nn.utils.rnn.pack_padded_sequence,when we have variable lengths
        ## but in this case, I just neglected it because beginners can be more confused with this
        lstm_input = self.char_embedding(sort_input)
        lstm_out, (h, c) = self.lstm(lstm_input)
        out = self.out_linear(lstm_out)
        
        return nn.functional.softmax(out)

### This is just a code to simulate grading process. 
- This code just fills in the random integer index as the predicted `user_id`. 

In [None]:
vocab_size = 5000
embedding_size = 32
epochs=20
learning_rate = 0.1
decay_rate = learning_rate / epochs
momentum = 0.8
model = RNNmodel(256, num_classes, max_len)

In [None]:
import numpy as np

def random_answer(input_data, num_users=8):
    # You need to save "list" of "dictionaries" as an "result.json"
    result = list()
    for d in input_data:
        # Each dictionary needs to have "sentence" and "user_id"
        tmp_result = dict()
        tmp_result['sentence'] = d['sentence']

        # This code just fill in the random integer to the answer. 
        '''
        Do something for your prediction 
        '''
        # Your TODO is to train a model which can fill in this 'user_id' with your own answer.
        tmp_result['user_id'] = np.random.randint(low=0, high=num_users) # Change with your prediction

        result.append(tmp_result)

    return result

In [None]:
random_result = random_answer(input_data)

In [None]:
## save with indent=2
json.dump(random_result, open('./result.json', 'w'), indent=2)