# Lab Assignment Seven: Recurrent Network Architectures

## Dataset Selection

Select a dataset that is text. That is, the dataset should be text data (or a time series sequence). In terms of generalization performance, it is helpful to have a large dataset of similar sized text documents. It is fine to perform binary classification or multi-class classification. The classification can be "many-to-one" or "many-to-many" sequence classification, whichever you feel more comfortable with. 

It's a dataset I got from [kaggle](https://www.kaggle.com/kazanova/sentiment140). Sentiment140 dataset with 1.6 million tweets

## Preparation (3 points total)
- [1 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed). Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). Discuss methods of tokenization in your dataset as well as any decisions to force a specific length of sequence.  
- [1 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.
- [1 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Convince me that your train/test splitting method is a realistic mirroring of how an algorithm would be used in practice. 

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
from tensorflow import keras
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from scipy.special import expit
from sklearn.model_selection import train_test_split
import copy
from sklearn.model_selection import ShuffleSplit
from sklearn import metrics as mt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import sys
import os
import tempfile
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, ShuffleSplit
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import make_scorer, accuracy_score, roc_curve, auc
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.neural_network import MLPClassifier
from sklearn import metrics as mt
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from tensorflow.keras.layers import Dense, Activation, Input, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import concatenate
from tensorflow.keras.utils import plot_model
from tensorflow.keras.optimizers import Adagrad,Adam

print('tensorflow version:',tf.__version__)
print('pandas version:',pd.__version__)
print('keras version:',keras.__version__)
print('numpy version:',np.__version__)

%matplotlib inline
%config InlineBackend.figure_format='retina'
plt.style.use('ggplot')

tensorflow version: 2.3.0
pandas version: 1.2.4
keras version: 2.4.0
numpy version: 1.18.5


### It's a large dataset contained 1.6 million rows, but I use only 10 thousand rows of them for computational efficiency.

In [63]:
%%time
chunks = pd.read_csv('./data/sentiment140.csv',names=['target','ids','date','flag','user','text'],usecols=['target','text'],iterator=True,chunksize=5000)
get_pos = False
get_neg = False
for chunk in chunks:
    if ~get_pos and (chunk['target']==4).all():
        chunka = chunk
        get_pos = True
    if ~get_neg and (chunk['target']==0).all():
        chunkb = chunk
        get_neg = True
    if  get_pos and get_neg:
        break

df = pd.concat([chunka,chunkb])
df['target'].replace(4,1,inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 800000 to 799999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   target  10000 non-null  int64 
 1   text    10000 non-null  object
dtypes: int64(1), object(1)
memory usage: 234.4+ KB
CPU times: user 638 ms, sys: 7.83 ms, total: 646 ms
Wall time: 645 ms


In [64]:
df.target.value_counts()

0    5000
1    5000
Name: target, dtype: int64

## Modeling (6 points total)
- [3 points] Investigate at least two different recurrent network architectures (perhaps LSTM and GRU). Be sure to use an embedding layer (pre-trained, from scratch, OR both). Adjust hyper-parameters of the networks as needed to improve generalization performance (train a total of at least four models). Discuss the performance of each network and compare them.
- [1 points] Using the best RNN parameters and architecture, add a second recurrent chain to your RNN. The input to the second chain should be the output sequence of the first chain. Visualize the performance of training and validation sets versus the training iterations. 
- [2 points] Use the method of train/test splitting and evaluation criteria that you argued for at the beginning of the lab. Visualize the results of all the RNNs you trained.  Use proper statistical comparison techniques to determine which method(s) is (are) superior. 

## Exceptional Work (1 points total)
- You have free reign to provide additional analyses.
- One idea (required for 7000 level students to do one of these options):
    - Option 1: Use dimensionality reduction (choose an appropriate method from this list: t-SNE, SVD, PCA, or UMAP) to visualize the word embeddings of a subset of words in your vocabulary that you expect to have an analogy that can be captured by the embedding. Try to interpret if an analogy exists, show the vectors that support/refute the analogy, and interpret your findings. 
    - Options 2: Use the ConceptNet Numberbatch embedding and compare to GloVe. Which method is better for your specific application? 
- Another Idea (NOT required): Try to create a RNN for generating novel text. 