#### Problem statement

Predict the political party from the tweet text and the handle

#### Data description
This dataset has three columns - label (party name), twitter handle, tweet text


#### Problem Description:

Design a feed forward deep neural network to predict the political party using the pytorch or tensorflow. 
Build two models

1. Without using the handle

2. Using the handle


#### Deliverables

- Report the performance on the test set.

- Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

- Experiment with:
    -L2 and dropout regularization techniques
    -SGD, RMSProp and Adamp optimization techniques



- Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

    - Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter. 

    - Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise
    
    -  Clearly state your design choices and assumptions. Think about the pros and cons of each option.

 

<b> Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:</b>

1. Experiment description

2. Hyperparameter used and their values

3. Performance on the test set

 

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.optimizers import Adam
from tensorflow.keras import regularizers
from keras.regularizers import l2
from keras.utils import to_categorical

In [None]:
train_df = pd.read_csv('train.csv', index_col=0, error_bad_lines=False)
test_df = pd.read_csv('test.csv', index_col=0, error_bad_lines=False)
train_df.head()

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...


In [None]:
print(train_df.Party.unique())

['Democrat' nan 'Republican']


In [None]:
print(test_df.Party.unique())

['Democrat' 'Republican']


In [None]:
train_df.dropna(inplace=True)
print(train_df.Party.unique())

['Democrat' 'Republican']


In [None]:
train_df.Party.replace(['Democrat', 'Republican'], [1, 0], inplace=True)
test_df.Party.replace(['Democrat', 'Republican'], [1, 0], inplace=True)
train_df.head()

Unnamed: 0,Party,Handle,Tweet
0,1,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,1,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,1,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,1,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,1,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...


In [None]:
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
    return TAG_RE.sub('', text)

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

In [None]:
X_train = []
#apply preprocess_text to remove HTML tags, punctuations and numbers.
train_sentences = list(train_df.Tweet)
for sen in train_sentences:
    X_train.append(preprocess_text(sen))

y_train = train_df.Party

In [None]:
X_test = []
#apply preprocess_text to remove HTML tags, punctuations and numbers.
test_sentences = list(test_df.Tweet)
for sen in test_sentences:
    X_test.append(preprocess_text(sen))

y_test = test_df.Party

In [None]:
# create a word-to-index dictionary by fitting it on X_train to see all vocab
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

# perform the tokenization (replace each word by its corresponding index)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

In [None]:
# Maximum sentence length in the training data
print(len(max(X_train, key=len)))

30


In [None]:
# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1 #this variable will be used later

# Experimented with 100, 50, 30 and finally 20.
maxlen = 20 # maximum length of the sentence in WORDS

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

In [None]:
pd.DataFrame(X_train).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,21,209,1003,145,4,3022,70,4,77,3023,701,152,64,7,1,39,3,2,30924,424
1,9,15880,1960,1569,2630,15881,5121,1160,11,73,5,992,1316,30,1545,8,90,1160,20320,0
2,9,9819,1545,8526,18,854,2863,41,836,5294,404,7,6951,55,41,9820,31,0,0,0
3,9,20321,109,13,1545,62,8,345,1,53,4,193,13,30925,499,30926,30927,20321,0,0
4,9,30928,854,1004,1663,10,2293,238,969,1121,1935,402,20322,1422,1545,15882,0,0,0,0


# Model 1
Without using the "Handle" feature.

A) Trying a normal feedforward deep neural network. 

In [None]:
model = Sequential()
embedding_layer = Embedding(vocab_size, 20, input_length=maxlen , trainable=False)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=Adam(lr=0.001), loss='binary_crossentropy', metrics=['acc'], )
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 20, 20)            2291400   
_________________________________________________________________
flatten (Flatten)            (None, 400)               0         
_________________________________________________________________
dense (Dense)                (None, 512)               205312    
_________________________________________________________________
dense_1 (Dense)              (None, 128)               65664     
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_3 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 3

In [None]:
history = model.fit(X_train, y_train, batch_size=256, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Accuracy:", score[1])

Test Accuracy: 0.5576278567314148


B) Adding dropout regularization.

In [None]:
model = Sequential()
embedding_layer = Embedding(vocab_size, 32, input_length=maxlen , trainable=False)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 32)            3666240   
_________________________________________________________________
flatten_1 (Flatten)          (None, 640)               0         
_________________________________________________________________
dropout (Dropout)            (None, 640)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 512)               328192    
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 128)               65664     
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)              

In [None]:
history = model.fit(X_train, y_train, batch_size=256, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Accuracy:", score[1])

Test Accuracy: 0.5597406625747681


C) Adding L2 regularization.

In [None]:
model = Sequential()
embedding_layer = Embedding(vocab_size, 32, input_length=maxlen , trainable=False)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(512, activation='relu', kernel_regularizer=l2(0.0001)))
model.add(Dense(128, activation='relu', kernel_regularizer=l2(0.0001)))
model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.0001)))
model.add(Dense(32, activation='relu', kernel_regularizer=l2(0.0001)))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 32)            3666240   
_________________________________________________________________
flatten_2 (Flatten)          (None, 640)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 512)               328192    
_________________________________________________________________
dense_11 (Dense)             (None, 128)               65664     
_________________________________________________________________
dense_12 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_13 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_14 (Dense)             (None, 1)                

In [None]:
history = model.fit(X_train, y_train, batch_size=256, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Accuracy:", score[1])

Test Accuracy: 0.5557336211204529


D) Trying dropout with L2 regularizations.

In [None]:
model = Sequential()
embedding_layer = Embedding(vocab_size, 32, input_length=maxlen , trainable=False)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu', kernel_regularizer=l2(0.0001)))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu', kernel_regularizer=l2(0.0001)))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.0001)))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu', kernel_regularizer=l2(0.0001)))
model.add(Dropout(0.2))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 20, 32)            3666240   
_________________________________________________________________
flatten_3 (Flatten)          (None, 640)               0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 640)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 512)               328192    
_________________________________________________________________
dropout_6 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_16 (Dense)             (None, 128)               65664     
_________________________________________________________________
dropout_7 (Dropout)          (None, 128)              

In [None]:
history = model.fit(X_train, y_train, batch_size=256, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Accuracy:", score[1])

Test Accuracy: 0.5523823499679565


# Model 2
Including the Handle feature.

In [None]:
unique_handles = train_df.Handle.unique()
### map each color to an integer
mapping = {}
for x in range(len(unique_handles)):
  mapping[unique_handles[x]] = x

# integer representation
for x in range(len(train_df.Handle)):
  train_df.Handle[x] = mapping[train_df.Handle[x]]
print(to_categorical(train_df.Handle))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


[[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]


In [None]:
test_df.reset_index(inplace=True)
unique_handles = test_df.Handle.unique()
### map each color to an integer
mapping = {}
for x in range(len(unique_handles)):
  mapping[unique_handles[x]] = x

# integer representation
for x in range(len(test_df.Handle)):
  test_df.Handle[x] = mapping[test_df.Handle[x]]
print(to_categorical(test_df.Handle))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


[[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]


In [None]:
X_train = pd.DataFrame(X_train).merge(pd.DataFrame(to_categorical(train_df.Handle)), left_index=True, right_index=True)
X_test = pd.DataFrame(X_test).merge(pd.DataFrame(to_categorical(test_df.Handle)), left_index=True, right_index=True)

In [None]:
print(X_train.shape)
print(X_test.shape)

(72734, 453)
(13726, 441)


In [None]:
X_test.head()

Unnamed: 0,0_x,1_x,2_x,3_x,4_x,5_x,6_x,7_x,8_x,9_x,10_x,11_x,12_x,13_x,14_x,15_x,16_x,17_x,18_x,19_x,0_y,1_y,2_y,3_y,4_y,5_y,6_y,7_y,8_y,9_y,10_y,11_y,12_y,13_y,14_y,15_y,16_y,17_y,18_y,19_y,...,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420
0,89,74,229,680,216,7,20614,8,12,2245,2672,313,16,25,89,3,2,0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,189,501,171,221,21,4,6836,3165,210,24,35,1324,48,615,3,2,0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,21,32338,636,47,746,32,38,117,78,1056,5,1726,22,32,24735,3,2,3263,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,34,192,6225,4506,1483,8,1213,1258,797,303,12,1038,48,3771,2985,11982,3,2,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,185,174,19,281,1009,242,263,107,3241,10,403,3,2,0,0,0,0,0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
for x in range(421, 433):
    X_test[str(x)] = 0
print(X_train.shape)
print(X_test.shape)

(72734, 453)
(13726, 453)


Using the model's architecture that showed highest accuracy in part 1.

In [None]:
model = Sequential()
embedding_layer = Embedding(vocab_size, 32, input_length=X_train.shape[1] , trainable=False)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 453, 32)           3666240   
_________________________________________________________________
flatten_7 (Flatten)          (None, 14496)             0         
_________________________________________________________________
dense_35 (Dense)             (None, 512)               7422464   
_________________________________________________________________
dense_36 (Dense)             (None, 128)               65664     
_________________________________________________________________
dense_37 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_38 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_39 (Dense)             (None, 1)                

In [None]:
history = model.fit(X_train, y_train, batch_size=256, epochs=5, verbose=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Accuracy:", score[1])

Test Accuracy: 0.9887804388999939


# Documentation

In the preprocessing part:
1. I followed some NLTK tutorials and the tensorflow's documentation tutorial on dealing with text based data (IMDB movies reviews dataset) to clean the tweets and tokenizing them.
2. I usd tensorflow's sequence_pad to pad to generate a fixed-size input. I experminted N=100, 50, 30, and 20. 20 showed the best training accuracy.


In part 1 (without using the Handle feature),
1. I first experiment a normal feedforward neural network with only 1 hidden layer, then tried increasing the number of layers to add depth to the network and tried different neurons numbers to reach the used model which showed highest training and testing accuracy so far.  
**Accuracy for this model: 55.76%.**
2. I tried adding Dropout layers after each hidden layer and experimented dropout ratio with 0.2 and 0.5 and found that 0.5 works well with layers having large number of neurons (like the first and second layers) and 0.2 works well with layers having small number of neurons.  
**Accuracy for this model: 55.97%.**
3. I tried using L2 regularization to all layers with lambda = 0.01, 0.001, and 0.0001 using tensorflow's kernel_regualizer. The 0.0001 showed the best results.
**Accuracy for this model: 55.57%.**
4. I tried mixing model 2 (with dropout) and model 3 (with L2 regularization) to see if they work better together.  
**Accuracy for this model: 55.24%.**

In part 2 (with the Handle feature):
1. I encoded the Handle feature column using tensorflow's to_categorical after giving each unique handle an index.
2. I added 0 columns at the end of X_test to match the dimensions of X_train as the model raised an error where they didn't match.
3. I tried the 3 different models used in part 1 and the first model (the one with no regularization)showed highest **accuracy: 98.9%**. So, clearly using the Handle feature improves the testing accuracy a lot.

Other hyperparameters:
1. I tried decreasing the learning rate but the model wasn't learning well (training accuracy was hardly increasing). So, I left it at the default 0.001.
2. I tried batch_size= 32, 64 but it took some time to train so, I increased it to 256 and the training speed increased noticably without affecting the accuracy.
3. I started with ephocs=20 but noticed that most models accuracy don't increase much after the tenth epoch. So, I used ephocs=10 for all part 1 models and ephocs=5 for the part 2 model as it already reaches 100% training accuracy in ephoc 5.