#### Problem statement

Predict the political party from the tweet text and the handle

#### Data description
This dataset has three columns - label (party name), twitter handle, tweet text


#### Problem Description:

Design a feed forward deep neural network to predict the political party using the pytorch or tensorflow. 
Build two models

1. Without using the handle

2. Using the handle


#### Deliverables

- Report the performance on the test set.

- Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

- Experiment with:
    -L2 and dropout regularization techniques
    -SGD, RMSProp and Adamp optimization techniques



- Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

    - Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter. 

    - Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise
    
    -  Clearly state your design choices and assumptions. Think about the pros and cons of each option.

 

<b> Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:</b>

1. Experiment description

2. Hyperparameter used and their values

3. Performance on the test set

 

In [1]:
#You can install TensorFlow and Keras with pip.
!pip install tensorflow



In [2]:
# Import necessary libraries
import tensorflow as tf
from tensorflow import keras

# Check TensorFlow version
print("TensorFlow version:", tf.__version__)


TensorFlow version: 2.13.0


In [65]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import SGD, RMSprop, Adam

In [66]:
# Load train and test datasets
train_data = pd.read_csv("C:\\Users\\mansi\\Downloads\\train.csv")
test_data = pd.read_csv("C:\\Users\\mansi\\Downloads\\test.csv")

In [67]:
# Handle NaN values in the "Tweet" column
train_data['Tweet'].fillna('', inplace=True)
test_data['Tweet'].fillna('', inplace=True)

# Preprocess tweet text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_data['Tweet'])
X_train = tokenizer.texts_to_sequences(train_data['Tweet'])
X_test = tokenizer.texts_to_sequences(test_data['Tweet'])

# Specify the number of words (K) to select from each tweet
K = 20 # Option 1
X_train = [seq[:K] for seq in X_train]
X_test = [seq[:K] for seq in X_test]

# Pad sequences to a fixed length
max_seq_length = K 
X_train = pad_sequences(X_train, maxlen=max_seq_length)
X_test = pad_sequences(X_test, maxlen=max_seq_length)

# Encode labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data['Party'])
y_test = label_encoder.transform(test_data['Party'])

The code uses Option 1 to create fixed-size vocabulary because it is simpler to implement and it can be used to capture sequential nature of tweet text. 

Although this option may not be good for capturing all of the important information.
For capturing all the important information, option 2 is recommended. Option 2 is less sensitive to the order of the words in the tweets.

In [68]:
# Define and compile the model without handles using ADAM optimizer
def build_model_without_handles_adam():
    model = Sequential()
    model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_seq_length))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))  # Adding dropout for regularization
    model.add(Dense(64, activation='relu'))
    model.add(Dense(len(label_encoder.classes_), activation='softmax'))
    return model

model_without_handles_adam = build_model_without_handles_adam()
model_without_handles_adam.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Using SGD optimizer
def build_model_without_handles_sgd():
    model = Sequential()
    model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_seq_length))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(len(label_encoder.classes_), activation='softmax'))
    return model

model_without_handles_sgd = build_model_without_handles_sgd()
model_without_handles_sgd.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Using RMSProp optimizer
def build_model_without_handles_rmsprop():
    model = Sequential()
    model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_seq_length))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(len(label_encoder.classes_), activation='softmax'))
    return model

model_without_handles_rmsprop = build_model_without_handles_rmsprop()
model_without_handles_rmsprop.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [69]:
# Train models with different hyperparameters
def train_model(model, X_train, y_train, epochs=5, batch_size=64):
    model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)

In [70]:
# Experiment models with different optimizers
train_model(model_without_handles_adam, X_train, y_train, epochs=10, batch_size=64)
train_model(model_without_handles_sgd, X_train, y_train, epochs=10, batch_size=64)
train_model(model_without_handles_rmsprop, X_train, y_train, epochs=10, batch_size=64)


Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [71]:
# Evaluate models on the test set
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_pred_classes = np.argmax(y_pred, axis=1)
    accuracy = accuracy_score(y_test, y_pred_classes)
    return accuracy

accuracy_without_handles_adam = evaluate_model(model_without_handles_adam, X_test, y_test)
accuracy_without_handles_sgd = evaluate_model(model_without_handles_sgd, X_test, y_test)
accuracy_without_handles_rmsprop = evaluate_model(model_without_handles_rmsprop, X_test, y_test)



In [72]:
print("Accuracy without Handles (Adam):", accuracy_without_handles_adam)
print("Accuracy without Handles (SGD):", accuracy_without_handles_sgd)
print("Accuracy without Handles (RMSProp):", accuracy_without_handles_rmsprop)

Accuracy without Handles (Adam): 0.7550633833600466
Accuracy without Handles (SGD): 0.5801398805187236
Accuracy without Handles (RMSProp): 0.7401282238088299


In [73]:
# With "Handle" column
train_data['Tweet'].fillna('', inplace=True)
train_data['Handle'].fillna('', inplace=True)
test_data['Tweet'].fillna('', inplace=True)
test_data['Handle'].fillna('', inplace=True)

train_data['Text'] = train_data['Handle'] + ' ' + train_data['Tweet']
test_data['Text'] = test_data['Handle'] + ' ' + test_data['Tweet']

# Preprocess tweet text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_data['Text'])
X_train = tokenizer.texts_to_sequences(train_data['Text'])
X_test = tokenizer.texts_to_sequences(test_data['Text'])

K = 20  # Option 1
X_train = [seq[:K] for seq in X_train]
X_test = [seq[:K] for seq in X_test]

# Pad sequences to a fixed length
max_seq_length = K
X_train = pad_sequences(X_train, maxlen=max_seq_length)
X_test = pad_sequences(X_test, maxlen=max_seq_length)

# Encode labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data['Party'])
y_test = label_encoder.transform(test_data['Party'])

In [74]:
# Define and compile the model with handles using ADAM optimizer
def build_model_with_handles_adam():
    model = Sequential()
    model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_seq_length))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))  # Adding dropout for regularization
    model.add(Dense(64, activation='relu'))
    model.add(Dense(len(label_encoder.classes_), activation='softmax'))
    return model

model_with_handles_adam = build_model_with_handles_adam()
custom_learning_rate = 0.15
adam_optimizer = Adam(learning_rate=custom_learning_rate)
model_with_handles_adam.compile(loss='sparse_categorical_crossentropy', optimizer=adam_optimizer, metrics=['accuracy'])

# Using SGD optimizer
def build_model_with_handles_sgd():
    model = Sequential()
    model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_seq_length))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(len(label_encoder.classes_), activation='softmax'))
    return model

model_with_handles_sgd = build_model_with_handles_sgd()
custom_learning_rate = 0.01
sgd_optimizer = SGD(learning_rate=custom_learning_rate)
model_with_handles_sgd.compile(loss='sparse_categorical_crossentropy', optimizer=sgd_optimizer, metrics=['accuracy'])

# Using RMSProp optimizer
def build_model_with_handles_rmsprop():
    model = Sequential()
    model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_seq_length))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.75))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(len(label_encoder.classes_), activation='softmax'))
    return model

model_with_handles_rmsprop = build_model_with_handles_rmsprop()
custom_learning_rate = 0.22
rmsprop_optimizer = RMSprop(learning_rate=custom_learning_rate)
model_with_handles_rmsprop.compile(loss='sparse_categorical_crossentropy', optimizer=rmsprop_optimizer, metrics=['accuracy'])


In [75]:
# Train models with different optimizers
train_model(model_with_handles_adam, X_train, y_train, epochs=10, batch_size=64)
train_model(model_with_handles_sgd, X_train, y_train, epochs=5, batch_size=64)
train_model(model_with_handles_rmsprop, X_train, y_train, epochs=12, batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


In [76]:
# Evaluate models on the test set
accuracy_with_handles_adam = evaluate_model(model_with_handles_adam, X_test, y_test)
accuracy_with_handles_sgd = evaluate_model(model_with_handles_sgd, X_test, y_test)
accuracy_with_handles_rmsprop = evaluate_model(model_with_handles_rmsprop, X_test, y_test)




In [77]:
print("Accuracy with Handles (Adam):", accuracy_with_handles_adam)
print("Accuracy with Handles (SGD):", accuracy_with_handles_sgd)
print("Accuracy with Handles (RMSProp):", accuracy_with_handles_rmsprop)

Accuracy with Handles (Adam): 0.4939530817426781
Accuracy with Handles (SGD): 0.8902083636893486
Accuracy with Handles (RMSProp): 0.4939530817426781


#### Tabulate the results

In [78]:
result = {
    'Experimental Description': ['Feed forward network without Handles (Adam)', 'Feed forward network with Handles (Adam)', 'Feed forward network without Handles (SGD)',
                                 'Feed forward network with Handles (SGD)', 'Feed forward network without Handles (RMSProp)', 'Feed forward network with Handles (RMSProp)'],
    'Hyperparameters': ["K = 20, Learning rate = 0.001, Dropout probablity = 0.5, epoch = 10",
                        "K = 20, Learning rate = 0.15, Dropout probablity = 0.5, epoch = 10",
                        "K = 20, Learning rate = 0.001, Dropout probablity = 0.5, epoch = 10",
                        "K = 20, Learning rate = 0.01, Dropout probablity = 0.2, epoch = 5",
                        "K = 20, Learning rate = 0.001, Dropout probablity = 0.5, epoch = 10",
                        "K = 20, Learning rate = 0.22, Dropout probablity = 0.75, epoch = 12"],
    'Performance on the Test Set': [0.755063383, 0.493953082, 0.580139881, 0.890208364, 0.740128224, 0.493953082]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(result)
print(df)

                         Experimental Description  \
0     Feed forward network without Handles (Adam)   
1        Feed forward network with Handles (Adam)   
2      Feed forward network without Handles (SGD)   
3         Feed forward network with Handles (SGD)   
4  Feed forward network without Handles (RMSProp)   
5     Feed forward network with Handles (RMSProp)   

                                     Hyperparameters  \
0  K = 20, Learning rate = 0.001, Dropout probabl...   
1  K = 20, Learning rate = 0.15, Dropout probabli...   
2  K = 20, Learning rate = 0.001, Dropout probabl...   
3  K = 20, Learning rate = 0.01, Dropout probabli...   
4  K = 20, Learning rate = 0.001, Dropout probabl...   
5  K = 20, Learning rate = 0.22, Dropout probabli...   

   Performance on the Test Set  
0                     0.755063  
1                     0.493953  
2                     0.580140  
3                     0.890208  
4                     0.740128  
5                     0.493953  


![Alt text](image-1.png)