#### Problem statement

Predict the political party from the tweet text and the handle

#### Data description
This dataset has three columns - label (party name), twitter handle, tweet text


#### Problem Description:

Design a feed forward deep neural network to predict the political party using the pytorch or tensorflow. 
Build two models

1. Without using the handle

2. Using the handle


#### Deliverables

- Report the performance on the test set.

- Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

- Experiment with:
    -L2 and dropout regularization techniques
    -SGD, RMSProp and Adamp optimization techniques



- Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

    - Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter. 

    - Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise
    
    -  Clearly state your design choices and assumptions. Think about the pros and cons of each option.

 

<b> Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:</b>

1. Experiment description

2. Hyperparameter used and their values

3. Performance on the test set

 

In [105]:
#Importing necessary modules
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import nltk

In [106]:
from sklearn.model_selection import train_test_split

In [107]:
train_data = pd.read_csv("C:/Users/Anusha Gadgil/Desktop/Deep learning/train.csv")

Performing EDA

In [108]:
train_data["Tweet"]

0        Today, Senate Dems vote to #SaveTheInternet. P...
1        RT @WinterHavenSun: Winter Haven resident / Al...
2        RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3        RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4        RT @Vegalteno: Hurricane season starts on June...
                               ...                        
72730    Check out my op-ed on need for End Executive O...
72731    Yesterday, Betty &amp; I had a great time lear...
72732    We are forever grateful for the service and sa...
72733    Happy first day of school @CobbSchools! #CobbB...
72734    #Zika fears realized in Florida. House GOP act...
Name: Tweet, Length: 72735, dtype: object

In [109]:
test_data=pd.read_csv("C:/Users/Anusha Gadgil/Desktop/Deep learning/test.csv")

In [110]:
test_data

Unnamed: 0.1,Unnamed: 0,Party,Handle,Tweet
0,1009,Democrat,RepBarragan,"Join me next Friday, May 18 in #Lynwood for ou..."
1,1025,Democrat,RepBarragan,The administration announced its plan today to...
2,1029,Democrat,RepBarragan,Today’s @SouthGateCAgov’s JAA Opening Day Cere...
3,1031,Democrat,RepBarragan,Great visit @Compton_YB! TY for creating a pos...
4,1035,Democrat,RepBarragan,Tune into my Water Quality Town Hall live feed...
...,...,...,...,...
13721,84986,Republican,michaelcburgess,"Forty-five years ago today, Rep. Sam Johnson r..."
13722,84987,Republican,michaelcburgess,Yesterday we all were deeply saddened by the e...
13723,84990,Republican,michaelcburgess,The White House has released a Statement of Ad...
13724,84992,Republican,michaelcburgess,Today I had a productive meeting with @SecAzar...


In [111]:
unique_values = train_data['Party'].unique()
unique_values

array(['Democrat', nan, 'Republican'], dtype=object)

Data Cleaning

In [112]:
#Removing rows where party is "NaN"
Party_Rows =train_data.dropna()

In [113]:
Party_Rows

Unnamed: 0.1,Unnamed: 0,Party,Handle,Tweet
0,0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...
...,...,...,...,...
72730,86455,Republican,RepTomPrice,Check out my op-ed on need for End Executive O...
72731,86456,Republican,RepTomPrice,"Yesterday, Betty &amp; I had a great time lear..."
72732,86457,Republican,RepTomPrice,We are forever grateful for the service and sa...
72733,86458,Republican,RepTomPrice,Happy first day of school @CobbSchools! #CobbB...


Basic Data Cleaning

In [217]:
import re
train_data['Tweet'] = train_data['Tweet'].astype(str)
# Function to clean up tweets
def clean_tweet(tweet):
    # Remove URLs
    tweet = re.sub(r'http\S+', '', tweet)
    # Remove mentions (e.g., @username)
    tweet = re.sub(r'@\w+', '', tweet)
    # Remove hashtags
    tweet = re.sub(r'#\w+', '', tweet)
    # Remove special characters and punctuation
    tweet = re.sub(r'[^a-zA-Z\s]', '', tweet)
    # Convert to lowercase
    tweet = tweet.lower()
    return tweet

# Apply the cleaning function to the 'Tweet' column
train_data['Cleaned Tweet'] = train_data['Tweet'].apply(clean_tweet)


In [115]:
train_data

Unnamed: 0.1,Unnamed: 0,Party,Handle,Tweet,Cleaned Tweet
0,0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",today senate dems vote to proud to support si...
1,1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...,rt winter haven resident alta vista teacher ...
2,2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...,rt noted that hurricane maria has left appro...
3,3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...,rt meeting with thanks for taking the time ...
4,4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...,rt hurricane season starts on june st puerto ...
...,...,...,...,...,...
72730,86455,Republican,RepTomPrice,Check out my op-ed on need for End Executive O...,check out my oped on need for end executive ov...
72731,86456,Republican,RepTomPrice,"Yesterday, Betty &amp; I had a great time lear...",yesterday betty amp i had a great time learnin...
72732,86457,Republican,RepTomPrice,We are forever grateful for the service and sa...,we are forever grateful for the service and sa...
72733,86458,Republican,RepTomPrice,Happy first day of school @CobbSchools! #CobbB...,happy first day of school


In [116]:
train_data["Cleaned Tweet"]

0        today senate dems vote to  proud to support si...
1        rt  winter haven resident  alta vista teacher ...
2        rt   noted that hurricane maria has left appro...
3        rt  meeting with   thanks for taking the time ...
4        rt  hurricane season starts on june st puerto ...
                               ...                        
72730    check out my oped on need for end executive ov...
72731    yesterday betty amp i had a great time learnin...
72732    we are forever grateful for the service and sa...
72733                          happy first day of school  
72734     fears realized in florida house gop acted to ...
Name: Cleaned Tweet, Length: 72735, dtype: object

Stop word removal, tokenization,Lemmatization

In [244]:
#import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download the stopwords corpus if you haven't already
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Initialize NLTK objects for text processing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to preprocess text
def preprocess_text(text):
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords and perform stemming/lemmatization
    tokens = [stemmer.stem(lemmatizer.lemmatize(token.lower())) for token in tokens if token.lower() not in stop_words]
    return tokens

# Apply the preprocessing function to the 'Cleaned Tweet' column
train_data['Processed Tweet'] = train_data['Cleaned Tweet'].apply(preprocess_text)

# Display the DataFrame with the processed text
print(train_data)

[nltk_data] Downloading package stopwords to C:\Users\Anusha
[nltk_data]     Gadgil\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Anusha
[nltk_data]     Gadgil\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Anusha
[nltk_data]     Gadgil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


      Unnamed: 0       Party         Handle  \
0              0    Democrat  RepDarrenSoto   
1              1    Democrat  RepDarrenSoto   
2              2    Democrat  RepDarrenSoto   
3              3    Democrat  RepDarrenSoto   
4              4    Democrat  RepDarrenSoto   
...          ...         ...            ...   
72730      86455  Republican    RepTomPrice   
72731      86456  Republican    RepTomPrice   
72732      86457  Republican    RepTomPrice   
72733      86458  Republican    RepTomPrice   
72734      86459  Republican    RepTomPrice   

                                                   Tweet  \
0      Today, Senate Dems vote to #SaveTheInternet. P...   
1      RT @WinterHavenSun: Winter Haven resident / Al...   
2      RT @NBCLatino: .@RepDarrenSoto noted that Hurr...   
3      RT @NALCABPolicy: Meeting with @RepDarrenSoto ...   
4      RT @Vegalteno: Hurricane season starts on June...   
...                                                  ...   
72730  Check ou

In [245]:
from collections import Counter
import string

# Define the preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = word_tokenize(text)
    return words

# Apply preprocessing and tokenize
data['tokenized_text'] = data['Processed Tweet'].apply(preprocess_text)

# Count word frequency
word_frequencies = Counter()

# Update word frequencies for each row
for _, row in data.iterrows():
    word_frequencies.update(row['tokenized_text'])

# Convert word frequencies to a DataFrame
word_freq_df = pd.DataFrame(word_frequencies.items(), columns=['Word', 'Frequency'])

# Sort by frequency 
word_freq_df = word_freq_df.sort_values(by='Frequency', ascending=False)

# Display the DataFrame with sorted word frequencies
print(word_freq_df)





                   Word  Frequency
9                    rt      16033
169                 amp       7745
0                 today       7336
30                thank       5593
563                work       3841
...                 ...        ...
16092    mikecapuanocom          1
16094            prolaw          1
16095  nationticketmast          1
16098          fiendish          1
27155            barney          1

[27156 rows x 2 columns]


Assigning variables to input features and Target Variables

In [246]:
from sklearn.preprocessing import LabelEncoder
X = train_data['Processed Tweet']
Y = LabelEncoder().fit_transform(train_data['Party'])

Performing train test split

In [247]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [248]:
print (len(X_train))
print (len(Y_train))

58188
58188


In [254]:
X_train

24018    [congratul, shemar, coomb, philadelphia, parti...
5865     [yesterday, rocket, attack, israel, iran, revo...
37609    [today, committe, kick, hear, seri, vital, non...
1059     [trump, administr, launch, anoth, attack, jeop...
48124              [help, inform, thank, tireless, effort]
                               ...                        
37194    [glad, receiv, award, tonight, recognit, stron...
6265     [hope, futur, amp, govern, listen, peopl, nati...
54886    [abl, secur, passag, amend, jessi, law, last, ...
860      [era, isnt, equal, pay, equal, work, much, equ...
15795                          [rt, proud, sport, f, rate]
Name: Processed Tweet, Length: 58188, dtype: object

Preparing the Data

In [249]:

max_words = 5000  # Maximum number of words to keep
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

In [250]:
#Tokenizing and Padding
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences to a fixed length
max_len = 50  # Maximum length of a sequence
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

In [251]:
print(X_train_pad.shape)
print(X_test_pad.shape)
print(X_train.shape)

(58188, 50)
(14547, 50)
(58188,)


Experimenting with RMS prop

In [252]:
import keras
from tensorflow.keras.layers import Embedding, Dense, Flatten, Dropout

# Assuming max_words and max_len are defined
embedding_dim = 50

model = keras.Sequential([
    Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(1024, activation='relu'),
    Dense(1, activation='sigmoid')  # Output layer with a single unit for binary classification
])

model.summary()




Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 50, 50)            250000    
                                                                 
 flatten_6 (Flatten)         (None, 2500)              0         
                                                                 
 dense_27 (Dense)            (None, 128)               320128    
                                                                 
 dense_28 (Dense)            (None, 1024)              132096    
                                                                 
 dense_29 (Dense)            (None, 1)                 1025      
                                                                 
Total params: 703249 (2.68 MB)
Trainable params: 703249 (2.68 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Experimenting with Adam and SGD

In [193]:
from tensorflow.keras.optimizers import Adam
# Compile the model
model.compile(optimizer=SGD(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

In [194]:
print("Shape of Y_train:", Y_train.shape)
print("Shape of Y_test:", Y_test.shape)

Shape of Y_train: (58188,)
Shape of Y_test: (14547,)


In [195]:
batch_size = 64
epochs = 5
model.fit(X_train_pad, Y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test_pad, Y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x2131ffba490>

experimenting with RMSprop

In [196]:
from tensorflow.keras.optimizers import RMSprop
# Compile the model
model.compile(optimizer=RMSprop(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

In [197]:
batch_size = 32
epochs = 5
model.fit(X_train_pad, Y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test_pad, Y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x2132039f1c0>

Experimenting with ADAM

In [198]:
from tensorflow.keras.optimizers import Adam
# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
batch_size = 32
epochs = 5
model.fit(X_train_pad, Y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test_pad, Y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x213203aefa0>

## Experimenting with L2 and dropout regularization

In [202]:
from tensorflow.keras.layers import Dense, Dropout
embedding_dim = 50
from tensorflow.keras.regularizers import l2
model = keras.Sequential([
    Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len),
    keras.layers.Flatten(input_shape=(50, 50)),
    keras.layers.Dense(128, activation='relu'),
    Dropout(0.5),
    keras.layers.Dense(1024, activation='relu'),
    keras.layers.Dense(512, activation='elu',kernel_regularizer=l2(0.00001)),
    Dropout(0.5),
    keras.layers.Dense(256, activation='tanh', kernel_regularizer=l2(0.00001)),
    keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 50, 50)            250000    
                                                                 
 flatten_5 (Flatten)         (None, 2500)              0         
                                                                 
 dense_22 (Dense)            (None, 128)               320128    
                                                                 
 dropout_2 (Dropout)         (None, 128)               0         
                                                                 
 dense_23 (Dense)            (None, 1024)              132096    
                                                                 
 dense_24 (Dense)            (None, 512)               524800    
                                                                 
 dropout_3 (Dropout)         (None, 512)              

In [203]:
from tensorflow.keras.optimizers import Adam
# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

In [205]:
batch_size = 16
epochs = 10
model.fit(X_train_pad, Y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test_pad, Y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10


<keras.src.callbacks.History at 0x2171becadf0>

Preparing the testing data for validation

In [218]:
test_data=pd.read_csv('C:/Users/Anusha Gadgil/Desktop/Deep learning/test.csv')
test_data

Unnamed: 0.1,Unnamed: 0,Party,Handle,Tweet
0,1009,Democrat,RepBarragan,"Join me next Friday, May 18 in #Lynwood for ou..."
1,1025,Democrat,RepBarragan,The administration announced its plan today to...
2,1029,Democrat,RepBarragan,Today’s @SouthGateCAgov’s JAA Opening Day Cere...
3,1031,Democrat,RepBarragan,Great visit @Compton_YB! TY for creating a pos...
4,1035,Democrat,RepBarragan,Tune into my Water Quality Town Hall live feed...
...,...,...,...,...
13721,84986,Republican,michaelcburgess,"Forty-five years ago today, Rep. Sam Johnson r..."
13722,84987,Republican,michaelcburgess,Yesterday we all were deeply saddened by the e...
13723,84990,Republican,michaelcburgess,The White House has released a Statement of Ad...
13724,84992,Republican,michaelcburgess,Today I had a productive meeting with @SecAzar...


In [220]:
test_data['Tweet'] = test_data['Tweet'].astype(str)
# Function to clean up tweets
def clean_tweet(tweet):
    # Remove URLs
    tweet = re.sub(r'http\S+', '', tweet)
    # Remove mentions (e.g., @username)
    tweet = re.sub(r'@\w+', '', tweet)
    # Remove hashtags
    tweet = re.sub(r'#\w+', '', tweet)
    # Remove special characters and punctuation
    tweet = re.sub(r'[^a-zA-Z\s]', '', tweet)
    # Convert to lowercase
    tweet = tweet.lower()
    return tweet

# Apply the cleaning function to the 'Tweet' column
test_data['Cleaned Tweet'] = test_data['Tweet'].apply(clean_tweet)

In [221]:
def preprocess_traintext(text):
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords and perform stemming/lemmatization
    tokens = [stemmer.stem(lemmatizer.lemmatize(token.lower())) for token in tokens if token.lower() not in stop_words]
    return tokens

# Apply the preprocessing function to the 'Cleaned Tweet' column
test_data['Processed Tweet'] = test_data['Cleaned Tweet'].apply(preprocess_text)

# Display the DataFrame with the processed text
print(test_data)

       Unnamed: 0       Party           Handle  \
0            1009    Democrat      RepBarragan   
1            1025    Democrat      RepBarragan   
2            1029    Democrat      RepBarragan   
3            1031    Democrat      RepBarragan   
4            1035    Democrat      RepBarragan   
...           ...         ...              ...   
13721       84986  Republican  michaelcburgess   
13722       84987  Republican  michaelcburgess   
13723       84990  Republican  michaelcburgess   
13724       84992  Republican  michaelcburgess   
13725       84998  Republican  michaelcburgess   

                                                   Tweet  \
0      Join me next Friday, May 18 in #Lynwood for ou...   
1      The administration announced its plan today to...   
2      Today’s @SouthGateCAgov’s JAA Opening Day Cere...   
3      Great visit @Compton_YB! TY for creating a pos...   
4      Tune into my Water Quality Town Hall live feed...   
...                                    

In [224]:
X_testcheck=test_data['Processed Tweet']
X_testcheck
Y_testcheck= LabelEncoder().fit_transform(test_data['Party'])
Y_testcheck

array([0, 0, 0, ..., 1, 1, 1])

In [225]:
X_test_Sequence = tokenizer.texts_to_sequences(X_testcheck)

# Pad sequences to a fixed length
max_len = 50  # Maximum length of a sequence

X_test_padded = pad_sequences(X_test_Sequence, maxlen=max_len)

In [226]:
X_test_padded.shape

(13726, 50)

In [228]:
Y_testcheck.shape

(13726,)

In [229]:
test_loss, test_accuracy = model.evaluate(X_test_padded, Y_testcheck)
print("Test accuracy:", test_accuracy)

Test accuracy: 0.6032347083091736


In [231]:
# Calculate and print classification report and confusion matrix
#checking F1 score after changes
from sklearn.metrics import classification_report, confusion_matrix
# Get model predictions
predictions = model.predict(X_test_padded)
threshold = 0.5  # Threshold for binary classification
y_pred = [(1 if prob >= threshold else 0) for prob in predictions]
y_pred_np=np.array([y_pred])
y_pred_np=y_pred_np.flatten()


print("Classification Report:\n", classification_report(Y_testcheck, y_pred_np))
print("Confusion Matrix:\n", confusion_matrix(Y_testcheck, y_pred_np))

Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.46      0.53      6780
           1       0.58      0.75      0.66      6946

    accuracy                           0.60     13726
   macro avg       0.61      0.60      0.59     13726
weighted avg       0.61      0.60      0.59     13726

Confusion Matrix:
 [[3100 3680]
 [1766 5180]]


F1 score is coming to be 60

2nd Model: Tweets with the Handle

In [255]:
X_test_tweets=test_data['Processed Tweet']
X_test_tweets

0        [join, me, next, friday, may, in, for, our, re...
1        [the, administration, announced, its, plan, to...
2        [todays, s, jaa, opening, day, ceremony, was, ...
3        [great, visit, ty, for, creating, a, positive,...
4        [tune, into, my, water, quality, town, hall, l...
                               ...                        
13721    [fortyfive, years, ago, today, rep, sam, johns...
13722    [yesterday, we, all, were, deeply, saddened, b...
13723    [the, white, house, has, released, a, statemen...
13724    [today, i, had, a, productive, meeting, with, ...
13725    [this, morning, on, the, first, anniversary, o...
Name: Processed Tweet, Length: 13726, dtype: object

In [241]:
X_Handle = pd.get_dummies(X_train['Handle'], prefix='Handle')
X_Handle

Unnamed: 0,Handle_AGBecerra,Handle_AlanGrayson,Handle_AnthonyBrownMD4,Handle_AustinScottGA08,Handle_BennieGThompson,Handle_BettyMcCollum04,Handle_BillPascrell,Handle_BobbyScott,Handle_BradSherman,Handle_Call_Me_Dutch,...,Handle_repjimcooper,Handle_repjoecrowley,Handle_repjohnlewis,Handle_replouiegohmert,Handle_repmarkpocan,Handle_reppittenger,Handle_repsandylevin,Handle_rosadelauro,Handle_sethmoulton,Handle_virginiafoxx
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72730,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
72731,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
72732,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
72733,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [256]:
X_test_tweets=test_data['Processed Tweet']
X_test_tweets

0        [join, me, next, friday, may, in, for, our, re...
1        [the, administration, announced, its, plan, to...
2        [todays, s, jaa, opening, day, ceremony, was, ...
3        [great, visit, ty, for, creating, a, positive,...
4        [tune, into, my, water, quality, town, hall, l...
                               ...                        
13721    [fortyfive, years, ago, today, rep, sam, johns...
13722    [yesterday, we, all, were, deeply, saddened, b...
13723    [the, white, house, has, released, a, statemen...
13724    [today, i, had, a, productive, meeting, with, ...
13725    [this, morning, on, the, first, anniversary, o...
Name: Processed Tweet, Length: 13726, dtype: object

In [257]:
max_words = 5000  # Maximum number of words to keep
tokenizer = Tokenizer(num_words=max_words)
X_test_new_tweet_seq = tokenizer.texts_to_sequences(X_test_tweets)

# Pad sequences to a fixed length
max_len = 50  # Maximum length of a sequence

X_test_new_tweet_seq = pad_sequences(X_test_new_tweet_seq, maxlen=max_len)
X_test_new_tweet_seq

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [258]:
X_test_new_tweet_seq.shape

(13726, 50)

In [259]:
X_handle_np=np.array([])
X_handle_np=X_Handle.values
X_handle_np.shape

(72735, 433)

Model with only handle

In [260]:
X_train_handle, X_test_handle, y_train, y_test = train_test_split(X_handle_np, y, test_size=0.2, random_state=42)

In [261]:
from tensorflow.keras.layers import Dense, Dropout
embedding_dim = 50
from tensorflow.keras.regularizers import l2
model = keras.Sequential([
    keras.layers.Dense(514, activation='relu', input_shape=(433,)),
    keras.layers.Dense(514, activation='relu'),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])
model.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_30 (Dense)            (None, 514)               223076    
                                                                 
 dense_31 (Dense)            (None, 514)               264710    
                                                                 
 dense_32 (Dense)            (None, 256)               131840    
                                                                 
 dense_33 (Dense)            (None, 1)                 257       
                                                                 
Total params: 619883 (2.36 MB)
Trainable params: 619883 (2.36 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [262]:
from tensorflow.keras.optimizers import Adam
# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

In [263]:
batch_size = 16
epochs = 5
model.fit(X_train_handle, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test_handle, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x2171a8558b0>

Text Model

In [264]:
from tensorflow.keras.layers import Dense, Dropout
embedding_dim = 50
from tensorflow.keras.regularizers import l2
model2 = keras.Sequential([
    Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len),
    keras.layers.Flatten(input_shape=(50, 50)),
    keras.layers.Dense(128, activation='relu'),
    Dropout(0.5),
    keras.layers.Dense(1024, activation='relu'),
    keras.layers.Dense(512, activation='elu',kernel_regularizer=l2(0.00001)),
    Dropout(0.5),
    keras.layers.Dense(256, activation='tanh', kernel_regularizer=l2(0.00001)),
    keras.layers.Dense(1, activation='sigmoid')
])

model2.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_8 (Embedding)     (None, 50, 50)            250000    
                                                                 
 flatten_7 (Flatten)         (None, 2500)              0         
                                                                 
 dense_34 (Dense)            (None, 128)               320128    
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_35 (Dense)            (None, 1024)              132096    
                                                                 
 dense_36 (Dense)            (None, 512)               524800    
                                                                 
 dropout_5 (Dropout)         (None, 512)              

In [265]:
from tensorflow.keras.optimizers import Adam
# Compile the model
model2.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
batch_size = 32
epochs = 5
model2.fit(X_train_pad, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test_pad, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x21305d84f70>

In [267]:
X_test_pad.shape

(14547, 50)

In [268]:
X_test_handle.shape

(14547, 433)

In [269]:
predictions_model_1 = model.predict(X_test_handle)
threshold = 0.5  # Threshold for binary classification
y_pred_model1 = [(1 if prob >= threshold else 0) for prob in predictions_model_1]

predictions_model_2 = model2.predict(X_test_pad)
threshold = 0.5  # Threshold for binary classification
y_pred_model2 = [(1 if prob >= threshold else 0) for prob in predictions_model_2]


y_pred_model1_np=np.array([y_pred_model1])
y_pred_model2_np=np.array([y_pred_model2])


y_pred_np_new=np.logical_or(y_pred_model1_np, y_pred_model2_np).astype(int)
y_pred_np_new=y_pred_np_new.flatten()



In [270]:
y_pred_np_new

array([1, 0, 0, ..., 0, 1, 1])

In [271]:
print(" Report:\n", classification_report(y_test, y_pred_np_new))

 Report:
               precision    recall  f1-score   support

           0       1.00      0.66      0.80      7048
           1       0.76      1.00      0.86      7499

    accuracy                           0.84     14547
   macro avg       0.88      0.83      0.83     14547
weighted avg       0.88      0.84      0.83     14547



Experiment Design:
Built 2 models to classify tweets for 2 political parties

Performed Data Cleaning for both models:

As a part of data cleaning I did the following:
Stopwords removal
Lemmatizations
junk word removal
Tokenisation 




Additionally Experimented with All the techniques listed in the deliverablare given belowere as follows:

In [275]:
import pandas as pd
report = {'Regularization/Optimisation Technique': ['SGD with base parameters', 'RMSprop', 'Adam', 'L2 regularisation and Dropout'], 'Accuracy': [51.5,84,65.7,99]}
reportdf = pd.DataFrame(data=report)
reportdf

Unnamed: 0,Regularization/Optimisation Technique,Accuracy
0,SGD with base parameters,51.5
1,RMSprop,84.0
2,Adam,65.7
3,L2 regularisation and Dropout,99.0


Built 2 models Party prediction:
1. With Handle Accuracy: 84%
2. Without Handle Accuracy:65%