<a href="https://colab.research.google.com/github/flaghunter21/DSCI-619-Deep-Learning/blob/main/DSCI_619_Final_Project_P2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSCI 619 Final Project Part 2
## Cameron Lauf

In this project, we will be using the `Sentiment140` dataset which contains 1,600,000 tweets extracted using the Twitter API. The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment.

https://www.kaggle.com/datasets/kazanova/sentiment140

The goal of this project is to use Recurrent Neural Networks (RNN) to perform sentiment analysis on the data.

## Load the data

In [165]:
import pandas as pd
import numpy as np
import nltk
from nltk.stem import WordNetLemmatizer
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [105]:
tweets = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding = "ISO-8859-1")
tweets.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [106]:
data_columns = ['target', 'ids', 'date', 'flag', 'user', 'text']
tweets = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding = "ISO-8859-1", names = data_columns)
tweets.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [107]:
print(len(tweets))

1600000


## Data Cleaning and Preprocessing

For this analysis, only the `target` and `text` column will be relevant.

In [108]:
tweets = tweets[['target','text']]
tweets['target'] = tweets['target'].replace(4, 1)
tweets.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


We also see there are 800,000 tweets of each positive and negative sentiment.

In [109]:
print(tweets['target'].value_counts())

0    800000
1    800000
Name: target, dtype: int64


To preprocess the text data, the following steps will need to be taken:
* Lower case text
* Replace URLs
* Replace emojis
* Replace usernames
* Remove Non-Alphabet characters
* Replace 3 or more consecutive letters with 2.
* Remove words with length less than or equal to 2.
* Remove stop words
* Lemmatization

In [110]:
text, target = list(tweets['text']), list(tweets['target'])

In [111]:
# Dictionary of emojis
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', 
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', 
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

# Stop word list
stop_words = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
             's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

In [112]:
# Preprocessing function
def preprocess(text_data):
  processed_text = []
  lemmatizer = WordNetLemmatizer()

  # Regex patterns
  urls = r'((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)'
  users = '@[^\s]+'
  alpha = '[^a-zA-Z0-9]'
  seq = r'(.)\1\1+'
  seq_replace = r'\1\1'

  for tweet in text_data:
    # Lower casing
    tweet = tweet.lower()

    # Replace URLs
    tweet = re.sub(urls,' URL',tweet)

    # Replace emojis
    for emoji in emojis.keys():
      tweet = tweet.replace(emoji, 'EMOJI' + emojis[emoji])

    # Replace usernames
    tweet = re.sub(users,' USER', tweet)

    # Replace non alphabetic
    tweet = re.sub(alpha, ' ', tweet)

    # Replace consecutive letters (>=3)
    tweet = re.sub(seq, seq_replace, tweet)

    # Stop words and lemmatization
    tweet_words = ''
    for word in tweet.split():
      if len(word) > 1:
        word = lemmatizer.lemmatize(word)
        tweet_words += (word + ' ')

    # Append text to list
    processed_text.append(tweet_words)
  
  return processed_text

In [113]:
processed_text = preprocess(text)

Let's look at a few examples of the now processed text.

In [114]:
processed_text[:3]

['USER URL aww that bummer you shoulda got david carr of third day to do it EMOJIwink ',
 'is upset that he can update his facebook by texting it and might cry a result school today also blah ',
 'USER dived many time for the ball managed to save 50 the rest go out of bound ']

In [115]:
tweets_df = pd.DataFrame({'target':target,
                          'text':processed_text})
tweets_df.head()

Unnamed: 0,target,text
0,0,USER URL aww that bummer you shoulda got david...
1,0,is upset that he can update his facebook by te...
2,0,USER dived many time for the ball managed to s...
3,0,my whole body feel itchy and like it on fire
4,0,USER no it not behaving at all mad why am here...


## Train Test Split and Vectorization

We now convert the text to numerical values using text vectorization.

In [144]:
X = tweets_df['text'].values
y = tweets_df['target'].values
vector = TfidfVectorizer(sublinear_tf=True)
X = vector.fit_transform(X)
print(f'Vector fitted.')
print('No. of feature_words: ', len(vector.get_feature_names_out()))

Vector fitted.
No. of feature_words:  249739


In [172]:
max_words = 5000
max_len = 1

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(tweets_df.text)
sequences = tokenizer.texts_to_sequences(tweets_df.text)
tweets = pad_sequences(sequences, maxlen=max_len)
print(tweets)

[[ 306]
 [1096]
 [3005]
 ...
 [1800]
 [  44]
 [   1]]


In [173]:
X_train, X_test, y_train, y_test = train_test_split(tweets, y, stratify = y, test_size = 0.2)

In [174]:
y_train = np.asarray(y_train).astype('float32').reshape((-1,1))
y_test = np.asarray(y_test).astype('float32').reshape((-1,1))

In [175]:
print(f'X_train Size: {X_train.shape}')
print(f'y_train Size: {X_test.shape}')
print('--------------------')
print(f'X_test Size: {y_train.shape}')
print(f'y_test Size: {y_test.shape}')

X_train Size: (1280000, 1)
y_train Size: (320000, 1)
--------------------
X_test Size: (1280000, 1)
y_test Size: (320000, 1)


## Baseline RNN Model with GRU

In [176]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        input_dim=max_words,
        output_dim=128,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    # 
    tf.keras.layers.GRU(128, return_sequences=True),
    # Binary classifier
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [177]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [178]:
model.summary()

Model: "sequential_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_18 (Embedding)    (None, None, 128)         640000    
                                                                 
 gru_18 (GRU)                (None, None, 128)         99072     
                                                                 
 dense_35 (Dense)            (None, None, 64)          8256      
                                                                 
 dense_36 (Dense)            (None, None, 1)           65        
                                                                 
Total params: 747,393
Trainable params: 747,393
Non-trainable params: 0
_________________________________________________________________


In [1]:
%%time
history = model.fit(x=X_train,y=y_train,batch_size = 32,epochs=5,
          validation_data=(X_test,y_test), verbose = 1
          )

NameError: ignored