# cs7324 Lab 6 - Sequential Networks

#### Chip Henderson - 48996654 


## Preparation

In this lab, I'll be using a collection of Tweets related to the coronavirus pandemic. My intent is to use Convolutional Neural Network and Transformer models to analyze sentiment.

Source: https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification

#### Data Import and Tokenization

My dataset is already split into training and testing categories. Since I will need to split out my y values as well, I'm going to combine the two.

In [3]:
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing import sequence

# I'm using a latin enoding because there are some special characters
X_train = pd.read_csv(r'C:\Users\Chip\source\repos\cs7324_code\Data_Sources\Coronavirus_tweets\Corona_NLP_train.csv',encoding='iso-8859-1')
X_test = pd.read_csv(r'C:\Users\Chip\source\repos\cs7324_code\Data_Sources\Coronavirus_tweets\Corona_NLP_test.csv',encoding='iso-8859-1')

X = pd.concat([X_train, X_test])

X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44955 entries, 0 to 3797
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       44955 non-null  int64 
 1   ScreenName     44955 non-null  int64 
 2   Location       35531 non-null  object
 3   TweetAt        44955 non-null  object
 4   OriginalTweet  44955 non-null  object
 5   Sentiment      44955 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.4+ MB


Because of the memory footprint and compute times for some of these networks, I'm going to make my dataset as simple as possible. Also, I'm focused on performing a many-to-one analysis. Therefore, I'll be dropping everything except for the Original Tweet and Sentiment columns. 

In [4]:
features_to_keep = ['OriginalTweet', 'Sentiment']
features_to_drop = [feature for feature in X.columns if feature not in features_to_keep]

# Drop features I won't be using for this lab
X_reduced = X.drop(features_to_drop,axis=1)

# Determine number of instances 
X_reduced.value_counts

<bound method DataFrame.value_counts of                                           OriginalTweet           Sentiment
0     @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...             Neutral
1     advice Talk to your neighbours family to excha...            Positive
2     Coronavirus Australia: Woolworths to give elde...            Positive
3     My food stock is not the only one which is emp...            Positive
4     Me, ready to go at supermarket during the #COV...  Extremely Negative
...                                                 ...                 ...
3793  Meanwhile In A Supermarket in Israel -- People...            Positive
3794  Did you panic buy a lot of non-perishable item...            Negative
3795  Asst Prof of Economics @cconces was on @NBCPhi...             Neutral
3796  Gov need to do somethings instead of biar je r...  Extremely Negative
3797  I and @ForestandPaper members are committed to...  Extremely Positive

[44955 rows x 2 columns]>

At a glance of the examples above. There are some characters that I'm not sure I want to include in my embeddings. Some are stop words like "a," "the," "is," "are." Others are @tags that I don't believe will help my sentiment predictions. When I tokenize these Tweets I'm going to remove the stop words and will examine some differences in leaving the @tags in or removing them.

First, I'll go ahead and split my dataset so that I'm not tokenizing my target data.

In [9]:
from sklearn.model_selection import train_test_split

X = X_reduced.OriginalTweet.values
y = X_reduced.Sentiment.values

X

array(['@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8',
       'advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order',
       'Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P',
       ...,
       "Asst Prof of Economics @cconces was on @NBCPhiladelphia talking about her recent research on coronavirus' impact on the economy. Watch it here (starting at :33): https://t.co/8tfYNoro5l",
       "Gov need to do somethings instead of biar je rakyat assume 'lockdown' ke or even worst. Harini semua supermarket crowded like hell. Lagi mudah virus tu tersebar ?? #COVID2019",
       'I and @ForestandPaper members are committed to the safety of our employees and our end-