# Sentiment Analysis

## Connect to Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


##Import libraries

In [21]:
import re
import pandas as pd
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

##Import and explore data

In [3]:
data = pd.read_csv('gdrive/MyDrive/Colab Notebooks/Sentiment-Analysis/data/sentiment.csv', encoding = "latin-1")
data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


We need to rename the columns and we can get rid of all the useless data. To build our sentiment analysis model, we only need the text and the sentiment columns.

In [4]:
data.columns = ['Sentiment', 'ids', 'Date', 'Flag', 'User', 'Text']
data.drop(['ids', 'Flag', 'Date', 'User'], axis=1, inplace=True)
data = data.sample(frac=1) # shuffle the data
data.head()

Unnamed: 0,Sentiment,Text
33423,0,I hate Mondays...
80174,0,Super tired I need relaxing time.. Had a very...
765138,0,sittin in 309 soc inequalities....dreading the...
989644,4,First Exam Tommorrow! Ready to suck! XD
1120400,4,@sid88 Happy to know that. Welcome to the fami...


In [5]:
data['Sentiment'].value_counts()

4    800000
0    799999
Name: Sentiment, dtype: int64

##Preprocess the data

The sentiment column has 2 different values : 0 for negative and 4 for positive. Let's replace them by 'positive' and 'negative' so that it is easier to read and we'll use the factorize method later to encode it for our model.


In [6]:
data.Sentiment.replace(4, 'positive', inplace=True)
data.Sentiment.replace(0, 'negative', inplace=True)
data.head()

Unnamed: 0,Sentiment,Text
33423,negative,I hate Mondays...
80174,negative,Super tired I need relaxing time.. Had a very...
765138,negative,sittin in 309 soc inequalities....dreading the...
989644,positive,First Exam Tommorrow! Ready to suck! XD
1120400,positive,@sid88 Happy to know that. Welcome to the fami...


Now we need to preprocess the text. We need to lower it, get rid of usernames and numbers.

In [7]:
data['Preprocessed text'] = pd.Series(dtype='object')

data['Preprocessed text'] = data['Text'].apply((lambda x: x.lower()))
data['Preprocessed text'] = data['Preprocessed text'].apply((lambda x: re.sub(r'\w*\d\w*', '', x).strip()))
data['Preprocessed text'] = data['Preprocessed text'].apply((lambda x: re.sub('^@(?=.*\w)[\w]{1,15}','',x)))
data.head()

Unnamed: 0,Sentiment,Text,Preprocessed text
33423,negative,I hate Mondays...,i hate mondays...
80174,negative,Super tired I need relaxing time.. Had a very...,super tired i need relaxing time.. had a very...
765138,negative,sittin in 309 soc inequalities....dreading the...,sittin in soc inequalities....dreading the en...
989644,positive,First Exam Tommorrow! Ready to suck! XD,first exam tommorrow! ready to suck! xd
1120400,positive,@sid88 Happy to know that. Welcome to the fami...,"@ happy to know that. welcome to the family, kid"
1035332,positive,@mrskutcher What happened? Can you give a link...,what happened? can you give a link please?
794425,negative,I miss NHL hockey ... as if I have to wait an...,i miss nhl hockey ... as if i have to wait an...
1526377,positive,"@LoMo0208 I'll make it easy for you, hire me a...","@ i'll make it easy for you, hire me at $,,/mo..."
622206,negative,I just wanna you here by my side,i just wanna you here by my side
162065,negative,Rain rain go away,rain rain go away


## Test on smaller dataset

In order to make several test and try different models, we're going to work on a smaller part of the dataset so that the processing doesn't take too long.

In [15]:
dev_data = data[:50000]

In [16]:
for index, row in dev_data.iterrows():
  tokens = row['Preprocessed text'].split()
  without_stopwords = [w for w in tokens if not w in list(nltk.corpus.stopwords.words('english'))]
  row['Preprocessed text'] = ' '.join(w for w in without_stopwords)

dev_data.head()

Unnamed: 0,Sentiment,Text,Preprocessed text
33423,negative,I hate Mondays...,hate mondays...
80174,negative,Super tired I need relaxing time.. Had a very...,super tired need relaxing time.. good day yest...
765138,negative,sittin in 309 soc inequalities....dreading the...,sittin soc inequalities....dreading end july....
989644,positive,First Exam Tommorrow! Ready to suck! XD,first exam tommorrow! ready suck! xd
1120400,positive,@sid88 Happy to know that. Welcome to the fami...,"@ happy know that. welcome family, kid"


Now let's tokenize the texts so that we can use it to train our model.

In [19]:
#max_features = 5000
tokenizer = Tokenizer(
    num_words= None, #max_features,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    split=" "
)
tokenizer.fit_on_texts(dev_data['Preprocessed text'].values)
X = tokenizer.texts_to_sequences(dev_data['Preprocessed text'].values)
X = pad_sequences(X)

vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

38864


Now that we vectorized the texts, by turning each text into a sequence of integers (each integer being the index of a token in a dictionary), we can create our training data by splitting the data in two part : a training set and a test set.

In [22]:
sentiment_label = dev_data.Sentiment.factorize()
print(sentiment_label[1])
y = sentiment_label[0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, shuffle=False)

Index(['negative', 'positive'], dtype='object')
