**Sentiment Analysis**: Given a sentence, analysing whether the sentence is sad, happy, angry, etc. For example, _Woho! I got selected in Harvard university!_ reflects happy sentiment. _I lost my phone in subway_ reflects sad sentiment. Usually, we use tweets data for sentiment analysis, but, of course, you can use any data.

We cannot use raw words as inputs to any machine learning model because data must be numerical. To convert text data into numerical, we use _vocabulary_.

**Vocabulary**: It is a set of unique words that appear in the data. Usually, we build this vocab from training data, but we can use predefined vocabulary as well. For this tutorial, we'll make it from scratch. Vocabulary is required to create features from words and sentences to train a model. For example, _I lost my phone in the subway. I am stupid._ will build a vocabulary that contains I, lost, my, phone, in, subway, am, stupid. Now, there are several ways to create features from these words.

There are several ways to convert text to numerical features. We will discuss them below.

Watch [this video](https://www.coursera.org/learn/classification-vector-spaces-in-nlp/lecture/gNXI3/vocabulary-feature-extraction) for more details on vocabulary and feature extraction.

To start let's use a simple tweets dataset from NLTK library

In [1]:
# uncomment below line to install dependencies
# !pip install numpy pandas scikit-learn nltk

In [2]:
import numpy as np
import nltk
from nltk.corpus import twitter_samples
from collections import Counter
# uncomment below line to download the dataset
# nltk.download('twitter_samples') 

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/virk/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [3]:
# select the set of positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

print('Number of positive tweets: ', len(positive_tweets))
print('Number of negative tweets: ', len(negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(positive_tweets))
print('The type of a tweet entry is: ', type(negative_tweets[0]))

Number of positive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>
The type of a tweet entry is:  <class 'str'>


In [4]:
# Let's look at an example tweet
print("Positive example ->", positive_tweets[0])
print()
print("Negative example ->", negative_tweets[0])

Positive example -> #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Negative example -> hopeless for tmr :(


Let's create a Vocabulary by splitting the string on space to get a list of words. This process is known as _tokenization_, and each word is known as _token_. Then we will take unique words from the training data to build vocabulary—also, lowercase the data for simplicity.

In [5]:
# Tokenize
positive_tokens = []
for sent in positive_tweets:
    # Strip removes trailing white space from string ends.
    positive_tokens += sent.lower().strip().split(" ")
negative_tokens = []
for sent in negative_tweets:
    # Strip removes trailing white space from string ends.
    negative_tokens += sent.lower().strip().split(" ")
    
# Combine all tokens
tokens = positive_tokens + negative_tokens
print(f"There are total {len(tokens)} tokens.")
# Now let's take unique words only to build a vocab
vocab = list(set(tokens))
print(f"The vocab size is {len(vocab)}.")
print(vocab[:10])

There are total 116244 tokens.
The vocab size is 26233.
['', '@ark_yujin96', '@derekklahn', 'cases', 'experiencing', 'video', 'nae', '#decorating', '@thedemocrats', '@hackadayio']


Now, we have a vocab, and we have to figure out how to create features from these words. One way is one-hot vector using word's index from vocab. A vector in python is simply a list of words of a particular sentence.

Let's see an example.

In [6]:
sample = positive_tweets[300]
print(f"Original Sample: {sample}")
words = [vocab.index(word) for word in sample.lower().strip().split(" ")]
print("number of words/tokens in this example:", len(words))
print(words)

Original Sample: Stats for the day have arrived. 2 new followers and NO unfollowers :) via http://t.co/xxlXs6xYwe.
number of words/tokens in this example: 15
[6820, 23233, 22609, 12793, 14574, 23682, 10528, 8235, 12313, 2170, 17006, 888, 14918, 9005, 15061]


Here, our one-hot vector will be of a size equal to the size of vocab, i.e. 26233 in this case, and we will one-hot 15 words at given indices in the example above.

In [7]:
sample_vector = np.zeros(len(vocab), dtype=int)
print("Size of sample vector:", sample_vector.shape)
# Make one-hot
sample_vector[words] = 1.
# Verify by counting
print(Counter(sample_vector))

Size of sample vector: (26233,)
Counter({0: 26218, 1: 15})


Now, we need to arrange the samples and split the data for training and validation. I am using pandas and sklearn here.

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from fastprogress import progress_bar

In [9]:
# Use vocab to convert sentences to one-hot vectors
positive_tweets_vectors = []
for sent in progress_bar(positive_tweets):
    words = [vocab.index(word) for word in sent.lower().strip().split(" ")]
    sent_vec = np.zeros(len(vocab), dtype=int)
    sent_vec[words] = 1.
    positive_tweets_vectors.append(sent_vec)
negative_tweets_vectors = []
for sent in progress_bar(negative_tweets):
    words = [vocab.index(word) for word in sent.lower().strip().split(" ")]
    sent_vec = np.zeros(len(vocab), dtype=int)
    sent_vec[words] = 1.
    negative_tweets_vectors.append(sent_vec)

positive_tweets_vectors = np.array(positive_tweets_vectors)
negative_tweets_vectors = np.array(negative_tweets_vectors)

In [10]:
posdf = pd.DataFrame(positive_tweets_vectors)
posdf["target"] = 1
negdf = pd.DataFrame(negative_tweets_vectors)
negdf["target"] = 0
df = pd.concat([posdf, negdf])
print("Size of data and features:", df.shape)

Size of data and features: (10000, 26234)


If you are familiar with machine learning algorithms, then you already know that this is a massive number of features and it will be computationally more expensive for bigger datasets. Anyway, we will try a logistic regression model to see how it performs.

In [11]:
# Split dataset into training and validation
traindf, valdf = train_test_split(df, shuffle=True, stratify=df.target)
print(f"Number of samples in training and validation:", traindf.shape, valdf.shape)

Number of samples in training and validation: (7500, 26234) (2500, 26234)


In [12]:
# Verify the classes split
traindf.target.value_counts(), valdf.target.value_counts()

(1    3750
 0    3750
 Name: target, dtype: int64,
 1    1250
 0    1250
 Name: target, dtype: int64)

In [13]:
%%time

# train model
X_train, y_train = traindf.drop("target", axis=1).values, traindf.target.values
X_val, y_val = valdf.drop("target", axis=1).values, valdf.target.values
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

clf = LogisticRegression()
clf.fit(X_train, y_train)
print("training accuracy:", clf.score(X_train, y_train))

(7500, 26233) (7500,) (2500, 26233) (2500,)
training accuracy: 1.0
CPU times: user 21.6 s, sys: 2.57 s, total: 24.2 s
Wall time: 10.2 s


In [14]:
print("Validation accuracy:", clf.score(X_val, y_val))

Validation accuracy: 0.9816


WOW!! That's really impressive results with a simple logistic regression. Let's test it with our made-up sentence

In [15]:
def test(sent):
    # Convert to one-hot vector
    words = [vocab.index(word) for word in sent.lower().strip().split(" ")]
    sent_vec = np.zeros(len(vocab), dtype=int)
    sent_vec[words] = 1.
    sent_vec = sent_vec.reshape(1, -1)
    # predict
    y_pred = clf.predict(sent_vec)
    if y_pred[0] == 1: return "positive"
    elif y_pred[0] == 0: return "negative"
    else: return None

In [16]:
test("I am happy about the results")

'positive'

In [17]:
test("This worked out fine.")

'negative'

In [18]:
test("I lost my phone")

'negative'

In [19]:
test("Julia broke up with John") # What? that's not good

'positive'

In [20]:
test("I forgot my lunch at home")

'negative'

In [21]:
test("I'm sick")

'negative'

In [22]:
test("this is a sick beat")

'positive'

In [23]:
test("Get away from me")

'positive'

So, this worked out well. However, there are some issues:
* The model took ~12GB RAM while training.
* We need another method to create features because one-hot will get expensive with large datasets.
* We still need to handle words that are not in the vocab which will raise the below error.

In [24]:
test("This tutorial worked out well :)") # Damn it :(

ValueError: 'tutorial' is not in list

#### References

- Natual Language Processing Specialization - Course 1 - Week 1 - Coursera