# Sentiment analysis using SpaCy

## 0. Text processing using SpaCy

### 0.1 Lemmatization

It turns your word to its original form.  Very common thing you wanna to do, because YouTubeVideo
do not want to confuse your model that run and running are different.

Note:  But if you use very powerful neural network like transformer, NO NEED lemmatization....

In [42]:
#running, ran --> run
import spacy

nlp = spacy.load("en_core_web_sm")

In [43]:
doc = nlp("run ran running")

In [44]:
for token in doc:
    print(token.text, token.lemma_)

run run
ran run
running run


### 0.2 Stop words

Common preprocessing is to remove stopwords, e.g., at, in, on, etc.  Removing them helps model memorize only the keywords.

Note: In powerful network, we DON'T remove stop words

In [46]:
from spacy.lang.en.stop_words import STOP_WORDS

stopwords = list(STOP_WORDS)

In [47]:
doc = nlp("Chaky is going to Disney Land to eat with his best friend Peter.")

In [48]:
clean_tokens = []

for token in doc:
    if token.text not in stopwords:
        clean_tokens.append(token.text)
        
clean_tokens

['Chaky', 'going', 'Disney', 'Land', 'eat', 'best', 'friend', 'Peter', '.']

### 0.3 Removing punct

In [50]:
doc = nlp("Chaky , the teacher, $ / @ # AIT !!!???? likes to eat sushi.")

In [52]:
token_no_punct = []

for token in doc:
    if token.pos_ != "PUNCT" and token.pos_ != "SYM":
        token_no_punct.append(token.text)

token_no_punct

['Chaky', 'the', 'teacher', '@', '#', 'AIT', 'likes', 'to', 'eat', 'sushi']

### 0.4 Lowercasing and unnecessary spaces

In [54]:
stripped_lowercase_tokens = []

for token in doc:
    stripped_lowercase_tokens.append(token.text.lower().strip())
    
stripped_lowercase_tokens

['chaky',
 ',',
 'the',
 'teacher',
 ',',
 '$',
 '/',
 '@',
 '#',
 'ait',
 '!',
 '!',
 '!',
 '?',
 '?',
 '?',
 '?',
 'likes',
 'to',
 'eat',
 'sushi',
 '.']

### 0.5 Combine everything

In [55]:
def preprocessing(sentence):
    stopwords = list(STOP_WORDS)
    doc = nlp(sentence)
    clean_tokens = []
    
    for token in doc:
        if token.text not in stopwords and token.pos_ != 'PUNCT' and token.pos_ != "SYM" and \
            token.pos_ != "SPACE":
                clean_tokens.append(token.text)
    
    return clean_tokens

## 1. Let's do sentiment analysis with the help sklearn and spacy!!!

In [56]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### 1.1 Load data

In [57]:
data_yelp   = pd.read_csv('data/yelp_labelled.txt',   sep='\t', header=None, names=['Review', 'Sentiment'])
data_amazon = pd.read_csv('data/amazon_labelled.txt', sep='\t', header=None, names=['Review', 'Sentiment'])
data_imdb   = pd.read_csv('data/imdb_labelled.txt',   sep='\t', header=None, names=['Review', 'Sentiment'])

In [59]:
# data_yelp.head()

In [60]:
data_yelp.shape, data_amazon.shape, data_imdb.shape

((1000, 2), (1000, 2), (748, 2))

### 1.2 EDA

Check the mean and std; check any null values

In [61]:
data = pd.concat([data_yelp, data_amazon, data_imdb], ignore_index=True)

In [62]:
data.shape

(2748, 2)

In [63]:
#check imbalances
data['Sentiment'].value_counts()

1    1386
0    1362
Name: Sentiment, dtype: int64

In [64]:
data.isna().sum()

Review       0
Sentiment    0
dtype: int64

### Countvectorizer

In [65]:
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer(tokenizer=preprocessing)

#examples
corpus = [
    'Chaky is coding python',
    'Deep learning is fun',
    'Spacy is cool and fun',
    'please hashtag #spacy'
]

result = countvec.fit_transform(corpus)

print(countvec.get_feature_names_out()) #list of tokens

print(result.toarray())
#rows are sentences
#columns are unique words

['#' 'chaky' 'coding' 'cool' 'deep' 'fun' 'hashtag' 'learning' 'python'
 'spacy']
[[0 1 1 0 0 0 0 0 1 0]
 [0 0 0 0 1 1 0 1 0 0]
 [0 0 0 1 0 1 0 0 0 1]
 [1 0 0 0 0 0 1 0 0 1]]




In [66]:
import numpy as np

neg_cond = data.Sentiment == 0
pos_cond = data.Sentiment == 1

neg_df   = data[neg_cond]
pos_df   = data[pos_cond]

In [67]:
neg_result = countvec.fit_transform(neg_df.Review)
neg_vocabs = countvec.get_feature_names_out()

pos_result = countvec.fit_transform(pos_df.Review)
pos_vocabs = countvec.get_feature_names_out()



In [68]:
neg_result.shape, pos_result.shape

((1362, 3158), (1386, 3115))

In [69]:
neg_counts = np.sum(neg_result, axis = 0)
pos_counts = np.sum(pos_result, axis = 0)

In [75]:
df = pd.DataFrame(neg_counts, columns = neg_vocabs).T.sort_values(by=0, ascending=False)

In [76]:
df.head(10)

Unnamed: 0,0
1,103
bad,96
movie,95
0,92
phone,78
film,72
like,67
food,66
time,62
good,57


### TfidfVectorizer

- usually, in NLP, we don't use countvectorizer
- because it makes very frequent words a prominent feature, which we don't want to
- we want something like normalized(countvectorizer) ==> tfidvectorizer

In [77]:
tfidvec = TfidfVectorizer(tokenizer=preprocessing)

neg_result = tfidvec.fit_transform(neg_df.Review)
neg_vocabs = tfidvec.get_feature_names_out()
pos_result = tfidvec.fit_transform(pos_df.Review)
pos_vocabs = tfidvec.get_feature_names_out()

neg_counts = np.sum(neg_result, axis = 0)
pos_counts = np.sum(pos_result, axis = 0)

neg_count_df = pd.DataFrame(neg_counts, columns = neg_vocabs).T.sort_values(by=0, ascending=False)
pos_count_df = pd.DataFrame(pos_counts, columns = pos_vocabs).T.sort_values(by=0, ascending=False)



In [79]:
pos_count_df.head(10)

Unnamed: 0,0
great,56.691299
good,47.769436
phone,30.258919
food,22.290479
place,22.060052
service,21.79469
works,21.240647
film,20.164936
movie,19.952642
excellent,19.037113


## 2. Modeling and training

Use sklearn 

In [82]:
from sklearn.svm import LinearSVC

classifer = LinearSVC()
tfidvec   = TfidfVectorizer()

X = data["Review"]
y = data["Sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=333)
print(X_test.shape)

(825,)


In [83]:
clf = Pipeline([('tfidf', tfidvec), ('clf', classifer)])

In [85]:
clf.fit(X_train, y_train)

In [86]:
yhat = clf.predict(X_test)

In [87]:
print(classification_report(yhat, y_test))

              precision    recall  f1-score   support

           0       0.85      0.82      0.83       418
           1       0.82      0.85      0.83       407

    accuracy                           0.83       825
   macro avg       0.83      0.83      0.83       825
weighted avg       0.83      0.83      0.83       825



In [88]:
confusion_matrix(yhat, y_test)

array([[342,  76],
       [ 62, 345]])

## 3. Real-world

In [90]:
clf.predict(['Chaky dislikes spiderman game in the PS5.'])

array([0])

## Appendix: TfidfVectorizer

TF-IDF focuses on **cutting very frequent words which tend to be less meaningful information like "the", "a", "is".**

In [80]:
from sklearn.feature_extraction.text import TfidfTransformer

#imagine that we already have a frequency features.  We can perform normalization
#as a follow up
#here we got n=3, and m=2
counts = [[3, 0, 1],
          [2, 1, 0],
          [3, 2, 5]]
transformer = TfidfTransformer()
transformer.fit_transform(counts).toarray()

array([[0.91892665, 0.        , 0.39442846],
       [0.84080197, 0.54134281, 0.        ],
       [0.39706158, 0.34085938, 0.85214845]])

Here is how it works underhood:

The formula is

$$ \text{TF-IDF} =  \text{TF} * \text{IDF} $$

where TF is 

$$ \text{TF}_t = \frac{\text{Count of words t in that document}}{\text{Total count of words in that document}}$$

Thus TF = 

| | 1st word  | 2nd word   | 3rd word |
|---:|:-------------|:-----------|:-----------|
| doc1 | 3/4 = 0.75  | 0     |  1/4 = 0.25 |
| doc2 | 2/3 = 0.66  | 1/3 = 0.33    |  0 |
| doc3 | 3/10 = 0.33  | 2/10 = 0.20    |  5/10 = 0.5 |

and 

$$ \text{IDF} = \log\left(\frac{\text{Number of documents}}{\text{Number of documents containing that word}}\right) + 1$$

*Note:  We plus one so that super frequent words will not be ignored entirely*

Thus IDF = 

| | IDF  |    
|---:|:-----------|
| 1st word | $\log$ 3/3 + 1 = 1 |
| 2nd word | $\log$ 3/2 + 1 = 1.4055  |
| 3rd word | $\log$ 3/2 + 1 = 1.4055  | 

*Notice that terms (i.e., 1st word) that appear frequently across documents will get low score.  By multiplying this IDF term with the frequency, it will scale the importance down.*

Thus TF * IDF = 

| | 1st word  | 2nd word | 3rd word|    
|---:|:-----------|:-----------|:-----------|
| doc1 | 0.75 * 1 = 0.75  | 0 * 1.4055 = 0 | 0.25 * 1.4055 = 0.3514 |
| doc2 | 0.66 * 1 = 0.66  | 0.33 * 1.4055 = 0.4685 | 0 * 1.4055 = 0   |
| doc3 | 0.33 * 1 = 0.33  | 0.20 * 1.4055 = 0.2811 | 0.5 * 1.4055 =0.7027   |


We need to further normalize each word using this formula (since each document has unequaled number of words):

$$ norm(t_i) = \frac{t_i}{\sqrt{t_1^2 + t_2^2 + ....+t_n^2}} $$ 

Thus, normalized factor for each document is

doc1 = $\sqrt{0.75^2 + 0^2 + 0.3514^2} = 0.8282$

doc2 = $\sqrt{0.66^2 + 0.4685^2 + 0^2} = 0.8094$

doc3 = $\sqrt{0.33^2 + 0.281^2 + 0.7027^2} = 0.8256$


Thus, normalized(TF * IDF) = 

| | 1st word  | 2nd word | 3rd word|    
|---:|:-----------|:-----------|:-----------|
| doc1 | 0.75 / 0.8282 = 0.9056 | 0 | 0.3514 / 0.8282 = 0.4243 |
| doc2 | 0.66 / 0.8094 = 0.8154  | 0.4685 / 0.8094 = 0.5788 | 0   |
| doc3 | 0.33 / 0.8256 = 0.3997  | 0.2811 / 0.8256 = 0.3405 | 0.7027 / 0.8256 = 0.8511 |

**Note**
- My numbers are not exactly the same due to float precisions
- Note I am using `TfidfTransformer`.  You may want to use `TfidfVectorizer` which automatically accepts raw data (i.e., text data)