#  Yelp reviews + Preprocessing

In this case study, we shall try to classify yelp reviews into two polarity, positive or negative.  Although this seems a bit boring for our class, I start this to warm you up, and also visit some basics on text preprocessing.

## 0. Basic text preprocessing

Before we actually do classification, let's review on stopwords removal and lemmatization, which is a common procedure in preprocessing the text before we input into our classifier.

**Note:  If you are using BERT or other pre-trained huggingface stuffs, please read their model carefully.  Usuallly preprocessing IS NOT required, since they do all these stuffs internally.  Preprocessing may actually harm the model!!**

### Lemmatization

Lemmatization is an essential step in text preprocessing for NLP. It deals with the structural or morphological analysis of words and break-down of words into their base forms or "lemmas".  For Example - The words walk, walking, walks, walked are indicative towards a common activity i.e. walk. And since they have different spelling structure, it makes it a confusing task for our algorithms to treat them differently. So, these will be treated under a single lemma.

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp('run runs running ran')

for token in doc:
    print(token.text, token.lemma_)

run run
runs run
running run
ran run


In [2]:
doc = nlp('Going to eat at Thammasat with my / best      good friends.')

for token in doc:
    print(token.text, token.lemma_)

Going go
to to
eat eat
at at
Thammasat Thammasat
with with
my my
/ /
best good
           
good good
friends friend
. .


### Stopwords

Stopwords are common words that we wanna remove, such as "of", "until", etc.

In [3]:
from spacy.lang.en.stop_words import STOP_WORDS

stopwords = list(STOP_WORDS)
print(stopwords[:10])

print(len(stopwords))

['although', 'for', 'could', 'thereupon', 'yet', 'much', "'s", 'is', 'nothing', 'should']
326


To remove stopwords is very easy.

In [4]:
doc

Going to eat at Thammasat with my / best      good friends.

In [5]:
clean_tokens = []

for token in doc:
    if token.text not in stopwords:  #can also use token.is_stop == False
        clean_tokens.append(token.text)
        
clean_tokens

['Going', 'eat', 'Thammasat', '/', 'best', '     ', 'good', 'friends', '.']

### Removing punctuation

You notice if we were to insert the token into some classifier, it may not be necessary to input the `.`.  Thus we can also consider removing punctuation, using POS taggers.

In [6]:
for token in doc:
    print(token.text, token.pos_)  #notice PUNCT and SPACE

Going VERB
to PART
eat VERB
at ADP
Thammasat PROPN
with ADP
my PRON
/ SYM
best ADJ
      SPACE
good ADJ
friends NOUN
. PUNCT


To get the full list of possible POS tags

In [7]:
nlp.get_pipe("parser").labels[:5]  #i limit to 5 to good looking notebook :-)

('ROOT', 'acl', 'acomp', 'advcl', 'advmod')

To remove punctuation is quite easy as well

In [8]:
token_no_punct = []

for token in doc:
    if token.pos_ != 'PUNCT' and token.pos_ != 'SPACE' and token.pos_ != 'SYM':
        token_no_punct.append(token.text)
        
token_no_punct

['Going',
 'to',
 'eat',
 'at',
 'Thammasat',
 'with',
 'my',
 'best',
 'good',
 'friends']

### Removing spaces and lowercasing

Last thing we wanna worry is removing unwanted spaces using `strip()` and lowercasing using `.lower()`

We shall also combine other techniques we learned so far....

In [9]:
stripped_lowercase_token = []

for token in doc:
    stripped_lowercase_token.append(token.text.lower().strip())
    
stripped_lowercase_token


['going',
 'to',
 'eat',
 'at',
 'thammasat',
 'with',
 'my',
 '/',
 'best',
 '',
 'good',
 'friends',
 '.']

### Combine everything

Let's combine everything into one nice function

In [10]:
def preprocessing(sentence):
    
    stopwords = list(STOP_WORDS)

    doc = nlp(sentence)
    
    cleaned_tokens = []
    for token in doc:
        # print(token.text, token.pos_, token.is_stop)
        if token.text not in stopwords and token.pos_ != 'PUNCT' and token.pos_ != 'SPACE' and \
            token.pos_ != 'SYM':
            cleaned_tokens.append(token.lemma_.lower().strip())
            
    return cleaned_tokens

In [11]:
#let's try
preprocessing("Going to eat at Thammasat with my best      good friends.")

['go', 'eat', 'thammasat', 'good', 'good', 'friend']

In [12]:
#let's try
preprocessing("We are having a lot of fun.")

['we', 'have', 'lot', 'fun']

## 1. Text Classification 

I will not be going into full classification details.  I think you are already quite versed in this thing in my CP class.  I will just be focusing on discussing `TfidVectorizer` which I did not discuss much.

### 1.1 Load the data

In [13]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [14]:
data_yelp = pd.read_csv('../data/yelp_labelled.txt', sep='\t', header = None, names = ['Review', 'Sentiment'])
data_amazon = pd.read_csv('../data/amazon_labelled.txt', sep='\t', header = None, names = ['Review', 'Sentiment'])
data_imdb = pd.read_csv('../data/imdb_labelled.txt', sep='\t', header = None, names = ['Review', 'Sentiment'])

In [15]:
data_yelp.head()

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [16]:
data_amazon.head()

Unnamed: 0,Review,Sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [17]:
data_imdb.head()

Unnamed: 0,Review,Sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [18]:
data_yelp.shape, data_amazon.shape, data_imdb.shape

((1000, 2), (1000, 2), (748, 2))

Let's combine all of them into one dataset

In [19]:
data = pd.concat([data_yelp, data_amazon, data_imdb], ignore_index=True)
data.shape

(2748, 2)

### 1.2 EDA

Let's check the class imbalance.  (if there's class imbalance, use SMOTE INSIDE the cross validation loop)

In [20]:
data['Sentiment'].value_counts()

1    1386
0    1362
Name: Sentiment, dtype: int64

In [21]:
data.isnull().sum()

Review       0
Sentiment    0
dtype: int64

### CountVectorizer

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

#preprocessing refers to the function we wrote earlier
    #the input should be a bunch of text
    #the output should return tokens
countvec = CountVectorizer(tokenizer = preprocessing)

#let's try
corpus = [
    'Chaky is coding python     ',
    'Deep learning is very deep',
    'Are you sure about this?????',
    'please hashtag #ilovepython',
]
result   = countvec.fit_transform(corpus)

#list of tokens
print(countvec.get_feature_names_out())

#count
#rows are sentences
#columns are
print(result.toarray())


['chaky' 'code' 'deep' 'hashtag' 'ilovepython' 'learning' 'python' 'sure']
[[1 1 0 0 0 0 1 0]
 [0 0 2 0 0 1 0 0]
 [0 0 0 0 0 0 0 1]
 [0 0 0 1 1 0 0 0]]


Let's try to see top words, categorized by positive and negative sentiments

In [23]:
import numpy as np

neg_cond = data.Sentiment == 0
pos_cond = data.Sentiment == 1

#condition the df
neg_df = data[neg_cond]
pos_df = data[pos_cond]

print("Negative", neg_df.shape)
print("Postivie", pos_df.shape)

Negative (1362, 2)
Postivie (1386, 2)


In [24]:
#count
neg_result   = countvec.fit_transform(neg_df.Review)
neg_vocabs   = countvec.get_feature_names_out()
pos_result   = countvec.fit_transform(pos_df.Review)
pos_vocabs   = countvec.get_feature_names_out()

In [25]:
#top 10 tokens
#sum words across all documents
neg_counts = np.sum(neg_result, axis=0)
pos_counts = np.sum(pos_result, axis=0)

print(neg_counts.shape, pos_counts.shape)
print(neg_vocabs.shape, pos_vocabs.shape)

(1, 2682) (1, 2708)
(2682,) (2708,)


In [26]:
#top ten negative terms
df = pd.DataFrame(neg_counts, columns = neg_vocabs).T.sort_values(by=0, ascending=False)
df.head(25)

Unnamed: 0,0
bad,146
movie,109
1,103
0,92
film,86
phone,82
time,77
like,71
good,71
food,67


In [27]:
#top ten negative terms
df = pd.DataFrame(pos_counts, columns = pos_vocabs).T.sort_values(by=0, ascending=False)
df.head(25)

Unnamed: 0,0
good,230
great,194
film,103
movie,103
phone,92
work,84
love,74
like,73
place,63
food,60


**By doing this, you can see that derive many useful stuffs.  Please do this EDA whenever possible.**  You can also use the `wordcloud` library.

### TfidfVectorizer

TfidVectorizer go beyond one more step, i.e., after counting the number of words, we shall perform a normalization process called TF-IDF which focuses on **cutting very frequent words which tend to be less meaningful information like "the", "a", "is".**

In [28]:
from sklearn.feature_extraction.text import TfidfTransformer

#imagine that we already have a frequency features.  We can perform normalization
#as a follow up
#here we got n=3, and m=2
counts = [[3, 0, 1],
          [2, 1, 0],
          [3, 2, 5]]
transformer = TfidfTransformer()
transformer.fit_transform(counts).toarray()

array([[0.91892665, 0.        , 0.39442846],
       [0.84080197, 0.54134281, 0.        ],
       [0.39706158, 0.34085938, 0.85214845]])

Here is how it works underhood:

The formula is

$$ \text{TF-IDF} =  \text{TF} * \text{IDF} $$

where TF is 

$$ \text{TF}_t = \frac{\text{Count of words t in that document}}{\text{Total count of words in that document}}$$

Thus TF = 

| | 1st word  | 2nd word   | 3rd word |
|---:|:-------------|:-----------|:-----------|
| doc1 | 3/4 = 0.75  | 0     |  1/4 = 0.25 |
| doc2 | 2/3 = 0.66  | 1/3 = 0.33    |  0 |
| doc3 | 3/10 = 0.33  | 2/10 = 0.20    |  5/10 = 0.5 |

and 

$$ \text{IDF} = \log\left(\frac{\text{Number of documents}}{\text{Number of documents containing that word}}\right) + 1$$

*Note:  We plus one so that super frequent words will not be ignored entirely*

Thus IDF = 

| | IDF  |    
|---:|:-----------|
| 1st word | $\log$ 3/3 + 1 = 1 |
| 2nd word | $\log$ 3/2 + 1 = 1.4055  |
| 3rd word | $\log$ 3/2 + 1 = 1.4055  | 

*Notice that terms (i.e., 1st word) that appear frequently across documents will get low score.  By multiplying this IDF term with the frequency, it will scale the importance down.*

Thus TF * IDF = 

| | 1st word  | 2nd word | 3rd word|    
|---:|:-----------|:-----------|:-----------|
| doc1 | 0.75 * 1 = 0.75  | 0 * 1.4055 = 0 | 0.25 * 1.4055 = 0.3514 |
| doc2 | 0.66 * 1 = 0.66  | 0.33 * 1.4055 = 0.4685 | 0 * 1.4055 = 0   |
| doc3 | 0.33 * 1 = 0.33  | 0.20 * 1.4055 = 0.2811 | 0.5 * 1.4055 =0.7027   |


We need to further normalize each word using this formula (since each document has unequaled number of words):

$$ norm(t_i) = \frac{t_i}{\sqrt{t_1^2 + t_2^2 + ....+t_n^2}} $$ 

Thus, normalized factor for each document is

doc1 = $\sqrt{0.75^2 + 0^2 + 0.3514^2} = 0.8282$

doc2 = $\sqrt{0.66^2 + 0.4685^2 + 0^2} = 0.8094$

doc3 = $\sqrt{0.33^2 + 0.281^2 + 0.7027^2} = 0.8256$


Thus, normalized(TF * IDF) = 

| | 1st word  | 2nd word | 3rd word|    
|---:|:-----------|:-----------|:-----------|
| doc1 | 0.75 / 0.8282 = 0.9056 | 0 | 0.3514 / 0.8282 = 0.4243 |
| doc2 | 0.66 / 0.8094 = 0.8154  | 0.4685 / 0.8094 = 0.5788 | 0   |
| doc3 | 0.33 / 0.8256 = 0.3997  | 0.2811 / 0.8256 = 0.3405 | 0.7027 / 0.8256 = 0.8511 |

**Note**
- My numbers are not exactly the same due to float precisions
- Note I am using TfidfTransformer.  You may want to use TfidfVectorizer which automatically accepts raw data (i.e., text data)

So let's try to see whether we got different words....!!

In [29]:
#note that it already count for you internally in tfidvectorizer
tfidvec = TfidfVectorizer(tokenizer = preprocessing)

#count
neg_result   = tfidvec.fit_transform(neg_df.Review)
neg_vocabs   = tfidvec.get_feature_names_out()
pos_result   = tfidvec.fit_transform(pos_df.Review)
pos_vocabs   = tfidvec.get_feature_names_out()

In [30]:
#top 10 tokens
#sum words across all documents
neg_counts = np.sum(neg_result, axis=0)
pos_counts = np.sum(pos_result, axis=0)

print(neg_counts.shape, pos_counts.shape)
print(neg_vocabs.shape, pos_vocabs.shape)

(1, 2682) (1, 2708)
(2682,) (2708,)


In [31]:
#top ten negative terms
df = pd.DataFrame(neg_counts, columns = neg_vocabs).T.sort_values(by=0, ascending=False)
df.head(25)

Unnamed: 0,0
bad,39.847912
phone,23.267698
service,22.505866
time,22.169014
food,21.564529
movie,21.019005
place,19.833446
good,19.725452
work,19.674071
waste,19.027319


In [32]:
#top ten negative terms
df = pd.DataFrame(pos_counts, columns = pos_vocabs).T.sort_values(by=0, ascending=False)
df.head(25)

Unnamed: 0,0
great,58.746367
good,57.9869
phone,32.194893
work,30.731898
love,25.858021
place,23.602051
film,22.920675
movie,22.859297
food,22.801345
service,22.183989


As you can see, tf-idf works much better than countvectorizer!   Also, a lesson learned is that if we can give very good features to the classifier, the model can work well right away....

## Modeling

I gonna keep this short.  Please remind the best practices I teach in CP class.  :-)

In [33]:
from sklearn.svm import LinearSVC

#define model
classifier = LinearSVC()
tfidvec    = TfidfVectorizer()

#define data
X = data['Review']
y = data['Sentiment']

#split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print(X_train.shape, X_test.shape)

#make pipeline
clf = Pipeline([('tfidf', tfidvec), ('clf', classifier)])

#train
clf.fit(X_train, y_train)

#predict
y_pred = clf.predict(X_test)

#metrics
print(classification_report(y_test, y_pred))

#confusion matrix
confusion_matrix(y_test, y_pred)

(2198,) (550,)
              precision    recall  f1-score   support

           0       0.82      0.87      0.85       285
           1       0.85      0.80      0.82       265

    accuracy                           0.84       550
   macro avg       0.84      0.84      0.84       550
weighted avg       0.84      0.84      0.84       550



array([[248,  37],
       [ 53, 212]])

## 4. Real world inference

In [34]:
clf.predict(['Wow, this is amzing lesson'])

array([1])

In [35]:
clf.predict(['Wow, this sucks'])

array([0])

In [36]:
clf.predict(['Worth of watching it. Please like it'])

array([1])

In [37]:
clf.predict(['This is not bad'])  #still fail in double negative!!! a very useful test... :-)
#possible way to fix this is to get "spanning" features....not single token

array([0])