***Reference:***

***Raschka, Sebastian; Liu, Yuxi (Hayden); Mirjalili, Vahid. Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing.*** 

# ***<u>Chapter 8</u> - Applying Machine Learning to Sentiment Analysis***

## NLP: Sentiment Analysis using IMDb Dataset

**Sentiment analysis**, sometimes also called **opinion mining**, is a popular subdiscipline of the broader field of NLP; it is concerned with analyzing the sentiment of documents. A popular task in sentiment analysis is the classification of documents based on the expressed opinions or emotions of the authors with regard to a particular topic.


The movie review dataset consists of 50,000 polar movie reviews that are labeled as either positive or negative; here, positive means that a movie was rated with more than six stars on IMDb, and negative means that a movie was rated with fewer than five stars on IMDb.

## Preprocessing the movie dataset 

In [1]:
# !pip install pyprind

In [2]:
# import pyprind
# import pandas as pd
# import os
# import sys
# from packaging import version


# # change the `basepath` to the directory of the
# # unzipped movie dataset

# basepath = 'm_db'

# labels = {'pos': 1, 'neg': 0}

# # if the progress bar does not show, change stream=sys.stdout to stream=2
# pbar = pyprind.ProgBar(50000, stream=2)

# df = pd.DataFrame()
# for s in ('test', 'train'):
#     for l in ('pos', 'neg'):
#         path = os.path.join(basepath, s, l)
#         for file in sorted(os.listdir(path)):
#             with open(os.path.join(path, file), 
#                       'r', encoding='utf-8') as infile:
#                 txt = infile.read()
                
#             if version.parse(pd.__version__) >= version.parse("1.3.2"):
#                 x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
#                 df = pd.concat([df, x], ignore_index=False)

#             else:
#                 df = df.append([[txt, labels[l]]], 
#                                ignore_index=True)
#             pbar.update()
            
# df.columns = ['review', 'sentiment']

In [3]:
# # Shuffling the DataFrame:
# import numpy as np


# if version.parse(pd.__version__) >= version.parse("1.3.2"):
#     df = df.sample(frac=1, random_state=0).reset_index(drop=True)
    
# else:
#     np.random.seed(0)
#     df = df.reindex(np.random.permutation(df.index))

In [4]:
# Saving the dataset
# df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [33]:
import pandas as pd

movies_df = pd.read_csv('movie_data.csv', encoding='utf-8')

# df = df.rename(columns={"0": "review", "1": "sentiment"})

df = movies_df.copy()

df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [6]:
df.shape

(50000, 2)

## Introducing the bag-of-words model

**bag-of-words model, allows us to represent text as numerical feature vectors.**

*The idea behind bag-of-words is quite simple:*

1. We create a vocabulary of unique tokens—for example, words—from the entire set of documents. 

2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document. 

*Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros, which is why we call them* **sparse.**

### Transforming documents into feature vectors

Using ```CountVectorizer``` that takes an array of text data, which can be documents or sentences, and construct the bag-of-words model:

In [7]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()

docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])

bag = count.fit_transform(docs)

Now let us print the contents of the vocabulary to get a better understanding of the underlying concepts:

In [8]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


As we see abovr, **the vocabulary is stored in a Python dictionary that maps the unique words to integer indices.**

In [9]:
# print the feature vectors

print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


***Each index position in the feature vectors shown above corresponds to the integer values that are stored as dictionary items in the ```CountVectorizer``` vocabulary.*** 

For example, the first feature at index position ```0``` resembles the count of the word ```"and"```, which only occurs in the last document, and the word ```"is"``` at index position ```1``` (the 2nd feature in the document vectors) occurs in all three sentences. 

Those values in the feature vectors are also called the 

**raw term frequencies: tf (t,d)—the number of times a term t occurs in a document d.**


In the bag-of-words model, the word or term order in a sentence or document does not matter. The order in which the term frequencies appear in the feature vector is derived from the vocabulary indices, which are usually assigned alphabetically.

### N-gram Models

The sequence of items in the bag-of-words model that we just created is also called the **1-gram or unigram model**—each item or token in the vocabulary represents a single word. 

More generally, the contiguous sequences of items in NLP—words, letters, or symbols—are also called **n-grams**. 

*The choice of the number, n, in the n-gram model depends on the particular application;* for example, n-grams of size 3 and 4 yield good performances in the anti-spam filtering of email messages.

***To summarize the concept of the n-gram representation, the 1-gram and 2-gram representations of our first document, “the sun is shining”, would be constructed as follows:***
- 1-gram: “the”,  “sun”,  “is”,  “shining” 
- 2-gram: “the sun”,  “sun is”,  “is shining” 
- Similarly n-gram is n-unique combination of words in order in the document

The ```CountVectorizer``` class allows us to use different n-gram models via its ```ngram_range``` parameter. While a 1-gram representation is used by default, for 2-gram representation, set, ```ngram_range=(2,2)```.

## Assessing word relevancy via tf-idf

#### **term frequency-inverse document frequency(tf-idf)**

- When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. These frequently occurring words typically don’t contain useful or discriminatory information.

- **tf-idf can be used to downweight these frequently occurring words in the feature vectors.** 

- **The tf-idf can be defined as the product of the term frequency and the inverse document frequency:** 

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times \text{idf}(t,d)$$

- $\text{tf}(t,d)$ - **term frequency**—***the number of times a term t occurs in a document d.***

- $\text{idf}(t,d)$ - **inverse document frequency**, calculated as

$$\text{idf}(t,d) = \log\frac{n_d}{1 + \text{df}(d,t)}$$

- $n_d$ = ***total # of documents***
- $\text{df}(d,t)$ - ***# of documents, $d$, that contains the term, $t$.***
    

Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in none of the training examples; the log is used to ensure that low document frequencies are not given too much weight.

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True,
                         norm = 'l2',
                         smooth_idf = True)

np.set_printoptions(precision=2)

print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As we saw earlier, the word "is" had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word "is" is now associated with a relatively small tf-idf (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.

#### $\rightarrow$ Sklearn ```TfidfTransformer``` calculates tf-idf slightly differently:

**In sklearn inverse document frequency(idf):**

$$\text{idf}(t,d) = \log\frac{1 + n_d}{1 + \text{df}(d,t)}$$

**Similarly tf-idf:**

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d) + 1)$$


$"+1"$ in the previous eq. is due to setting ```smooth_idf=True```, which is helpful for assigning zero weight (that is, $idf(t, d) = log(1) = 0$) to terms that occur in all documents.


Also, it is more typical to mormalize the raw term fequency $\text{tf}(t,d)$ before calculating $\text{tf-idf}(t,d)$. ```TfidfTransformer``` class by default uses L2-normalization(```norm='l2'```), returning a vector of length 1, as: $$v_{norm} = \frac{v}{||v||_2}$$ 

-------------------------------------

##### For example: Let's calculate the tf-idf of the word ```"is"``` in the 3rd document.

3rd doc: ```'The sun is shining, the weather is sweet, and one and one is two'```

$$\text{term fequency of 'is'  in 3rd doc.} = \text{tf}("is",d_3) = 3$$

$$\text{df}("is", d_3) = 3$$

$$\implies \text{idf}("is", d_3) = \log \frac{1+3}{1+3} = 0$$

$$\implies \text{tf-idf}("is",d_3) = 3 \times (0+1) = 3$$


Now, similarly repeating above for all terms in the 3rd doc.

In [11]:
# for "is" in 3rd doc
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print(f'tf-idf of term "is" = {tfidf_is:.2f}')

tf-idf of term "is" = 3.00


In [12]:
tfidf = TfidfTransformer(use_idf=True, norm = None, smooth_idf = True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf

array([3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29])

For all terms in the third document, tf-idf vectors is as: ```[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29]```. 

However, notice that the values in this feature vector are different from the values that we obtained from ```TfidfTransformer``` that we used previously. The final step in this tf-idf calculation is the L2-normalization, which can be applied as follows: 

$$\text{tfi-df}(d_3)_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

As you can see, the results now match the results returned by scikit-learn’s ```TfidfTransformer```. Thus, this is how sklearn implements tf-idf.

In [13]:
tfidf = TfidfTransformer(use_idf=True, norm = 'l2', smooth_idf = True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf

array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

## Cleaning the text data

Before applying bag-of-words model, it is imp. to clean up the text data by stripping all the unwanted characters. 

Below we can see the text contains HTML markup as well as punctuation and other non-letter characters. HTML markup does not contain many useful semantics so we will remove those.

Punctuation marks can represent useful, additional information in certain NLP contexts. However, for simplicity, remove all punctuation marks except for emoticon characters, such as :), since those are certainly useful for sentiment analysis.

In [14]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [34]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    
    return text

- Via the first regex, ```<[^>]*>```, in the code, we tried to remove all of the HTML markup from the movie reviews.

- After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as ```emoticons```.

- Next, we removed all non-word characters from the text via the regex ```[\W]+``` and converted the text into lowercase characters.

- Eventually, we added the temporarily stored emoticons to the end of the processed document string. Additionally, we removed the nose character (- in :-)) from the emoticons for consistency.

***Although the addition of the emoticon characters to the end of the cleaned document strings may not look like the most elegant approach, we must note that the order of the words doesn’t matter in our bag-of-words model if our vocabulary consists of only one-word tokens.***

In [35]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [36]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [37]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [38]:
# Applying the preprocessor to our dataframe

df['review'] = df['review'].apply(preprocessor)

In [39]:
df.head()

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grac...,1
1,ok so i really like kris kristofferson and his...,0
2,spoiler do not read this if you think about w...,0
3,hi for all the people who have seen this wonde...,1
4,i recently bought the dvd forgetting just how ...,0


## Processing documents into tokens

### Tokenization
**Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.**

In [40]:
def tokenizer(text):
    return text.split()

In [41]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

### Stemming/ Porter stemmer algoritm
In the context of tokenization, another useful technique is **word stemming**, ***which is the process of transforming a word into its root form. It allows us to map related words to the same stem.***


So, like root word for:
- running -> run
- thus -> thu
- and -> and
- runners -> runner, etc.

```nltk``` library implements potter stemming algorithm

In [42]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [43]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

**Using the ```PorterStemmer``` from the ```nltk``` package, we modified our tokenizer function to reduce words to their root form.**

### Lemmatization

**While stemming can create non-real words**, such as ```'thu'``` (from ```'thus'```), as shown in the previous example, a technique called ***lemmatization aims to obtain the canonical (grammatically correct) forms of individual words—the so-called lemmas.*** 

***However, lemmatization is computationally more difficult and expensive compared to stemming and, in practice, it has been observed that stemming and lemmatization have little impact on the performance of text classification.***

### Diff. b/w Stemming & Lemmatization

**Stemming identifies the common root form of a word by removing or replacing word suffixes (e.g. “flooding” is stemmed as “flood”), while lemmatization identifies the inflected forms of a word and returns its base form (e.g. “better” is lemmatized as “good”).**



<table>
  <tr>
    <td> <img src="Images/stem_lem_1.jpg"  alt="1"></td>
    <td><img src="Images/stem_lem_2.png" alt="2"></td>   
  </tr>
</table>

<table>
  <tr>
    <td> <img src="Images/stem_lem_3.png"  alt="1"></td>
    <td><img src="Images/stem_lem_4.png" alt="2"></td>
  </tr>
</table>

### Stop Words Removal

**Stop words are simply those words that are extremely common in all sorts of texts** and probably bear no (or only a little) useful information that can be used to distinguish between different classes of documents.*

*Examples of stop words are *is, and, has, and like.*

Removing stop words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs, which already downweight the frequently occurring words.


```nltk``` library has set of 127 English stop words

In [44]:
# import nltk

# nltk.download()

In [45]:
from nltk.corpus import stopwords

stop = stopwords.words("english")

[w for w in tokenizer_porter('a runner likes running and runs a lot')
 if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## Training a logistic regression model for document classification

In [46]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [53]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

"""
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]
"""


small_param_grid = [{'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]},
                    {'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf':[False],
                     'vect__norm':[None],
                     'clf__penalty': ['l1','l2'],
                  'clf__C': [1.0, 10.0]},
              ]


lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=2,
                           n_jobs=-1)

In [54]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


- In the previous code example, we replaced ```CountVectorizer``` and ```TfidfTransformer``` from the previous subsection with ```TfidfVectorizer```, which combines ```CountVectorizer``` with the ```TfidfTransformer```. 


- Our ```param_grid``` consisted of two parameter dictionaries. 

    - In the first dictionary, we used ```TfidfVectorizer``` **with its default settings** (```use_idf=True```, ```smooth_idf=True```, and ```norm='l2'```) to calculate the tf-idfs; 

    - in the second dictionary, we set those parameters to ```use_idf=False```, ```smooth_idf=False```, and ```norm=None``` **in order to train a model based on raw term frequencies.** 

- Furthermore, for the logistic regression classifier itself, we trained models using L2 regularization via the penalty parameter and compared different regularization strengths by defining a range of values for the inverse-regularization parameter ```C```.

In [55]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x00000233B25D03A0>}
CV Accuracy: 0.897


In [56]:
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Test Accuracy: 0.899


### The naïve Bayes classifier

A still very popular classifier for text classification is the naïve Bayes classifier, which gained popularity in applications of email spam filtering. **Naïve Bayes classifiers** are easy to implement, computationally efficient, and tend to **perform particularly well on relatively small datasets** compared to other algorithms.

## Working with bigger data - online algorithms and out-of-core learning

**out-of-core learning**, allows us to work with such large datasets by fitting the classifier incrementally on smaller batches of a dataset.


**Stochastic Gradient Descent**: ***it is an optimization algorithm that updates the model’s weights using one example at a time.*** 

We will make use of the ```partial_fit``` function of ```SGDClassifier``` in scikit-learn to stream the documents directly from our local drive and train a logistic regression model using small mini-batches of documents. 

In [58]:
# defining a tokenizer func. that cleans the unprocessed text data 

import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words("english")

def tokenizer(text):
    # substituting/Removing any html tag elments alongs with it's contents
    # in our text
    text = re.sub('<[^>]*>', '', text)
    
    # getting all emotions signs
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', 
                           text.lower())
    
    # removing all emoticons and appending at the end, also removing the
    # nose '-' symbol in ':-)' from consistensy
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    
    # tokenization
    tokenized = [w for w in text.split() if w not in stop]
    
    return tokenized


# define a generator func., stream_docs, that reads in and 
# returns one document at a time:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [60]:
next(stream_docs(path='movie_data.csv'))

# stream_docs works

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [61]:
# define a func., get_minibatch, that will take a document stream
# from the stream_docs func. and return a particular number of 
# documents specified by the size parameter:

def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

Unfortunately, we can’t use ```CountVectorizer``` for out-of-core learning since it requires holding the complete vocabulary in memory. Also, ```TfidfVectorizer``` needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies. 

However, another useful vectorizer for text processing implemented in scikit-learn is ```HashingVectorizer```. ```HashingVectorizer``` is data-independent and makes use of the hashing trick via the 32-bit ```MurmurHash3``` function

In [62]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier


vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

In [65]:
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

# loss='log_loss' for logistic regression
clf = SGDClassifier(loss='log_loss', random_state=1)


doc_stream = stream_docs(path='movie_data.csv')

In [66]:
# start out-of-core learning 

import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0,1])

for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    
    if not X_train:
        break
        
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()
    
    
# we iterated over 45 mini-batches of documents where each mini-batch
# consists of 1,000 documents. 

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:18


In [67]:
# Having completed the incremental learning process, we will use the last 
# 5,000 documents to evaluate the performance of our model:

X_test, y_test = get_minibatch(doc_stream, size=5000)

X_test = vect.transform(X_test)

print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 0.868


As you can see, the accuracy of the model is approximately 87 percent, slightly below the accuracy that we achieved in the previous section using the grid search for hyperparameter tuning. However, out-of-core learning is very memory efficient, and it took less than a minute to complete. 


Finally, we can use the last 5,000 documents to update our model:

In [68]:
clf = clf.partial_fit(X_test, y_test)

### The word2vec model 

**A more modern alternative to the bag-of-words model is word2vec**, an algorithm by Google.

The word2vec algorithm is an unsupervised learning algorithm based on neural networks that attempts to automatically learn the relationship between words.

The idea behind word2vec is to put words that have similar meanings into similar clusters, and via clever vector spacing, the model can reproduce certain words using simple vector math, for example, king – man   woman = queen.

## Topic modelling with Latent Dirichlet Allocation(LDA)
<a href="https://youtu.be/djxXHg17oTA">Pronunciation</a>

- **Topic modeling :** ***describes the broad task of assigning topics to unlabeled text documents.*** 

For example, a typical application is the categorization of documents in a large text corpus of newspaper articles. In applications of topic modeling, we then aim to assign category labels to those articles, for example, sports, finance, world news, politics, and local news. 

Topic modeling can be considered as a clustering task, a subcategory of unsupervised learning. 

- A popular technique for topic modeling called **latent Dirichlet allocation (LDA)**. 

However, note that while latent Dirichlet allocation is often abbreviated as LDA, it is not to be confused with linear discriminant analysis, a supervised dimensionality reduction technique.

### Decomposing text documents with LDA

*Since the mathematics behind LDA is quite involved and requires knowledge of Bayesian inference, we will approach this topic from a practitioner’s perspective and interpret LDA using layman’s terms.*


- ***LDA is a generative probabilistic model that tries to find groups of words that appear frequently together across different documents. These frequently appearing words represent our topics, assuming that each document is a mixture of different words.***

- ***The input to an LDA is the bag-of-words model.*** Given a bag-of-words matrix as input, LDA decomposes it into two new matrices:

    - **A document-to-topic matrix** 
    - **A word-to-topic matrix** 
    

- LDA decomposes the bag-of-words matrix in such a way that if we multiply those two matrices together, we will be able to reproduce the input, the bag-of-words matrix, with the lowest possible error. 

- *In practice, we are interested in those topics that LDA found in the bag-of-words matrix.* 

- ***The only downside may be that we must define the number of topics beforehand—the number of topics is a hyperparameter of LDA that has to be specified manually.***

### <a href="https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2">A Beginner’s Guide to Latent Dirichlet Allocation(LDA)</a>

https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

### LDA with scikit-learn

In [70]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')

df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [71]:
# Using CountVectorizer to create the bag-of-words matrix 
# as input to the LDA.

from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english', 
                        max_df=.1, 
                        max_features=5000)

X = count.fit_transform(df['review'].values)

we set the maximum document frequency of words to be considered to 10% (```max_df=.1```) to exclude words that occur too frequently across documents.

The rationale behind the removal of frequently occurring words is that these might be common words appearing across all documents that are, therefore, less likely to be associated with a specific topic category of a given document.

Also, we limited the number of words to be considered to the most frequently occurring 5,000 words (```max_features=5000```), to limit the dimensionality of this dataset to improve the inference performed by LDA. 

However, both ```max_df=.1``` and ```max_features=5000``` are hyperparameter values chosen arbitrarily, and can be tuned.

In [72]:
from sklearn.decomposition import LatentDirichletAllocation

# performing LDA with 10 topics
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')

X_topics = lda.fit_transform(X)

By setting ```learning_method='batch'```, we let the lda estimator do its estimation based on all available training data (the bag-of-words matrix) in one iteration, which is slower than the alternative ```'online'``` learning method, but can lead to more accurate results (setting ```learning_method='online'``` is analogous to online or mini-batch learning.

#### Expectation-maximization 
The scikit-learn library’s implementation of LDA uses the expectation-maximization <a href="https://youtu.be/1jSonYih_sM">(EM) algorithm</a> to update its parameter estimates iteratively.

In [74]:
lda.components_.shape

(10, 5000)

To analyze the results, let’s print the five most important words for each of the 10 topics. Note that the word importance values are ranked in increasing order. Thus, to print the top five words, we need to sort the topic array in reverse order:

In [79]:
n_top_words = 5
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx + 1)}:')
    print(' '.join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]]))

Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool


Based on reading the five most important words for each topic, you may guess that the LDA identified the following topics: 

1. Generally bad movies (not really a topic category) 
2. Movies about families 
3. War movies 
4. Art movies 
5. Crime movies 
6. Horror movies 
7. Comedy movie reviews 
8. Movies somehow related to TV shows 
9. Movies based on books 
10. Action movies


To confirm that the categories make sense based on the reviews, let’s plot three movies from the horror movie category (horror movies belong to category 6 at index position ```5```):

In [80]:
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

Horror movie #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...


Using the preceding code example, we printed the first 300 characters from the top three horror movies. The reviews—even though we don’t know which exact movie they belong to—sound like reviews of horror movies (however, one might argue that ```Horror movie #2``` could also be a goot fit for topic category 1: Generally bad movies).

## Summary

- We learned how to use machine learning algorithms to classify text documents based on their polarity, which is a basic task in sentiment analysis in the field of NLP. 

- bag-of-words model - to encode a document as a feature vector

- tf-idf - to weight the term frequency by relevance using. 

- Working with text data can be computationally quite expensive due to the large feature vectors that are created during this process; **out-of-core or incremental learning** is used to train a machine learning algorithm without loading the whole dataset into a computer’s memory. 

- Lastly, the concept of **topic modeling using LDA** to categorize the movie reviews into different categories in an unsupervised fashion. 

***Reference:***

***Raschka, Sebastian; Liu, Yuxi (Hayden); Mirjalili, Vahid. Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing.*** 