# Statistical Inference

**Statistics** involves drawing inferences about a population from a sample, while **machine learning** identifies predictive patterns that generalize well to new data.

**Statistical inference** constructs mathematical models to describe the data generation process, thereby formalizing understanding or testing hypotheses about the system's behavior.

**Prediction** focuses on forecasting unobserved outcomes or future behaviors, such as determining whether a mouse with a specific gene expression pattern is likely to have a disease.

<div>
   <img src="img/compare.png" width="700">
</div>

### Statsmodels

`Statsmodels` is a Python library designed for statistical modeling, testing, estimation, and performing various statistical analyses. It is built on top of the popular NumPy, SciPy, and pandas libraries, making it a powerful tool for statistical computation and data manipulation within the Python ecosystem.

**Key Features of Statsmodels:**

1. **Statistical Models**: Supports a wide range of statistical models, including linear regression, generalized linear models (GLMs), robust linear models, and many others. Each model can be fitted to data using different statistical methods, such as ordinary least squares (OLS), maximum likelihood estimation, and others.

2. **Time Series Analysis**: Provides extensive tools for time series analysis, which include ARIMA models, vector autoregressive (VAR) models, seasonal decompositions, and tools for dealing with cointegration.

3. **Nonparametric Methods**: Includes nonparametric estimation techniques and kernel density estimation, which are useful for making inferences without assuming a specific parametric model form.

4. **Hypothesis Tests**: Offers a variety of hypothesis tests and procedures for statistical testing, such as t-tests, F-tests, chi-squared tests, and ANOVA.

5. **Regression Analysis**: Facilitates detailed regression analysis, providing extensive output for diagnostic measures, including residuals analysis, influence and leverage diagnostics, and goodness-of-fit metrics.

6. **Plotting Functions**: Integrates with Matplotlib to provide a range of plotting functions for visual analysis of data relationships and model diagnostics.

7. **Robust Statistical Methods**: Includes methods that are robust to outliers and other anomalies in data, which are crucial for practical data analysis.

<div>
   <img src="img/statsmodels.png" width="800">
</div>

In [1]:
import statsmodels.api as sm

In [2]:
#!pip install nltk
#!pip install spacy
#!python -m spacy download en_core_web_sm

In [3]:
# Import useful libraries used for data management
import pandas as pd
import numpy as np
import re
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

import spacy

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to C:\Users\CA.CA-
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Movie Review Sentiment Analysis

IMDB allows users to rate movies on a scale from 1 to 10. For labeling purposes, the curator of the data classified reviews with ratings of 4 stars or less as negative and those with 7 stars or more as positive. Reviews rated 5 or 6 stars were excluded. These labels will serve as the benchmark for comparisons.

In [5]:
dataset = pd.read_csv('IMDB.csv')
dataset

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [6]:
dataset['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [7]:
# Sample 1000 reviews from each sentiment
dataset = dataset.groupby('sentiment').apply(lambda x: x.sample(1000)).reset_index(drop = True)

In [8]:
# Convert the 'label' column into a numeric variable; 'negative' as 0, 'positive' as 1
dataset['label'] = dataset['sentiment'].map({'negative':0, 'positive':1})
dataset

Unnamed: 0,review,sentiment,label
0,Martin Lawrence is not a funny man i Runteldat...,negative,0
1,"As soon as it hits a screen, it destroys all i...",negative,0
2,You have to acknowledge Cimino's contribution ...,negative,0
3,"Knights was just a beginning of a series, a pi...",negative,0
4,"Well, I must say, I initially found this short...",negative,0
...,...,...,...
1995,This movie takes the psychological thriller to...,positive,1
1996,"If you find yourself in need of an escape, som...",positive,1
1997,"This eloquent, simple film makes a remarkably ...",positive,1
1998,What a real treat and quite unexpected. This i...,positive,1


# Topic Modeling - Latent Dirichlet Allocation

## Topic Modeling and Latent Dirichlet Allocation (LDA)

**Topic modeling** is a type of statistical modeling for discovering abstract topics that occur in a collection of documents. It is a frequently used text-mining tool for the discovery of hidden semantic structures in a text body. Topic models are built around the idea that the semantics of our document are actually being governed by hidden, or "latent," topics that can be inferred from the words in the documents.

**Latent Dirichlet Allocation (LDA)** is one of the most popular algorithms for topic modeling. It was introduced by David Blei, Andrew Ng, and Michael I. Jordan in 2003. LDA is a generative probabilistic model, meaning it assumes that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA represents documents as mixtures of topics that spit out words with certain probabilities.

**How LDA Works**

1. **Assumptions**:
    - Documents are represented as random mixtures over latent topics.
    - Each topic is characterized by a distribution over words.

2. **Process**:
    - LDA starts with a fixed number of topics. Each topic is modeled as a distribution over words, and each document is modeled as a distribution over topics.
    - Though the true topic distribution is unknown, LDA attempts to backtrack from the documents to discover what topics would create those documents in the first place.
    - LDA uses two Dirichlet distributions (hence the name): one for the topics in documents and one for the words in topics. Dirichlet distributions are used because they are conjugate to multinomial distributions, which simplifies the computation.
    
3. **Mathematical Model**:
    - Each document is assumed to be generated by first picking a distribution over topics.
    - For each word in the document, a topic is chosen from this distribution, and then the word is chosen from the corresponding distribution over words in the topic.

## Gensim and LDA

**Gensim** is a popular open-source library in Python designed for unsupervised semantic modeling from plain text. It is particularly useful for applications such as topic modeling and similarity detection and primarily focuses on vector space modeling and topic modeling.

Gensim’s LDA model allows users to model text data and extract topic distribution of documents and word distribution of topics. It provides an efficient implementation of the LDA algorithm that can scale to large datasets and can handle sparse data efficiently.

**Using Gensim for LDA**

Here is how you might typically use Gensim to perform LDA topic modeling:

1. **Preprocessing**:

    - Tokenization: Splitting the text into sentences and sentences into words.
    - Removing stopwords: Frequently occurring words such as 'and', 'the', etc., are removed.
    - Making bigrams or trigrams: Depending on the context, some words might need to be processed together ('New York', 'financial crisis').
    - Lemmatization: Words are reduced to their root form.

2. **Creating the Dictionary and Corpus**:

    - Gensim’s Dictionary object converts words into unique ids.
    - Texts are converted to a bag-of-words format using this dictionary (a list of (word_id, word_frequency) tuples).

3. **Running LDA**:

    - Using `gensim.models.ldamodel.LdaModel` to train the LDA model on the corpus.
    - Parameters such as the number of topics need to be specified.

In [9]:
#!pip install gensim --user

In [10]:
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

<div>
   <img src="img/lda.png" width="800">
</div>

In [11]:
dataset = dataset.drop(columns=['sentiment'])
dataset

Unnamed: 0,review,label
0,Martin Lawrence is not a funny man i Runteldat...,0
1,"As soon as it hits a screen, it destroys all i...",0
2,You have to acknowledge Cimino's contribution ...,0
3,"Knights was just a beginning of a series, a pi...",0
4,"Well, I must say, I initially found this short...",0
...,...,...
1995,This movie takes the psychological thriller to...,1
1996,"If you find yourself in need of an escape, som...",1
1997,"This eloquent, simple film makes a remarkably ...",1
1998,What a real treat and quite unexpected. This i...,1


## Remove Unnecessary Characters

In [12]:
# Convert to list
review = dataset.review.values.tolist()

# Remove all html tags
review = [re.sub("<.*?>", " ", i) for i in review]

# Remove unnecessary characters
review = [re.sub("[^A-Za-z0-9]+", " ", i) for i in review]

# Change to lower case
review = [i.lower() for i in review]

print(dataset['review'][0])
print('\n')
print(review[:1])

Martin Lawrence is not a funny man i Runteldat. He just has too much on his mind and he is too mad which trips his puns pretty early in the game. He tries to make fun of critics, which boils down to "f*** them". Then he goes on to rather primitive sexual jokes on smokers with throat cancer and it just goes downhill from there. 3/10


['martin lawrence is not a funny man i runteldat he just has too much on his mind and he is too mad which trips his puns pretty early in the game he tries to make fun of critics which boils down to f them then he goes on to rather primitive sexual jokes on smokers with throat cancer and it just goes downhill from there 3 10']


## Tokenization

In [13]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield gensim.utils.simple_preprocess(str(sentence), deacc=True)  # deacc=True removes punctuations

review_words = list(sent_to_words(review))

print(review_words[:3])

[['martin', 'lawrence', 'is', 'not', 'funny', 'man', 'runteldat', 'he', 'just', 'has', 'too', 'much', 'on', 'his', 'mind', 'and', 'he', 'is', 'too', 'mad', 'which', 'trips', 'his', 'puns', 'pretty', 'early', 'in', 'the', 'game', 'he', 'tries', 'to', 'make', 'fun', 'of', 'critics', 'which', 'boils', 'down', 'to', 'them', 'then', 'he', 'goes', 'on', 'to', 'rather', 'primitive', 'sexual', 'jokes', 'on', 'smokers', 'with', 'throat', 'cancer', 'and', 'it', 'just', 'goes', 'downhill', 'from', 'there'], ['as', 'soon', 'as', 'it', 'hits', 'screen', 'it', 'destroys', 'all', 'intelligent', 'life', 'forms', 'around', 'but', 'on', 'behalf', 'of', 'its', 'producers', 'must', 'say', 'it', 'doesn', 'fall', 'into', 'any', 'known', 'movie', 'category', 'it', 'deserves', 'brand', 'new', 'denomination', 'of', 'its', 'own', 'it', 'neurological', 'drama', 'it', 'saddens', 'and', 'depresses', 'every', 'single', 'neuron', 'inside', 'person', 'brain', 'it', 'the', 'closest', 'thing', 'one', 'will', 'ever', 'g

## Creating Bigram and Trigram Models

<div>
   <img src="img/n-gram.png" width="700">
</div>

In [14]:
?gensim.models.Phrases

In [15]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(review_words, min_count=5, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[review_words], threshold=10)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[review_words[0]]])

['martin', 'lawrence', 'is', 'not', 'funny', 'man', 'runteldat', 'he', 'just', 'has', 'too_much', 'on', 'his', 'mind', 'and', 'he', 'is', 'too', 'mad', 'which', 'trips', 'his', 'puns', 'pretty', 'early', 'in', 'the', 'game', 'he', 'tries_to_make', 'fun', 'of', 'critics', 'which', 'boils', 'down', 'to', 'them', 'then', 'he', 'goes_on', 'to', 'rather', 'primitive', 'sexual', 'jokes', 'on', 'smokers', 'with', 'throat', 'cancer', 'and', 'it', 'just', 'goes', 'downhill', 'from', 'there']


In [16]:
print(trigram_mod[bigram_mod[review_words[9]]])

['this_film', 'is', 'just', 'as', 'bad', 'as', 'the', 'birdman', 'of', 'alcatraz', 'do_not', 'refer', 'to', 'the', 'acting', 'but', 'rather', 'the', 'premise', 'of', 'both', 'films', 'which', 'try_to', 'portray', 'psychopathic', 'criminals', 'as', 'heroic', 'figures', 'moreover', 'it', 'disturbs', 'me', 'when', 'well', 'respected', 'revered', 'actors', 'like', 'alan', 'alda', 'and', 'burt', 'lancaster', 'play', 'such', 'roles', 'because', 'their', 'status', 'tends_to', 'lend', 'credibility', 'to', 'the', 'director', 'intent', 'to', 'elevate', 'the', 'film', 'subject', 'societal', 'outcast', 'was', 'in', 'junior', 'high_school', 'during', 'the', 'last_years', 'of', 'caryl', 'chessman', 'life', 'and', 'his', 'death', 'penalty', 'appeals', 'and', 'books', 'were', 'very_much', 'in', 'the', 'news', 'remember', 'the', 'groundswell', 'of', 'opinion', 'that', 'the', 'death', 'penalty', 'was', 'wrong', 'and', 'chessman', 'was', 'the', 'victim', 'get', 'grip', 'people', 'read', 'the', 'history',

In [17]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=None):
    if allowed_postags is None:
        allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

### ***We remove stop words before applying the n-gram model!***

In [18]:
# Remove Stop Words
review_words_nostops = remove_stopwords(review_words)

# Form Bigrams. You can try Trigrams at home!
review_words_bigrams = make_bigrams(review_words_nostops)

In [19]:
# Initialize spacy 'en_core_web_sm' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
review_lemmatized = lemmatization(review_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(review_lemmatized[:1])

[['man', 'runteldat', 'much', 'mind', 'mad', 'trip', 'pun', 'pretty', 'early', 'game', 'try', 'make', 'fun', 'critic', 'boil', 'go', 'rather', 'primitive', 'sexual', 'joke', 'smoker', 'throat', 'cancer', 'go', 'downhill']]


## Create the Dictionary and Corpus needed for Topic Modeling

In [20]:
# Create Dictionary
id2word = corpora.Dictionary(review_lemmatized)

# Create Corpus
texts = review_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1]) # Return a mapping of (word_id, word_frequency)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)]]


In [21]:
id2word[123]

'contrary'

In [22]:
# Human-readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('boil', 1),
  ('cancer', 1),
  ('critic', 1),
  ('downhill', 1),
  ('early', 1),
  ('fun', 1),
  ('game', 1),
  ('go', 2),
  ('joke', 1),
  ('mad', 1),
  ('make', 1),
  ('man', 1),
  ('mind', 1),
  ('much', 1),
  ('pretty', 1),
  ('primitive', 1),
  ('pun', 1),
  ('rather', 1),
  ('runteldat', 1),
  ('sexual', 1),
  ('smoker', 1),
  ('throat', 1),
  ('trip', 1),
  ('try', 1)]]

## Train a LDA Model

In [23]:
?gensim.models.ldamodel.LdaModel

In [24]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=9, 
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [25]:
# Print the Keyword in the 9 topics
lda_model.print_topics()

[(0,
  '0.007*"companion" + 0.004*"salesman" + 0.003*"corman" + 0.001*"jill" + 0.000*"babysit" + 0.000*"cloak" + 0.000*"demme" + 0.000*"blouse" + 0.000*"woodstock" + 0.000*"interval"'),
 (1,
  '0.014*"small" + 0.012*"soon" + 0.011*"catch" + 0.008*"tale" + 0.008*"discover" + 0.008*"game" + 0.007*"class" + 0.007*"wait" + 0.007*"town" + 0.007*"sport"'),
 (2,
  '0.030*"film" + 0.011*"man" + 0.010*"life" + 0.007*"story" + 0.006*"live" + 0.006*"woman" + 0.006*"performance" + 0.005*"work" + 0.005*"role" + 0.005*"family"'),
 (3,
  '0.035*"movie" + 0.019*"film" + 0.016*"see" + 0.015*"make" + 0.014*"good" + 0.012*"get" + 0.010*"great" + 0.010*"well" + 0.010*"watch" + 0.009*"show"'),
 (4,
  '0.011*"western" + 0.010*"land" + 0.010*"battle" + 0.010*"reveal" + 0.009*"fight" + 0.009*"career" + 0.007*"horse" + 0.007*"fly" + 0.007*"destroy" + 0.006*"send"'),
 (5,
  '0.031*"generation" + 0.029*"english" + 0.020*"hitchcock" + 0.018*"tragedy" + 0.007*"sera" + 0.003*"que" + 0.002*"richard_gere" + 0.001*"wh

## Evaluating LDA Models

**Perplexity** is a statistical measure used to evaluate how well a probability model predicts a sample. In the context of LDA, perplexity measures the model's performance by comparing the predicted word distributions across topics with the actual distribution of words in your documents. To use perplexity, you first estimate the LDA model for a given number of topics, then analyze how closely the model's topic distributions align with the observed data.

Although perplexity can provide valuable insights, it is not very informative on its own. Its true utility lies in comparing the perplexity scores across different models, each configured with varying numbers of topics. Generally, the model with the lowest perplexity is considered the best, as a lower score indicates a better fit to the data.

**Coherence** assesses the semantic similarity between high scoring words within the same topic. This metric, which ranges from 0 to 1, helps determine how coherent the words in each topic are, with higher scores indicating better coherence.

Together, these two metrics, perplexity and coherence, are essential for determining the **optimal number of topics** in a dataset. They help balance the statistical fit of the model (perplexity) with the interpretability of the topics (coherence).

In [27]:
# Compute Perplexity
print('Perplexity: ', lda_model.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=review_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Perplexity:  -8.555970743205263

Coherence Score:  0.45678944834664126


## Interactive Visualization

In [28]:
#!pip install pyLDAvis --user

In [29]:
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, id2word)
vis

In [30]:
list(enumerate(lda_model[corpus]))[0][1][0]

[(0, 0.018127156), (6, 0.91621286), (7, 0.045854088), (8, 0.013718232)]

In [31]:
topic = []

for i, row_list in enumerate(lda_model[corpus]):
    row = sorted(row_list[0], key=lambda x: (x[1]), reverse=True)
    topic.append(row[0][0])
    
dataset['topic'] = topic

In [32]:
dataset

Unnamed: 0,review,label,topic
0,To quote Clark Griswold (in the original Chris...,0,6
1,"Steven buddy, you remember when you said this:...",0,6
2,"""Maximum Risk"" is a step sideways for Van Damm...",0,6
3,I am a Catholic taught in parochial elementary...,0,6
4,To call this anything at all would be an insul...,0,6
...,...,...,...
1995,In September 2003 36-year-old Jonny Kennedy di...,1,6
1996,Great movie in a Trainspotting style... Being ...,1,6
1997,(originally a response to a movie reviewer who...,1,6
1998,Ernest P. Worrell comes through with his third...,1,6


In [33]:
set(list(dataset.topic))

{6}

# Word2vec

In the second part of this lab, we will explore how to use the word embedding model for sentiment analysis.

<div>
   <img src="img/word2vec.png" width="800">
</div>

## Popular Pre-trained Word Vectors

In [34]:
import gensim.downloader

list(gensim.downloader.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

In [35]:
w2v = gensim.downloader.load('glove-wiki-gigaword-50')



In [37]:
w2v.most_similar('machine')

[('machines', 0.8238813877105713),
 ('device', 0.8175780773162842),
 ('using', 0.7789539694786072),
 ('gun', 0.7508804798126221),
 ('used', 0.7492657899856567),
 ('devices', 0.7369279265403748),
 ('uses', 0.7261233329772949),
 ('portable', 0.7247284650802612),
 ('automatic', 0.722038209438324),
 ('drives', 0.7156105041503906)]

## Movie Review Sentiment Analysis Using Pre-Trained Word Vectors

In [38]:
dataset

Unnamed: 0,review,label,topic
0,To quote Clark Griswold (in the original Chris...,0,6
1,"Steven buddy, you remember when you said this:...",0,6
2,"""Maximum Risk"" is a step sideways for Van Damm...",0,6
3,I am a Catholic taught in parochial elementary...,0,6
4,To call this anything at all would be an insul...,0,6
...,...,...,...
1995,In September 2003 36-year-old Jonny Kennedy di...,1,6
1996,Great movie in a Trainspotting style... Being ...,1,6
1997,(originally a response to a movie reviewer who...,1,6
1998,Ernest P. Worrell comes through with his third...,1,6


In [39]:
from sklearn.model_selection import train_test_split

reviews = dataset['review'].values
y = dataset['label'].values

reviews_train, reviews_test, y_train, y_test = train_test_split(reviews, y, test_size=0.25, random_state=1000)

### Create Benchmarks

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.5, min_df = 0.02)
vectorizer.fit(reviews_train)

X_train = vectorizer.transform(reviews_train).toarray()
X_test = vectorizer.transform(reviews_test).toarray()

In [41]:
print(X_train.shape)
print(X_test.shape)

(1500, 814)
(500, 814)


### Logistic Regression

In [42]:
from sklearn.linear_model import LogisticRegression

Model_lr = LogisticRegression()
Model_lr.fit(X_train, y_train)
score_lr = Model_lr.score(X_test, y_test)

print("Testing accuracy: {:.4f}".format(score_lr))

Testing accuracy: 0.8060


### Multilayer Perceptron

In [44]:
from keras.api.models import Sequential
from keras.api.layers import Dense
from keras.api.backend import clear_session


text_dim = X_train.shape[1]

Model_mlp = Sequential()
Model_mlp.add(Dense(100, input_dim = text_dim, activation = 'relu'))
Model_mlp.add(Dense(10, activation = 'relu'))
Model_mlp.add(Dense(1, activation = 'sigmoid'))

print(Model_mlp.summary())

None


In [45]:
Model_mlp.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

clear_session()

Result_mlp = Model_mlp.fit(X_train, y_train, batch_size=10, epochs=100, verbose=False, validation_data=(X_test, y_test))

loss, accuracy = Model_mlp.evaluate(X_train, y_train, verbose=False)
print("Training accuracy: {:.4f}".format(accuracy))
loss, accuracy = Model_mlp.evaluate(X_test, y_test, verbose=False)
print("Testing accuracy:  {:.4f}".format(accuracy))


Training accuracy: 1.0000
Testing accuracy:  0.7820


### Word Embedding Model

We need to tokenize the text data into a format that can be used by the word embeddings. To do so, we:
1. Create a vocabulary based on the training movie reviews.
2. Transfer all training and testing movie reviews into lists of integers. Each integer map a word to a value in a dictionary.

We can use the parameter ***num_words*** to control the size of the vocabulary. Unknown words will be ignored, while you can also use the parameter ***oov_token*** to set all unknown words to a same index.

<div>
   <img src="img/procedure.png" width="600">
</div>

In [1]:
from keras_preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(reviews_train)

X_train = tokenizer.texts_to_sequences(reviews_train)
X_test = tokenizer.texts_to_sequences(reviews_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

print(vocab_size)
print(reviews_train[2])
print(X_train[2])

ModuleNotFoundError: No module named 'keras.preprocessing.text'

In [43]:
for word in ['deep', 'learning', 'business', 'analysis']:
    print('{}: {}'.format(word, tokenizer.word_index[word]))

deep: 932
learning: 3630
business: 755
analysis: 8103


One issue we face is that documents vary in length. To address this, we will pad the sequence of words with zeros.

Additionally, we will specify a parameter, **maxlen**, to truncate longer texts. This is necessary to enhance training efficiency.

In [44]:
from keras_preprocessing.sequence import pad_sequences

maxlen = 100

X_train = pad_sequences(X_train, padding = 'post', maxlen = maxlen)
X_test = pad_sequences(X_test, padding = 'post', maxlen = maxlen)

In [45]:
print(X_train[0, :])

[   6  468   28    4  111  106    9   39   47   75  280  302    2    3
  167    4 1066   35    9  193   34    3  692  159  917    1   19    6
 4387 1720  145    2   10  100  172   74    4    1  229   38 2500    2
 1982  331  733    2    3   51  715  647   92    9    3 1469   19    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


In [46]:
print(X_train[1, :])

[ 517  217   11    6    3 4392  271   19    1  150  549   14    1    2
 1098 2503   14    1  261 1525   25   29  365    1  229    6    2  234
    2    1  366  311 1161  169  347  439   42   59  118   91   26   11
   97  475 3164 1722    3  734    5 1983   12   36   25 2352    2   66
  204   35   80 1984    5    1  820   33   25 1471   42   21  186    5
  116    3   19    5  120  271   12    6 3165    8   51 1161   17    6
   84  569  418  164   11   18    9    6    5   24  288 1132   60   14
   67   25]


### Keras Embedding Layer

We can use the nicely in-bulit Keras embedding layer for the word embedding model. The following parameters are needed:

1. **input_dim**: the size of the vocabulary
2. **output_dim**: the size of the dense vector
3. **input_length**: the length of the text sequence

In [47]:
from keras.api.layers import Embedding
from keras.api.layers import Flatten

embedding_dim = 50

model_em1 = Sequential()
model_em1.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=maxlen))
model_em1.add(Flatten())
model_em1.add(Dense(10, activation='relu'))
model_em1.add(Dense(1, activation='sigmoid'))

model_em1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 50)           1204100   
                                                                 
 flatten (Flatten)           (None, 5000)              0         
                                                                 
 dense (Dense)               (None, 10)                50010     
                                                                 
 dense_1 (Dense)             (None, 1)                 11        
                                                                 
Total params: 1,254,121
Trainable params: 1,254,121
Non-trainable params: 0
_________________________________________________________________


In [48]:
print(23269*50)
print(vocab_size * embedding_dim)

1163450
1204100


In [49]:
model_em1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

clear_session()

Result_em1 = model_em1.fit(X_train, y_train, batch_size=10, epochs=50, verbose=False, validation_data=(X_test, y_test))

loss, accuracy = model_em1.evaluate(X_train, y_train, verbose=False)
print("Training accuracy: {:.4f}".format(accuracy))
loss, accuracy = model_em1.evaluate(X_test, y_test, verbose=False)
print("Testing accuracy:  {:.4f}".format(accuracy))

Training accuracy: 1.0000
Testing accuracy:  0.7200


### Pooling Layer

In [50]:
from keras.api.layers import GlobalMaxPool1D

embedding_dim = 50

model_em2 = Sequential()
model_em2.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=maxlen))
model_em2.add(GlobalMaxPool1D())
model_em2.add(Dense(10, activation='relu'))
model_em2.add(Dense(1, activation='sigmoid'))

model_em2.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 50)           1204100   
                                                                 
 global_max_pooling1d (Globa  (None, 50)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 10)                510       
                                                                 
 dense_1 (Dense)             (None, 1)                 11        
                                                                 
Total params: 1,204,621
Trainable params: 1,204,621
Non-trainable params: 0
_________________________________________________________________


In [51]:
model_em2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Result_em2 = model_em2.fit(X_train, y_train, batch_size=10, epochs=50, verbose=False, validation_data=(X_test, y_test))

loss, accuracy = model_em2.evaluate(X_train, y_train, verbose=False)
print("Training accuracy: {:.4f}".format(accuracy))
loss, accuracy = model_em2.evaluate(X_test, y_test, verbose=False)
print("Testing accuracy:  {:.4f}".format(accuracy))

Training accuracy: 1.0000
Testing accuracy:  0.7860


### Pre-trained Word Vectors

In the previous example, we learned word embeddings concurrently with the deep learning weights that help predict review sentiment. An alternative approach involves utilizing pre-trained word vectors, such as **word2vec** and **GloVe**.

We will demonstrate this concept using GloVe. You can access the pre-trained word vectors by following this link: https://nlp.stanford.edu/projects/glove/. Specifically, we will use the **glove.6B.50d.txt** version, which is pre-trained on Wikipedia 2014 and Gigaword 5 and features 50-dimensional vectors.

Since we do not need all pre-trained word vectors in **GloVe**, we can first prepare a word embedding matrix that contains words only in our movie review dataset. 

For the \* in front of a variable:
* \* unpacks a list or tuple into position arguments.
* ** unpacks a dictionary into keyword arguments.

In [52]:
def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # We add 1 to the vocabulary size as we need to reserve 0 index for sequence padding
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath, encoding='utf-8') as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

In [53]:
embedding_dim = 50

embedding_matrix = create_embedding_matrix('glove.6B.50d.txt', tokenizer.word_index, embedding_dim)

In [54]:
embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.41800001,  0.24968   , -0.41242   , ..., -0.18411   ,
        -0.11514   , -0.78580999],
       [ 0.26818001,  0.14346001, -0.27877   , ..., -0.63209999,
        -0.25027999, -0.38097   ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.07190001, -0.4815    , -0.59113002, ...,  0.75568002,
         0.1908    ,  0.79163998]])

### Fixed Embedding Weights

In [55]:
model_em3 = Sequential()
model_em3.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, 
                        weights=[embedding_matrix], input_length=maxlen, 
                        trainable=False))
model_em3.add(GlobalMaxPool1D())
model_em3.add(Dense(10, activation='relu'))
model_em3.add(Dense(1, activation='sigmoid'))

model_em3.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 50)           1204100   
                                                                 
 global_max_pooling1d_1 (Glo  (None, 50)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_2 (Dense)             (None, 10)                510       
                                                                 
 dense_3 (Dense)             (None, 1)                 11        
                                                                 
Total params: 1,204,621
Trainable params: 521
Non-trainable params: 1,204,100
_________________________________________________________________


In [56]:
model_em3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

clear_session()

Result_em3 = model_em3.fit(X_train, y_train, batch_size=10, epochs=50, verbose=False, validation_data=(X_test, y_test))

loss, accuracy = model_em3.evaluate(X_train, y_train, verbose=False)
print("Training accuracy: {:.4f}".format(accuracy))
loss, accuracy = model_em3.evaluate(X_test, y_test, verbose=False)
print("Testing accuracy:  {:.4f}".format(accuracy))

Training accuracy: 0.6940
Testing accuracy:  0.7020


### Trained Embedding Weights

In [57]:
model_em4 = Sequential()
model_em4.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, 
                        weights=[embedding_matrix], input_length=maxlen, 
                        trainable=True))
model_em4.add(GlobalMaxPool1D())
model_em4.add(Dense(10, activation='relu'))
model_em4.add(Dense(1, activation='sigmoid'))

model_em4.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

clear_session()

Result_em4 = model_em4.fit(X_train, y_train, batch_size=10, epochs=50, verbose=False, validation_data=(X_test, y_test))

loss, accuracy = model_em4.evaluate(X_train, y_train, verbose=False)
print("Training accuracy: {:.4f}".format(accuracy))
loss, accuracy = model_em4.evaluate(X_test, y_test, verbose=False)
print("Testing accuracy:  {:.4f}".format(accuracy))

Training accuracy: 1.0000
Testing accuracy:  0.7340
