# NLP-Topic-Modeling

Welcome to a new NLP project!

In this project, we are going to cover topic modeling, or the unsupervised discovery of topics present in a corpus of text. There are many different algorithms available to do this, and we will cover four of them: 
- Latent Dirichlet Allocation (LDA) topic modeling with sklearn
- LDA topic modeling with gensim
- NMF topic modeling
- K-means with Bidirectional Encoder Representations from Transformers (BERT) embeddings
- Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) for topic modeling of short texts.

## Table of Contents
- [1 - Set up the working directory & Import packages ](#1)
- [2 - Load the dataset](#2)
- [3 - Preprocess the dataset](#3)
    - [Reshape the training and test data sets](#pre-1)
    - [Normalize the training and test data sets](#pre-2)
- [4 - Build the model](#4)
    - [4.1 - Define the model structure](#4-1)
    - [4.2 - Train the top layer](#4-2)
    - [4.3 - Do a round of fine-tuning of the entire model](#4-3)


<a name='1'></a>
## 1 - Set up the working directory & Import packages ##

In [21]:
# Get the running time of each cell 
#  (similar to the ExecuteTime extension for Jupyter Notebook
!pip install ipython-autotime
%load_ext autotime

Collecting ipython-autotime
  Downloading ipython_autotime-0.3.1-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.1
time: 191 µs (started: 2021-09-18 04:16:30 +00:00)


In [1]:
# Move to the working directory on Google Drive as using Google Colab
import os
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
  PROJECT_ROOT ="/content/drive/MyDrive/GitHub/NLP-Topic-Modeling"
else:
  PROJECT_ROOT ="."
os.chdir(PROJECT_ROOT)
!pwd

Running on CoLab
/content/drive/MyDrive/GitHub/NLP-Topic-Modeling


In [2]:
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA


In [3]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('whitegrid')
sns.set_context('talk')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
params = {'legend.fontsize': 11,
          'figure.figsize': (10, 5),
          'axes.labelsize': 11,
          'axes.titlesize':11,
          'xtick.labelsize':11,
          'ytick.labelsize':11}
plt.rcParams.update(params)

<a name='2'></a>
## 2 - Load the dataset ##


### Get the stopwords

In [5]:
import csv
from nltk.stem.snowball import SnowballStemmer


def read_in_csv(csv_file):
    with open(csv_file, 'r', encoding='utf-8') as fp:
        reader = csv.reader(fp, delimiter=',', quotechar='"')
        data_read = [row for row in reader]
    return data_read


def get_stopwords(path):
    stemmer = SnowballStemmer('english')
    stopwords = read_in_csv(path)
    stopwords = [word[0] for word in stopwords]
    stemmed_stopwords = [stemmer.stem(word) for word in stopwords]
    stopwords = stopwords + stemmed_stopwords
    return stopwords

stopwords_file_path = "datasets/stopwords.csv"
stopwords = get_stopwords(stopwords_file_path)

In [23]:
stopwords 

["'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'able',
 'about',
 'above',
 'accordance',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterward',
 'afterwards',
 'again',
 'against',
 'ago',
 'ah',
 'all',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anymore',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'are',
 'aren',
 "aren'",
 'arent',
 'around',
 'as',
 'aside',
 'at',
 'away',
 'be',
 'because',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'but',
 'by',
 'ca',
 'can',
 "can'",
 "can't",
 'cannot',
 'cause',
 'co',
 'com',
 'could',
 'couldn',
 "couldn'",
 'couldnt',
 'day',
 'days',
 'despite',
 'did',
 'didn',
 "didn'",
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn'",
 "doesn't",
 'doing',
 'don',
 "don't",
 'done',
 'dont',
 'down',
 'downwards',
 'during',
 'each',
 'ed',
 'edu

time: 16.5 ms (started: 2021-09-18 04:18:32 +00:00)


### Load the BBC dataset into a Pandas dataframe

In [7]:
bbc_dataset = "datasets/bbc-text.csv"
df = pd.read_csv(bbc_dataset)
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [11]:
df.category.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

In [8]:
def clean_data(df):
    df['text'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]', ' ', x))
    df['text'] = df['text'].apply(lambda x: re.sub(r'\d', '', x))
    return df

df = clean_data(df)
documents = df['text']
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string
import re
import nltk
nltk.download('punkt')

stemmer = SnowballStemmer('english')

def tokenize_and_stem(sentence):
    tokens = nltk.word_tokenize(sentence)
    filtered_tokens = [t for t in tokens if t not in stopwords and t not in string.punctuation and re.search('[a-zA-Z]', t)]
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

def create_tf_idf_vectorizer(documents):
  tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords,
                                       tokenizer=tokenize_and_stem, 
                                       max_df=0.95, 
                                       max_features=20000,
                                       use_idf=True)
  data = tfidf_vectorizer.fit_transform(documents)
  return (tfidf_vectorizer, data)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [17]:
(tfidf_vectorizer, data) = create_tf_idf_vectorizer(documents)
data

<2225x18650 sparse matrix of type '<class 'numpy.float64'>'
	with 284079 stored elements in Compressed Sparse Row format>

In [18]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

def create_and_fit_lda(data, num_topics):
    lda = LDA(n_components=num_topics, n_jobs=-1)
    lda.fit(data)
    return lda

number_topics = 5
lda = create_and_fit_lda(data, number_topics)

In [20]:
def get_most_common_words_for_topics(model, vectorizer, n_top_words):
    words = vectorizer.get_feature_names()
    word_dict = {}
    for topic_index, topic in enumerate(model.components_):
        this_topic_words = [words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        word_dict[topic_index] = this_topic_words
    return word_dict

def print_topic_words(word_dict):
    for key in word_dict.keys():
        print(f"Topic {key}")
        print("\t", word_dict[key])

topic_words = get_most_common_words_for_topics(lda, tfidf_vectorizer,20)
print_topic_words(topic_words)

Topic 0
	 ['film', 'm', 'year', 'best', 'award', 'game', 'star', 'play', 'win', 'show', 'last', 'time', 'sale', 'first', 'music', 'won', 'world', 'top', 'new', 'rate']
Topic 1
	 ['peopl', 'govern', 'elect', 'use', 'labour', 'say', 'parti', 'bn', 'firm', 'compani', 'year', 'new', 'blair', 'servic', 'minist', 'mobil', 'tax', 'plan', 'tori', 'phone']
Topic 2
	 ['england', 'wale', 'o', 'ireland', 'match', 'rugbi', 'win', 'play', 'seed', 'injuri', 'game', 'robinson', 'franc', 'coach', 'open', 'six', 'player', 'final', 'william', 'half']
Topic 3
	 ['printer', 'cartridg', 'ssl', 'nestl', 'wpp', 'metlif', 'elgindi', 'curbishley', 'pernod', 'carniv', 'inkjet', 'sakhnin', 'domecq', 'bnei', 'murambadoro', 'electrolux', 'jonatan', 'coltran', 'aurora', 'condom']
Topic 4
	 ['commodor', 'qanta', 'mido', 'melcher', 'newri', 'scoggin', 'mukesh', 'tulu', 'yili', 'ambani', 'ead', 'winn', 'dixi', 'camus', 'anil', 'forgeard', 'meldrum', 'hillbilli', 'turkmen', 'turkmenistan']


In [22]:
import spacy
nlp = spacy.load("en_core_web_sm")

docs = ["We've been running all day.", "Let's be better."]

for doc in nlp.pipe(docs, batch_size=32, n_process=3, disable=["parser", "ner"]):
    print([tok.lemma_ for tok in doc])

['-PRON-', 'have', 'be', 'run', 'all', 'day', '.']
['let', '-PRON-', 'be', 'well', '.']
time: 1.76 s (started: 2021-09-18 04:17:03 +00:00)


In [24]:
df['text'][0]

'tv future in the hands of viewers with home theatre systems  plasma high definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time   that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes  with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices   one of the most talked about technologies of ces has been digital and personal video recorders  dvr and pvr   these set top boxes  like the us s tivo and the uk s sky  system  allow people to record  store  play  pause and forward wind tv programmes when they want   essentially  the technology allows for much more personalised tv  they are also being built in to high

time: 5.54 ms (started: 2021-09-18 04:33:49 +00:00)


<a name='3'></a>
## 3 - Preprocess the dataset ##

In general, it's a good practice to develop models that take raw data as input, as opposed to models that take already-preprocessed data. The reason being that, if the model expects preprocessed data, any time we export the model to use it elsewhere (in a web browser, in a mobile app), we'll need to reimplement the exact same preprocessing pipeline. This gets very tricky. So we should do the least possible amount of preprocessing before hitting the model.

Here, we'll do image resizing in the data pipeline (because a deep neural network can only process contiguous batches of data), and we'll do the input value scaling as part of the model, when we create it.

### Resize the images to 150x150


In [None]:
size = (150, 150)
train_ds = train_ds.map(lambda x, y: (tf.image.resize(x, size), y))
validation_ds = validation_ds.map(lambda x, y: (tf.image.resize(x, size), y))
test_ds = test_ds.map(lambda x, y: (tf.image.resize(x, size), y))

Besides, let's batch the data and use caching & prefetching to optimize loading speed.

In [None]:
batch_size = 32
train_ds = train_ds.cache().batch(batch_size).prefetch(buffer_size=10)
validation_ds = validation_ds.cache().batch(batch_size).prefetch(buffer_size=10)
test_ds = test_ds.cache().batch(batch_size).prefetch(buffer_size=10)

<a name='4'></a>
## 4 - Build the model ## 


<a name='4-1'></a>
### 4.1 - Define the model structure

**Note that**:
- We add a `Rescaling` layer to scale input values (initially in the [0, 255] range) to the [-1, 1] range.
- We add a `Image Augumentation` layers to help expose the model to different aspects of the training data while slowing down overfitting.
- We add a `Dropout` layer before the classification layer, for regularization.
- We make sure to pass `training=False` when calling the base model, so that it runs in inference mode, so that batchnorm statistics don't get updated even after we unfreeze the base model for fine-tuning.

In [None]:
from tensorflow.keras import layers

# Create the base_model
base_model = keras.applications.Xception(
    weights="imagenet",  # Load weights pre-trained on ImageNet.
    input_shape=(150, 150, 3),
    include_top=False,  # Do not include the ImageNet classifier at the top.
    )  

# Freeze the base_model
base_model.trainable = False

# Create new model on top
inputs = keras.Input(shape=(150, 150, 3))

# Apply random data augmentation
data_augmentation = keras.Sequential(
    [layers.RandomFlip("horizontal"), 
     layers.RandomRotation(0.1),]
     )
x = data_augmentation(inputs)  

# Pre-trained Xception weights requires that input be scaled
# from (0, 255) to a range of (-1., +1.), the rescaling layer
# outputs: `(inputs * scale) + offset`
scale_layer = keras.layers.Rescaling(scale=1 / 127.5, offset=-1)
x = scale_layer(x)

# The base model contains batchnorm layers. We want to keep them in inference mode
# when we unfreeze the base model for fine-tuning, so we make sure that the
# base_model is running in inference mode here.
x = base_model(x, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)  # Convert features to vectors
x = keras.layers.Dropout(0.2)(x)  # Regularize with dropout
outputs = keras.layers.Dense(1)(x)  # Binary classification)
model = keras.Model(inputs, outputs)

model.summary()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/xception/xception_weights_tf_dim_ordering_tf_kernels_notop.h5
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 150, 150, 3)]     0         
_________________________________________________________________
sequential (Sequential)      (None, 150, 150, 3)       0         
_________________________________________________________________
rescaling (Rescaling)        (None, 150, 150, 3)       0         
_________________________________________________________________
xception (Functional)        (None, 5, 5, 2048)        20861480  
_________________________________________________________________
global_average_pooling2d (Gl (None, 2048)              0         
_________________________________________________________________
dropout (Dropout)            (None, 2048)       

<a name='4-2'></a>
### 4.2 - Train the top layer



In [None]:
model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=[keras.metrics.BinaryAccuracy()],
    )

epochs = 20
model.fit(train_ds, epochs=epochs, validation_data=validation_ds)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f1365aee650>

<a name='4-3'></a>
### 4.3 - Fine-tuning

Finally, let's unfreeze the base model and train the entire model end-to-end with a low learning rate.

Importantly, although the base model becomes trainable, it is still running in inference mode since we passed `training=False` when calling it when we built the model. This means that the batch normalization layers inside won't update their batch statistics. If they did, they would wreck havoc on the representations learned by the model so far.

In [None]:
base_model.trainable = True
print(model.summary())

model.compile(
    optimizer=keras.optimizers.Adam(1e-5),  # Low learning rate
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=[keras.metrics.BinaryAccuracy()],
    )

epochs = 10
model.fit(train_ds, epochs=epochs, validation_data=validation_ds)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 150, 150, 3)]     0         
_________________________________________________________________
sequential (Sequential)      (None, 150, 150, 3)       0         
_________________________________________________________________
rescaling (Rescaling)        (None, 150, 150, 3)       0         
_________________________________________________________________
xception (Functional)        (None, 5, 5, 2048)        20861480  
_________________________________________________________________
global_average_pooling2d (Gl (None, 2048)              0         
_________________________________________________________________
dropout (Dropout)            (None, 2048)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 2049  

<keras.callbacks.History at 0x7f1363f933d0>

After 10 epochs, fine-tuning gains us a nice improvement here.

<a name='5'></a>
### 5 - Save the model

In [None]:
model.save('finetuned_model')

INFO:tensorflow:Assets written to: finetuned_model/assets


INFO:tensorflow:Assets written to: finetuned_model/assets
