### Topic Modeling: Latent Dirichlet Allocation (LDA) using Gensim

data: https://www.kaggle.com/yufengdev/bbc-fulltext-and-category/version/2?select=bbc-text.csv

tutorials: https://www.linkedin.com/pulse/nlp-a-complete-guide-topic-modeling-latent-dirichlet-sahil-m/ & Chpt 7 Practical Natural Language Processing

In [1]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
import pandas as pd
import gensim
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
import spacy  # for lemmatization

[nltk_data] Downloading package stopwords to /home/rachel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv('Data/bbc-text.csv')
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [3]:
print(f'Number of documents: {len(df)}')

Number of documents: 2225


### Data Cleaning

In [22]:
#tokenize, remove stopwords, non-alphabetic words, lowercase
def preprocess(textstring):
    stops =  set(stopwords.words('english'))
    tokens = word_tokenize(textstring) 
    return [token.lower() for token in tokens if token.isalpha() 
          and token not in stops]

data = []
for s in df['text']:
    data_words = preprocess(s)
    data.append(data_words)

In [24]:
# Are their any weird characters?
regexp = re.compile('[^a-zA-Z]+') # ^ means not a-z or A-Z
print(f'Number of weird characters: {len([s for text in data for s in text if regexp.search(s)])}')

Number of weird characters: 0


All text is alphabetic with no special characters or numbers, we don't have to do any further text cleaning.

For future reference I should try SpaCy's rule-based matching: "It is like Regular Expressions on steroids." (https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/).... the spaCy matcher not only uses the text patterns but lexical properties of the word, such as part of speech (POS) tags, dependency tags, lemma, etc.

#### Lemmatize Tokens using spaCy. 

The en_core_web_sm model is a small English pipeline trained on written web text and is used to perform several NLP related tasks such as part-of-speech tagging, named entity recognition and dependency parsing. 

The assigned `nlp` object is a pipeline of several text pre-processing operation through which the input text string has to go through. 

In [33]:
# ! python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
def lemmatization(data, allowed_postags=['NOUN']):
    texts_out = []
    for text in data:
        for s in text:
            doc = nlp(''.join(s))
            texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

data_lemmatized = lemmatization(data, allowed_postags=['NOUN'])

In [34]:
data_lemmatized

[['tv'],
 ['future'],
 ['hand'],
 ['viewer'],
 ['home'],
 [],
 ['system'],
 ['plasma'],
 ['tv'],
 [],
 ['video'],
 ['recorder'],
 [],
 [],
 ['room'],
 ['way'],
 ['people'],
 [],
 ['tv'],
 [],
 [],
 [],
 ['year'],
 ['time'],
 [],
 ['expert'],
 ['panel'],
 [],
 [],
 ['consumer'],
 ['electronic'],
 [],
 [],
 [],
 [],
 [],
 ['technology'],
 ['impact'],
 [],
 [],
 ['pastime'],
 [],
 [],
 [],
 ['programme'],
 ['content'],
 [],
 ['viewer'],
 [],
 ['home'],
 ['network'],
 ['cable'],
 ['satellite'],
 ['telecom'],
 ['company'],
 ['broadband'],
 ['service'],
 ['provider'],
 ['front'],
 ['room'],
 [],
 ['device'],
 [],
 ['technology'],
 ['ce'],
 [],
 [],
 ['video'],
 ['recorder'],
 ['dvr'],
 ['pvr'],
 ['box'],
 [],
 [],
 [],
 [],
 ['system'],
 [],
 ['people'],
 ['record'],
 ['store'],
 [],
 ['pause'],
 [],
 ['wind'],
 ['tv'],
 ['programme'],
 [],
 [],
 ['technology'],
 [],
 [],
 [],
 ['tv'],
 [],
 ['tv'],
 ['set'],
 [],
 ['business'],
 [],
 [],
 [],
 [],
 [],
 ['lack'],
 ['programming'],
 ['people