# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
pip install pyLDAvis gensim spacy


Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting scipy (from pyLDAvis)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scipy-1.13.1-cp311-cp311-manylinux_

### Import the libraries

In [2]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Load spacy model
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [3]:
import urllib.request

In [33]:
url = 'https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json'
filename = 'newsgroups.json'
urllib.request.urlretrieve(url, filename)

('newsgroups.json', <http.client.HTTPMessage at 0x7f031d551c90>)

### Load the dataset

In [5]:
import json
import pandas as pd

# Load the dataset
with open('newsgroups.json', 'r') as file:
    data = json.load(file)

# Convert to DataFrame
df = pd.DataFrame(data)


In [6]:

# Display the first few rows of the DataFrame
df.head()


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


### Preprocess the data

### Email Removal

In [7]:
import re

# Function to remove emails
def remove_emails(text):
    return re.sub(r'\S+@\S+', '', text)

# Apply the function to the 'content' column
df['content'] = df['content'].apply(remove_emails)

# Display the first few rows of the DataFrame after email removal
df.head()

Unnamed: 0,content,target,target_names
0,From: (where's my thing)\nSubject: WHAT car i...,7,rec.autos
1,From: (Guy Kuo)\nSubject: SI Clock Poll - Fin...,4,comp.sys.mac.hardware
2,From: (Thomas E Willis)\nSubject: PB question...,4,comp.sys.mac.hardware
3,From: (Joe Green)\nSubject: Re: Weitek P9000 ...,1,comp.graphics
4,From: (Jonathan McDowell)\nSubject: Re: Shutt...,14,sci.space


### Newline Removal

In [8]:
# Function to remove newlines
def remove_newlines(text):
    return text.replace('\n', ' ')

# Apply the function to the 'content' column
df['content'] = df['content'].apply(remove_newlines)

# Display the first few rows of the DataFrame after newline removal
df.head()

Unnamed: 0,content,target,target_names
0,From: (where's my thing) Subject: WHAT car is...,7,rec.autos
1,From: (Guy Kuo) Subject: SI Clock Poll - Fina...,4,comp.sys.mac.hardware
2,From: (Thomas E Willis) Subject: PB questions...,4,comp.sys.mac.hardware
3,From: (Joe Green) Subject: Re: Weitek P9000 ?...,1,comp.graphics
4,From: (Jonathan McDowell) Subject: Re: Shuttl...,14,sci.space


### Single Quotes Removal

In [9]:
# Function to remove single quotes
def remove_single_quotes(text):
    return text.replace("'", "")

# Apply the function to the 'content' column
df['content'] = df['content'].apply(remove_single_quotes)

# Display the first few rows of the DataFrame after single quotes removal
df.head()

Unnamed: 0,content,target,target_names
0,From: (wheres my thing) Subject: WHAT car is ...,7,rec.autos
1,From: (Guy Kuo) Subject: SI Clock Poll - Fina...,4,comp.sys.mac.hardware
2,From: (Thomas E Willis) Subject: PB questions...,4,comp.sys.mac.hardware
3,From: (Joe Green) Subject: Re: Weitek P9000 ?...,1,comp.graphics
4,From: (Jonathan McDowell) Subject: Re: Shuttl...,14,sci.space


### Tokenize
- Create **sent_to_words()**
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [10]:
import gensim
from gensim.utils import simple_preprocess

# Generator function to tokenize the text
def sent_to_words(texts):
    for text in texts:
        yield gensim.utils.simple_preprocess(str(text), deacc=True)  # deacc=True removes punctuations

# Apply the generator function to the 'content' column
data_words = list(sent_to_words(df['content']))

# Display the first few tokenized texts
data_words[:5]

[['from',
  'wheres',
  'my',
  'thing',
  'subject',
  'what',
  'car',
  'is',
  'this',
  'nntp',
  'posting',
  'host',
  'rac',
  'wam',
  'umd',
  'edu',
  'organization',
  'university',
  'of',
  'maryland',
  'college',
  'park',
  'lines',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'saw',
  'the',
  'other',
  'day',
  'it',
  'was',
  'door',
  'sports',
  'car',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  'early',
  'it',
  'was',
  'called',
  'bricklin',
  'the',
  'doors',
  'were',
  'really',
  'small',
  'in',
  'addition',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  'this',
  'is',
  'all',
  'know',
  'if',
  'anyone',
  'can',
  'tellme',
  'model',
  'name',
  'engine',
  'specs',
  'years',
  'of',
  'production',
  'where',
  'this',
  'car',
  'is',
  'made',
  'history',
  'or',
  'whatever',
  

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [11]:
# Extend the stop words corpus
stop_words = set(stopwords.words('english'))
stop_words.update(['from', 'subject', 're', 'edu', 'use'])

# Function to remove stop words
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

# Apply the function to the tokenized texts
data_words_nostops = remove_stopwords(data_words)

# Display the first few texts after stop words removal
data_words_nostops[:5]

[['wheres',
  'thing',
  'car',
  'nntp',
  'posting',
  'host',
  'rac',
  'wam',
  'umd',
  'organization',
  'university',
  'maryland',
  'college',
  'park',
  'lines',
  'wondering',
  'anyone',
  'could',
  'enlighten',
  'car',
  'saw',
  'day',
  'door',
  'sports',
  'car',
  'looked',
  'late',
  'early',
  'called',
  'bricklin',
  'doors',
  'really',
  'small',
  'addition',
  'front',
  'bumper',
  'separate',
  'rest',
  'body',
  'know',
  'anyone',
  'tellme',
  'model',
  'name',
  'engine',
  'specs',
  'years',
  'production',
  'car',
  'made',
  'history',
  'whatever',
  'info',
  'funky',
  'looking',
  'car',
  'please',
  'mail',
  'thanks',
  'il',
  'brought',
  'neighborhood',
  'lerxst'],
 ['guy',
  'kuo',
  'si',
  'clock',
  'poll',
  'final',
  'call',
  'summary',
  'final',
  'call',
  'si',
  'clock',
  'reports',
  'keywords',
  'si',
  'acceleration',
  'clock',
  'upgrade',
  'article',
  'shelley',
  'qvfo',
  'innc',
  'organization',
  'univer

#### remove_stopwords( )

In [12]:

# Function to remove stop words
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]


In [13]:

# Apply the function to the tokenized texts
data_words_nostops = remove_stopwords(data_words)

# Display the first few texts after stop words removal
print(data_words_nostops[:5])



### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [14]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser

# Define the bigram model
bigram = Phrases(data_words_nostops, min_count=5, threshold=100)

# Convert the bigram model to a Phraser for efficiency
bigram_mod = Phraser(bigram)


#### make_bigrams( )

In [15]:
# Function to make bigrams
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

In [16]:
# Apply the function to the tokenized texts
data_words_bigrams = make_bigrams(data_words_nostops)
data_words_bigrams

[['wheres',
  'thing',
  'car',
  'nntp_posting',
  'host',
  'rac_wam',
  'umd',
  'organization',
  'university',
  'maryland_college',
  'park',
  'lines',
  'wondering',
  'anyone',
  'could',
  'enlighten',
  'car',
  'saw',
  'day',
  'door',
  'sports',
  'car',
  'looked',
  'late',
  'early',
  'called',
  'bricklin',
  'doors',
  'really',
  'small',
  'addition',
  'front_bumper',
  'separate',
  'rest',
  'body',
  'know',
  'anyone',
  'tellme',
  'model',
  'name',
  'engine',
  'specs',
  'years',
  'production',
  'car',
  'made',
  'history',
  'whatever',
  'info',
  'funky',
  'looking',
  'car',
  'please',
  'mail',
  'thanks',
  'il',
  'brought',
  'neighborhood',
  'lerxst'],
 ['guy_kuo',
  'si',
  'clock',
  'poll',
  'final',
  'call',
  'summary',
  'final',
  'call',
  'si',
  'clock',
  'reports',
  'keywords',
  'si',
  'acceleration',
  'clock',
  'upgrade',
  'article_shelley',
  'qvfo',
  'innc',
  'organization',
  'university',
  'washington',
  'line

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [17]:
! python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [18]:
import spacy

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

#### lemmatizaton( )

In [19]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [20]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [21]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host', 'rac_wam', 'university', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [22]:
# Create Dictionary
id2word = corpora.Dictionary(data_words_bigrams)


### Create Corpus

In [23]:
# Create Corpus
corpus = [id2word.doc2bow(text) for text in data_words_bigrams]

### Filter low-frequency words

In [24]:
# Filter out words that occur in less than 15 documents, or more than 50% of the documents
id2word.filter_extremes(no_below=15, no_above=0.5)

# Create the corpus again after filtering
corpus = [id2word.doc2bow(text) for text in data_words_bigrams]

### Create Index 2 word dictionary

In [25]:
# Create index to word dictionary
index2word = {id: word for id, word in id2word.items()}

In [26]:
# Display the first few entries of the index2word dictionary
print(list(index2word.items())[:10])

[(0, 'addition'), (1, 'anyone'), (2, 'body'), (3, 'brought'), (4, 'called'), (5, 'car'), (6, 'could'), (7, 'day'), (8, 'door'), (9, 'doors')]


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [27]:
from gensim.models import LdaModel

# Define the parameters
num_topics = 10
chunksize = 100
alpha = 'auto'
passes = 10

# Build the LDA model
lda_model = LdaModel(corpus=corpus,
                     id2word=id2word,
                     num_topics=num_topics,
                     random_state=100,
                     chunksize=chunksize,
                     passes=passes,
                     alpha=alpha,
                     per_word_topics=True)



### Print the Keyword in the 10 topics

In [28]:
# Print the keywords in the 10 topics
topics = lda_model.print_topics(num_topics=10, num_words=10)
for topic in topics:
    print(topic)

(0, '0.014*"year" + 0.009*"team" + 0.008*"car" + 0.008*"game" + 0.007*"last" + 0.006*"physical" + 0.006*"first" + 0.006*"next" + 0.006*"st" + 0.006*"win"')
(1, '0.014*"people" + 0.012*"said" + 0.011*"government" + 0.011*"gun" + 0.007*"soldiers" + 0.007*"war" + 0.007*"guns" + 0.006*"country" + 0.006*"us" + 0.006*"killed"')
(2, '0.016*"key" + 0.016*"space" + 0.013*"information" + 0.011*"research" + 0.009*"public" + 0.008*"technology" + 0.008*"may" + 0.008*"new" + 0.007*"system" + 0.006*"data"')
(3, '0.014*"mail" + 0.013*"drive" + 0.013*"system" + 0.012*"windows" + 0.010*"software" + 0.009*"computer" + 0.009*"bit" + 0.009*"card" + 0.009*"file" + 0.008*"program"')
(4, '0.020*"god" + 0.013*"evidence" + 0.012*"people" + 0.009*"reason" + 0.008*"believe" + 0.008*"may" + 0.007*"us" + 0.006*"jesus" + 0.006*"faith" + 0.006*"one"')
(5, '0.015*"problem" + 0.012*"using" + 0.009*"set" + 0.008*"work" + 0.007*"used" + 0.007*"problems" + 0.007*"copy" + 0.006*"may" + 0.006*"must" + 0.006*"line"')
(6, '0.

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [29]:
# Compute Perplexity
perplexity = lda_model.log_perplexity(corpus)
print(f'Model Perplexity: {perplexity}')

Model Perplexity: -7.8001167016181565


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [30]:
from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words_bigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f'Topic Coherence: {coherence_lda}')

Topic Coherence: 0.5810598930005508


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [31]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# Prepare the visualization
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, id2word)

In [32]:
# Display the visualization
pyLDAvis.display(vis)