# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Import the libraries

In [1]:
import pyLDAvis
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import gensim
import spacy

  from imp import reload
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()
  from .optimizers import Adam, linear_decay
  from collections import Sequence, Sized, Iterable, Callable
  from collections import Sequence, Sized, Iterable, Callable
  from collections import Sequence, Sized, Iterable, Callable


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [2]:
! wget https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json   

--2023-03-22 06:02:15--  https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23237087 (22M) [text/plain]
Saving to: 'newsgroups.json'

     0K .......... .......... .......... .......... ..........  0% 1.17M 19s
    50K .......... .......... .......... .......... ..........  0% 1.45M 17s
   100K .......... .......... .......... .......... ..........  0% 1.72M 16s
   150K .......... .......... .......... .......... ..........  0% 1.79M 15s
   200K .......... .......... .......... .......... ..........  1% 1.76M 14s
   250K .......... .......... .......... .......... ..........  1% 1.63M 14s
   300K .......... .......... .......... .......... ..........  1% 2.64M 13s
   3

### Load the dataset

In [2]:
df = pd.read_json("newsgroups.json")
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


### Preprocess the data

#### Email Removal

In [4]:
df["content"] = df["content"].str.split(' ', n=2).str.get(-1)
df.head()

Unnamed: 0,content,target,target_names
0,(where's my thing)Subject: WHAT car is this!?N...,7,rec.autos
1,(Guy Kuo)Subject: SI Clock Poll - Final CallSu...,4,comp.sys.mac.hardware
2,(Thomas E Willis)Subject: PB questions...Organ...,4,comp.sys.mac.hardware
3,(Joe Green)Subject: Re: Weitek P9000 ?Organiza...,1,comp.graphics
4,(Jonathan McDowell)Subject: Re: Shuttle Launch...,14,sci.space


#### Newline Removal

In [5]:
df['content'] = df['content'].replace('\n','', regex=True)
df.head()

Unnamed: 0,content,target,target_names
0,(where's my thing)Subject: WHAT car is this!?N...,7,rec.autos
1,(Guy Kuo)Subject: SI Clock Poll - Final CallSu...,4,comp.sys.mac.hardware
2,(Thomas E Willis)Subject: PB questions...Organ...,4,comp.sys.mac.hardware
3,(Joe Green)Subject: Re: Weitek P9000 ?Organiza...,1,comp.graphics
4,(Jonathan McDowell)Subject: Re: Shuttle Launch...,14,sci.space


#### Single Quotes Removal

In [6]:
df['content'] = df['content'].replace("'",'', regex=True)
df.head()

Unnamed: 0,content,target,target_names
0,(wheres my thing)Subject: WHAT car is this!?Nn...,7,rec.autos
1,(Guy Kuo)Subject: SI Clock Poll - Final CallSu...,4,comp.sys.mac.hardware
2,(Thomas E Willis)Subject: PB questions...Organ...,4,comp.sys.mac.hardware
3,(Joe Green)Subject: Re: Weitek P9000 ?Organiza...,1,comp.graphics
4,(Jonathan McDowell)Subject: Re: Shuttle Launch...,14,sci.space


#### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**

In [9]:
from gensim.utils import simple_preprocess

In [17]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [10]:
#change each row to a list(each cell becomes an element); then, change the dataframe to a list of lists.
data=df.apply(list, axis=1).tolist()

In [26]:
#check how above function works
df.head(1)         

data[0]
len(data[0])


Unnamed: 0,content,target,target_names
0,(wheres my thing)Subject: WHAT car is this!?Nn...,7,rec.autos


['(wheres my thing)Subject: WHAT car is this!?Nntp-Posting-Host: rac3.wam.umd.eduOrganization: University of Maryland, College ParkLines: 15 I was wondering if anyone out there could enlighten me on this car I sawthe other day. It was a 2-door sports car, looked to be from the late 60s/early 70s. It was called a Bricklin. The doors were really small. In addition,the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, yearsof production, where this car is made, history, or whatever info youhave on this funky looking car, please e-mail.Thanks,- IL   ---- brought to you by your neighborhood Lerxst ----',
 7,
 'rec.autos']

3

In [13]:
#for each list converted from each row, change it this time to a list splitting into single words
def sent_to_words(li_of_lis):
    total=[]
    for i in range(len(li_of_lis)):
        total.append(simple_preprocess(str(li_of_lis[i])))
    return total

In [38]:
#check how above function works
datanow=sent_to_words(data)    
print(datanow[0])

['wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'eduorganization', 'university', 'of', 'maryland', 'college', 'parklines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'sawthe', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'yearsof', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'youhave', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst', 'rec', 'autos']


In [39]:
#double check
len(data)==len(datanow)

data[1]
print(datanow[1])

True

['(Guy Kuo)Subject: SI Clock Poll - Final CallSummary: Final call for SI clock reportsKeywords: SI,acceleration,clock,upgradeArticle-I.D.: shelley.1qvfo9INNc3sOrganization: University of WashingtonLines: 11NNTP-Posting-Host: carson.u.washington.eduA fair number of brave souls who upgraded their SI clock oscillator haveshared their experiences for this poll. Please send a brief message detailingyour experiences with the procedure. Top speed attained, CPU rated speed,add on cards and adapters, heat sinks, hour of usage per day, floppy diskfunctionality with 800 and 1.4 m floppies are especially requested.I will be summarizing in the next two days, so please add to the networkknowledge base if you have done the clock upgrade and havent answered thispoll. Thanks.Guy Kuo <guykuo@u.washington.edu>',
 4,
 'comp.sys.mac.hardware']

['guy', 'kuo', 'subject', 'si', 'clock', 'poll', 'final', 'callsummary', 'final', 'call', 'for', 'si', 'clock', 'reportskeywords', 'si', 'acceleration', 'clock', 'upgradearticle', 'shelley', 'qvfo', 'innc', 'sorganization', 'university', 'of', 'washingtonlines', 'nntp', 'posting', 'host', 'carson', 'washington', 'edua', 'fair', 'number', 'of', 'brave', 'souls', 'who', 'upgraded', 'their', 'si', 'clock', 'oscillator', 'haveshared', 'their', 'experiences', 'for', 'this', 'poll', 'please', 'send', 'brief', 'message', 'detailingyour', 'experiences', 'with', 'the', 'procedure', 'top', 'speed', 'attained', 'cpu', 'rated', 'speed', 'add', 'on', 'cards', 'and', 'adapters', 'heat', 'sinks', 'hour', 'of', 'usage', 'per', 'day', 'floppy', 'with', 'and', 'floppies', 'are', 'especially', 'requested', 'will', 'be', 'summarizing', 'in', 'the', 'next', 'two', 'days', 'so', 'please', 'add', 'to', 'the', 'base', 'if', 'you', 'have', 'done', 'the', 'clock', 'upgrade', 'and', 'havent', 'answered', 'thispo

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [32]:
import nltk

In [33]:
stopping=stopwords.words("english")
new=["from", "subject", "re", "edu", "use"]

stoppings = stopping + new

In [34]:
print(len(stopping),len(stoppings))

179 184


#### remove_stopwords( )

In [35]:
from string import punctuation

In [37]:
print(datanow[0])

['wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'eduorganization', 'university', 'of', 'maryland', 'college', 'parklines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'sawthe', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'yearsof', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'youhave', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst', 'rec', 'autos']


In [40]:
#remove stopwords and punctuations
def remove_stopwords(datanow):
    totals=[]

    for i in range(len(datanow)):
        total=[]
        for token in datanow[i]:
            if not token.isdigit() and token not in stoppings not in list(punctuation) :
                total.append(token)
        totals.append(total)   
        
    return totals

In [41]:
#check if the function works
datathen=remove_stopwords(datanow)     
print(datathen[0])

['wheres', 'thing', 'car', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'eduorganization', 'university', 'maryland', 'college', 'parklines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'sawthe', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'yearsof', 'production', 'car', 'made', 'history', 'whatever', 'info', 'youhave', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst', 'rec', 'autos']


### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [42]:
from gensim.models import Phrases 

In [43]:
bigram=Phrases(datathen, min_count = 5, threshold = 100)

In [44]:
print(bigram[datathen[0]])

['wheres', 'thing', 'car', 'nntp_posting', 'host', 'rac_wam', 'umd', 'eduorganization', 'university', 'maryland_college', 'parklines', 'wondering_anyone', 'could', 'enlighten', 'car', 'sawthe', 'day', 'door', 'sports_car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'yearsof', 'production', 'car', 'made', 'history', 'whatever', 'info', 'youhave', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst', 'rec_autos']


#### make_bigrams( )

In [45]:
def make_bigrams(texts):
    return [bigram[text] for text in texts]

In [47]:
#check
data_words_bigrams=make_bigrams(datathen)    
print(data_words_bigrams[0])

['wheres', 'thing', 'car', 'nntp_posting', 'host', 'rac_wam', 'umd', 'eduorganization', 'university', 'maryland_college', 'parklines', 'wondering_anyone', 'could', 'enlighten', 'car', 'sawthe', 'day', 'door', 'sports_car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'yearsof', 'production', 'car', 'made', 'history', 'whatever', 'info', 'youhave', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst', 'rec_autos']


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [48]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

#### lemmatizaton( )

In [49]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [50]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [57]:
print(data_lemmatized[0])

['where', 's', 'thing', 'car', 'nntp_poste', 'host', 'umd', 'eduorganization', 'university', 'maryland_college', 'parkline', 'wondering_anyone', 'could', 'enlighten', 'car', 'day', 'door', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'yearsof', 'production', 'car', 'make', 'history', 'info', 'youhave', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']


### Create a Dictionary

In [58]:
dictionary = Dictionary(data_lemmatized)

### Create Corpus

In [59]:
texts = data_lemmatized
corpus = [dictionary.doc2bow(text) for text in texts]

### Filter low-frequency words

In [60]:
dictionary.filter_extremes(no_below=10, no_above=0.5)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

### Create Index 2 word dictionary

In [61]:
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [62]:
ldamodel = LdaModel(corpus, num_topics=10, chunksize=100, alpha='auto', id2word = id2word,passes=20)

### Print the Keyword in the 10 topics

In [63]:
for idx in range(10):
    print("Topic #%s:" % idx, ldamodel.print_topic(idx, 10))

Topic #0: 0.027*"space" + 0.013*"physical" + 0.011*"ripem" + 0.011*"research" + 0.010*"technology" + 0.008*"datum" + 0.008*"earth" + 0.008*"system" + 0.007*"sci_space" + 0.007*"item"
Topic #1: 0.015*"com" + 0.013*"university" + 0.012*"drive" + 0.012*"system" + 0.011*"organization" + 0.011*"host" + 0.010*"use" + 0.010*"line" + 0.010*"mail" + 0.008*"thank"
Topic #2: 0.030*"do" + 0.029*"would" + 0.027*"be" + 0.024*"get" + 0.019*"go" + 0.015*"know" + 0.014*"think" + 0.013*"good" + 0.012*"com" + 0.011*"s"
Topic #3: 0.029*"game" + 0.027*"team" + 0.022*"hockey" + 0.020*"win" + 0.020*"play" + 0.019*"rec_sport" + 0.009*"season" + 0.009*"year" + 0.009*"fan" + 0.007*"division"
Topic #4: 0.031*"window" + 0.021*"file" + 0.019*"program" + 0.011*"sale" + 0.010*"include" + 0.010*"comp" + 0.009*"image" + 0.008*"comp_graphic" + 0.008*"version" + 0.008*"entry"
Topic #5: 0.027*"year" + 0.011*"new" + 0.010*"money" + 0.009*"week" + 0.009*"baseball" + 0.009*"point" + 0.008*"day" + 0.007*"pay" + 0.007*"increa

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [64]:
print(ldamodel.log_perplexity(corpus)) 

-7.676674689440603


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [65]:
from gensim.models import CoherenceModel

In [66]:
coherence = CoherenceModel(model=ldamodel, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
Coherence = coherence.get_coherence()
print(Coherence)

0.5849240794230908


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [67]:
import pyLDAvis.gensim_models

In [68]:
pyLDAvis.enable_notebook()

In [69]:
pyLDAvis.gensim_models.prepare(ldamodel, corpus, dictionary)

  by='saliency', ascending=False).head(R).drop('saliency', 1)
