# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
#! pip install pyLDAvis gensim spacy

### Import the libraries

In [32]:
import nltk
import re
import spacy
import pyLDAvis
import logging
import json
import gensim
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
nltk.download('stopwords')


  and should_run_async(code)
[nltk_data] Downloading package stopwords to C:\Users\Andres Lojan
[nltk_data]     Yepez\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [4]:
! wget https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

  and should_run_async(code)
--2021-11-13 12:33:47--  https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23237087 (22M) [text/plain]
Saving to: 'newsgroups.json'

     0K .......... .......... .......... .......... ..........  0% 1.71M 13s
    50K .......... .......... .......... .......... ..........  0% 26.5M 7s
   100K .......... .......... .......... .......... ..........  0% 98.4M 5s
   150K .......... .......... .......... .......... ..........  0% 5.95M 4s
   200K .......... .......... .......... .......... ..........  1% 11.5M 4s
   250K .......... .......... .......... .......... ..........  1% 5.76M 4s
   300K .......... .......... .......... .......... ..........  1% 15.3M 

### Load the dataset

In [5]:
newsgroups = pd.read_json(r"C:\Users\Andres Lojan Yepez\Desktop\Assignments\ML-II\Theory\Module 5\newsgroups.json")

  and should_run_async(code)


In [14]:
newsgroups["content"][0]

  and should_run_async(code)


"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [15]:
newsgroups.head()

  and should_run_async(code)


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


In [12]:
print(newsgroups)

                                                 content  target  \
0      From: lerxst@wam.umd.edu (where's my thing)\nS...       7   
1      From: guykuo@carson.u.washington.edu (Guy Kuo)...       4   
2      From: twillis@ec.ecn.purdue.edu (Thomas E Will...       4   
3      From: jgreen@amber (Joe Green)\nSubject: Re: W...       1   
4      From: jcm@head-cfa.harvard.edu (Jonathan McDow...      14   
...                                                  ...     ...   
11309  From: jim.zisfein@factory.com (Jim Zisfein) \n...      13   
11310  From: ebodin@pearl.tufts.edu\nSubject: Screen ...       4   
11311  From: westes@netcom.com (Will Estes)\nSubject:...       3   
11312  From: steve@hcrlgw (Steven Collins)\nSubject: ...       1   
11313  From: gunning@cco.caltech.edu (Kevin J. Gunnin...       8   

                   target_names  
0                     rec.autos  
1         comp.sys.mac.hardware  
2         comp.sys.mac.hardware  
3                 comp.graphics  
4            

  and should_run_async(code)


### Preprocess the data

### Email Removal

In [21]:
newsgroups["clean_content"] = newsgroups["content"].str.replace(r'[A-Za-z0-9]*@[A-Za-z-]*\.?[A-Za-z0-9]*\.?[A-Za-z0-9]*\.?[A-Za-z0-9]*\.?[A-Za-z0-9]', '').str.strip()

  and should_run_async(code)
  newsgroups["clean_content"] = newsgroups["content"].str.replace(r'[A-Za-z0-9]*@[A-Za-z-]*\.?[A-Za-z0-9]*\.?[A-Za-z0-9]*\.?[A-Za-z0-9]*\.?[A-Za-z0-9]', '').str.strip()


In [23]:
print(newsgroups["clean_content"].head())

0    From:  (where's my thing)\nSubject: WHAT car i...
1    From:  (Guy Kuo)\nSubject: SI Clock Poll - Fin...
2    From:  (Thomas E Willis)\nSubject: PB question...
3    From:  (Joe Green)\nSubject: Re: Weitek P9000 ...
4    From:  (Jonathan McDowell)\nSubject: Re: Shutt...
Name: clean_content, dtype: object


  and should_run_async(code)


### Newline Removal

In [24]:
newsgroups["clean_content"] = newsgroups["clean_content"].str.replace("\n","")

  and should_run_async(code)


In [25]:
print(newsgroups["clean_content"].head())

0    From:  (where's my thing)Subject: WHAT car is ...
1    From:  (Guy Kuo)Subject: SI Clock Poll - Final...
2    From:  (Thomas E Willis)Subject: PB questions....
3    From:  (Joe Green)Subject: Re: Weitek P9000 ?O...
4    From:  (Jonathan McDowell)Subject: Re: Shuttle...
Name: clean_content, dtype: object


  and should_run_async(code)


### Single Quotes Removal

In [26]:
newsgroups["clean_content"] = newsgroups["clean_content"].str.replace("'", "")

  and should_run_async(code)


In [27]:
print(newsgroups["clean_content"].head())

0    From:  (wheres my thing)Subject: WHAT car is t...
1    From:  (Guy Kuo)Subject: SI Clock Poll - Final...
2    From:  (Thomas E Willis)Subject: PB questions....
3    From:  (Joe Green)Subject: Re: Weitek P9000 ?O...
4    From:  (Jonathan McDowell)Subject: Re: Shuttle...
Name: clean_content, dtype: object


  and should_run_async(code)


### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [28]:
def sent_to_words():
    for i in newsgroups['clean_content'].iteritems():
        raw = str(i[1]).lower()

        yield gensim.utils.simple_preprocess(raw)

  and should_run_async(code)


In [29]:
print(sent_to_words())

<generator object sent_to_words at 0x000001EEAF166200>


  and should_run_async(code)


In [30]:
for x in sent_to_words():
    print(x)

  and should_run_async(code)


['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'eduorganization', 'university', 'of', 'maryland', 'college', 'parklines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'sawthe', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'yearsof', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'youhave', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']
['from', 'guy', 'kuo', 'subject', 'si', 'clock', 'p

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



['from', 'henry', 'spencer', 'subject', 're', 'public', 'domain', 'circuits', 'in', 'commercial', 'of', 'toronto', 'zoologylines', 'in', 'article', 'ge', 'du', 'michael', 'covington', 'writes', 'patent', 'law', 'says', 'you', 'can', 'build', 'anything', 'you', 'want', 'to', 'for', 'your', 'own', 'personal', 'noncommercial', 'use', 'im', 'not', 'up', 'on', 'the', 'details', 'of', 'us', 'patent', 'law', 'but', 'think', 'this', 'is', 'incorrect', 'there', 'is', 'reasonable', 'use', 'exemption', 'for', 'copyright', 'there', 'is', 'none', 'for', 'patents', 'the', 'exemptions', 'from', 'patent', 'licensing', 'are', 'quite', 'narrow', 'workis', 'exempt', 'but', 'personal', 'use', 'is', 'not', 'that', 'is', 'its', 'okay', 'to', 'experiment', 'witha', 'patented', 'idea', 'but', 'not', 'to', 'put', 'it', 'to', 'practical', 'use', 'to', 'improve', 'yourstereo', 'even', 'if', 'its', 'only', 'your', 'own', 'private', 'practical', 'use', 'of', 'course', 'it', 'is', 'unlikely', 'that', 'discreet', 'p

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [33]:
eng_stop_words = stopwords.words('english')
new_words = ["from", "subject", "re", "edu", "use"]
whole_list = eng_stop_words + new_words

  and should_run_async(code)


#### remove_stopwords( )

In [34]:
def remove_stopwords(texts):
    tokens = [i for i in texts if i not in whole_list]
    return tokens

  and should_run_async(code)


In [35]:
new_docs = [remove_stopwords(doc) for doc in sent_to_words()]

  and should_run_async(code)


### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [36]:
from gensim.models import Phrases

sentence_stream = [doc for doc in new_docs]
bigram = Phrases(sentence_stream, min_count=1, threshold=100)
sent = new_docs[1]
print(bigram[sent])

  and should_run_async(code)


['guy_kuo', 'si_clock', 'poll', 'final', 'callsummary', 'final', 'call', 'si_clock', 'reportskeywords', 'si', 'acceleration', 'clock', 'upgradearticle', 'shelley', 'qvfo', 'innc', 'sorganization', 'university_washingtonlines', 'nntp_posting', 'host_carson', 'washington', 'edua', 'fair', 'number', 'brave_souls', 'upgraded', 'si_clock', 'oscillator', 'haveshared', 'experiences', 'poll', 'please', 'send', 'brief', 'message', 'detailingyour', 'experiences', 'procedure', 'top', 'speed', 'attained', 'cpu', 'rated', 'speed', 'add', 'cards', 'adapters', 'heat_sinks', 'hour', 'usage', 'per', 'day', 'floppy', 'floppies', 'especially', 'requested', 'summarizing', 'next', 'two', 'days', 'please', 'add', 'base', 'done', 'clock', 'upgrade', 'havent', 'answered', 'thispoll', 'thanks', 'guy_kuo']


['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


  and should_run_async(code)


#### make_bigrams( )

In [37]:
def make_bigrams(texts):
    sentence_stream = [doc for doc in texts]
    bigram = Phrases(sentence_stream, min_count=1, threshold=100)

    return bigram[texts]

  and should_run_async(code)


In [38]:
bigram_doc = make_bigrams(new_docs)

  and should_run_async(code)


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [42]:
! python -m spacy download en

  and should_run_async(code)


Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)

2021-11-13 23:03:23.458220: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-11-13 23:03:23.459734: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.



Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [43]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

  and should_run_async(code)


#### lemmatizaton( )

In [44]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

  and should_run_async(code)


In [45]:
data_lemmatized = lemmatization(bigram_doc, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

  and should_run_async(code)


In [46]:
print(data_lemmatized[:1])

[['thing', 'car', 'nntp_poste', 'wam_umd', 'parkline', 'enlighten', 'car', 'sawthe', 'day', 'door_sport', 'car', 'look', 'late_early', 'called_bricklin', 'door', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'tellme_model', 'name', 'engine', 'spec', 'yearsof', 'production', 'car', 'make', 'history', 'info', 'youhave', 'funky_looke', 'car', 'mail', 'thank']]


  and should_run_async(code)


### Create a Dictionary

In [47]:
dictionary = Dictionary(data_lemmatized) 

  and should_run_async(code)


### Filter low-frequency words

In [48]:
dictionary.filter_extremes(no_below=10, no_above=0.5)

  and should_run_async(code)


### Create Corpus

In [49]:
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

  and should_run_async(code)


In [50]:
print(corpus)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

  and should_run_async(code)


### Create Index 2 word dictionary

In [51]:
temp = dictionary[0]
id2word = dictionary.id2token

  and should_run_async(code)


In [52]:
print(id2word)



  and should_run_async(code)


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [53]:
ldamodel = LdaModel(corpus, num_topics=8, id2word = id2word, passes=20)

  and should_run_async(code)


### Print the Keyword in the 10 topics

In [54]:
pprint(ldamodel.print_topics(num_topics=8, num_words=10))

[(0,
  '0.922*"ax" + 0.003*"max" + 0.002*"qax" + 0.002*"_" + 0.002*"rlk" + '
  '0.002*"ei" + 0.002*"sm" + 0.001*"lk" + 0.001*"tm" + 0.001*"r"'),
 (1,
  '0.017*"get" + 0.015*"organization" + 0.015*"know" + 0.014*"nntp_poste" + '
  '0.014*"m" + 0.012*"host" + 0.011*"line" + 0.010*"article" + 0.010*"thank" + '
  '0.008*"need"'),
 (2,
  '0.015*"go" + 0.013*"get" + 0.011*"say" + 0.009*"people" + 0.008*"make" + '
  '0.008*"think" + 0.008*"know" + 0.008*"time" + 0.008*"car" + 0.007*"take"'),
 (3,
  '0.021*"drive" + 0.014*"use" + 0.012*"system" + 0.010*"bit" + '
  '0.010*"problem" + 0.009*"work" + 0.008*"scsi" + 0.008*"time" + '
  '0.008*"power" + 0.007*"speed"'),
 (4,
  '0.031*"key" + 0.015*"government" + 0.012*"system" + 0.011*"use" + '
  '0.010*"encryption" + 0.009*"security" + 0.008*"chip" + 0.007*"clipper" + '
  '0.007*"public" + 0.007*"make"'),
 (5,
  '0.018*"file" + 0.014*"program" + 0.008*"use" + 0.008*"include" + '
  '0.008*"information" + 0.008*"space" + 0.007*"image" + 0.007*"availa

  and should_run_async(code)


## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [55]:
pprint(ldamodel.log_perplexity(corpus))

  and should_run_async(code)


-6.986324162496614


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [56]:
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=ldamodel, corpus=corpus, coherence='u_mass')
coherence = cm.get_coherence()
pprint(coherence)

  and should_run_async(code)


-1.6272310405610662


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [57]:
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()

pyLDAvis.gensim_models.prepare(ldamodel, corpus, dictionary)

  and should_run_async(code)
