<a href="https://colab.research.google.com/github/christabs27/Linear-Regression-for-Heights/blob/main/11_13_2_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity 11.13.2

Determining the topics of a text is an interesting—and challenging—problem.  

In this lesson, you'll categorize the topics of about 20,000 posts to 20 different newsgroups. This [20 newsgroups data set](http://qwone.com/~jason/20Newsgroups/) is often used when working with natural langague processing models.

#Step 1: Install the Necessary Packages
* Run the following code block to import the necessary libraries and packages. Make sure to type "y" when asked if you want to proceed.

In [1]:
#Step 1

import nltk 
nltk.download('stopwords')
!python3 -m spacy download en_core_web_sm
from nltk.corpus import stopwords

# Helpful packages
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
!pip uninstall gensim
! pip install gensim==4.2.0
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
%matplotlib inline




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 26.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Found existing installation: gensim 3.6.0
Uninstalling gensim-3.6.0:
  Would remove:
    /usr/local/lib/python3.7/dist-packages/gensim-3.6.0.dist-info/*
    /usr/local/lib/python3.7/dist-packages/gensim/*
Proceed (y/n)? y
  Successfully uninstalled gensim-3.6.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim==4.2.0
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 54.9 MB/s 
Instal

  from collections import Iterable
  from collections import Mapping


#Step 2: Read in the 20 Newsgroups Data Set
* Run the following code block to read in the 20 Newsgroups data set.
* Although you won't use it in this activity because you're approaching this as an unsupervised learning problem, `target_names` tells you which newsgroup a post came from.
* What is the name of the newsgroup that appears at the top of the data set?


In [2]:
#Step 2

df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


#Step 3: Remove Extra Characters and Symbols
* As you can see in the printed data above, there are a lot of extra characters and symbols like "@" and "\n" in the text that should be removed.
* Run the following code block to remove or replace each symbol or character. Other punctuation will be removed later using `gensim.utils.simple_preprocess`.
* Preview the text of the first newsgroup post. What is the topic of the post?


In [3]:
#Step 3

# Convert to list
data = df.content.values.tolist()

# Remove Emails
data = [re.sub(r'\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub(r'\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub(r"\'", "", sent) for sent in data]

pprint(data[:1])

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: '
 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
 '15 I was wondering if anyone out there could enlighten me on this car I saw '
 'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition, the front bumper was separate from the rest of the body. This is '
 'all I know. If anyone can tellme a model name, engine specs, years of '
 'production, where this car is made, history, or whatever info you have on '
 'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by '
 'your neighborhood Lerxst ---- ']


**Step 3 Answer:**



#Step 4: Process the Sentences into Lists of Strings
* Run the following code block to convert the sentences to lists of single strings.  
* `deacc = True` also removes the reamining punctuation.
* Note the format of the first post. It should match the format used for the text of the literary works preprocessed during the word embedding activity.


In [4]:
#Step 4

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']]


#Step 5: Remove Stop Words, Lemmatize Words, and Create Bigrams
* Run the following code block to preprocess the data by removing stop words, lemmatizing words, and creating bigrams:


In [5]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
#trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
#trigram_mod = gensim.models.phrases.Phraser(trigram)

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

#def make_trigrams(texts):
#    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

# Remove Stop Words
#Define stop words
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])



  config_value=config["nlp"][key],


[['s', 'thing', 'car', 'nntp_poste', 'host', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


#Step 6: Tokenize Words and Compute Word Frequencies
* Run the following code block to tokenize the words in the texts and calculate word frequencies.
* How many times does the word "addition" appear in the first newsgroup post?


In [6]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('addition', 1),
  ('body', 1),
  ('bring', 1),
  ('call', 1),
  ('car', 5),
  ('day', 1),
  ('door', 2),
  ('early', 1),
  ('engine', 1),
  ('enlighten', 1),
  ('funky', 1),
  ('history', 1),
  ('host', 1),
  ('info', 1),
  ('know', 1),
  ('late', 1),
  ('lerxst', 1),
  ('line', 1),
  ('look', 2),
  ('mail', 1),
  ('make', 1),
  ('model', 1),
  ('name', 1),
  ('neighborhood', 1),
  ('nntp_poste', 1),
  ('park', 1),
  ('production', 1),
  ('really', 1),
  ('rest', 1),
  ('s', 1),
  ('see', 1),
  ('separate', 1),
  ('small', 1),
  ('spec', 1),
  ('sport', 1),
  ('tellme', 1),
  ('thank', 1),
  ('thing', 1),
  ('wonder', 1),
  ('year', 1)]]

**Step 6 Answer:**



#Step 7: Model the Topics in the Newsgroup Posts
* Run the following code block to find the topics.
* Most of the hyperparameters have been tuned so that the model will run as quickly and efficiently as possible.
* How many topics will the model find in the data? Hint: Look at the hyperparameter `num_topics`.


In [7]:
#Step 7

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

**Step 7 Answer:**



#Step 8: Print the Terms Associated with the Top 10 Topics
* Run the following code block to print the words associated with each topic.
* What are the words associated with the 20th (last) topic?
* How would you label the last topic?


In [8]:
# Step 8

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.024*"kill" + 0.023*"live" + 0.021*"death" + 0.017*"die" + '
  '0.017*"physical" + 0.015*"center" + 0.014*"bike" + 0.014*"attack" + '
  '0.012*"israeli" + 0.012*"fire"'),
 (1,
  '0.621*"ax" + 0.018*"slow" + 0.014*"brain" + 0.014*"review" + 0.012*"mb" + '
  '0.011*"clipper_chip" + 0.010*"sc" + 0.010*"printer" + 0.009*"box" + '
  '0.008*"mouse"'),
 (2,
  '0.075*"space" + 0.063*"gun" + 0.022*"launch" + 0.021*"earth" + '
  '0.019*"firearm" + 0.017*"orbit" + 0.017*"mission" + 0.017*"series" + '
  '0.015*"vehicle" + 0.015*"year"'),
 (3,
  '0.150*"com" + 0.048*"mount" + 0.046*"apple" + 0.037*"ram" + '
  '0.026*"corporation" + 0.025*"frame" + 0.025*"task" + 0.022*"spring" + '
  '0.020*"locate" + 0.019*"spacecraft"'),
 (4,
  '0.024*"evidence" + 0.019*"believe" + 0.016*"claim" + 0.016*"reason" + '
  '0.014*"man" + 0.014*"exist" + 0.012*"sense" + 0.012*"book" + 0.012*"life" + '
  '0.011*"faith"'),
 (5,
  '0.024*"thank" + 0.024*"line" + 0.019*"program" + 0.018*"file" + '
  '0.017*"mail" +

**Step 8 Answers:**



#Step 9: Visualize the Topics and Explore the Keywords
* Run the following code block to create a visualization of the topics and keywords associated with the topics.
* Hover over each topic on the right side to view the associated topics.
* You can also see how some topics have keywords that overlap (such as topics 1 and 4).
* What is the most relevant keyword for topic 9?


In [9]:
#Step 9

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis

  by='saliency', ascending=False).head(R).drop('saliency', 1)


**Step 9 Answer:**

