## Assignment 8
by Charlie Mei cm3947

Write a Python program based on the Week 8 class exercises (with Gensim, Scikit-Learn, or Spark MLLib), implementing LDA training and topic modeling on your dataset of deduplicated Webhose feeds

- You may use LDA from Scikit-Learn, Gensim or Spark packages
- Modify the values of min_df and max_df, max_features and max_iter (sklearn) to achieve best results
- Your final submission should include:
    - Jupyter Notebook or Python (.py) with your implementation
    - Output showing set of n topic clusters with up to 10 keywords per cluster

In [1]:
import json
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk, re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import time

#### Importing my deduplicated dataset

In [2]:
DATA_FILE = 'unique_data.json'

def parse_json_file(json_file):
    with open(json_file) as f:
        json_parsed = f.readlines()
    json_data = [json.loads(row) for row in json_parsed]
    return json_data

mydata = parse_json_file(DATA_FILE)[0]

In [3]:
stories = [story['text'] for story in mydata]

In [4]:
# View some of the text
for story in stories[:2]:
    print(story)
    print()

iOS 14 Will Support All iPhone Models That Run iOS 13, Including iPhone 6s Series: Report iOS 14 Will Support All iPhone Models That Run iOS 13, Including iPhone 6s Series: Report Apple will reportedly provide iOS 14 as the last major update on the iPhone 6s series and the iPhone SE. By Jagmeet Singh | Updated: 3 June 2020 12:28 IST Apple is hosting its WWDC on June 22 where we expect the formal launch of iOS 14 Highlights iOS 14 compatible devices list could include iPhone 6s and iPhone 6s Plus iPhone 7 series will reportedly receive updates until iOS 16 Apple is likely to release a beta version of iOS 14 following its launch
Apple will reportedly support iOS 14 for all iPhone models that can run iOS 13. This suggests that the Cupertino giant wouldn't make any changes in the list of compatible devices for the next iOS version from what it offered last year. The new development is, of course, good news for iPhone 6s and iPhone 6s Plus users as they both were launched with iOS 9 in Sept

#### Building a Word Tokeenizer

In [5]:
def tokenize_stories_lemma(story):
    tokens = nltk.word_tokenize(story)
    lmtzr = WordNetLemmatizer()
    filtered_tokens = []

    for token in tokens:
        token = token.replace("'s", " ").replace("n't", " not").replace("'ve", " have")
        token = re.sub(r'[^a-zA-Z0-9 ]', '', token)
        if token not in stopwords.words('english'):
            filtered_tokens.append(token.lower())
    
    lemmas = [lmtzr.lemmatize(t, 'v') for t in filtered_tokens]
    return lemmas

#### Building an LDA creator

The LDA will run on unigrams only.

In [6]:
def clstr_lda(num_topics, stories, min_df, max_df, max_iter, max_features=10, n_top_words=10):
    tic = time.clock()

    
    print('Tokenizing...')
    # Create word vectors from stories
    tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, tokenizer=tokenize_stories_lemma, ngram_range=(1, 1))
    tf = tf_vectorizer.fit_transform(stories)
    
    toc = time.clock()
    print('Tokenization finished in {} mins'.format((toc-tic)/60))

    # Instantiate a LDA model
    lda = LatentDirichletAllocation(n_components=num_topics, max_iter=max_iter, learning_method='batch', learning_offset=10, random_state=1)
    # Using batch method since using on a massive dataset

    print('Applying LDA model...')
    tic = time.clock()
    
    lda.fit(tf)
    tf_feature_names = tf_vectorizer.get_feature_names()

    topics = dict()
    for topic_idx, topic in enumerate(lda.components_):
        topics[topic_idx] = [tf_feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]
        print("Topic " + str(topic_idx))
        print(" | ".join([tf_feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]))
    

    toc = time.clock()
    print('Time taken to run LDA: {} mins'.format((toc-tic)/60))

    return topics

#### Training an LDA model

In [19]:
clstr_lda(5, stories, min_df=100, max_df=500, max_iter=10, max_features=10, n_top_words=3)

Tokenizing...
Tokenization finished in 38.622762644999966 mins
Applying LDA model...
Topic 0
5g | keyboard | table
Topic 1
hbo | aapl | remote
Topic 2
trump | remote | table
Topic 3
table | ratio | aapl
Topic 4
laptop | keyboard | ratio
Time taken to run LDA: 0.183505718333375 mins


{0: ['5g', 'keyboard', 'table'],
 1: ['hbo', 'aapl', 'remote'],
 2: ['trump', 'remote', 'table'],
 3: ['table', 'ratio', 'aapl'],
 4: ['laptop', 'keyboard', 'ratio']}

There seems to be too many topics as words such as ```keyboard``` and ```table``` show up in multiple topics. The keywords themselves perhaps could be more specific as well.

In [7]:
clstr_lda(3, stories, min_df=300, max_df=1000, max_iter=10, n_top_words=3)

Tokenizing...
Tokenization finished in 38.33329671666667 mins
Applying LDA model...
Topic 0
hbo | airpods | max
Topic 1
stock | quarter | rat
Topic 2
trace | game | music
Time taken to run LDA: 0.2743710083333326 mins


{0: ['hbo', 'airpods', 'max'],
 1: ['stock', 'quarter', 'rat'],
 2: ['trace', 'game', 'music']}

The topics still are not distinct, perhaps increasing the ```max_iter```.

In [8]:
clstr_lda(3, stories, min_df=300, max_df=1000, max_iter=100, n_top_words=3)

Tokenizing...
Tokenization finished in 24.76275367333333 mins
Applying LDA model...
Topic 0
hbo | airpods | max
Topic 1
stock | quarter | rat
Topic 2
trace | game | music
Time taken to run LDA: 1.025736458333328 mins


{0: ['hbo', 'airpods', 'max'],
 1: ['stock', 'quarter', 'rat'],
 2: ['trace', 'game', 'music']}

Changing ```max_iter``` made no changes to the LDA results. How about making the keywords more specific?

In [9]:
clstr_lda(3, stories, min_df=500, max_df=1000, max_iter=10, n_top_words=3)

Tokenizing...
Tokenization finished in 21.761033093333328 mins
Applying LDA model...
Topic 0
airpods | max | battery
Topic 1
stock | quarter | rat
Topic 2
trace | game | battery
Time taken to run LDA: 0.17530960000000656 mins


{0: ['airpods', 'max', 'battery'],
 1: ['stock', 'quarter', 'rat'],
 2: ['trace', 'game', 'battery']}

In [10]:
clstr_lda(3, stories, min_df=500, max_df=1000, max_iter=10, max_features=100, n_top_words=3)

Tokenizing...
Tokenization finished in 22.070170145000006 mins
Applying LDA model...
Topic 0
airpods | music | game
Topic 1
max | se | galaxy
Topic 2
stock | quarter | rat
Time taken to run LDA: 0.4987031899999996 mins


{0: ['airpods', 'music', 'game'],
 1: ['max', 'se', 'galaxy'],
 2: ['stock', 'quarter', 'rat']}

The keywords seem better but not fully informative of topics.

In [11]:
clstr_lda(3, stories, min_df=500, max_df=1000, max_iter=10, max_features=500, n_top_words=3)

Tokenizing...
Tokenization finished in 19.706789646666646 mins
Applying LDA model...
Topic 0
game | max | music
Topic 1
stock | quarter | rat
Topic 2
trace | global | china
Time taken to run LDA: 0.8385474150000239 mins


{0: ['game', 'max', 'music'],
 1: ['stock', 'quarter', 'rat'],
 2: ['trace', 'global', 'china']}

In [14]:
clstr_lda(3, stories, min_df=200, max_df=1000, max_iter=10, max_features=100, n_top_words=3)

Tokenizing...
Tokenization finished in 20.28756733333333 mins
Applying LDA model...
Topic 0
stock | quarter | rat
Topic 1
game | macbook | se
Topic 2
hbo | trace | max
Time taken to run LDA: 0.42144841833329943 mins


{0: ['stock', 'quarter', 'rat'],
 1: ['game', 'macbook', 'se'],
 2: ['hbo', 'trace', 'max']}

The model seems better at applying keywords to topics. Three keywords appears too much.

In [15]:
clstr_lda(3, stories, min_df=200, max_df=1000, max_iter=10, max_features=100, n_top_words=2)

Tokenizing...
Tokenization finished in 20.310402481666642 mins
Applying LDA model...
Topic 0
stock | quarter
Topic 1
game | macbook
Topic 2
hbo | trace
Time taken to run LDA: 0.47555513166668484 mins


{0: ['stock', 'quarter'], 1: ['game', 'macbook'], 2: ['hbo', 'trace']}

The tokenizer appears appropriate but still need some optimizing for topics and maximum number of iterations.

#### Optimizing for number of topics and max. iterations

In [16]:
tf_vectorizer = CountVectorizer(max_df=1000, min_df=200, max_features=100, tokenizer=tokenize_stories_lemma, ngram_range=(1, 1))
tf = tf_vectorizer.fit_transform(stories)

In [26]:
def test_lda_model(tf, tf_vectorizer, num_topics, max_iter, n_top_words):
    print('Applying LDA model...')
    lda = LatentDirichletAllocation(n_components=num_topics, max_iter=max_iter, learning_method='batch', learning_offset=10, random_state=1)
    lda.fit(tf)
    tf_feature_names = tf_vectorizer.get_feature_names()

    topics = dict()
    for topic_idx, topic in enumerate(lda.components_):
        topics[topic_idx] = [tf_feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]

    return topics


Based on the unigram tokenizer, will see which combination of topics and keywords offer the best results.

In [27]:
num_topics = [3, 5, 7]
n_top_words = [3, 5, 7]

for topic in num_topics:
    for keywords in n_top_words:
        topics = test_lda_model(tf=tf, tf_vectorizer=tf_vectorizer, num_topics=topic, max_iter=10, n_top_words=keywords)
        print('Topics for ' + str(topic) + ' topics and ' + str(keywords) + ' keywords.')
        print(topics)
        print('\n')


Applying LDA model...
Topics for 3 topics and 3 keywords.
{0: ['stock', 'quarter', 'rat'], 1: ['game', 'macbook', 'se'], 2: ['hbo', 'trace', 'max']}


Applying LDA model...
Topics for 3 topics and 5 keywords.
{0: ['stock', 'quarter', 'rat', 'airpods', 'inc'], 1: ['game', 'macbook', 'se', 'galaxy', 'battery'], 2: ['hbo', 'trace', 'max', 'music', 'film']}


Applying LDA model...
Topics for 3 topics and 7 keywords.
{0: ['stock', 'quarter', 'rat', 'airpods', 'inc', 'target', 'value'], 1: ['game', 'macbook', 'se', 'galaxy', 'battery', 'smart', 'camera'], 2: ['hbo', 'trace', 'max', 'music', 'film', 'exposure', 'podcast']}


Applying LDA model...
Topics for 5 topics and 3 keywords.
{0: ['trace', 'airpods', 'china'], 1: ['macbook', 'table', 'microsoft'], 2: ['hbo', 'music', 'max'], 3: ['stock', 'quarter', 'rat'], 4: ['game', 'se', 'galaxy']}


Applying LDA model...
Topics for 5 topics and 5 keywords.
{0: ['trace', 'airpods', 'china', 'podcast', 'government'], 1: ['macbook', 'table', 'microsoft

Based on these results, the best LDA contains 7 topics and 7 keywords.

#### Applying LDA model on 10 random articles

In [35]:
import random
random_sample = random.sample(range(1, len(stories)), 10)

random_10_stories = [stories[i] for i in random_sample]

In [77]:
print('Fitting LDA model...')
lda = LatentDirichletAllocation(n_components=7, max_iter=10, learning_method='batch', learning_offset=10, random_state=1)
lda_model = lda.fit(tf)

Fitting LDA model...


In [87]:
topics = test_lda_model(tf, tf_vectorizer, 7, 10, 7)
print(topics)

Applying LDA model...
{0: ['trace', 'government', 'reopen', 'china', 'exposure', 'virus', 'trump'], 1: ['macbook', 'keyboard', 'tap', 'mac', 'smart', 'laptop', 'message'], 2: ['hbo', 'max', 'film', '135', 'netflix', 'id', 'plus'], 3: ['stock', 'quarter', 'rat', 'inc', 'value', 'own', 'average'], 4: ['game', 'galaxy', 'camera', '5g', 'card', 'tablet', 'brand'], 5: ['table', 'podcast', 'global', 'remote', 'microsoft', 'production', 'employees'], 6: ['airpods', 'music', 'se', 'voice', 'sound', 'audio', 'hear']}


In [90]:
lda_results = lda.fit_transform(tf)
sample_stories_results = lda_results[random_sample,]

In [89]:
import pandas as pd
pd.DataFrame(sample_stories_results, index=random_sample)

Unnamed: 0,0,1,2,3,4,5,6
8710,0.003868,0.003865,0.003867,0.003893,0.862413,0.118218,0.003875
7772,0.424807,0.362566,0.015897,0.015925,0.148893,0.015903,0.01601
7275,0.142857,0.142857,0.142857,0.142857,0.142857,0.142857,0.142857
7759,0.071434,0.071429,0.071479,0.071429,0.071437,0.571348,0.071445
6351,0.002133,0.41427,0.035311,0.002138,0.109,0.002138,0.435009
26,0.001318,0.001312,0.992106,0.001319,0.001314,0.001314,0.001317
8933,0.938714,0.010225,0.010207,0.010206,0.010217,0.010205,0.010226
2193,0.00794,0.007967,0.008008,0.063889,0.549055,0.007943,0.355197
4715,0.010213,0.010269,0.753245,0.010215,0.01021,0.19564,0.010208
8468,0.028719,0.230128,0.028575,0.626747,0.028584,0.02866,0.028587


The LDA model predicts that:

- story 8710 belongs to topic 4
- story 7772 belongs to topic 0
- story 7275 could belong to any of the topics
- story 7759 could belong to any of the topics
- story 6351 belongs to topic 6
- story 26 could belong to any of the topics
- story 8933 belongs to topic 0
- story 2193 belongs to topic 4
- story 4715 belongs to topic 2
- story 8468 belongs to topic 1.