## Assignment 8
by Charlie Mei cm3947

Write a Python program based on the Week 8 class exercises (with Gensim, Scikit-Learn, or Spark MLLib), implementing LDA training and topic modeling on your dataset of deduplicated Webhose feeds

- You may use LDA from Scikit-Learn, Gensim or Spark packages
- Modify the values of min_df and max_df, max_features and max_iter (sklearn) to achieve best results
- Your final submission should include:
    - Jupyter Notebook or Python (.py) with your implementation
    - Output showing set of n topic clusters with up to 10 keywords per cluster

In [1]:
import json
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk, re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import time

#### Importing my deduplicated dataset

In [2]:
DATA_FILE = 'unique_data.json'

def parse_json_file(json_file):
    with open(json_file) as f:
        json_parsed = f.readlines()
    json_data = [json.loads(row) for row in json_parsed]
    return json_data

mydata = parse_json_file(DATA_FILE)[0]

In [3]:
stories = [story['text'] for story in mydata]

In [4]:
# View some of the text
for story in stories[:2]:
    print(story)
    print()

iOS 14 Will Support All iPhone Models That Run iOS 13, Including iPhone 6s Series: Report iOS 14 Will Support All iPhone Models That Run iOS 13, Including iPhone 6s Series: Report Apple will reportedly provide iOS 14 as the last major update on the iPhone 6s series and the iPhone SE. By Jagmeet Singh | Updated: 3 June 2020 12:28 IST Apple is hosting its WWDC on June 22 where we expect the formal launch of iOS 14 Highlights iOS 14 compatible devices list could include iPhone 6s and iPhone 6s Plus iPhone 7 series will reportedly receive updates until iOS 16 Apple is likely to release a beta version of iOS 14 following its launch
Apple will reportedly support iOS 14 for all iPhone models that can run iOS 13. This suggests that the Cupertino giant wouldn't make any changes in the list of compatible devices for the next iOS version from what it offered last year. The new development is, of course, good news for iPhone 6s and iPhone 6s Plus users as they both were launched with iOS 9 in Sept

#### Building a Word Tokeenizer

In [5]:
def tokenize_stories_lemma(story):
    tokens = nltk.word_tokenize(story)
    lmtzr = WordNetLemmatizer()
    filtered_tokens = []

    for token in tokens:
        token = token.replace("'s", " ").replace("n't", " not").replace("'ve", " have")
        token = re.sub(r'[^a-zA-Z0-9 ]', '', token)
        if token not in stopwords.words('english'):
            filtered_tokens.append(token.lower())
    
    lemmas = [lmtzr.lemmatize(t, 'v') for t in filtered_tokens]
    return lemmas

#### Building an LDA creator

The LDA will run on unigrams only.

In [17]:
def clstr_lda(num_topics, stories, min_df, max_df, max_iter, max_features=10, n_top_words=10):
    tic = time.clock()

    
    print('Tokenizing...')
    # Create word vectors from stories
    tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, tokenizer=tokenize_stories_lemma, ngram_range=(1, 1))
    tf = tf_vectorizer.fit_transform(stories)
    
    toc = time.clock()
    print('Tokenization finished in {} mins'.format((toc-tic)/60))

    # Instantiate a LDA model
    lda = LatentDirichletAllocation(n_components=num_topics, max_iter=max_iter, learning_method='batch', learning_offset=10, random_state=1)
    # Using batch method since using on a massive dataset

    print('Applying LDA model...')
    tic = time.clock()
    
    lda.fit(tf)
    tf_feature_names = tf_vectorizer.get_feature_names()

    topics = dict()
    for topic_idx, topic in enumerate(lda.components_):
        topics[topic_idx] = [tf_feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]
        print("Topic " + str(topic_idx))
        print(" | ".join([tf_feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]))
    

    toc = time.clock()
    print('Time taken to run LDA: {} mins'.format((toc-tic)/60))

    return topics

#### Training an LDA model

In [19]:
clstr_lda(5, stories, min_df=100, max_df=500, max_iter=10, max_features=10, n_top_words=3)

Tokenizing...
Tokenization finished in 38.622762644999966 mins
Applying LDA model...
Topic 0
5g | keyboard | table
Topic 1
hbo | aapl | remote
Topic 2
trump | remote | table
Topic 3
table | ratio | aapl
Topic 4
laptop | keyboard | ratio
Time taken to run LDA: 0.183505718333375 mins


{0: ['5g', 'keyboard', 'table'],
 1: ['hbo', 'aapl', 'remote'],
 2: ['trump', 'remote', 'table'],
 3: ['table', 'ratio', 'aapl'],
 4: ['laptop', 'keyboard', 'ratio']}

There seems to be too many topics as words such as ```keyboard``` and ```table``` show up in multiple topics. The keywords themselves perhaps could be more specific as well.