## Assignment 6
by Charlie Mei cm3947


Review Word2Vec project page and download GoogleNews-vectors-negative300.bin.gz (Links to an external site.) pre-trained model to your computer.

Write a Python program based on the provided Class Exercise, which:

- Loads the downloaded pre-trained Google Word2Vec model from your computer
- Loads your previously obtained dataset of Webhose news articles
- For any one selected article title from the dataset, finds 100 most similar titles based on Word2Vec similarity, prints those titles in a descending order of similarity scores.

In [1]:
import nlp_toolkit as ntk
import json
import gensim, operator
from gensim.models import KeyedVectors

In [2]:
# Load pretrained Google Word2Vec Model
w2v_model = ntk.load_wordvec_model('Word2Vec', 'GoogleNews-vectors-negative300.bin.gz', True, 'C:\\Github\\nlp-analytics\\')

Loading Word2Vec model...
Finished loading Word2Vec model...


In [4]:
# Load webhose dataset
feeds = ntk.parse_json_file('webhose_netflix.json')
for feed in feeds[:2]:
    print(feed['title'])

13 Reasons Why: The popular Netflix show's creator teases chance of a hopeful ending
Judge gives control of 'Tiger King' Joe Exotic's zoo to Carole Baskin


In [13]:
# Extract just the titles
titles = [feed['title'] for feed in feeds]
titles[:3]

["13 Reasons Why: The popular Netflix show's creator teases chance of a hopeful ending",
 "Judge gives control of 'Tiger King' Joe Exotic's zoo to Carole Baskin",
 "A TV reboot of Bong Joon-ho's acclaimed film Snowpiercer has landed on Netflix — what's the deal?"]

In [58]:
# Create a function that calculates similar articles
def identify_similar_articles(title, title_list, vsm, topn=100):
    print('Finding similar articles to: \n' + title + '\n')
    # Remove the test title from title-list
    title_list.remove(title)
    vec_sims = []
    # Calculate similarities and skip over errors
    for feed in title_list:
        try:
            vec_sims.append(ntk.calc_similarity(title, feed, vsm))
        except:
            title_list.remove(feed)
            continue
    
    # Create a dictionary representation of feed title and similarity score
    similarity_dict = dict(zip(title_list, vec_sims))
    # Sort the dictionary in descending order
    sorted_sims = {k: v for k, v in sorted(similarity_dict.items(), key=lambda item: item[1], reverse=True)}
    # Return top n keys i.e. titles
    titles = [title for title in sorted_sims.keys()]
    # and their similarities
    sims = [sim for sim in sorted_sims.values()]

    top_sims = dict(zip(titles[1:topn+1], sims[1:topn+1]))

    print('Here are the top ' + str(topn) + ' most similar articles... \n')
    for title in top_sims.keys():
        print('\t' + title + ": " + str(top_sims[title]) + '\n')

    # Return top n similarities in dictionary
    return top_sims

In [61]:
top_sims = identify_similar_articles(titles[5], titles, w2v_model)

Finding similar articles to: 
New: The 'Netflix for Jews'

Here are the top 100 most similar articles... 

	Review: The Willoughbys: 0.76631606

	8: The Willoughbys Spoiler!!!: 0.76631606

	The Reader's stay-at-home chronicles: day 70: 0.75200534

	Review: The Half of It: 0.72341526

	Makin’ It Work: The Lark: 0.72036564

	Last Kingdom: How accurate is The Last Kingdom?: 0.6984327

	‘Avatar: The Last Airbender,’ A 15 Year-Old Cartoon, Is Now Netflix’s Most Popular Show: 0.6693507

	The Half of It – Review: 0.666419

	The Last Showing: 0.6621617

	'Avatar: The Last Airbender' on Netflix: Why a 2005 cartoon is so popular: 0.66123927

	Avatar: The Last Airbender is back on Netflix, but don’t start with the first episode: 0.6490926

	Review: The Lovebirds Are Anything But In This Slight Rom-Com: 0.6483616

	The Director's Cut: Jody Wisternoff - Nightwhisper: 0.6461748

	The Last Kingdom: Who plays Finan in The Last Kingdom?: 0.64197326

	Film Review: The Half of It: 0.6416532

	‘Avatar: Th

Write a Pyspark program based on the other provided Class Exercise, which:

- Loads your previously obtained dataset of Webhose news articles into a Spark dataframe
- Cleans up and tokenizes article bodies using the RegexTokenizer and Stopword remover functions provided in the Class Exercise
- Trains a Word2Vec model based on the output column produced in step 2
- Implements any sample search query, as shown in Class Exercise, and produces matching article titles