## Assignment 7
by Charlie Mei cm2947

Using the Week 7 Class Exercise as a reference, write a Python program that filters out exactly and/or semantically duplicate articles from your Webhose dataset of news articles:

- Use LSH (SimHash or MinHash), separately or along with Word2Vec, to deduplicate your Webhose feeds based on titles
- Make sure to store entire feeds in a JSON, text or CSV file

Your final submission should include the Jupyter Notebook with the original set of titles and a deduplicated subset of titles

In [1]:
import json
import gensim, operator
from gensim.models import KeyedVectors
from simhash import Simhash, SimhashIndex
import numpy as np
import logging

In [2]:
DATA_FILE = 'webhose_apple.json'
WORD_2_VEC_DIR = r'C:\Github\nlp-analytics'
W2V_FILE = r'\GoogleNews-vectors-negative300.bin.gz'

#### Setting up Functions

In [3]:
# Read in JSON object as a list of Python dictionaries
def parse_json_file(json_file):
    with open(json_file) as f:
        json_parsed = f.readlines()
    feeds = [json.loads(feed) for feed in json_parsed]
    return feeds


def load_wordvec_model(modelName, modelFile, flagBin, model_path):
    print('Loading ' + modelName + ' model...')
    model = KeyedVectors.load_word2vec_format(model_path + modelFile, binary=flagBin)
    print('Finished loading ' + modelName + ' model...')
    return model

# function checks whether the input words are present in the vocabulary for the model
def vocab_check(vectors, words):
    
    output = list()
    for word in words:
        if word in vectors.vocab:
            output.append(word.strip())
            
    return output

# function calculates similarity between two strings using a particular word vector model
def calc_similarity(input1, input2, vectors):
    s1words = set(vocab_check(vectors, input1.split()))
    s2words = set(vocab_check(vectors, input2.split()))
    
    output = vectors.n_similarity(s1words, s2words)
    return output

def cleanup(input):
    # remove English stopwords
    input = input.replace("'s", " ").replace("n’t", " not").replace("’ve", " have")
    input = re.sub(r'[^a-zA-Z0-9 ]', '', input)
    return input

#### Load Word2Vec Model

In [4]:
w2v_model = load_wordvec_model('Word2Vec', W2V_FILE, True, WORD_2_VEC_DIR)

Loading Word2Vec model...
Finished loading Word2Vec model...


#### Load in Webhose dataset

In [5]:
webhose_data = parse_json_file(DATA_FILE)

We will base deduplication on article titles only.

In [6]:
# Create a list of news feed titles and include an ID flag for each feed in the original data
feeds = []
index = 0
for feed in webhose_data:
    feeds.append(feed['title'])
    # Add index to original feeds dataset
    feed['id'] = index
    index += 1
    

In [7]:
# Establish a simhash logger
logging.getLogger('simhash').setLevel(logging.CRITICAL)



In [8]:
# Create a simhashed object
obj = [(str(feed['id']), Simhash(str(feed['title']))) for feed in webhose_data]

In [9]:
def generate_simhash_index(obj, dist):
    print('Generating Simhash Index based on maximum hemming distance of {}...'.format(dist))
    index = SimhashIndex(obj, k=dist)
    return index

def length_of_simhash_dupes(selection_index, index, feeds):
    selected_feed = feeds[selection_index]
    feed_hash = Simhash(str(selected_feed['title']))
    dupe_indices = index.get_near_dups(feed_hash)
    return dupe_indices

def length_of_w2v_dupes(selection_index, dupe_indices, feeds, threshold_score=0.7):
    count_dupe = 0
    removed_dupes = []
    
    for dupe in dupe_indices:
        try:
            score = calc_similarity(feeds[selection_index]['title'], feeds[int(dupe)]['title'], w2v_model)
        except:
            score = 0
        
        if score > threshold_score:
            count_dupe += 1
            removed_dupes.append(dupe)
    
    return [count_dupe, removed_dupes]

#### Selecting Hemming distance and threshold scores for deduplication

I will test what the optimal Hemming distance and scoring threhold score to use based on the performance on one randomly selected article in the dataset. I will first change the Hemming distance, then adjust the threshold score such that the resulting list of articles are only the duplicates.

In [10]:
test_feed = feeds[100]

# Function to test dupes on test article
def print_dupes(dist, score):
    # Generate a SimHash based on specified Hemming distance
    simhash_index = generate_simhash_index(obj, dist=dist)

    # Generate list of dupes from SimHash
    dupe_indices = length_of_simhash_dupes(100, simhash_index, webhose_data)

    # Generate filtered SimHash list using Word2Vec
    count_dupe, removed_dupes = length_of_w2v_dupes(100, dupe_indices, webhose_data, threshold_score=score)

    # Print out the resulting duplicates
    for dupe in removed_dupes:
        print(webhose_data[int(dupe)]['title'])

In [11]:
print_dupes(10, 0.7)

Generating Simhash Index based on maximum hemming distance of 10...
The Next Apple Pencil Could Come In A Black Finish


Perhaps restrictive - let's increase Hemming distance.

In [12]:
print_dupes(20, 0.7)

Generating Simhash Index based on maximum hemming distance of 20...
Next-Generation Apple Pencil Rumoured to Come in Black Colour
Apple Pencil Could Be Released in Black
The Next Apple Pencil Could Come In A Black Finish


In [13]:
print_dupes(30, 0.7)

Generating Simhash Index based on maximum hemming distance of 30...
This Year’s Apple MacBook To Come With ARM-based Chip
How To Watch 'The Great' In The UK
Slap An Apple Watch Series 3 Onto Your Wrist For Just $179 If You’re Quick
Apple AirPods Might Get An Ambient Light Sensor, Here's What It Would Do
A $50 Apple Watch is Possible… But Not From Apple
Martin Scorsese’s Next Film Is Coming to Apple TV+
A Full-Length ‘Fraggle Rock’ (Clap Clap) Reboot Is Coming to Apple TV+
How To Sign Up For HBO Max So You Can Watch 'Friends' All Day
The Next Apple Pencil Could Come In A Black Finish
Apple Pencil Could Be Released in Black
‘The Office’ Would Have Been Cancelled After Season 1 If It Weren’t For Apple
Apple iOS 13.5 Update To Fix Face ID Unlock Issue When Wearing A Mask & More
Could Apple Be Worth $2 Trillion in Just 4 Years? | The Motley Fool
Next-Generation Apple Pencil Rumoured to Come in Black Colour
With The New iOS 13.5 You Can Unlock Your iPhone While Wearing a Face Mask
If I Could

Here the Hemming distance is too large. How about changing the threshold?

In [14]:
print_dupes(20, 0.5)

Generating Simhash Index based on maximum hemming distance of 20...
iPhone 12 May Not Come Bundled With Free Headphones
Apple To Reopen 100 Stores This Week
Realme Buds Air vs Buds Air Neo: Which one should you go for? - Technology News
New rumor says that the next generation of Apple Pencil will come in black
Next-Generation Apple Pencil Rumoured to Come in Black Colour
Apple Watch Series 3 38mm GPS Aluminum Smart Watch w/ Sport Band $179 at Amazon
iPhone 12 May Not Come Bundled With Free Headphones
Rumor claims future Apple Pencil will come in black
Three States Will Start Using Google and Apple COVID-19 Contact-Tracing Tech
Apple Pencil Could Be Released in Black
Apple to Reopen Two Apple Store Location in Japan This Week
The Next Apple Pencil Could Come In A Black Finish
Apple Reportedly to Reopen 130 of Its 271 U.S. Stores This Week


Threshold now is too low. My optimal parameters are: Hemming distance of 20 and using threhold score of 0.7.

#### Full-scale deduplication (```hemming_distance=20``` and ```threshold_score=0.7```)

In [15]:
hemming_distance = 20
simhash_index = generate_simhash_index(obj, hemming_distance)

Generating Simhash Index based on maximum hemming distance of 20...


In [16]:
# Function to test dupes on test article
def find_dupes(ind, simhash_index, webhose_data):
    dupe_indices = length_of_simhash_dupes(ind, simhash_index, webhose_data)

    # Generate filtered SimHash list using Word2Vec
    count_dupe, removed_dupes = length_of_w2v_dupes(ind, dupe_indices, webhose_data)

    # Save list of dupe indices
    return removed_dupes

In [18]:
feed_length = len(feeds)
unique_data = webhose_data
for i in range(feed_length):
    if i % 1000 == 0:
        print('Cycling through 1000 articles and removing duplicates...')
    
    try:
        removed_dupes = find_dupes(i, simhash_index, unique_data)
        removed_dupes2 = [int(dupe) for dupe in removed_dupes]
        unique_data = [feed for feed in unique_data if feed['id'] not in removed_dupes2]
    except:
        continue

Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...
Cycling through 1000 articles and removing duplicates...


#### Saving the final dataset to a ```JSON``` file

In [20]:
with open('unique_data.json', 'w') as f:
    json.dump(unique_data, f)

print('There are {} unique articles in this dataset.'.format(len(unique_data)))

There are 9298 unique articles in this dataset.
