# Initial Experiments

### 10/3/2018

These experiments are intended just to get a feel for the possible document similarity algorithms out there

## Common Data Functionality

[Data source](https://webhose.io/datasets/) (English news articles)

In [17]:
import os
import json

In [14]:
# TODO: add progress.py

# get the list of filenames of articles
doc_labels = []
doc_labels = [f for f in os.listdir("data/news") if f.endswith('.json')]

# iterate each filename, and append the json to a data array
data = []
for doc in doc_labels:
    with open("data/news/{0}".format(doc)) as file:
        data.append(json.loads(file.read()))

In [16]:
# verify we got all the things
len(data)

70000

Organize the data a bit so it can be more readily accessed as needed. By default each article in this dataset comes with a bunch of other stuff (some of which might be usable for testing later on?)

In [27]:
print(data[0])
print("-----")
print(data[0].keys())

{'organizations': [], 'uuid': '8b92bef7c1f346826b2782a1c4add77b8643df84', 'thread': {'social': {'gplus': {'shares': 0}, 'pinterest': {'shares': 0}, 'vk': {'shares': 0}, 'linkedin': {'shares': 0}, 'facebook': {'likes': 0, 'shares': 0, 'comments': 0}, 'stumbledupon': {'shares': 0}}, 'site_full': 'www.yahoo.com', 'main_image': 'https://s.yimg.com/os/mit/media/m/social/images/social_default_logo-1481777.png', 'site_section': 'http://news.yahoo.com/rss/world', 'section_title': 'World News Headlines - Yahoo News', 'url': 'https://www.yahoo.com/news/taiwanese-airline-transasia-shuts-down-heavy-losses-072321844--finance.html?ref=gs', 'country': 'US', 'domain_rank': 5, 'title': 'Taiwanese airline TransAsia shuts down after heavy losses', 'performance_score': 0, 'site': 'yahoo.com', 'participants_count': 0, 'title_full': 'Taiwanese airline TransAsia shuts down after heavy losses', 'spam_score': 0.0, 'site_type': 'news', 'published': '2016-11-22T02:00:00.000+02:00', 'replies_count': 0, 'uuid': '8

In [38]:
text_data = [article["text"] for article in data]

In [39]:
text_data[0]

"TAIPEI, Taiwan (AP) — Taiwanese airline TransAsia has announced it is shutting down following financial losses and two fatal crashes. TransAsia chairman Vincent Lin said Tuesday the airline, which served cities in China, Japan and Southeast Asia, was unable to reverse widening losses or raise additional money.\nTransAsia was established in 1951 as Taiwan's first privately owned airline.\nThe carrier suffered two fatal crashes in 2014 and early 2015, both in Taiwan, that killed a total of 92 people."

## Doc2vec

[Medium tutorial](https://medium.com/@mishra.thedeepak/doc2vec-in-a-simple-way-fa80bfe81104)  
[Another, newer medium tutorial](https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5)

In [53]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

import nltk
from nltk import RegexpTokenizer
from nltk.corpus import stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /home/dwl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

First step of any experiment is to clean the data!

In [90]:
tokenizer = RegexpTokenizer(r'\w+')
stopword_set = set(stopwords.words('english'))

# clean data via tokenizer and stopword removal
def nlp_clean(data):
    """Returns a tokenized and stopword-removed version of 
    every array element."""
    
    clean_data = []
    for d in data:
        new_str = d.lower()
        dlist = tokenizer.tokenize(new_str)
        dlist = list(set(dlist).difference(stopword_set))
        clean_data.append(dlist)
    return clean_data

In [91]:
cleaned_text_data = nlp_clean(text_data)

Next we create an iterator for all the documents, which seems to be necessary for the gensim model to work (note that "labels" are just the file names - a way for it to keep track of what document is what)

In [59]:
class DocIterator(object):
    """A class acting as an iterator for the gensim models."""
    
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list
        
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield TaggedDocument(doc, [self.labels_list[idx]])   

In [51]:
it = DocIterator(cleaned_text_data, doc_labels)

Finally we get to the good stuff - doing all the modely things. Size is the number of features, alpha is the learning rate, and min_count is the minimum number of times a words needs to appear in order to be used

In [92]:
#model = Doc2Vec(vector_size=300, min_count=0, alpha=.025, min_alpha=.025)
#model.build_vocab(it)

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(cleaned_text_data)]
model = Doc2Vec(documents, vector_size=5, min_count=1, workers=4)
            

# train the model
for epoch in range(100):
    print('iteration {0}'.format(epoch + 1))
    model.train(it, total_examples=model.corpus_count, epochs=model.epochs)
    model.alpha -= .002 # annealing?
    model.min_alpha = model.alpha
    
# saving the created model
#model.save('doc2vec.model')

# loading the created model
#model = Doc2Vec.load('doc2vec.model')


iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration 77
iteratio

In [93]:
model.docvecs[1] # vector of file index 0
#model.docvecs["news_0075624"] # NOTE: doesn't work?

array([-0.64948785, -0.66313255, -0.41883677,  0.21751669, -0.85845864],
      dtype=float32)

In [94]:
model.docvecs.most_similar(1)

  if np.issubdtype(vec.dtype, np.int):


[(13943, 0.9998277425765991),
 (9309, 0.9993250370025635),
 (52976, 0.9989607930183411),
 (50930, 0.9989186525344849),
 (29316, 0.9988681077957153),
 (11447, 0.998862624168396),
 (55467, 0.9987135529518127),
 (30052, 0.9987015128135681),
 (26618, 0.998691201210022),
 (62834, 0.9986816644668579)]