## Practical 2: Text Classification

In [2]:
import sys
sys.path.append("../lib/")
from util import *
import numpy as np
import os
from random import shuffle
import re

""" bokeh """
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

""" data loading """
import urllib.request
import zipfile
import lxml
import lxml.etree

""" Autoload """
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Part 0: Download the TED dataset

In [3]:
ted_zip = '../data/ted_en-20160408.zip'

In [4]:
# For now, we're only interested in the subtitle text, so let's extract that from the XML:
with zipfile.ZipFile(ted_zip, 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))
del doc

In [6]:
sentences = input_text.split("\n")
print("Input len = {}".format(len(sentences)))
sentences[:3]

Input len = 54863


["Here are two reasons companies fail: they only do more of the same, or they only do what's new.",
 'To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.',
 "Consider Facit. I'm actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. And what did Facit do when the electronic calculator came along? They continued doing exactly the same. In six months, they went from maximum revenue ... and they were gone. Gone."]

### Part 1: Preprocessing

### In this part, we attempt to clean up the raw subtitles a bit, so that we get only sentences. The following substring shows examples of what we're trying to get rid of. Since it's hard to define precisely what we want to get rid of, we'll just use some simple heuristics.

In [7]:
i = input_text.find("Hyowon Gweon: See this?")
input_text[i-20:i+150]

' baby does.\n(Video) Hyowon Gweon: See this? (Ball squeaks) Did you see that? (Ball squeaks) Cool. See this one? (Ball squeaks) Wow.\nLaura Schulz: Told you. (Laughs)\n(Vide'

Let's start by 
### removing all parenthesized strings using a regex:

In [8]:
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)

We can 
### verify the same location in the text is now clean as follows. We won't worry about the irregular spaces since we'll later split the text into sentences and tokenize it anyway.

In [9]:
i = input_text_noparens.find("Hyowon Gweon: See this?")
input_text_noparens[i-20:i+150]

"hat the baby does.\n Hyowon Gweon: See this?  Did you see that?  Cool. See this one?  Wow.\nLaura Schulz: Told you. \n HG: See this one?  Hey Clara, this one's for you. You "

Now, let's attempt to 
### remove speakers' names that occur at the beginning of a line, by deleting pieces of the form "`<up to 20 characters>:`", as shown in this example. Of course, this is an imperfect heuristic. 

In [10]:
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)

# Uncomment if you need to save some RAM: these strings are about 50MB.
# del input_text, input_text_noparens

# Let's view the first few:
sentences_strings_ted[:5]

["Here are two reasons companies fail: they only do more of the same, or they only do what's new",
 'To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation',
 ' Both are necessary, but it can be too much of a good thing',
 'Consider Facit',
 " I'm actually old enough to remember them"]

In [11]:
m

<_sre.SRE_Match object; span=(0, 0), match=''>

Now that we have sentences, we're ready to 
### tokenize each of them into words. This tokenization is imperfect, of course. For instance, how many tokens is "can't", and where/how do we split it? We'll take the simplest naive approach of splitting on spaces. Before splitting, we remove non-alphanumeric characters, such as punctuation. You may want to consider the following question: why do we replace these characters with spaces rather than deleting them? Think of a case where this yields a different answer.

In [12]:
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

Two sample processed sentences:

In [13]:
len(sentences_ted)

266694

In [70]:
print(sentences_ted[0])
print(sentences_ted[1])

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']
['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']


### Part 2: Word Frequencies

If you store the counts of the top 1000 words in a list called `counts_ted_top1000`, the code below will plot the histogram requested in the writeup.

In [74]:
# ...
c = counter_of_words(sentences_ted)
counts_ted_top1000 = [x[1] for x in c.most_common(1000)]

### Plot distribution of top-1000 words

In [75]:
hist, edges = np.histogram(counts_ted_top1000, density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Top-1000 words distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

### Part 3: Train Word2Vec

In [18]:
from gensim.models import Word2Vec

In [35]:
# ...
model_ted = Word2Vec(sentences_ted, size=100, min_count=10)

In [36]:
print("model_ted voc: {}".format(len(model_ted.wv.vocab)))

model_ted voc: 14427


### Part 4: Ted Learnt Representations

Finding similar words: (see gensim docs for more functionality of `most_similar`)

In [37]:
model_ted.most_similar("man")

[('woman', 0.8560987114906311),
 ('guy', 0.8098085522651672),
 ('lady', 0.7557809352874756),
 ('boy', 0.7379118204116821),
 ('soldier', 0.7317131161689758),
 ('gentleman', 0.7294946908950806),
 ('girl', 0.7148830890655518),
 ('poet', 0.6848183274269104),
 ('surgeon', 0.6828598976135254),
 ('rabbi', 0.6786246299743652)]

In [38]:
model_ted.most_similar("computer")

[('machine', 0.7370160222053528),
 ('software', 0.7127266526222229),
 ('robot', 0.7080278992652893),
 ('device', 0.705635130405426),
 ('program', 0.6421622633934021),
 ('printer', 0.6381993293762207),
 ('chip', 0.6379455327987671),
 ('simulation', 0.6375066041946411),
 ('mechanical', 0.6345296502113342),
 ('camera', 0.6295073628425598)]

In [40]:
# ...
model_ted.most_similar("sex")

[('stress', 0.5612446069717407),
 ('depression', 0.559812068939209),
 ('abuse', 0.552577018737793),
 ('symptoms', 0.5474222302436829),
 ('schizophrenia', 0.5275021195411682),
 ('condom', 0.5252465605735779),
 ('gender', 0.5209177136421204),
 ('parkinson', 0.5189566612243652),
 ('alzheimer', 0.5117388367652893),
 ('disabilities', 0.5069156885147095)]

In [54]:
import numpy
from numpy.linalg import norm
def show_similairty(w1, w2):
    v1 = model_ted.wv[w1]
    v2 = model_ted.wv[w2]
    sim = np.dot(v1, v2)/norm(v1)/norm(v2)
    print("({:10s}, {:10s}): similarity = {:.3f}".format(w1, w2, sim))

In [55]:
show_similairty("king", "queen")
show_similairty("king", "cat")

(king      , queen     ): similarity = 0.732
(king      , cat       ): similarity = 0.594


#### t-SNE visualization
To use the t-SNE code below, first put a list of the top 1000 words (as strings) into a variable `words_top_ted`. The following code gets the corresponding vectors from the model, assuming it's called `model_ted`:

In [57]:
words_top_ted = [x[0] for x in c.most_common(1000)]

In [58]:
# This assumes words_top_ted is a list of strings, the top 1000 words
words_top_vec_ted = model_ted[words_top_ted]

In [59]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted)

In [60]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_top_ted))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Part 5: Wiki Learnt Representations

Download dataset

In [62]:
if not os.path.isfile('wikitext-103-raw-v1.zip'):
    urllib.request.urlretrieve("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip", filename="wikitext-103-raw-v1.zip")

In [63]:
with zipfile.ZipFile('wikitext-103-raw-v1.zip', 'r') as z:
    input_text = str(z.open('wikitext-103-raw/wiki.train.raw', 'r').read(), encoding='utf-8') # Thanks Robert Bastian

Preprocess sentences (note that it's important to remove small sentences for performance)

In [64]:
sentences_wiki = []
for line in input_text.split('\n'):
    s = [x for x in line.split('.') if x and len(x.split()) >= 5]
    sentences_wiki.extend(s)
    
for s_i in range(len(sentences_wiki)):
    sentences_wiki[s_i] = re.sub("[^a-z]", " ", sentences_wiki[s_i].lower())
    sentences_wiki[s_i] = re.sub(r'\([^)]*\)', '', sentences_wiki[s_i])
del input_text

In [65]:
# sample 1/5 of the data
shuffle(sentences_wiki)
print(len(sentences_wiki))
sentences_wiki = sentences_wiki[:int(len(sentences_wiki)/5)]
print(len(sentences_wiki))

4267112
853422


Now, repeat all the same steps that you performed above. You should be able to reuse essentially all the code.

In [89]:
# ...
tokenized_wiki = tokenize_sentences(sentences_wiki)
c_wiki = counter_of_words(tokenized_wiki)
counts_wiki_top1000 = [x[1] for x in c_wiki.most_common(1000)]
words_top_wiki = [x[0] for x in c_wiki.most_common(1000)]

In [91]:
model_wiki = Word2Vec(tokenized_wiki, size=100, min_count=10)

#### t-SNE visualization

In [92]:
# This assumes words_top_wiki is a list of strings, the top 1000 words
words_top_vec_wiki = model_wiki[words_top_wiki]

tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)


In [93]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_wiki_tsne[:,0],
                                    x2=words_top_wiki_tsne[:,1],
                                    names=words_top_wiki))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)