<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lesson 14 - Latent Variables and Natural Language Processing

---


## Activity: Twitter Lab
In this exercise, we will compare some of the classical NLP tools from the last class with these more modern latent variable techniques.  We will do this by comparing information extraction on Twitter using two different methods.

There is a pre-existing file of captured tweets you can use.  It is located in the datasets folder of the class repo. 

The sample `captured-tweets.txt` dataset in the repo was generated by collecting ~5000 tweets from the TwitterAPI using the keywords:
- `Google`
- `Microsoft`
- `Goldman Sachs`
- `Citigroup`
- `Tesla`
- `Verizon`
- `Syria`
- `Iran`
- `Israel`
- `Iraq`

In [1]:
# Unicode Handling
from __future__ import unicode_literals
import codecs

import numpy as np
import gensim

# spacy is used for pre-processing and traditional NLP
import spacy
from spacy.en import English

nlp_toolkit = spacy.load('en')

# Gensim is used for LDA and word2vec
from gensim.models.word2vec import Word2Vec

In [2]:
# Loading the tweet data
filename = '../dataset/captured-tweets.txt'
tweets = []
for tweet in codecs.open(filename, 'r', encoding="utf-8"):
    tweets.append(tweet)

### Example for nlp_toolkit

In [3]:
doc = nlp_toolkit(u'London is a big city in the United Kingdom.')
for ent in doc.ents:
    print(ent.label_, ent.text)
    # GPE London
    # GPE United Kingdom

(u'GPE', u'London')
(u'GPE', u'the United Kingdom')


## Exercise 1a

Write a function that can take a sentence parsed by `spacy` and identify if it mentions a company named 'Google'. Remember, `spacy` can find entities and codes them as `ORG` if they are a company. Look at the slides for class 13 if you need a hint.

### Bonus (1b)

Parameterise the company name so that the function works for any company.

In [4]:
def mentions_company(parsed):
    for entity in parsed.ents:
        if entity.text == "Google" and entity.label_ == 'ORG':
            return True
    return False

# 1b

def mentions_company(parsed, company='Google'):
    for entity in parsed.ents:
        if entity.text == company and entity.label_ == 'ORG':
            return True
    return False

## Exercise 1c

Write a function that can take a sentence parsed by `spacy` 
and return the verbs of the sentence (preferably lemmatized)

In [5]:
def get_actions(parsed):
    actions = []
    for el in parsed:
        if el.pos == spacy.parts_of_speech.VERB:
            actions.append(el.text)
    return actions

## Exercise 1d
For each tweet, parse it using spacy and print it out if the tweet has 'release' or 'announce' as a verb. You'll need to use your `mentions_company` and `get_actions` functions.

In [6]:
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    if mentions_company(parsed, 'Google'):
        actions = get_actions(parsed)        
        if 'spying' in actions or 'announce' in actions:
            print(tweet)




In [7]:
actions

[u'joins', u'spying']

In [8]:
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    if mentions_company(parsed, 'Google'):
        actions = get_actions(parsed)
        if 'release' in actions or 'announce' in actions or 'building' in actions:
            print(tweet)

RT @business: Where Amazon, Microsoft, Google, IBM and DigitalOcean are building data centers https://t.co/VX047Jm9tq https://t.co/opTaZzO7…



## Exercise 1e
Write a function that identifies countries - HINT: the entity label for countries is GPE (or GeoPolitical Entity)



In [9]:
def mentions_country(parsed, country):
    for entity in parsed.ents:
        if entity.text == country and entity.label_ == 'GPE':
            return True
    return False

## Exercise 1f

Re-run (d) to find country tweets that discuss 'Iran' announcing or releasing.


In [10]:
for tweet in tweets:
    parsed = nlp_toolkit(tweet)

    if mentions_country(parsed, 'Iran'):
        actions = get_actions(parsed)
        if 'release' in actions or 'announce' in actions:
            print(tweet)

RT @cerenomri: "Literally every US ally in Mideast is on brink of hot war w/ Iran, so we're going to release $100 billion to Iran this mont…

GOBE! Iran warns Nigeria to release Shiite leader El-Zakzaky - SEE https://t.co/TRshnC6sVU

GOBE! Iran warns Nigeria to release Shiite leader El-Zakzaky - SEE https://t.co/SlvcQtk3vE

RT @cerenomri: "Literally every US ally in Mideast is on brink of hot war w/ Iran, so we're going to release $100 billion to Iran this mont…

Hhmmm. Iran claiming to have 'warned Nigeria' to release detained Shiite leader.... @afalli

RT @cerenomri: "Literally every US ally in Mideast is on brink of hot war w/ Iran, so we're going to release $100 billion to Iran this mont…



In [11]:
actions

[]

## Exercise 2
Build a word2vec model of the tweets we have collected using gensim.
First take the collection of tweets and tokenize them using spacy.

### Exercise 2a:
* Think about how this should be done. 
* Should you only use upper-case or lower-case? 
* Should you remove punctuations or symbols? 

In [12]:
text_split = [[x.text if x.pos != spacy.parts_of_speech.VERB else x.lemma_ 
                for x in nlp_toolkit(t)] for t in tweets]

In [13]:
tweets[0]

u'I made a(n) Small Tourmaline in Paradise Island! https://t.co/cAoW1b6DRc #Gameinsight #Androidgames #Android\n'

In [14]:
text_split[:1]

[[u'I',
  u'make',
  u'a(n',
  u')',
  u'Small',
  u'Tourmaline',
  u'in',
  u'Paradise',
  u'Island',
  u'!',
  u'https://t.co/cAoW1b6DRc',
  u'#',
  u'Gameinsight',
  u'#',
  u'Androidgames',
  u'#',
  u'Android',
  u'\n']]

### Exercise 2b:
Build a word2vec model.
Test the window size as well - this is how many surrounding words need to be used to model a word. What do you think is appropriate for Twitter? 

In [15]:
model = Word2Vec(text_split, size=100, window=4, min_count=5, workers=4)

### Exercise 2c:
Test your word2vec model with a few similarity functions. 
* Find words similar to 'tweet'.
* Find words similar to 'rank'.
* Find words similar to 'Google'.
* Find words similar to 'Syria'. 



In [16]:
model.wv.most_similar(positive=['tweet'],topn=10)

[(u'us', 0.9996129274368286),
 (u'call', 0.999569296836853),
 (u'Your', 0.9995595812797546),
 (u'than', 0.9995377063751221),
 (u'10', 0.9995316863059998),
 (u'--', 0.9995256662368774),
 (u'at', 0.99952232837677),
 (u'ne', 0.9995107650756836),
 (u'set', 0.9995098114013672),
 (u'people', 0.9995038509368896)]

In [17]:
model.wv.most_similar(positive=['rank'],topn=10)

[(u'Times', 0.9908788800239563),
 (u'1,100', 0.9908362030982971),
 (u'100', 0.9907811284065247),
 (u'Non', 0.9906907081604004),
 (u'York', 0.9905609488487244),
 (u'6P', 0.990376353263855),
 (u'kill', 0.9903687238693237),
 (u'Israel', 0.9903035163879395),
 (u'near', 0.9903000593185425),
 (u'tensions', 0.9902849793434143)]

In [18]:
model.wv.most_similar(positive=['Google'],topn=10)

[(u'S', 0.9927763938903809),
 (u'Play', 0.9926636219024658),
 (u'ROYALTY', 0.9924790859222412),
 (u'@BrookingsInst', 0.991484522819519),
 (u'Datastore', 0.9911888241767883),
 (u'Java', 0.9907231330871582),
 (u'famous', 0.9905280470848083),
 (u'miss', 0.9904383420944214),
 (u'Rank', 0.990268349647522),
 (u'hide', 0.9891901016235352)]

In [19]:
model.wv.most_similar(positive=['Syria'],topn=12)

[(u'opposition', 0.998782753944397),
 (u'/', 0.997977614402771),
 (u'Russia', 0.9974946975708008),
 (u'must', 0.997404932975769),
 (u'by', 0.9969280958175659),
 (u'democractic', 0.9967551231384277),
 (u'internet', 0.996660053730011),
 (u'Arab', 0.996646523475647),
 (u'Ads', 0.9966259002685547),
 (u'prison', 0.9964969754219055),
 (u'+', 0.9964832663536072),
 (u'death', 0.9963929057121277)]

# Exercise 2d

Adjust the choices in (b) and (c) as necessary.


## Exercise 3

Filter tweets to those that mention 'Iran' or similar entities and 'plan' or similar entities.
* Do this using just spacy.
* Do this using word2vec similarity scores.

In [20]:
# Using word2vec similarity scores
def tweet_sim(word_1,word_2,threshold_1=0.99,threshold_2=0.99):
    tweet_list = []
    i = 0
    for tweet in tweets[:200]:
        tweet_sublist = []
        parsed = nlp_toolkit(tweet)

        similarity_to_1 = max([model.similarity(word_1, tok.text) for tok in parsed if tok.text in model.wv.vocab])
        similarity_to_2 = max([model.similarity(word_2, tok.text) for tok in parsed if tok.text in model.wv.vocab])
        if similarity_to_1 > threshold_1 and similarity_to_2 > threshold_2:
            tweet_sublist.append(i)
            tweet_sublist.append([similarity_to_1, similarity_to_2])
            tweet_sublist.append(tweet)
            tweet_list.append(tweet_sublist)
            i += 1
        
    print 'tweets above threshold:', i
    return tweet_list

In [21]:
list_google_find = tweet_sim('Google','find',0.99,0.995)

tweets above threshold: 64


In [22]:
[lg[2] for lg in list_google_find][:10]

[u'Claim your Google Play Gift Card Code... https://t.co/ySYH1x5kQl #amazon #itunes #googl\u2026 https://t.co/ayDI4X1FKO\n',
 u"I've entered to win a Google Nexus 6P from  !    https://t.co/4vFHfhaBey\n",
 u"RT @kamcb29: I've entered to win a Google Nexus 6P from @MakeUseOf ! https://t.co/o30B9xG6Dx #giveaway #competition\n",
 u'I LOVE your Google plus page with the other girls! \U0001f49c\U0001f606\n',
 u"After I've Google &amp; read a ton of articles on the same subject, I remember, I could've just searched YouTube for this shit \U0001f620\n",
 u'RT @ShowerThoughtts: Apple has "air", Amazon has "Fire", Google has "earth", why doesn\'t Microsoft have "water"?\n',
 u"-Looks up on Google 'MikexJeremy' secretly- &lt;33 ;) [@FnafSchimdt,@MikeSchmit10,]#SenpaiBot~\n",
 u'RT @_silentbent_: Go support @dadeputy single "It\'s Okay" feat @jaesongreen on iTunes,Google\u2026 https://t.co/WPc1jwFLs0\n',
 u'Ever wanted to become a Google Small Business Advisor? Now you can! https://t.co/fcOkq6srSX

In [23]:
list_iran_plan = tweet_sim('Iran','plan',0.99,0.995)

tweets above threshold: 61


In [24]:
[lg[2] for lg in list_iran_plan][:10]

[u"RT @kamcb29: I've entered to win a Google Nexus 6P from @MakeUseOf ! https://t.co/o30B9xG6Dx #giveaway #competition\n",
 u"After I've Google &amp; read a ton of articles on the same subject, I remember, I could've just searched YouTube for this shit \U0001f620\n",
 u'RT @ShowerThoughtts: Apple has "air", Amazon has "Fire", Google has "earth", why doesn\'t Microsoft have "water"?\n',
 u'RT @_silentbent_: Go support @dadeputy single "It\'s Okay" feat @jaesongreen on iTunes,Google\u2026 https://t.co/WPc1jwFLs0\n',
 u'Check out @GCloudAndroid. Never worry about losing your #Android device up to 10 GB Free to backup https://t.co/7eaqZEHAnI\n',
 u'Top Android Apps (without all the games) revealed in hidden Google Play Store link -\u2026 https://t.co/7ZSs9MkcLm #Android #India\n',
 u"                    Europe could produce a Facebook ' and the Google of healthcare\n",
 u"                    Europe could produce a Facebook \\' and the Google of healthcare\n",
 u'#USA #israel \n',
 u'The fi