<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lesson 14 - Latent Variables and Natural Language Processing

---


## Activity: Twitter Lab
In this exercise, we will compare some of the classical NLP tools from the last class with these more modern latent variable techniques.  We will do this by comparing information extraction on Twitter using two different methods.

There is a pre-existing file of captured tweets you can use.  It is located in the datasets folder of the class repo. 

The sample `captured-tweets.txt` dataset in the repo was generated by collecting ~5000 tweets from the TwitterAPI using the keywords:
- `Google`
- `Microsoft`
- `Goldman Sachs`
- `Citigroup`
- `Tesla`
- `Verizon`
- `Syria`
- `Iran`
- `Israel`
- `Iraq`

In [1]:
# Unicode Handling
from __future__ import unicode_literals
import codecs

import numpy as np
import gensim

# spacy is used for pre-processing and traditional NLP
import spacy
from spacy.en import English

nlp_toolkit = spacy.load('en')

# Gensim is used for LDA and word2vec
from gensim.models.word2vec import Word2Vec

In [2]:
# Loading the tweet data
filename = '../dataset/captured-tweets.txt'
tweets = []
for tweet in codecs.open(filename, 'r', encoding="utf-8"):
    tweets.append(tweet)

### Example for nlp_toolkit

In [3]:
doc = nlp_toolkit(u'London is a big city in the United Kingdom.')
for ent in doc.ents:
    print(ent.label_, ent.text)
    # GPE London
    # GPE United Kingdom

(u'GPE', u'London')
(u'GPE', u'the United Kingdom')


## Exercise 1a

Write a function that can take a sentence parsed by `spacy` and identify if it mentions a company named 'Google'. Remember, `spacy` can find entities and codes them as `ORG` if they are a company. Look at the slides for class 13 if you need a hint.

### Bonus (1b)

Parameterise the company name so that the function works for any company.

In [4]:
def mentions_company(parsed):
    # Return True if the sentence contains an organisation and that organisation is Google
    for entity in parsed.ents:
        pass
        # Fill in code here
    # Otherwise return False
    return False

# 1b

def mentions_company(parsed, company='Google'):
    # Your code here
    pass

## Exercise 1c

Write a function that can take a sentence parsed by `spacy` 
and return the verbs of the sentence (preferably lemmatized)

In [5]:
def get_actions(parsed):
    actions = []
    # Your code here
    return actions

## Exercise 1d
For each tweet, parse it using spacy and print it out if the tweet has 'release' or 'announce' as a verb. You'll need to use your `mentions_company` and `get_actions` functions.

In [6]:
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    pass

## Exercise 1e
Write a function that identifies countries - HINT: the entity label for countries is GPE (or GeoPolitical Entity)



In [7]:
def mentions_country(parsed, country):
    pass

## Exercise 1f

Re-run (d) to find country tweets that discuss 'Iran' announcing or releasing.


In [8]:
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    pass

## Exercise 2
Build a word2vec model of the tweets we have collected using gensim.
First take the collection of tweets and tokenize them using spacy.

### Exercise 2a:
* Think about how this should be done. 
* Should you only use upper-case or lower-case? 
* Should you remove punctuations or symbols? 

In [9]:
text_split = [[x.text if x.pos != spacy.parts_of_speech.VERB else x.lemma_ 
                for x in nlp_toolkit(t)] for t in tweets]

### Exercise 2b:
Build a word2vec model.
Test the window size as well - this is how many surrounding words need to be used to model a word. What do you think is appropriate for Twitter? 

In [10]:
model = Word2Vec(text_split, size=100, window=4, min_count=5, workers=4)

### Exercise 2c:
Test your word2vec model with a few similarity functions. 
* Find words similar to 'tweet'.
* Find words similar to 'rank'.
* Find words similar to 'Google'.
* Find words similar to 'Syria'. 



In [11]:
most_similar = model.wv.most_similar(positive=['tweet'],topn=10)

# Exercise 2d

Adjust the choices in (b) and (c) as necessary.


## Exercise 3

Filter tweets to those that mention 'Iran' or similar entities and 'plan' or similar entities.
* Do this using just spacy.
* Do this using word2vec similarity scores.

In [12]:
# Using spacy
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    pass

In [13]:
# Using word2vec similarity scores (limit number of tweets initially)
for tweet in tweets[:200]:
    parsed = nlp_toolkit(tweet)
    pass