# Spacy Demo

If you haven't installed spacy yet, use:
```
conda install spacy
python -m spacy.en.download
```
This downloads about 500 MB of data.

Another popular package, `nltk`, can be installed as follows (you can skip this for now):

```
conda install nltk
python -m nltk.downloader all
```

This also downloads a lot of data

## Load StumbleUpon dataset

In [1]:
# Unicode Handling
from __future__ import unicode_literals

import pandas as pd
pd.set_option('display.max_colwidth', -1)
import json

data = pd.read_csv("stumbleupon.tsv", sep='\t', encoding="utf-8")
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
#data.head()

In [2]:
data['title'][0]

'IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries'

In [3]:
import spacy
nlp_toolkit = spacy.load("en")
nlp_toolkit

<spacy.lang.en.English at 0x10cfcf630>

In [4]:
title = u"IBM sees holographic calls, air breathing batteries"
parsed = nlp_toolkit(title)

for (i, word) in enumerate(parsed): 
    print( "Word: {}".format(word))
    print( "\t Phrase type: {}".format(word.dep_))
    print( "\t Father Node: {}".format(word.head.text))
    print( "\t Is the word a known entity type? {}".format(
        word.ent_type_  if word.ent_type_ else "No"))
    print( "\t Lemma: {}".format(word.lemma_))
    print( "\t Parent of this word: {}".format(word.head.lemma_))

Word: IBM
	 Phrase type: nsubj
	 Father Node: sees
	 Is the word a known entity type? ORG
	 Lemma: ibm
	 Parent of this word: see
Word: sees
	 Phrase type: ROOT
	 Father Node: sees
	 Is the word a known entity type? No
	 Lemma: see
	 Parent of this word: see
Word: holographic
	 Phrase type: amod
	 Father Node: calls
	 Is the word a known entity type? No
	 Lemma: holographic
	 Parent of this word: call
Word: calls
	 Phrase type: dobj
	 Father Node: sees
	 Is the word a known entity type? No
	 Lemma: call
	 Parent of this word: see
Word: ,
	 Phrase type: punct
	 Father Node: calls
	 Is the word a known entity type? No
	 Lemma: ,
	 Parent of this word: call
Word: air
	 Phrase type: compound
	 Father Node: breathing
	 Is the word a known entity type? No
	 Lemma: air
	 Parent of this word: breathing
Word: breathing
	 Phrase type: compound
	 Father Node: batteries
	 Is the word a known entity type? No
	 Lemma: breathing
	 Parent of this word: battery
Word: batteries
	 Phrase type: conj
	 Fat

In [5]:
# DEMO code to display the dependency tree.
from nltk import Tree

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_


[to_nltk_tree(sent.root).pretty_print() for sent in parsed.sents]

                sees                
  _______________|_____              
 |                   calls          
 |        _____________|_______      
 |       |             |   batteries
 |       |             |       |     
 |       |             |   breathing
 |       |             |       |     
IBM holographic        ,      air   



[None]

## Investigate Page Titles

Let's see if we can find organizations in our page titles.

In [6]:
def references_organization(title):
    parsed = nlp_toolkit(title)
    return any([word.ent_type_ == 'ORG' for word in parsed])

data['references_organization'] = data['title'].fillna(u'').map(references_organization)

# Take a look
data[data['references_organization']][['title']].head()

Unnamed: 0,title
0,"IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"
1,"The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races"
3,10 Foolproof Tips for Better Sleep
6,fashion lane American Wild Child
10,Business Financial News Breaking US International News


In [7]:
data['references_organization'].head()

0    True 
1    True 
2    False
3    True 
4    False
Name: references_organization, dtype: bool

## Exercise:

Lets write a function to identify titles that mention an organization (ORG) and a person (PERSON).

In [8]:
## Exercise solution
def references_org_person(title):
    parsed = nlp_toolkit(title)
    contains_org = any([word.ent_type_ == 'ORG' for word in parsed])
    contains_person = any([word.ent_type_ == 'PERSON' for word in parsed])
    return contains_org and contains_person

data['references_org_person'] = data['title'].fillna(u'').map(references_org_person)

# Take a look
data[data['references_org_person']][['title']].head()

Unnamed: 0,title
29,Genevieve Morton Swimsuit by Tyler Rose Swimwear 2011 Sports Illustrated Swimsuit Photo Gallery genevieve morton - model - 2011 sports illustrated swimsuit edition - si.com genevieve morton on si swimsuit
44,Alyssa Miller Swimsuit by Charlie by Matthew Zink 2011 Sports Illustrated Swimsuit Photo Gallery alyssa miller - maui action - 2011 sports illustrated swimsuit edition - si.com alyssa miller on si swimsuit
89,4 Surprising Foods to Cook on the Grill Whisked Foodie 4 surprising foods to cook on the grill | whisked foodie | whisk up something delicious.
91,Heidi s Favorite Snacks Heidi Klum on AOL heidi's favorite snacks
105,Chicken and Spinach Casserole Martha Stewart Recipes chicken and spinach casserole


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

 ### Demo: Use of the Count Vectorizer

In [9]:
titles = data['title'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles)

 ### Demo: Build a random forest model to predict evergreeness of a website using the title features

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles).toarray()
y = data['label']


scores = cross_val_score(model, X, y, scoring='accuracy')
print('CV Accuracy {}, Average Accuracy {}'.format(scores, scores.mean()))

CV Accuracy [0.7270884  0.73752535 0.73782468], Average Accuracy 0.7341461441883778


### Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [11]:
# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X_text_features = vectorizer.transform(titles)

# Identify the features you want from the original dataset
other_features_columns = ['html_ratio', 'image_ratio']
other_features = data[other_features_columns]

# Stack them horizontally together
# This takes all of the word/n-gram columns and appends on two more columns for `html_ratio` and `image_ratio`
from scipy.sparse import hstack
X = hstack((X_text_features, other_features)).toarray()

scores = cross_val_score(model, X, y, scoring='accuracy')
print('CV Accuracy {}, Average Accuracy {}'.format(scores, scores.mean()))

# What features of these are most important?
model.fit(X, y)

all_feature_names = vectorizer.get_feature_names() + other_features_columns
feature_importances = pd.DataFrame({'Features' : all_feature_names, 'Importance Score': model.feature_importances_})
feature_importances.sort_values('Importance Score', ascending=False).head()

CV Accuracy [0.72506083 0.73306288 0.72199675], Average Accuracy 0.7267068202739684


Unnamed: 0,Features,Importance Score
1000,html_ratio,0.156436
1001,image_ratio,0.094733
715,recipe,0.037826
721,recipes,0.021396
192,chocolate,0.012656


 ### Exercise: Build a random forest model to predict evergreeness of a website using the body features

In [12]:
body_text = data['body'].fillna('')

# Use `fit` to learn the vocabulary
vectorizer.fit(body_text)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(body_text).toarray()

scores = cross_val_score(model, X, y, scoring='accuracy')
print('CV Accuracy {}, Average Accuracy {}'.format(scores, scores.mean()))

CV Accuracy [0.77128954 0.78052738 0.76948052], Average Accuracy 0.7737658135201849


 ### Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [13]:
titles = data['title'].fillna('')

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english')


# Use `fit` to learn the vocabulary
vectorizer.fit(body_text)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(body_text).toarray()

scores = cross_val_score(model, X, y, scoring='accuracy')
print('CV Accuracy {}, Average Accuracy {}'.format(scores, scores.mean()))

CV Accuracy [0.77696675 0.78864097 0.78530844], Average Accuracy 0.7836387209863136
