# Spacy Demo

If you haven't installed spacy yet, use:
```
conda install spacy
python -m spacy.en.download
```
This downloads about 500 MB of data.

Another popular package, `nltk`, can be installed as follows (you can skip this for now):

```
conda install nltk
python -m nltk.downloader all
```

This also downloads a lot of data

## Load StumbleUpon dataset

In [2]:
# Unicode Handling
from __future__ import unicode_literals

import pandas as pd
import json

data = pd.read_csv("../../DS-SF-32/lessons/lesson-11/stumbleupon.tsv", sep='\t',
                  encoding="utf-8")
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [4]:
## Load spacy
import spacy
nlp_toolkit = spacy.load("en")


from spacy.en import English
nlp_toolkit = English()
nlp_toolkit

RuntimeError: Model 'en' not installed. Please run 'python -m spacy.en.download' to install latest compatible model.

Another way to load `spacy`:
```
import spacy
nlp_toolkit = spacy.load("en")
```

## If you got an error above:

+ Model 'en>=1.1.0,<1.2.0' not installed. Please run 'python -m spacy.en.download' to install latest compatible model.

Try running this in terminal:


    conda create -n spacy python

    source activate spacy

    conda install spacy

    python -m spacy.en.download
    
    python -c "import spacy; spacy.load('en')"
    
+ note: this didn't work for me


In [None]:
title = u"IBM sees holographic calls, air breathing batteries"
parsed = nlp_toolkit(title)

for (i, word) in enumerate(parsed): 
    print "Word: {}".format(word)
    print "\t Phrase type: {}".format(word.dep_)
    print "\t Is the word a known entity type? {}".format(
        word.ent_type_  if word.ent_type_ else "No")
    print "\t Lemma: {}".format(word.lemma_)
    print "\t Parent of this word: {}".format(word.head.lemma_)

## Investigate Page Titles

Let's see if we can find organizations in our page titles.

In [None]:
def references_organization(title):
    parsed = nlp_toolkit(title)
    return any([word.ent_type_ == 'ORG' for word in parsed])

data['references_organization'] = data['title'].fillna(u'').map(references_organization)

# Take a look
data[data['references_organization']][['title']].head()

## Exercise:

Write a function to identify titles that mention an organization (ORG) and a person (PERSON).

.
.
.
.
.
.
.
.

In [None]:
## Exercise solution

def references_org_person(title):
    parsed = nlp_toolkit(title)
    contains_org = any([word.ent_type_ == 'ORG' for word in parsed])
    contains_person = any([word.ent_type_ == 'PERSON' for word in parsed])
    return contains_org and contains_person

data['references_org_person'] = data['title'].fillna(u'').map(references_org_person)

# Take a look
data[data['references_org_person']][['title']].head()


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

> ### Let's try extracting some of the text content.
> ### Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [None]:
# Option 1: Create a function to check for this

def has_recipe(text_in):
    try:
        if 'recipe' in str(text_in).lower():
            return 1
        else:
            return 0
    except: 
        return 0
        
data['recipe'] = data['title'].map(has_recipe)

# Option 2: lambda functions

#data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0)


# Option 3: string functions
data['recipe'] = data['title'].str.contains('recipe')

### DEMO:  compare count vectorizer and tfidfvectorizer



In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [127]:
cv = CountVectorizer(
    max_features = 1000,
    ngram_range=(1, 1), 
#     stop_words=None,  # words to exclude, in addition to the defaults
    stop_words='english',  # words to exclude, in addition to the defaults
    binary=False)

vectors = cv.fit_transform(["hey I am reid", "hey hey this is a second sentence"])
print cv.vocabulary_
print vectors.A  # .A for array version instead of 
                 # default sparse matrix, which is hard to read

# simply counts occurances of words
# note that it does not include all the words (I, a)
# in this example, keys are in alphabetical order

{u'reid': 1, u'second': 2, u'hey': 0, u'sentence': 3}
[[1 1 0 0]
 [2 0 1 1]]


In [128]:
tfidfv = TfidfVectorizer(
    ngram_range=(1, 1), 
#     stop_words=None,
    stop_words='english',
    binary=False, 
#     norm='l2', # normalizes with L2 regularization
#     use_idf=True # uses inverted document frequency
    use_idf=False
)



tfidfv_matrix = tfidfv.fit_transform(["hey I am reid", "hey hey this is a second sentence"])

print tfidfv_matrix.A

tfidfv.vocabulary_

# returns inverse term freuqency
# includes regularization


[[ 0.70710678  0.70710678  0.          0.        ]
 [ 0.81649658  0.          0.40824829  0.40824829]]


{u'hey': 0, u'reid': 1, u'second': 2, u'sentence': 3}

 ### Demo: Use of the Count Vectorizer

In [129]:
titles = data['title'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles)
# print X.A
# print vectorizer.vocabulary_

 ### Demo: Build a random forest model to predict evergreeness of a website using the title features

In [140]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles).toarray()
y = data['label']


model.fit(X, y)
from sklearn.cross_validation import cross_val_score

text_scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(text_scores, text_scores.mean()))

CV AUC [ 0.78846794  0.80015053  0.80488319], Average AUC 0.797833885748


### Exercise: Build a random forest model to predict evergreeness of a website using both title features and more "quantitative" features

In [131]:
data.head(2)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...


In [132]:
data.columns

Index([u'url', u'urlid', u'boilerplate', u'alchemy_category',
       u'alchemy_category_score', u'avglinksize', u'commonlinkratio_1',
       u'commonlinkratio_2', u'commonlinkratio_3', u'commonlinkratio_4',
       u'compression_ratio', u'embed_ratio', u'framebased', u'frameTagRatio',
       u'hasDomainLink', u'html_ratio', u'image_ratio', u'is_news',
       u'lengthyLinkDomain', u'linkwordscore', u'news_front_page',
       u'non_markup_alphanum_characters', u'numberOfLinks', u'numwords_in_url',
       u'parametrizedLinkRatio', u'spelling_errors_ratio', u'label', u'title',
       u'body'],
      dtype='object')

In [126]:
# from sklearn.ensemble import RandomForestClassifier

# model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
# vectorizer.fit(titles)

# # Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
# X = vectorizer.transform(titles).toarray()
# y = data['label']
print data.shape     # number of samples, number of columns
print X.shape        # number of samples, number of columns
print y.shape        # number of samples, number of columns

(7395, 29)
(7395, 1000)
(7395,)


In [88]:
print type(X.A)
Xarr = X.A

<type 'numpy.ndarray'>


In [89]:
dfX = pd.DataFrame( data = Xarr )

In [100]:
data.columns.values

array([u'url', u'urlid', u'boilerplate', u'alchemy_category',
       u'alchemy_category_score', u'avglinksize', u'commonlinkratio_1',
       u'commonlinkratio_2', u'commonlinkratio_3', u'commonlinkratio_4',
       u'compression_ratio', u'embed_ratio', u'framebased',
       u'frameTagRatio', u'hasDomainLink', u'html_ratio', u'image_ratio',
       u'is_news', u'lengthyLinkDomain', u'linkwordscore',
       u'news_front_page', u'non_markup_alphanum_characters',
       u'numberOfLinks', u'numwords_in_url', u'parametrizedLinkRatio',
       u'spelling_errors_ratio', u'label', u'title', u'body'], dtype=object)

In [102]:
data.dtypes

url                                object
urlid                               int64
boilerplate                        object
alchemy_category                   object
alchemy_category_score             object
avglinksize                       float64
commonlinkratio_1                 float64
commonlinkratio_2                 float64
commonlinkratio_3                 float64
commonlinkratio_4                 float64
compression_ratio                 float64
embed_ratio                       float64
framebased                          int64
frameTagRatio                     float64
hasDomainLink                       int64
html_ratio                        float64
image_ratio                       float64
is_news                            object
lengthyLinkDomain                   int64
linkwordscore                       int64
news_front_page                    object
non_markup_alphanum_characters      int64
numberOfLinks                       int64
numwords_in_url                   

In [109]:
# choose some features - these are ones that are not strings
float_features = data.loc[:, data.dtypes == 'float64']

In [114]:
float_features.head(2)

Unnamed: 0,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,frameTagRatio,html_ratio,image_ratio,parametrizedLinkRatio,spelling_errors_ratio
0,2.055556,0.676471,0.205882,0.047059,0.023529,0.443783,0.0,0.090774,0.245831,0.003883,0.152941,0.07913
1,3.677966,0.508021,0.28877,0.213904,0.144385,0.468649,0.0,0.098707,0.20349,0.088652,0.181818,0.125448


In [143]:
result = pd.concat([float_features, dfX], axis=1)  
# axis=1 means we're combining by adding new columns
# axis=0 (default) means we're combining by adding new rows

In [91]:
result.head(2)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,990,991,992,993,994,995,996,997,998,999
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,0,0,0,0,0,0,0,0,0,0
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,0,0,0,0,0,0,0,0,0,0


In [150]:
# from sklearn.ensemble import RandomForestClassifier

model2 = RandomForestClassifier(n_estimators = 20)

# model2.fit(result, y)

In [151]:
# from sklearn.cross_validation import cross_val_score

text_and_float_scores = cross_val_score(model2, result, y, scoring='roc_auc')

In [152]:
print('CV AUC {}, Average AUC {}'.format(text_and_float_scores, text_and_float_scores.mean()))

CV AUC [ 0.79097354  0.81299045  0.80852192], Average AUC 0.804161970133


In [153]:
print('CV AUC {}, Average AUC {}'.format(text_scores, text_scores.mean()))

CV AUC [ 0.78846794  0.80015053  0.80488319], Average AUC 0.797833885748


In [155]:
def getAUCs(df, text_features_arr, other_features_arr):
    # create combined data frame
    dfText = pd.DataFrame( data = text_features_arr )
    dfOther = df[other_features_arr]
    dfX = pd.concat([dfOther, dfText], axis=1)
    model = RandomForestClassifier(n_estimators = 20)
#     model.fit(dfX, df['label'])
    scores = cross_val_score(model, dfX, df['label'], scoring='roc_auc')
    return scores

In [162]:
X3 = data.loc[:, data.dtypes != 'object']

In [163]:
X3.head(2)

Unnamed: 0,urlid,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,framebased,frameTagRatio,...,html_ratio,image_ratio,lengthyLinkDomain,linkwordscore,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,4042,2.055556,0.676471,0.205882,0.047059,0.023529,0.443783,0.0,0,0.090774,...,0.245831,0.003883,1,24,5424,170,8,0.152941,0.07913,0
1,8471,3.677966,0.508021,0.28877,0.213904,0.144385,0.468649,0.0,0,0.098707,...,0.20349,0.088652,1,40,4973,187,9,0.181818,0.125448,1


In [165]:
new_scores = getAUCs(data, X, X3.columns.values)

In [166]:
print('CV AUC {}, Average AUC {}'.format(new_scores, new_scores.mean()))
# oops! we included the label (y) in X

CV AUC [ 0.99930654  0.99978623  0.99961397], Average AUC 0.999568916079


 ### Exercise: Build a random forest model to predict evergreeness of a website using the body features

In [None]:
## TODO

 ### Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [None]:
## TODO