<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">



# Predicting Evergreeness of Content with Decision Trees and Random Forests

## Predicting "Greenness" Of Content

This dataset comes from [StumbleUpon](https://www.stumbleupon.com/), a discovery and advertisement engine (a form of web search engine) that pushed recommends of web content to its users. 

Its features allowed users to discover and rate Web pages, photos and videos that are personalized to their tastes and interests using peer-sourcing, social-networking and advertising (sponsored pages) principles. The service shut down in June 2018. 

A description of the columns within this dataset is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### What are 'evergreen' sites?

Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 



In [1]:
import pandas as pd
import json

import seaborn as sb
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv("../assets/data/stumbleupon.tsv", sep='\t')
## as the info is in a JSON file format, we use 'boilerplate' to help us extract info
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

is_news_map = {"?": "No", "1": "Yes"}
data = data.replace({"is_news": is_news_map})

data = data[data.alchemy_category != "?"]


data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


A sample of URLs is below, where label = 1 are 'evergreen' websites

In [2]:
data[['url', 'label']].head()

Unnamed: 0,url,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,0
1,http://www.popsci.com/technology/article/2012-...,1
2,http://www.menshealth.com/health/flu-fighting-...,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,1
4,http://bleacherreport.com/articles/1205138-the...,0


## Using the StumbleUpon dataset, we want to see if we can use the features to predict whether content is evergreen or not, using random forests and decision trees.

### 1. How many articles (evergreen and not) are there per category?

In [3]:
data.groupby(['alchemy_category', 'label'])[['label']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,label
alchemy_category,label,Unnamed: 2_level_1
arts_entertainment,0,591
arts_entertainment,1,350
business,0,254
business,1,626
computer_internet,0,223
computer_internet,1,73
culture_politics,0,186
culture_politics,1,157
gaming,0,48
gaming,1,28


### 2. Create a feature for the title containing 'recipe'.  Is the % of evegreen websites higher or lower on pages that have recipe in the the title?


In [4]:
data['recipe'] = data['title'].str.contains('recipe')

data.groupby(['recipe'])[['label']].mean()

Unnamed: 0_level_0,label
recipe,Unnamed: 1_level_1
False,0.496223
True,0.903226


 ### 3: Build a decision tree model to predict the "evergreeness" of a given website.  Use 'image_ratio', 'html_ratio', 'recipe' as features. 

In [5]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

X = data[['image_ratio', 'html_ratio', 'recipe', 'label']].dropna()
y = X['label']
# get rid of the predictor from our features for input
X.drop('label', axis=1, inplace=True)
    
    


In [6]:
# Fits the model
model.fit(X, y)
  

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [7]:
# Helper function to visualize Decision Trees to export (creates a file tree.png)

from sklearn.tree import export_graphviz
from os import system 
def build_tree_image(model):
    dotfile = open("tree.dot", 'w')
    export_graphviz(model,
                              out_file = dotfile,
                              feature_names = X.columns)
    dotfile.close()
    system("dot -Tpng tree.dot -o tree.png")
    
build_tree_image(model)

 ### 4. Evaluate the decision tree using cross-validation; use AUC as the evaluation metric.

In [8]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

CV AUC [ 0.52606179  0.56278469  0.55995011  0.54538054  0.53790385], Average AUC 0.5464161952196066


 ### 5: Build a random forest model to predict the evergreeness of a website. 

In [9]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### 6 Extract the importance of features from our random forest.

In [10]:
features = X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df.head()

Unnamed: 0,Features,Importance Score
1,html_ratio,0.495749
0,image_ratio,0.468384
2,recipe,0.035867


 ### 7: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance. Perform this over a range for the number of trees (e.g. up to 100, in steps of 5 or 10)

In [11]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

for n_trees in range(1, 100, 10):
    model = RandomForestClassifier(n_estimators = n_trees)
    scores = cross_val_score(model, X, y, scoring='roc_auc')
    print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean()))

CV AUC [ 0.57982973  0.59923481  0.57499277], Average AUC 0.5846857724391424
n trees: 1, CV AUC [ 0.52931781  0.54227739  0.54632872], Average AUC 0.5393079704673157
n trees: 11, CV AUC [ 0.57455984  0.60545356  0.58552579], Average AUC 0.5885130632582932
n trees: 21, CV AUC [ 0.58912152  0.59130581  0.58447823], Average AUC 0.5883018533869735
n trees: 31, CV AUC [ 0.58358293  0.60483693  0.58621117], Average AUC 0.5915436779206481
n trees: 41, CV AUC [ 0.58435578  0.59704897  0.59871458], Average AUC 0.5933731064441742
n trees: 51, CV AUC [ 0.58533084  0.60326204  0.58235334], Average AUC 0.5903154087991935
n trees: 61, CV AUC [ 0.58943546  0.59679877  0.59182959], Average AUC 0.5926879418226618
n trees: 71, CV AUC [ 0.59293765  0.59723325  0.58683702], Average AUC 0.5923359713136226
n trees: 81, CV AUC [ 0.58671956  0.6054089   0.59290479], Average AUC 0.5950110840704244
n trees: 91, CV AUC [ 0.59218743  0.60100674  0.58520047], Average AUC 0.5927982127369172


### 8. Continue adding input variables/text features to the model that you think may be relevant


In [12]:
# Adding in text features

model = RandomForestClassifier(n_estimators=50)

# Check for keywords in the title
data['PhotoInTitle'] = data['title'].fillna('').str.lower().str.contains('photo').astype(int)
X = data[['image_ratio', 'html_ratio', 'recipe', 'PhotoInTitle', 'label']].dropna()
X.drop('label', axis=1, inplace=True)


scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))




CV AUC [ 0.58790392  0.61070483  0.59033196], Average AUC 0.5963135698185463
