<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">



# Predicting Evergreeness of Content with Decision Trees and Random Forests

## Predicting "Greenness" Of Content

This dataset comes from [StumbleUpon](https://www.stumbleupon.com/), a discovery and advertisement engine (a form of web search engine) that pushed recommends of web content to its users. 

Its features allowed users to discover and rate Web pages, photos and videos that are personalized to their tastes and interests using peer-sourcing, social-networking and advertising (sponsored pages) principles. The service shut down in June 2018. 

A description of the columns within this dataset is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### What are 'evergreen' sites?

Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 



In [None]:
import pandas as pd
import json

import seaborn as sb
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv("../assets/data/stumbleupon.tsv", sep='\t')
## as the info is in a JSON file format, we use 'boilerplate' to help us extract info
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

is_news_map = {"?": "No", "1": "Yes"}
data = data.replace({"is_news": is_news_map})

data = data[data.alchemy_category != "?"]


data.head()

A sample of URLs is below, where label = 1 are 'evergreen' websites

In [None]:
data[['url', 'label']].head()

## Using the StumbleUpon dataset, we want to see if we can use the features to predict whether content is evergreen or not, using random forests and decision trees.

### 1. How many articles (evergreen and not) are there per category?

### 2. Create a feature for the title containing 'recipe'.  Is the % of evegreen websites higher or lower on pages that have recipe in the the title?


 ### 3: Build a decision tree model to predict the "evergreeness" of a given website.  Use 'image_ratio', 'html_ratio', 'recipe' as features. 

In [None]:
# Helper function to visualize Decision Trees to export (creates a file tree.png)

from sklearn.tree import export_graphviz
from os import system 
def build_tree_image(model):
    dotfile = open("tree.dot", 'w')
    export_graphviz(model,
                              out_file = dotfile,
                              feature_names = X.columns)
    dotfile.close()
    system("dot -Tpng tree.dot -o tree.png")
    
build_tree_image(model)

 ### 4. Evaluate the decision tree using cross-validation; use AUC as the evaluation metric.

 ### 5: Build a random forest model to predict the evergreeness of a website. 

### 6 Extract the importance of features from our random forest.

 ### 7: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance. Perform this over a range for the number of trees (e.g. up to 100, in steps of 5 or 10)

### 8. Continue adding input variables/text features to the model that you think may be relevant
