# Predicting Evergreeness of Content with Decision Trees and Random Forests

In [8]:
import pandas as pd
import numpy as np
import json
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import tree
from IPython.display import Image
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
import pydotplus

In [2]:
# Read in the stumbleupon.tsv data - notice the use of the "sep" parameter here.
# This is tab-separated file and not a comma-separated file, so the "sep"
# parameter is set to "\t" to denote tab.
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
data = pd.read_csv("../../assets/dataset/stumbleupon.tsv", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head(2)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...


In [3]:
# Check the dimensions of your new dataset
data.shape

(7395, 29)

In [4]:
# Check for missing values
data.isnull().sum()[data.isnull().sum() != 0].sort_values(ascending=False)

body     57
title    12
dtype: int64

In [5]:
# Address missing values by dropping rows with missing values
data.dropna(inplace=True)

In [6]:
# Spot check your work
data.isnull().sum()[data.isnull().sum() != 0]

Series([], dtype: int64)

## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender. A description of the columns is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### What are "evergreen" sites?

> #### Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

> #### A sample of URLs below where label = 1 are "evergreen" websites

In [None]:
data[['url', 'label']].head()

### Exercises to Get Started

 ### Exercise: 1. In a group: Brainstorm 3 - 5 features you could develop that would be useful for predicting evergreen websites.
 ###  Exercise: 2. After looking at the dataset, can you model or quantify any of the characteristics you wanted?
- If you believe high-image content websites are likely to be evergreen, how can you build a feature that represents that?
- If you believe weather content is likely NOT to be evergreen, how might you build a feature that represents that?

### Split up and develop 1-3 of those features independently.

### Exercise: 3. Does being a news site affect evergreeness? 
Compute and/or plot the percentage of news related evergreen sites.

In [None]:
import seaborn as sb
%matplotlib inline
# data.is_news.value_counts()

# is_news = True (1) if StumbleUpon's news classifier determines that this webpage is news

# Option 1: Find out P ( evergreen | is_news = 1) vs P ( evergreen | is_news = ?)

data.groupby(['is_news'])[['label']].mean()

In [None]:
# A visualized version of the output above
sns.factorplot(x='is_news', y='label', kind='bar', data = data);

### Exercise: 4. Does category in general affect evergreeness? 

Compute and/or plot the rate of evergreen sites for all Alchemy categories.

### Exercise: 5. How many articles are there per alchemy category?
Compute and/or plot the count of evergreen sites for all Alchemy categories.

> #### Let's try extracting some of the text content.
> ### Exercise: 6. Create a feature for the title containing 'recipe'. 
Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [None]:
# Option 1: Create a function to check for this

# Try/Except: If an error is encountered, the try block code execution
# is stopped and transferred down to the except block.
def has_recipe(text_in):
    try:
        if 'recipe' in str(text_in).lower():
            return 1
        else:
            return 0
    except: 
        return 0

# Map our newly created function to the title column in our dataset
# and assign the output of 1's and 0's to a new data column "recipe"
data['recipe'] = data['title'].map(has_recipe)

# Option 2: lambda functions

#data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0)

# Option 3: string functions

# Similar to above but the output will be bools 
# (true/false) instead of 1's and 0's.

#data['recipe'] = data['title'].str.contains('recipe')

In [None]:
# In this newly created column, check to see how
# many pages had recipe in the the title.
data.recipe.value_counts()

## Decision Trees in scikit-learn

###  Let's Explore Some Decision Trees:

`In essence, decision trees are just a sequence of "if this, than that" conditions.`

`1. They are non-parametric (no assumptions about the distribution(s) of your data).`<br>
`2. No coefficients`

 ### Demo: Build a decision tree model to predict the "evergreeness" of a given website. 

In [None]:
# Import the DecisionTree Classifier from scikit's tree module
from sklearn.tree import DecisionTreeClassifier

# Instantiate your dtc object as assign it to the variable name "model"
model = DecisionTreeClassifier()

# Select your features
X = data[['image_ratio', 'html_ratio', 'recipe']]

# Set your target variable
y = data['label']
    
    
# Fit the model
model.fit(X, y)

# Helper function to visualize Decision Trees (creates a file tree.png)
from sklearn.tree import export_graphviz
from os import system 
def build_tree_image(model):
    dotfile = open("tree.dot", 'w')
    export_graphviz(model, out_file = dotfile, feature_names = X.columns)
    dotfile.close()
    system("dot -Tpng tree.dot -o tree.png")
    
build_tree_image(model)

### Now we're going to take a sample of our data and attempt to visualize it right inside the notebook.

`The full X and y would not output in Jupyter notebook. Attempts were a real kernel killer.`

### The way a decision tree works is that it attempts to find the decision that will best segregate the classes.

`When entropy is 1, the classes are balanced and when entropy is 0 everything is the same class.` 

In [None]:
#graph_model = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
graph_model = DecisionTreeClassifier(criterion='entropy', random_state=42)

# Setting class_weight to 'balanced' will replicate the 
# minority class until the two classes have equal representation

# When proportion=True, the output is the fraction of records 
# for each class that have reached that node.

graph_X = X[:200]
graph_y = y[:200]

In [None]:
print(graph_X.shape)
print(graph_y.shape)

In [None]:
graph_model.fit(graph_X, graph_y)

In [None]:
dot_data = tree.export_graphviz(graph_model, out_file=None, feature_names=X.columns,
                                filled=True, rounded=True, special_characters=True) 
graph = pydotplus.graphviz.graph_from_dot_data(dot_data)
Image(graph.create_png())

`If we don't put any limits on the decision tree classifier, what do see? It just continues splitting until entropy is zero (all examples in that node are in the same class).` 

`Q: Why might that be a problem?`

`A. Overfitting. A common problem with decision trees.`

 ### Demo: Evaluate the decision tree we just created using cross-validation; use AUC as the evaluation metric.

In [None]:
from sklearn.model_selection import cross_val_score

# model is the estimator or the object to used to fit the data.
# X is your features, y is your target. You then select a scoring method -
# here we choose area under the curve (auc) as the evaluation metric. 
# cv: the number of folds (default is 3)
scores = cross_val_score(model, graph_X, graph_y, scoring='roc_auc', cv=5)
print('CV AUC {}, \nAverage AUC {}'.format(scores, scores.mean()))

`How we can address overfitting concerns?` 

###  Adjusting Decision Trees to Avoid Overfitting:

DecisionTreeClassifier(): This is the classifier function for DecisionTree. It is the main function for implementing the algorithms. Some important parameters are:

- **criterion**: It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes “gini” value.
- **splitter**: It defines the strategy to choose the split at each node. Supports “best” value to choose the best split & “random” to choose the best random split. By default, it takes “best” value.
- **max_features**: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.
If an integer is inputted then it considers that value as max features at each split.
If float value is taken then it shows the percentage of features at each split.
If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
If “log2” is taken then max_features= log2(n_features).
If None, then max_features=n_features. By default, it takes “None” value.
- **max_depth**: The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.
- **min_samples_split**: This tells above the minimum no. of samples reqd. to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.
- **min_samples_leaf**: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.
- **max_leaf_nodes**: It defines the maximum number of possible leaf nodes. If None then it takes an unlimited number of leaf nodes. By default, it takes “None” value.
- **min_impurity_split**: It defines the threshold for early stopping tree growth. A node will split if its impurity is above the threshold otherwise it is a leaf.

 ### Demo: Control for overfitting in the decision model by adjusting the maximum number of questions (max_depth) or the minimum number of records in each final node (min_samples_leaf)

In [None]:
# Instantiating a Decision Tree Classifier with explicit setting of max_depth=2 
# (instead of the default which is max_depth=None) and explicit setting of 
# min_samples_leaf=5 (instead of the default which is min_samples_leaf=2)
model = DecisionTreeClassifier(max_depth=2, min_samples_leaf=5)

# Fit your model
model.fit(X, y)
build_tree_image(model)

 ### Demo: Build a random forest model to predict the evergreeness of a website. 

In [None]:
# Import RandomForestClassifier from scikit's ensemble module
from sklearn.ensemble import RandomForestClassifier

# Instanitate your Random Forest Classifier model object with n_estimators or
# the number of trees in the forest set to 20 (default is 10)
model = RandomForestClassifier(n_estimators = 20)

# Fit your model on the features (X) and the target (y)
model.fit(X, y)

### Demo: Extracting importance of features

In [None]:
# Set features variable with the names of the features in X
features = X.columns
print("Feature columns:", features)

# Set feature_importances variable using the attribute "feature_importances_".
# The higher the score, the more important the feature in that particular combination.
# If you changed the features in X it would impact the scores.
# Similar to coefficients in that respect.
feature_importances = model.feature_importances_
print("Feature Importance scores:", feature_importances)

# Create a dataframe of the features and their respective importance scores
features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})

# Sort the values by "Importance Score" with ascending in false to ensure
# the score appear from highest to lowest in the new dataframe
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df

In [None]:
# Seaborn's factorplot draws a categorical plot onto a FacetGrid WITH COLORS!
sns.factorplot(x='Features', y='Importance Score', kind='bar', data = features_df);

 ### Demo: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance.

In [None]:
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
# cross_val_score(estimator, features, target, chosen scoring method) assigned to the variable "scores."
# scores will be a numpy array
scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

# A for loop for running a Random Forest Classifier with n-estimators (or number of trees in your forest).
# The is 1 to 100 in steps of 10 - range(start, end, step)
for n_trees in range(1, 100, 10):
    model = RandomForestClassifier(n_estimators = n_trees)
    scores = cross_val_score(model, X, y, scoring='roc_auc')
    print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean()))

##  Independent Practice: Evaluate Random Forest Using Cross-Validation

1. Continue adding input variables to the model that you think may be relevant
2. For each feature:
  - Evaluate the model for improved predictive performance using cross-validation
  - Evaluate the _importance_ of the features and create a quick plot to visually express the _importance_ of those features
3. **Bonus**: Just like the `recipe` feature, add in similar **text** features and evaluate the performance.

### 1. Build a model with more relevant features

- Instantiate your model object
- Define your features (X) and your target (y)
- Fit your model

### 2a. Evaluate predictive performance for the given feature set

`Here's an example from some ship people won't stop talking about:`

`scores = cross_validation.cross_val_score(model, titanic[features] (or X), titanic[target] (or y), cv=10)`

In [None]:
# Evaluate with cross_val_score


### 2b. Evaluating feature importances 

`If unsure how to proceed, review the demo earlier in the notebook regarding extracting the importance of features and the subsequent use of seaborn to visually display those feature importances.`

In [None]:
# Get columns and their importance scores


In [None]:
# Seaborn's factorplot STILL draws a categorical plot onto a FacetGrid WITH COLORS!
# What a time to be alive!


### 3. BONUS!!! Add in text features (strings, objects) and do it all over again!

`How many text columns options do you have?`

In [None]:
data.get_dtype_counts()

`What are your text column options?`

In [None]:
#data.loc[:, data.dtypes == object].columns
data.select_dtypes(exclude=[np.number]).columns

### Now create a new model with a text features or many text features

In [None]:
# Here's an example:

#Check for keywords in the title
data['PhotoInTitle'] = data['title'].fillna('').str.lower().str.contains('photo').astype(int)

# New X with an additional feature


# Still the same target variable "label"


# Fit a model on your data


### Evaluate your new text-inclusive model

In [None]:
# cross_val_score(estimator, features, target, chosen scoring method)
# assigned to the variable "scores"



### Find and visually express the feature importances of your new model

In [None]:
# Get columns and their importance scores



In [None]:
# Seaborn's factorplot STILL draws a categorical plot onto a FacetGrid WITH COLORS!
# What a time to be alive!!!


### BUT WAIT, THERE'S MORE BONUSES! 

<a href="https://imgflip.com/i/253hro"><img src="https://i.imgflip.com/253hro.jpg" title="made at imgflip.com"/></a>

### You can also tune your Random Forest Classifier hyperparameters using GridsearchCV!

`Here's an example of gridsearchCV parameter dictionary from good old kNN:`

`knn_dict = {
    'n_neighbors': [10, 12, 14, 16],
    'p': [1, 2],
    'weights': ['uniform', 'distance']}`


You can find parameter options here: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
# Import GridSearchCV from sklearn's model selection library
from sklearn.model_selection import GridSearchCV

print('Processing GridSearch. Please hold for the next available set of outputs.\n')
# Review the plethora of hyperparameter options for Random Forest Classifiers, and
# then fill in some options for each hyperparameter you selected in a similar format
# as the kNN example.
gd_parameters = {   }

# Instantiate a new model object
rf = RandomForestClassifier(random_state=42)

gd_model = GridSearchCV(rf, gd_parameters, n_jobs = -1, cv=10)

# Fit your model


# Find your best combination of parameters and 
# the score associated with that combinations
print(gd_model.best_params_)
#print(gd_model.best_estimator_)
print(gd_model.best_score_)

### Classification not for you? Fancy yourself a regression aficionado? 
### That's just fine! Scikit's got your back.

`The Ames Housing Training and Test .csvs are available for you to lexplore the Regressor versions of Decision Trees
and Random Forests.`

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
    
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [None]:
# Import libraries
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Load data
ames_train = pd.read_csv("../../assets/dataset/Ames_train.csv")
ames_test = pd.read_csv("../../assets/dataset/Ames_test.csv")