<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Predicting "Greenness" Of Content

_Authors: Joseph Nelson (DC), Kiefer Katovich (SF)_

---


This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender and was made available [here](https://www.kaggle.com/c/stumbleupon/download/train.tsv)

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonLinkRatio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
%matplotlib inline

# set max printout options for pandas:
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 300

### 1. Load the data
- Note it is a `.tsv` file and has a tab separator instead of comma.
- Clean the `is_news` column.
- Make two new columns, `title` and `body`, from the `boilerplate` column.

> **Note:** The `boilerplate` column is in json dictionary format. You can use the `json.loads()` function from the `json` module to convert this into a python dictionary.

In [2]:
df = pd.read_table('../data/evergreen_sites.tsv')

In [3]:
# A: 
df['is_news']

0       1
1       1
2       1
3       1
4       1
5       ?
6       1
7       ?
8       1
9       ?
10      1
11      ?
12      1
13      ?
14      ?
15      ?
16      1
17      1
18      1
19      1
20      1
21      ?
22      ?
23      1
24      1
25      1
26      1
27      ?
28      1
29      ?
       ..
7365    ?
7366    ?
7367    ?
7368    1
7369    ?
7370    ?
7371    ?
7372    1
7373    1
7374    1
7375    1
7376    ?
7377    1
7378    ?
7379    1
7380    ?
7381    ?
7382    1
7383    1
7384    ?
7385    ?
7386    ?
7387    1
7388    1
7389    ?
7390    1
7391    1
7392    ?
7393    1
7394    ?
Name: is_news, Length: 7395, dtype: object

In [4]:
df['is_news'] = df['is_news'].str.replace('?','0').astype(int)

### 2. What are 'evergreen' sites?
- These are websites that always relevant like recipes or reviews (as opposed to current events).
- Stored as a binary indicator in the `label` column.
- Look at some examples.

In [6]:
# A:
df['label'].head(10)

0    0
1    1
2    1
3    1
4    0
5    0
6    1
7    0
8    1
9    1
Name: label, dtype: int64

### 3. Does being a news site affect green-ness?

**3.A Investigate with plots/EDA.**

In [7]:
ndf = df[['is_news', 'label']]

In [8]:
# A:
ndf.corr()

Unnamed: 0,is_news,label
is_news,1.0,0.009103
label,0.009103,1.0


In [9]:
pd.crosstab(df['is_news'], df['label'], margins=True)

label,0,1,All
is_news,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1400,1443,2843
1,2199,2353,4552
All,3599,3796,7395


**3.B Test the hypothesis with a logistic regression using statsmodels.**

> **Hint:** The `sm.logit` function from `statsmodels.formula.api` will perform a logistic regression using a formula string.

In [10]:
import statsmodels.formula.api as sm

In [11]:
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)


import statsmodels.formula.api as smf
result = smf.logit('label ~ is_news', data=df)
result = result.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.692751
         Iterations 3


0,1,2,3
Dep. Variable:,label,No. Observations:,7395.0
Model:,Logit,Df Residuals:,7393.0
Method:,MLE,Df Model:,1.0
Date:,"Thu, 10 May 2018",Pseudo R-squ.:,5.98e-05
Time:,10:40:46,Log-Likelihood:,-5122.9
converged:,True,LL-Null:,-5123.2
,,LLR p-value:,0.4337

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0303,0.038,0.806,0.420,-0.043,0.104
is_news,0.0374,0.048,0.783,0.434,-0.056,0.131


In [12]:
# A:
# Fit a logistic regression model and store the class predictions.
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression() #create object

#feature_cols = []
X = df[['is_news']] #create X (if you are passing a single column or array, you need to double [[]] so that it reads as a df)
y = df['label']  #create y

logreg.fit(X, y) #fit
pred = logreg.predict(X) #predict

logreg.score(X, y) #this returns accuracy

0.5133198106828939

**3.C Interpret the results of your model.**

In [None]:
# A:

### 4. Does the website category affect green-ness?

**4.A Investigate with plots/EDA.**

In [None]:
# A:

**4.B Test the hypothesis with a logistic regression.**

In [None]:
# A:

**4.C Interpret the model results.**

In [None]:
# A:

### 5. Does the image ratio affect green-ness?

**5.A Investigate with plots/EDA.**

In [None]:
# A:

**5.B Test the hypothesis using a logistic regression.**

> **Note**: It is worth thinking about how to best represent this variable. It may not be wise to input the image ratio as-is.

In [None]:
# A:

**5.C Interpret the model.**

In [None]:
# A:

### 6. Fit a logistic regression with multiple predictors.
- The choice of predictors is up to you. Test features you think may be valuable to predict evergreen status.
- Do any EDA you may need.
- Interpret the coefficients of the model.

> **Tip:** [This pdf is very useful for an overview of interpreting logistic regression coefficients.](https://www.unm.edu/~schrader/biostat/bio2/Spr06/lec11.pdf)

In [None]:
# A: