#Project -- Online News Popularity -- Case Study

In [3]:
# Data Set: Online News Popularity Data Set 

Source: https://archive.ics.uci.edu/ml/datasets/online+news+popularity#
<br> Note: Don't use the file from this URL. Rather use the dataset attached with project as column names are changed to remove spaces.


**Attribute Information:<br>**
Number of Attributes: 61 including target column -- shares

Attribute Information: 
0. url: URL of the article 
1. timedelta: Days between the article publication and the dataset acquisition
2. n_tokens_title: Number of words in the title 
3. n_tokens_content: Number of words in the content 
4. n_unique_tokens: Rate of unique words in the content 
5. n_non_stop_words: Rate of non-stop words in the content 
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content 
7. num_hrefs: Number of links 
8. num_self_hrefs: Number of links to other articles published by Mashable 
9. num_imgs: Number of images 
10. num_videos: Number of videos 
11. average_token_length: Average length of the words in the content 
12. num_keywords: Number of keywords in the metadata 
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'? 
14. data_channel_is_entertainment: Is data channel 'Entertainment'? 
15. data_channel_is_bus: Is data channel 'Business'? 
16. data_channel_is_socmed: Is data channel 'Social Media'? 
17. data_channel_is_tech: Is data channel 'Tech'? 
18. data_channel_is_world: Is data channel 'World'? 
19. kw_min_min: Worst keyword (min. shares) 
20. kw_max_min: Worst keyword (max. shares) 
21. kw_avg_min: Worst keyword (avg. shares) 
22. kw_min_max: Best keyword (min. shares) 
23. kw_max_max: Best keyword (max. shares) 
24. kw_avg_max: Best keyword (avg. shares) 
25. kw_min_avg: Avg. keyword (min. shares) 
26. kw_max_avg: Avg. keyword (max. shares) 
27. kw_avg_avg: Avg. keyword (avg. shares) 
28. self_reference_min_shares: Min. shares of referenced articles in Mashable 
29. self_reference_max_shares: Max. shares of referenced articles in Mashable 
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable 
31. weekday_is_monday: Was the article published on a Monday? 
32. weekday_is_tuesday: Was the article published on a Tuesday? 
33. weekday_is_wednesday: Was the article published on a Wednesday? 
34. weekday_is_thursday: Was the article published on a Thursday? 
35. weekday_is_friday: Was the article published on a Friday? 
36. weekday_is_saturday: Was the article published on a Saturday? 
37. weekday_is_sunday: Was the article published on a Sunday? 
38. is_weekend: Was the article published on the weekend? 
39. LDA_00: Closeness to LDA topic 0 
40. LDA_01: Closeness to LDA topic 1 
41. LDA_02: Closeness to LDA topic 2 
42. LDA_03: Closeness to LDA topic 3 
43. LDA_04: Closeness to LDA topic 4 
44. global_subjectivity: Text subjectivity 
45. global_sentiment_polarity: Text sentiment polarity 
46. global_rate_positive_words: Rate of positive words in the content 
47. global_rate_negative_words: Rate of negative words in the content 
48. rate_positive_words: Rate of positive words among non-neutral tokens 
49. rate_negative_words: Rate of negative words among non-neutral tokens 
50. avg_positive_polarity: Avg. polarity of positive words 
51. min_positive_polarity: Min. polarity of positive words 
52. max_positive_polarity: Max. polarity of positive words 
53. avg_negative_polarity: Avg. polarity of negative words 
54. min_negative_polarity: Min. polarity of negative words 
55. max_negative_polarity: Max. polarity of negative words 
56. title_subjectivity: Title subjectivity 
57. title_sentiment_polarity: Title polarity 
58. abs_title_subjectivity: Absolute subjectivity level 
59. abs_title_sentiment_polarity: Absolute polarity level 
60. shares: Number of shares (target)

SyntaxError: invalid syntax (<ipython-input-3-0a0dc557dee6>, line 3)

In [4]:
#### This note book is divided in two parts
#### Part 1 -- Explore, Understand the Data and If required perform Wrangling
#### Part 2 -- Apply various modeling techniques and use Root Mean Square Error (RMSE) to evaluate models

In [8]:
# Import Libraries

##Import the usual libraries **
df=pd.read_csv('OnlineNewsPopularity.csv')

In [12]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

import seaborn as sns
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls


# Suppress Warnings
import warnings
warnings.filterwarnings('ignore')
from sklearn.decomposition import PCA

ModuleNotFoundError: No module named 'plotly'

In [None]:
#### Start of Part I --  Explore, Understand the Data and If required perform Wrangling
#### Get the Data

** Use pandas to read data as a dataframe called df.**

In [None]:
df = pd.read_csv("OnlineNewsPopularity.csv")

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.columns.values

In [None]:
df.isnull().any()

** Q1. Looking at the data above what are your first thoughts about quality of data and modeling? ** 

In [None]:
df.drop(["url","timedelta"],axis = 1, inplace = True)
#df.describe()

### Correlation of Features
Lets find the correlation among features (very important for successfull modelling)

We will plot correlation matrix using Plotly HeatMap

In [None]:
data = [
    go.Heatmap(
        z= df.astype(float).corr().values,
        x=df.columns.values,
        y=df.columns.values,
        colorscale='Viridis',
        reversescale = False,
        text = True ,
        opacity = 1.0
        
    )
]


layout = go.Layout(
    title='Correlation of features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700,
    
)


fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')

** Q2. What inference can be drawn from correlation heatmap? **

Quite a few features which are not coorelated hence good for modelling.

#### Start Of Part 2 -- Apply various modeling techniques

#### Common imports for modeling

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

In [None]:
# To ensure output is same each time code is run
random_state = 101     

Function to split the data

In [None]:
def get_data(df_data,test_size=0.3):
    X = df_data.copy()
    X.drop("shares",axis=1, inplace=True,)
    y = df_data["shares"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X,y,X_train, X_test, y_train, y_test

Function to perform Simple Linear Regression 

In [None]:
def perf_linear_regression(df_data,standscalar=False):
    X,y,X_train, X_test, y_train, y_test = get_data(df_data)
    if(standscalar):
        print ("Regression after applying StandardScaler")
        X_scaler = StandardScaler()
        X_train = X_scaler.fit_transform(X_train)
        X_test = X_scaler.transform(X_test)
        
#         y_scaler = StandardScaler()
# #         y_train = y_scaler.fit_transform(y_train[:, None])[:, 0]
# #         y_test = y_scaler.transform(y_test[:, None])[:, 0]
#         y_train = y_scaler.fit_transform(y_train)
#         y_test = y_scaler.transform(y_test)

       
     # End of If for StandardScaler  
    lm = LinearRegression()
    lm.fit(X_train,y_train)
    y_pred = lm.predict(X_test)
    y_train_pred = lm.predict(X_train)
    print ("RMSE on Train Data is: {:.2f} :".format( np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))) )
    print ("RMSE on Test Data  is: {:.2f} :".format( np.sqrt(metrics.mean_squared_error(y_test, y_pred))) )
    return X,y,X_train, X_test, y_train, y_test

** Approach 1: Linear Regression ** <br>
** Q3: What is the RMSE on Test Data for Linear Regression? **

In [None]:
X,y,X_train, X_test, y_train, y_test = perf_linear_regression(df)
print ("Median Value of Shares:", y.median())

** Q: 4.	Challenge Why Linear Regression should be used here ?**

4.	Linear Regression is a good starting point in case we need to predict continuous variables .  On Speed vs. Accuracy balance � it scores on Speed.  RMSE value from Linear Regression is a good indication that it is not sufficient

 ** Approach 2:  Use Standard Scaler to scale the data ** <br>
StandardScaler removes the mean and scales the data to unit variance. Standardization of a dataset is a common requirement for many machine learning estimators 

Calling Regression with Standard Scaler

** Q5: Is the RMSE better than in Approach 1 (Linear Regression) in this case? If Not Why ? **

In [None]:
X,y,X_train, X_test, y_train, y_test = perf_linear_regression(df,True)

Standard Scalar behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance). StandardScaler cannot guarantee balanced feature scales in the presence of outliers. Refer to http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
** So looks like Standard Scaling is bad choice **

Lets do analysis of target value y and see if there are outliers and we can get rid of them

In [None]:
    print ("\nPrinting shares count distribution data")
    data = y.value_counts(ascending=False)
    print (data)

# #print (y.value_counts(ascending=False))

# # print ("\nPrinting share count")
# # print (y.sort_values(ascending=False) )

** Approach 3: Detect Outlier and remove them from data set ** <br>

In [2]:
sns.distplot(y,kde=False,color="green");

NameError: name 'y' is not defined

In [None]:
data = [go.Bar(
            x=y.value_counts().index.values,
            # Use log of value_count to make graph more comprehendible 
            #y= np.log2(y.value_counts().values) 
            y= y.value_counts().values
    )]

py.iplot(data, filename='data-basic-bar',image_width=1200, image_height=1800)

** Q6: Which data -- based on shares count should be dropped from dataset ?** <br>
You can use just vizual cue here , we will have more formal approach in Capstone

In [None]:
# # Lets get  a scatter plot of shares and respective counts
# y_unique = y.unique()
# sns.regplot(y_unique,y_unique,data=y,scatter=True)

** We can try multiple values, but here I will drop the data below 100,000 and then below 20,000 shares **

Define a generic function to filter data

In [None]:
def filter_threshold_data(df_copy,threshold,column_name):
    df_adjusted = df_copy[df_copy[column_name] <= threshold]
    print (" Original Data Count:", len(df_copy))
    print (" After Adjusting Data Count:", len(df_adjusted))
    return df_adjusted

In [None]:
df_adjusted = filter_threshold_data(df,100000,"shares")

** Q7: What is RMSE for data filtered at 100,000 and 20,000? What inference can you draw?**

In [None]:
X,y,X_train, X_test, y_train, y_test = perf_linear_regression(df_adjusted)

In [None]:
df_adjusted = filter_threshold_data(df,20000,"shares")
X,y,X_train, X_test, y_train, y_test = perf_linear_regression(df_adjusted)

7.	RMSE  for filter 100,000 � 5756.43, RMSE for filter 20,000 � 2670.63 
Key Point is outliers skew the model hence it is important to drop them for dataset


** Approach 4: Lets try improve the model by using DecisionTreeRegressor for regression ** <br>

** Q8. Why we should use DecisionTree? **

Reasons for using Decision Tree
1. Decision trees implicitly perform variable screening or feature selection
2. Nonlinear relationships between parameters do not affect tree performance
3. It is easy to interpret

#### Using Decision Tree Regressor
http://scikit-learn.org/stable/modules/tree.html#tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
def decision_tree_regression(max_depth,X_train,y_train,X_test,y_test):
    model = DecisionTreeRegressor(max_depth=max_depth)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    print ("Tree Max Depth is:", max_depth)
    print ("RMSE on Train Data is: {:.2f} :".format( np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))) )
    print ("RMSE on Test Data is: {:.2f} :".format( np.sqrt(metrics.mean_squared_error(y_test, y_pred))) )
    #print (model.decision_path)

    return

** Q9. For what max_depth you get least RMSE for Test Data? ** <br>

9.	RMSE -- 2698.59  on test data is least at max_depth = 5 

** Q10. Why RMSE increases after certain max_depth? Is this sign of overfitting?**

This is very important to understand. Beyond a certain max_depth tree will start to overfit. Meaning higher no. of nodes mean more complicated model.  Sign of overfit is when training error is significantly less than test error.  In this case it starts happening from max_depth 15

In [None]:
df_adjusted = filter_threshold_data(df,20000,"shares")
X,y,X_train, X_test, y_train, y_test = get_data(df_adjusted)
max_depth_count = len(X.columns)

depth_range = [1,2,5,10,15,20,25,30,35,40,max_depth_count]
print ("Starting Decision Tree Regression:")
for depth in depth_range:
    decision_tree_regression(depth,X_train,y_train,X_test,y_test)
    
# End of decision tree testing 

** Approach 5: Combining multiple techniques ** <br>
We will use the following
1. Use MinMaxScaler to Scale the Data
2. Use Principal Component Analysis to restrict the no. of dimensions
3. Use Ensemble technique AdaBoost With DecisionTreeRegressor to further improve the model

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import AdaBoostRegressor

** Q11. Why MinMaxScaler should be used ? **

The MinMaxScaler is the most famous scaling algorithm, and follows the following formula for each feature:
xi�min(x)/(max(x)�min(x) )

It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
This scaler works better for cases in which the standard scaler might not work so well. If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better.

Applying MinMaxScaler

** Q12. While applying min max scaling to normalize your features, do you apply min max scaling on the entire dataset before splitting it into training, validation and test data? -- My Favourite Question **

It is very important to understand this point 
 Split it, then scale. Imagine it this way: you have no idea what real-world data looks like, so you couldn't scale the training data to it. Your test data is the surrogate for real-world data, so you should treat it the same way.  To reiterate: Split, scale your training data, then use the scaling from your training data on the testing data.


In [None]:
df_adjusted = filter_threshold_data(df,20000,"shares")
X,y,X_train, X_test, y_train, y_test = get_data(df_adjusted,)
min_max_norm = MinMaxScaler()
X_train_norm = min_max_norm.fit_transform(X_train)
X_test_norm = min_max_norm.transform(X_test)

#### Applying PCA -- Principal Component Analysis

** Q13. What is the significance of using PCA here ? **

We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the �intrinsic� dimensionality may be smaller than the number of features. PCA reduces dimension and make the model leaner and better

Get the PCA for variance upto 95 % <br>
** Q14. How many features are there after variance is limited to 95% ?**

In [None]:
pca = PCA()
pca.fit(X_train_norm)
X_train_normreduced =  pd.DataFrame(pca.transform(X_train_norm))
X_train_normreduced = X_train_normreduced.loc[:,pca.explained_variance_ratio_.cumsum()<0.95]

In [None]:
print (X_train_normreduced.shape)

In [None]:
# We got 22 features for around 95 % variance. Lets use this info to do modeling further

In [None]:
pca = PCA(n_components=22)
df_adjusted = filter_threshold_data(df,20000,"shares")
X,y,X_train, X_test, y_train, y_test = get_data(df_adjusted,)
min_max_norm = MinMaxScaler()
X_train_norm = min_max_norm.fit_transform(X_train)
X_test_norm = min_max_norm.transform(X_test)

X_train_norm  = pca.fit_transform(X_train_norm)
X_test_norm = pca.transform(X_test_norm)
X_train_norm =  pd.DataFrame(X_train_norm)
X_test_norm = pd.DataFrame(X_test_norm)

In [None]:
# Transforming y . We can choose different log base also such as 2, 10 etc
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)

** Q15. Why use AdaBoostRegressor ? ** <br>

An AdaBoost regressor is a meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction. As such, subsequent regressors focus more on difficult cases.

Generic function to use AdaBoostRegression for different dataset and estimator values

In [None]:
def ada_boost_regression(X_train_norm,y_train_log,X_test_norm,y_test_log,n_estimators):
    AdaDecision = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4, min_samples_leaf= 5, min_samples_split= 5),
                                n_estimators=n_estimators)
    AdaDecision.fit(X_train_norm, y_train_log)
    y_pred = AdaDecision.predict(X_test_norm)
    y_pred = pd.DataFrame(y_pred)
    
    y_train_pred = pd.DataFrame(AdaDecision.predict(X_train_norm) )
    y_train_orig_pred = y_train_pred.apply(lambda x: np.exp(x))
    
    y_orig_pred = y_pred.apply(lambda x: np.exp(x)) # Convert Output Back from Log to Original Value
    print ("Estimator Count in AdaBoostRegressor:", n_estimators)
    print ("RMSE on Train Data is: {:.2f} :".format( np.sqrt(metrics.mean_squared_error(y_train, y_train_orig_pred))) )
    print ("RMSE on Test Data is: {:.2f} :".format( np.sqrt(metrics.mean_squared_error(y_test, y_orig_pred))) )
    return

** Q16. For which n_estimators values do you see lowest RMSE for Test Data ? **

Try changing the values in n_estimators_list and see what RMSE you get

In [None]:
n_estimators_list = [10,20,30,40,50,100,125,150]
for n_estimator in n_estimators_list:
    ada_boost_regression(X_train_norm,y_train_log,X_test_norm,y_test_log,n_estimator)

** Q17. Compare the result above with DecisionTreeRegressor approach. What inference can you draw ? ** 

Unlike DecisionTreeRegressor, AdaBoostRegressor does not overfit the data for higher number of estimators

** Q18. What is your final inference ? **

## Welcome! to the real world of Machine Learning. We started with Linear Regression , used scaling , removed outliers, tried  DecisionTree and AdaBoost with PCA.   We improved RMSE on test data from 11100 to around 2700 with reduction in no. of features from 58 to 22.  No mean feat!.  
Yet something is missing here. In spite of trying various modeling, our RMSE score did not improve much after filtering <= 20,000 shares records.
 Why ? Fundamental question could be is data correct and sufficient?
Data is off course correct, but let's explore on sufficient part. 
 It's very difficult to evaluate popularity of article based on just numerical features. Reason of popularity could also be the sentiments captured by article and the timing of publishing when certain event was occurring.  And these are some of the features which seem to be missing in data. 
Remember you will not always arrive at a optimal model and that is the point when you should start looking beyond the presented dataset.


### END