# Exploratory Data Analysis - DSCI 522 Group 8

## Authors

* Linhan Cai

GitHub link to group repository: https://github.com/UBC-MDS/online_news_popularity


## Introduction
A project assessing factors associated with online news popularity in accordance with DSCI 522 (Data Science Workflows) for the Master of Data Science Program at the University of British Columbia.

Initial exploratory data analysis of OnlineNewsPopularity dataset. Using this Oniline News popularity dataset, our aim is to answer the question: Figure out the missing data, and find out what analysis we can do with  the dataset.

Creators: Kelwin Fernandes (kafc ‘@’ inesctec.pt, kelwinfc ’@’ gmail.com),
                 Pedro Vinagre (pedro.vinagre.sousa ’@’ gmail.com) and
                 Pedro Sernadela
Donor: Kelwin Fernandes (kafc ’@’ inesctec.pt, kelwinfc '@' gmail.com)
Date: May, 2015

Past Usage:
       K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision
       Support System for Predicting the Popularity of Online News. Proceedings
       of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence,
       September, Coimbra, Portugal.

       -- Results: 
          -- Binary classification as popular vs unpopular using a decision
             threshold of 1400 social interactions.
          -- Experiments with different models: Random Forest (best model),
             Adaboost, SVM, KNN and Naïve Bayes.
          -- Recorded 67% of accuracy and 0.73 of AUC.
  

## EDA:

### Import Packages

In [31]:
import numpy as np
import pandas as pd
#from pandas_profiling import ProfileReport
import altair as alt
import matplotlib.pyplot as plt
import mglearn
from imageio import imread
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.ensemble import RandomForestClassifier
from pandas_profiling import ProfileReport
import seaborn as sns

In [32]:
# Packages necessary for data splitting
from sklearn.model_selection import train_test_split

### Import the Dataset

In [33]:
ONP_csv = pd.read_csv("../data/raw/OnlineNewsPopularity/OnlineNewsPopularity.csv")

### Explore Dataset Features

In [34]:
ONP_csv.head(10)

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505
5,http://mashable.com/2013/01/07/beewi-smart-toys/,731.0,10.0,370.0,0.559889,1.0,0.698198,2.0,2.0,0.0,...,0.136364,0.6,-0.195,-0.4,-0.1,0.642857,0.214286,0.142857,0.214286,855
6,http://mashable.com/2013/01/07/bodymedia-armba...,731.0,8.0,960.0,0.418163,1.0,0.549834,21.0,20.0,20.0,...,0.1,1.0,-0.224479,-0.5,-0.05,0.0,0.0,0.5,0.0,556
7,http://mashable.com/2013/01/07/canon-poweshot-n/,731.0,12.0,989.0,0.433574,1.0,0.572108,20.0,20.0,20.0,...,0.1,1.0,-0.242778,-0.5,-0.05,1.0,0.5,0.5,0.5,891
8,http://mashable.com/2013/01/07/car-of-the-futu...,731.0,11.0,97.0,0.670103,1.0,0.836735,2.0,0.0,0.0,...,0.4,0.8,-0.125,-0.125,-0.125,0.125,0.0,0.375,0.0,3600
9,http://mashable.com/2013/01/07/chuck-hagel-web...,731.0,10.0,231.0,0.636364,1.0,0.797101,4.0,1.0,1.0,...,0.1,0.5,-0.238095,-0.5,-0.1,0.0,0.0,0.5,0.0,710


In [35]:
ONP_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39644 entries, 0 to 39643
Data columns (total 61 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   url                             39644 non-null  object 
 1    timedelta                      39644 non-null  float64
 2    n_tokens_title                 39644 non-null  float64
 3    n_tokens_content               39644 non-null  float64
 4    n_unique_tokens                39644 non-null  float64
 5    n_non_stop_words               39644 non-null  float64
 6    n_non_stop_unique_tokens       39644 non-null  float64
 7    num_hrefs                      39644 non-null  float64
 8    num_self_hrefs                 39644 non-null  float64
 9    num_imgs                       39644 non-null  float64
 10   num_videos                     39644 non-null  float64
 11   average_token_length           39644 non-null  float64
 12   num_keywords                   

In [36]:
ONP_csv.describe(include = 'all')

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
count,39644,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,...,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0
unique,39644,,,,,,,,,,...,,,,,,,,,,
top,http://mashable.com/2013/01/07/amazon-instant-...,,,,,,,,,,...,,,,,,,,,,
freq,1,,,,,,,,,,...,,,,,,,,,,
mean,,354.530471,10.398749,546.514731,0.548216,0.996469,0.689175,10.88369,3.293638,4.544143,...,0.095446,0.756728,-0.259524,-0.521944,-0.1075,0.282353,0.071425,0.341843,0.156064,3395.380184
std,,214.163767,2.114037,471.107508,3.520708,5.231231,3.264816,11.332017,3.855141,8.309434,...,0.071315,0.247786,0.127726,0.29029,0.095373,0.324247,0.26545,0.188791,0.226294,11626.950749
min,,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,1.0
25%,,164.0,9.0,246.0,0.47087,1.0,0.625739,4.0,1.0,1.0,...,0.05,0.6,-0.328383,-0.7,-0.125,0.0,0.0,0.166667,0.0,946.0
50%,,339.0,10.0,409.0,0.539226,1.0,0.690476,8.0,3.0,1.0,...,0.1,0.8,-0.253333,-0.5,-0.1,0.15,0.0,0.5,0.0,1400.0
75%,,542.0,12.0,716.0,0.608696,1.0,0.75463,14.0,4.0,4.0,...,0.1,1.0,-0.186905,-0.3,-0.05,0.5,0.15,0.5,0.25,2800.0


Initial data exploration shows that we have different types of features (binary, numerical...), and some of these features also have missing values. The dataset is large and representative.

### Pandas Profiling
Use Pandas Profiling to create a report about the dataset, including information about each feature, and possible correlations between features. The report is outputted as an interactable html file named `pandas_profiling.html`.

In [37]:
'''
profile = ProfileReport(ONP_csv, title='ONP')
#profile.to_notebook_iframe() # create pandas profiling report in notebook
profile.to_file("pandas_profiling.html") # create pandas profiling report in an html file
'''

'\nprofile = ProfileReport(ONP_csv, title=\'ONP\')\n#profile.to_notebook_iframe() # create pandas profiling report in notebook\nprofile.to_file("pandas_profiling.html") # create pandas profiling report in an html file\n'

It throw the error MemoryError: Unable to allocate 30.0 TiB for an array with shape (4117972210849,) and data type float64. Seems like we may need to truncate our data set.


In [38]:
from pandas_profiling import ProfileReport

profile = ProfileReport(ONP_csv, title='Pandas Profiling Report', minimal=True)
profile.to_notebook_iframe()

Summarize dataset: 100%|██████████| 67/67 [00:01<00:00, 56.47it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:21<00:00, 21.08s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00,  2.06s/it]


### Split the data into train and test set

In [39]:
ONP_csv_ = ONP_csv.dropna()
train_df, test_df = train_test_split(ONP_csv, test_size=0.3, random_state=42)

In [40]:
train_df.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
6286,http://mashable.com/2013/05/02/robot-fly/,616.0,9.0,744.0,0.49797,1.0,0.689655,9.0,0.0,0.0,...,0.1,0.8,-0.249621,-0.6,-0.083333,0.9,0.8,0.4,0.8,3800
36285,http://mashable.com/2014/11/05/magic-tricks-90...,63.0,12.0,623.0,0.503268,1.0,0.675603,5.0,4.0,9.0,...,0.1,1.0,-0.235185,-0.7,-0.1,0.1,0.2,0.4,0.2,1500
12083,http://mashable.com/2013/08/26/nebraska-footba...,500.0,14.0,493.0,0.582485,1.0,0.777778,9.0,0.0,1.0,...,0.1,1.0,-0.3125,-0.5,-0.125,0.727273,0.318182,0.227273,0.318182,974
7859,http://mashable.com/2013/06/04/nba-finals-anim...,583.0,10.0,126.0,0.821138,1.0,0.892857,4.0,3.0,0.0,...,0.5,0.5,-0.433333,-0.433333,-0.433333,0.65,0.35,0.15,0.35,2000
12702,http://mashable.com/2013/09/09/government-data...,486.0,11.0,238.0,0.608696,1.0,0.77305,5.0,3.0,1.0,...,0.1,0.8,-0.316667,-0.5,-0.05,0.066667,0.0,0.433333,0.0,1600


In [41]:
train_df.describe()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
count,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,...,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0,27750.0
mean,354.351856,10.403856,550.164613,0.555211,1.007532,0.69601,10.860505,3.296216,4.550414,1.254991,...,0.094863,0.757122,-0.259783,-0.522282,-0.10776,0.282392,0.070954,0.342277,0.156543,3373.312541
std,213.593922,2.124787,470.52354,4.207166,6.251632,3.90097,11.221066,3.838467,8.320699,4.178646,...,0.070337,0.2481,0.128101,0.289994,0.096598,0.324514,0.266772,0.188409,0.227366,10922.210179
min,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,1.0
25%,166.0,9.0,247.0,0.470296,1.0,0.625726,4.0,1.0,1.0,0.0,...,0.05,0.6,-0.327972,-0.7,-0.125,0.0,0.0,0.166667,0.0,947.0
50%,339.0,10.0,414.0,0.538613,1.0,0.690583,8.0,3.0,1.0,0.0,...,0.1,0.8,-0.253568,-0.5,-0.1,0.141667,0.0,0.5,0.0,1400.0
75%,541.0,12.0,721.0,0.608318,1.0,0.754386,14.0,4.0,4.0,1.0,...,0.1,1.0,-0.1875,-0.3,-0.05,0.5,0.145739,0.5,0.25,2800.0
max,731.0,23.0,7185.0,701.0,1042.0,650.0,304.0,116.0,128.0,91.0,...,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.5,1.0,690400.0


In [42]:
train_df.info

<bound method DataFrame.info of                                                      url   timedelta  \
6286           http://mashable.com/2013/05/02/robot-fly/       616.0   
36285  http://mashable.com/2014/11/05/magic-tricks-90...        63.0   
12083  http://mashable.com/2013/08/26/nebraska-footba...       500.0   
7859   http://mashable.com/2013/06/04/nba-finals-anim...       583.0   
12702  http://mashable.com/2013/09/09/government-data...       486.0   
...                                                  ...         ...   
6265              http://mashable.com/2013/05/02/iflask/       616.0   
11284  http://mashable.com/2013/08/10/internet-securi...       516.0   
38158  http://mashable.com/2014/12/03/indiegogo-insur...        35.0   
860    http://mashable.com/2013/01/22/obama-sings-sex...       716.0   
15795  http://mashable.com/2013/11/07/space-burial-co...       427.0   

        n_tokens_title   n_tokens_content   n_unique_tokens  \
6286               9.0              744.

### Initial EDA

In [43]:
y_train = train_df[' shares']
X_train = train_df.drop(columns=[' shares'])
X_test = test_df.drop(columns=[' shares'])
y_test = test_df[' shares']

In [44]:
numeric_features = [
' timedelta', ' n_tokens_title', ' n_tokens_content', ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens', ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos', ' average_token_length', ' num_keywords',
' kw_min_min', ' kw_max_min', ' kw_avg_min', ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg', ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares', ' self_reference_max_shares', ' self_reference_avg_sharess',
' global_subjectivity', ' global_sentiment_polarity', ' global_rate_positive_words', ' global_rate_negative_words', ' rate_positive_words', ' rate_negative_words', ' avg_positive_polarity', ' min_positive_polarity', ' max_positive_polarity', ' avg_negative_polarity', 
    ' min_negative_polarity', ' max_negative_polarity', ' title_subjectivity', ' title_sentiment_polarity', ' abs_title_subjectivity', ' abs_title_sentiment_polarity',' shares'
]

In [45]:
looking_columns = X_train.select_dtypes(include=np.number).columns.tolist()
print(looking_columns)

[' timedelta', ' n_tokens_title', ' n_tokens_content', ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens', ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos', ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle', ' data_channel_is_entertainment', ' data_channel_is_bus', ' data_channel_is_socmed', ' data_channel_is_tech', ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min', ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg', ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares', ' self_reference_max_shares', ' self_reference_avg_sharess', ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday', ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday', ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02', ' LDA_03', ' LDA_04', ' global_subjectivity', ' global_sentiment_polarity', ' global_rate_positive_words', ' global_rate_negative_words', ' rate_positive_words', ' rate_n

### Determining Which Numerical Features Are Important for Shares

In [46]:
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

In [47]:
df_cor = ONP_csv[numeric_features]

corrmat = df_cor.corr(method='pearson')
f, ax = plt.subplots(figsize=(60, 60))

sns.heatmap(corrmat, vmax=0.5, square=True, cmap=plt.cm.Blues)
plt.title("Correlation map", fontsize=16)
plt.show()

The correlation heat map for our numeric features shows that the shares has not very strong correlations with the other numeric features. all the correlation is between 0.1-0.3.

### Identify numeric, categorical, binary, and other features

In [48]:
numeric_features = [
' timedelta', ' n_tokens_title', ' n_tokens_content', ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens', ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos', ' average_token_length', ' num_keywords',
' kw_min_min', ' kw_max_min', ' kw_avg_min', ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg', ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares', ' self_reference_max_shares', ' self_reference_avg_sharess',
' global_subjectivity', ' global_sentiment_polarity', ' global_rate_positive_words', ' global_rate_negative_words', ' rate_positive_words', ' rate_negative_words', ' avg_positive_polarity', ' min_positive_polarity', ' max_positive_polarity', ' avg_negative_polarity', 
    ' LDA_00', ' LDA_01', ' LDA_02', ' LDA_03', ' LDA_04',' min_negative_polarity', ' max_negative_polarity', ' title_subjectivity', ' title_sentiment_polarity', ' abs_title_subjectivity', ' abs_title_sentiment_polarity',' shares'
]
binary_features=[' data_channel_is_lifestyle', ' data_channel_is_entertainment', ' data_channel_is_bus', ' data_channel_is_socmed', ' data_channel_is_tech', ' data_channel_is_world',
                ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday', ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday', ' weekday_is_sunday', ' is_weekend'
                ]
target = ' shares'

### Find features with correlation with absolute value greater than 0.7

In [49]:
corr_matrix = ONP_csv.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find features with correlation with absolute value greater than 0.7
to_drop = [column for column in upper.columns if any(upper[column] > 0.7)]
# Drop features
ONP_csv.drop(to_drop, axis=1, inplace=True)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))


In [50]:
corr_matrix = ONP_csv.corr().abs()
corr_matrix

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,average_token_length,num_keywords,...,global_rate_positive_words,global_rate_negative_words,avg_positive_polarity,min_positive_polarity,avg_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,shares
timedelta,1.0,0.24032,0.062867,0.002866,0.000832,0.06453,0.027636,0.000936,0.130465,0.046884,...,0.207604,0.010266,0.126344,0.054772,0.000507,0.063239,0.015919,0.038711,0.011551,0.008662
n_tokens_title,0.24032,1.0,0.01816,0.005318,0.053496,0.014856,0.008858,0.05146,0.071403,0.006077,...,0.064951,0.01553,0.049619,0.025069,0.017096,0.011425,0.077245,0.00024,0.146954,0.008783
n_tokens_content,0.062867,0.01816,1.0,0.004737,0.423065,0.304682,0.3426,0.103699,0.167789,0.072845,...,0.133979,0.125013,0.135123,0.261493,0.130375,0.22587,0.004484,0.023358,0.007136,0.002459
n_unique_tokens,0.002866,0.005318,0.004737,1.0,0.004352,0.00662,0.018802,0.000597,0.026407,0.003679,...,1.4e-05,0.000877,0.000487,0.009193,0.001453,0.007315,0.004678,0.002333,0.009242,0.000806
num_hrefs,0.000832,0.053496,0.423065,0.004352,1.0,0.396452,0.342633,0.114518,0.222588,0.12589,...,0.056428,0.032515,0.188236,0.082168,0.152146,0.054948,0.04395,0.039041,0.009443,0.045404
num_self_hrefs,0.06453,0.014856,0.304682,0.00662,0.396452,1.0,0.238586,0.077458,0.126879,0.099578,...,0.12114,0.011433,0.098062,0.072648,0.058222,0.039153,0.011239,0.026224,0.008961,0.0019
num_imgs,0.027636,0.008858,0.3426,0.018802,0.342633,0.238586,1.0,0.067336,0.033924,0.088432,...,0.041582,0.024772,0.096446,0.024683,0.0725,0.042644,0.056815,0.04631,0.013759,0.039388
num_videos,0.000936,0.05146,0.103699,0.000597,0.114518,0.077458,0.067336,1.0,0.00294,0.022257,...,0.07229,0.179167,0.09744,0.010103,0.115976,0.027251,0.061028,0.02198,0.021982,0.023936
average_token_length,0.130465,0.071403,0.167789,0.026407,0.222588,0.126879,0.033924,0.00294,1.0,0.016814,...,0.322929,0.228655,0.540117,0.222207,0.324529,0.19466,0.040406,0.016718,0.026586,0.022007
num_keywords,0.046884,0.006077,0.072845,0.003679,0.12589,0.099578,0.088432,0.022257,0.016814,1.0,...,0.050466,0.037969,0.0337,0.01589,0.021114,0.028036,0.016014,0.031705,0.010992,0.021818


Just the experimental class for the features. need do more to explore.

### References

Online News Popularity. (2015). UCI Machine Learning Repository. Available at: https://archive-beta.ics.uci.edu/ml/datasets/online+news+popularity.