## Focused Feature Engineering

### Loading Original Clean Dataset

In [1]:
import pandas as pd
import numpy as np

online_df = pd.read_csv(r'https://raw.githubusercontent.com/doryaswi/Data-Science/master/cleaned_onlinepopularity.csv')
online_df.head()

Unnamed: 0.1,Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,data_channel_Technology,data_channel_World,data_channel_missing,day_of_week_Friday,day_of_week_Monday,day_of_week_Saturday,day_of_week_Sunday,day_of_week_Thursday,day_of_week_Tuesday,day_of_week_Wednesday
0,0,731,12,219.0,0.663594,1.0,0.815385,4,2,1,...,0,0,0,0,1,0,0,0,0,0
1,1,731,9,255.0,0.604743,1.0,0.791946,3,1,1,...,0,0,0,0,1,0,0,0,0,0
2,2,731,9,211.0,0.57513,1.0,0.663866,3,1,1,...,0,0,0,0,1,0,0,0,0,0
3,3,731,9,531.0,0.503788,1.0,0.665635,9,0,1,...,0,0,0,0,1,0,0,0,0,0
4,4,731,13,1072.0,0.415646,1.0,0.54089,19,19,20,...,1,0,0,0,1,0,0,0,0,0


Creates a dataframe that only has the "best" features as selected from previous step.

In [2]:
features_col = ['n_unique_tokens','num_hrefs','num_imgs','num_videos','kw_avg_max','kw_min_avg','kw_avg_avg',\
                'self_reference_min_shares','self_reference_max_shares','LDA_03','avg_negative_polarity', \
                'data_channel_missing','shares']
select_features_df = online_df.loc[:,features_col]
select_features_df.head()

Unnamed: 0,n_unique_tokens,num_hrefs,num_imgs,num_videos,kw_avg_max,kw_min_avg,kw_avg_avg,self_reference_min_shares,self_reference_max_shares,LDA_03,avg_negative_polarity,data_channel_missing,shares
0,0.663594,4,1,0,0.0,0.0,0.0,496.0,496.0,0.041263,-0.35,0,593
1,0.604743,3,1,0,0.0,0.0,0.0,0.0,0.0,0.050101,-0.11875,0,711
2,0.57513,3,1,0,0.0,0.0,0.0,918.0,918.0,0.033334,-0.466667,0,1500
3,0.503788,9,1,0,0.0,0.0,0.0,0.0,0.0,0.028905,-0.369697,0,1200
4,0.415646,19,20,0,0.0,0.0,0.0,545.0,16000.0,0.028572,-0.220192,0,505


### New Features

Based on the features that were previously selected by our model, we try to create new features. Since we have two variables that looks similar but is just the maximum and the minimum value of the other (self_reference_max_shares and self_reference_min_shares), we get the difference of these variables and use it as a new feature.

Since we also have num_hrefs, num_imgs, and num_videos which are all whole numbers, we can group the values using its quantiles and categorize how many values fall in each group. By doing this, we are able to create new variables that are categorical and see how this new "featured" variables will affect our model.

In [3]:
select_features_df['diff_kw_avg'] = select_features_df['self_reference_max_shares'] - select_features_df['self_reference_min_shares']
select_features_df.head()

Unnamed: 0,n_unique_tokens,num_hrefs,num_imgs,num_videos,kw_avg_max,kw_min_avg,kw_avg_avg,self_reference_min_shares,self_reference_max_shares,LDA_03,avg_negative_polarity,data_channel_missing,shares,diff_kw_avg
0,0.663594,4,1,0,0.0,0.0,0.0,496.0,496.0,0.041263,-0.35,0,593,0.0
1,0.604743,3,1,0,0.0,0.0,0.0,0.0,0.0,0.050101,-0.11875,0,711,0.0
2,0.57513,3,1,0,0.0,0.0,0.0,918.0,918.0,0.033334,-0.466667,0,1500,0.0
3,0.503788,9,1,0,0.0,0.0,0.0,0.0,0.0,0.028905,-0.369697,0,1200,0.0
4,0.415646,19,20,0,0.0,0.0,0.0,545.0,16000.0,0.028572,-0.220192,0,505,15455.0


In [4]:
def cut_cat(df,column):
    return pd.qcut(df[column],4,duplicates='drop')

In [5]:
select_features_df['num_hrefs_cat'] = cut_cat(select_features_df,'num_hrefs')
select_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39543 entries, 0 to 39542
Data columns (total 15 columns):
n_unique_tokens              39543 non-null float64
num_hrefs                    39543 non-null int64
num_imgs                     39543 non-null int64
num_videos                   39543 non-null int64
kw_avg_max                   39543 non-null float64
kw_min_avg                   39543 non-null float64
kw_avg_avg                   39543 non-null float64
self_reference_min_shares    39543 non-null float64
self_reference_max_shares    39543 non-null float64
LDA_03                       39543 non-null float64
avg_negative_polarity        39543 non-null float64
data_channel_missing         39543 non-null int64
shares                       39543 non-null int64
diff_kw_avg                  39543 non-null float64
num_hrefs_cat                39543 non-null category
dtypes: category(1), float64(9), int64(5)
memory usage: 4.3 MB


In [6]:
select_features_df['num_videos_cat'] = pd.qcut(select_features_df['num_videos'],4,duplicates='drop')
select_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39543 entries, 0 to 39542
Data columns (total 16 columns):
n_unique_tokens              39543 non-null float64
num_hrefs                    39543 non-null int64
num_imgs                     39543 non-null int64
num_videos                   39543 non-null int64
kw_avg_max                   39543 non-null float64
kw_min_avg                   39543 non-null float64
kw_avg_avg                   39543 non-null float64
self_reference_min_shares    39543 non-null float64
self_reference_max_shares    39543 non-null float64
LDA_03                       39543 non-null float64
avg_negative_polarity        39543 non-null float64
data_channel_missing         39543 non-null int64
shares                       39543 non-null int64
diff_kw_avg                  39543 non-null float64
num_hrefs_cat                39543 non-null category
num_videos_cat               39543 non-null category
dtypes: category(2), float64(9), int64(5)
memory usage: 4.3 MB


In [7]:
select_features_df['num_imgs_cat'] = pd.qcut(select_features_df['num_imgs'],4,duplicates='drop')
select_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39543 entries, 0 to 39542
Data columns (total 17 columns):
n_unique_tokens              39543 non-null float64
num_hrefs                    39543 non-null int64
num_imgs                     39543 non-null int64
num_videos                   39543 non-null int64
kw_avg_max                   39543 non-null float64
kw_min_avg                   39543 non-null float64
kw_avg_avg                   39543 non-null float64
self_reference_min_shares    39543 non-null float64
self_reference_max_shares    39543 non-null float64
LDA_03                       39543 non-null float64
avg_negative_polarity        39543 non-null float64
data_channel_missing         39543 non-null int64
shares                       39543 non-null int64
diff_kw_avg                  39543 non-null float64
num_hrefs_cat                39543 non-null category
num_videos_cat               39543 non-null category
num_imgs_cat                 39543 non-null category
dtypes: catego

In [8]:
def create_cat_col(df,column):
    df[column].cat.categories
    catcol = df[column].cat.codes
    return catcol

In [9]:
select_features_df['num_hrefs_cc'] = create_cat_col(select_features_df,'num_hrefs_cat')
select_features_df.head()

Unnamed: 0,n_unique_tokens,num_hrefs,num_imgs,num_videos,kw_avg_max,kw_min_avg,kw_avg_avg,self_reference_min_shares,self_reference_max_shares,LDA_03,avg_negative_polarity,data_channel_missing,shares,diff_kw_avg,num_hrefs_cat,num_videos_cat,num_imgs_cat,num_hrefs_cc
0,0.663594,4,1,0,0.0,0.0,0.0,496.0,496.0,0.041263,-0.35,0,593,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0
1,0.604743,3,1,0,0.0,0.0,0.0,0.0,0.0,0.050101,-0.11875,0,711,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0
2,0.57513,3,1,0,0.0,0.0,0.0,918.0,918.0,0.033334,-0.466667,0,1500,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0
3,0.503788,9,1,0,0.0,0.0,0.0,0.0,0.0,0.028905,-0.369697,0,1200,0.0,"(8.0, 14.0]","(-0.001, 1.0]","(-0.001, 1.0]",2
4,0.415646,19,20,0,0.0,0.0,0.0,545.0,16000.0,0.028572,-0.220192,0,505,15455.0,"(14.0, 304.0]","(-0.001, 1.0]","(4.0, 128.0]",3


In [10]:
select_features_df['num_videos_cc'] = create_cat_col(select_features_df,'num_videos_cat')
select_features_df.head()

Unnamed: 0,n_unique_tokens,num_hrefs,num_imgs,num_videos,kw_avg_max,kw_min_avg,kw_avg_avg,self_reference_min_shares,self_reference_max_shares,LDA_03,avg_negative_polarity,data_channel_missing,shares,diff_kw_avg,num_hrefs_cat,num_videos_cat,num_imgs_cat,num_hrefs_cc,num_videos_cc
0,0.663594,4,1,0,0.0,0.0,0.0,496.0,496.0,0.041263,-0.35,0,593,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0,0
1,0.604743,3,1,0,0.0,0.0,0.0,0.0,0.0,0.050101,-0.11875,0,711,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0,0
2,0.57513,3,1,0,0.0,0.0,0.0,918.0,918.0,0.033334,-0.466667,0,1500,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0,0
3,0.503788,9,1,0,0.0,0.0,0.0,0.0,0.0,0.028905,-0.369697,0,1200,0.0,"(8.0, 14.0]","(-0.001, 1.0]","(-0.001, 1.0]",2,0
4,0.415646,19,20,0,0.0,0.0,0.0,545.0,16000.0,0.028572,-0.220192,0,505,15455.0,"(14.0, 304.0]","(-0.001, 1.0]","(4.0, 128.0]",3,0


In [11]:
select_features_df['num_imgs_cc'] = create_cat_col(select_features_df,'num_imgs_cat')
select_features_df.head()

Unnamed: 0,n_unique_tokens,num_hrefs,num_imgs,num_videos,kw_avg_max,kw_min_avg,kw_avg_avg,self_reference_min_shares,self_reference_max_shares,LDA_03,avg_negative_polarity,data_channel_missing,shares,diff_kw_avg,num_hrefs_cat,num_videos_cat,num_imgs_cat,num_hrefs_cc,num_videos_cc,num_imgs_cc
0,0.663594,4,1,0,0.0,0.0,0.0,496.0,496.0,0.041263,-0.35,0,593,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0,0,0
1,0.604743,3,1,0,0.0,0.0,0.0,0.0,0.0,0.050101,-0.11875,0,711,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0,0,0
2,0.57513,3,1,0,0.0,0.0,0.0,918.0,918.0,0.033334,-0.466667,0,1500,0.0,"(-0.001, 4.0]","(-0.001, 1.0]","(-0.001, 1.0]",0,0,0
3,0.503788,9,1,0,0.0,0.0,0.0,0.0,0.0,0.028905,-0.369697,0,1200,0.0,"(8.0, 14.0]","(-0.001, 1.0]","(-0.001, 1.0]",2,0,0
4,0.415646,19,20,0,0.0,0.0,0.0,545.0,16000.0,0.028572,-0.220192,0,505,15455.0,"(14.0, 304.0]","(-0.001, 1.0]","(4.0, 128.0]",3,0,2


In [12]:
new_select_df = select_features_df.drop(columns=['num_hrefs','num_videos', 'num_imgs',
                                                 'num_hrefs_cat','num_videos_cat', 'num_imgs_cat',
                                                 'self_reference_min_shares','self_reference_max_shares'])
new_select_df.head()

Unnamed: 0,n_unique_tokens,kw_avg_max,kw_min_avg,kw_avg_avg,LDA_03,avg_negative_polarity,data_channel_missing,shares,diff_kw_avg,num_hrefs_cc,num_videos_cc,num_imgs_cc
0,0.663594,0.0,0.0,0.0,0.041263,-0.35,0,593,0.0,0,0,0
1,0.604743,0.0,0.0,0.0,0.050101,-0.11875,0,711,0.0,0,0,0
2,0.57513,0.0,0.0,0.0,0.033334,-0.466667,0,1500,0.0,0,0,0
3,0.503788,0.0,0.0,0.0,0.028905,-0.369697,0,1200,0.0,2,0,0
4,0.415646,0.0,0.0,0.0,0.028572,-0.220192,0,505,15455.0,3,0,2


### Retrain Model Using Focused Feature Engineering

Using our optimized hyperparameters, we re-train our model using the new features that we created above.

In [13]:
y = new_select_df.loc[:,'shares']
y[:5]

0     593
1     711
2    1500
3    1200
4     505
Name: shares, dtype: int64

In [14]:
X = new_select_df.drop('shares',axis=1)
X[:5]

Unnamed: 0,n_unique_tokens,kw_avg_max,kw_min_avg,kw_avg_avg,LDA_03,avg_negative_polarity,data_channel_missing,diff_kw_avg,num_hrefs_cc,num_videos_cc,num_imgs_cc
0,0.663594,0.0,0.0,0.0,0.041263,-0.35,0,0.0,0,0,0
1,0.604743,0.0,0.0,0.0,0.050101,-0.11875,0,0.0,0,0,0
2,0.57513,0.0,0.0,0.0,0.033334,-0.466667,0,0.0,0,0,0
3,0.503788,0.0,0.0,0.0,0.028905,-0.369697,0,0.0,2,0,0
4,0.415646,0.0,0.0,0.0,0.028572,-0.220192,0,15455.0,3,0,2


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=13)

In [16]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=42, min_samples_leaf=46)
rfr.fit(X_train, y_train)
rfr.score(X_val, y_val)

0.03665579812596731

Since the score got worse after re-engineering some features, it would be a better decision to go to the previously selected features.