## Feature Engineering

This dataset was taken from University of California Irvine Machine Learning Repository. The dataset were articles from Mashable for a 2 year period (2013-2014). It originally has 50 attributes (49 features, 1 target) and has a total of 39644 entries. The goal of this project is to analyze this articles, create more features, and be able to create a model that will predict the number of shares (target) an online news article will have.

In [1]:
import pandas as pd
df = pd.read_csv('OnlineNewsPopularity.csv')

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39644 entries, 0 to 39643
Data columns (total 50 columns):
url                             39644 non-null object
timedelta                       39644 non-null int64
n_tokens_title                  39644 non-null int64
n_tokens_content                39644 non-null int64
n_unique_tokens                 39644 non-null float64
n_non_stop_words                39644 non-null float64
n_non_stop_unique_tokens        39644 non-null float64
num_hrefs                       39644 non-null int64
num_self_hrefs                  39644 non-null int64
num_imgs                        39644 non-null int64
num_videos                      39644 non-null int64
average_token_length            39644 non-null float64
num_keywords                    39644 non-null int64
data_channel                    33510 non-null object
kw_min_min                      39644 non-null int64
kw_max_min                      39644 non-null float64
kw_avg_min                     

In [3]:
pd.set_option('display.max_rows',100)

### Data Cleanup and Missing Values
There were some rows in the dataset wherein n_tokens_content = 0, num_imgs = 0, and num_videos = 0. These are valuable columns that were supposed to be filled out and having all of these columns together as zeroes mean that the article does not exist which is odd. I did my research and looked at a few of them and found out that one of the columns should actually have been filled out. There were 101 rows that fell into this category. Since this is only 0.2% of the dataset, it was decided to just drop these rows.

In [4]:
def drop_noninfo_rows(df, index):
    copied_df = df.copy()
    copied_df.drop(index, inplace=True)
    copied_df.reset_index(inplace=True)
    return copied_df

In [71]:
index = df[(df['n_tokens_content']==0) & (df['num_imgs']==0) & (df['num_videos']==0)].index
new_df = drop_noninfo_rows(df,index)
new_df[0:2]

Unnamed: 0,index,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219,0.663594,1.0,0.815385,4,2,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255,0.604743,1.0,0.791946,3,1,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711


### Create New Feature for 0 and -1 column values
There were a few more "missing" values that needed to be cleaned up. There were some rows for n_tokens_content and ave_token_length that has zero values but when you look into the actual url, should really have values. Also, there is a column called kw_min_min that has values equal to -1 which are also considered missing values. The rows that had these values were changed and the mean of that column was used to fill those "missing" values.

In [6]:
def missing_col_feature(df,column,check_value):
    col_list = list(df[column])
    miss_col_val = [1 if val==check_value else 0 for val in col_list]
    return pd.DataFrame(miss_col_val)

In [7]:
new_df['is_nocontent']=missing_col_feature(new_df,'n_tokens_content',0)

In [8]:
new_df['is_nocontent'].value_counts()

0    38463
1     1080
Name: is_nocontent, dtype: int64

In [9]:
new_df['is_no_kw_min_min']=missing_col_feature(new_df,'kw_min_min',-1)
new_df['is_no_kw_min_min'].value_counts()

1    22963
0    16580
Name: is_no_kw_min_min, dtype: int64

### Impute Means

Impute means on n_tokens_content (with zero values), ave_token_length (with zero values), kw_min_min (with -1 values)

In [10]:
def impute_means_on_missing(df,column,check_value):
    col_mean = df.loc[:,column].mean()
    return df[column].mask(df[column]==check_value,col_mean)

In [11]:
before_n_tokens_content = new_df['n_tokens_content'].describe()
before_n_tokens_content

count    39543.000000
mean       547.910629
std        470.897370
min          0.000000
25%        247.000000
50%        410.000000
75%        717.000000
max       8474.000000
Name: n_tokens_content, dtype: float64

In [12]:
new_df['n_tokens_content'] = impute_means_on_missing(new_df,'n_tokens_content',0)
new_df[0:2]

Unnamed: 0,index,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,...,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares,is_nocontent,is_no_kw_min_min
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219.0,0.663594,1.0,0.815385,4,2,...,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593,0,0
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255.0,0.604743,1.0,0.791946,3,1,...,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711,0,0


In [13]:
after_n_tokens_content = new_df['n_tokens_content'].describe()
after_n_tokens_content

count    39543.000000
mean       562.875186
std        461.866801
min         18.000000
25%        263.000000
50%        435.000000
75%        717.000000
max       8474.000000
Name: n_tokens_content, dtype: float64

In [14]:
before_kw_min_min = new_df['kw_min_min'].describe()
before_kw_min_min

count    39543.000000
mean        26.003819
std         69.516394
min         -1.000000
25%         -1.000000
50%         -1.000000
75%          4.000000
max        377.000000
Name: kw_min_min, dtype: float64

In [15]:
new_df['kw_min_min'] = impute_means_on_missing(new_df,'kw_min_min',-1)
new_df[0:2]

Unnamed: 0,index,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,...,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares,is_nocontent,is_no_kw_min_min
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219.0,0.663594,1.0,0.815385,4,2,...,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593,0,0
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255.0,0.604743,1.0,0.791946,3,1,...,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711,0,0


In [16]:
after_kw_min_min = new_df['kw_min_min'].describe()
after_kw_min_min

count    39543.000000
mean        41.685196
std         64.522472
min          0.000000
25%          4.000000
50%         26.003819
75%         26.003819
max        377.000000
Name: kw_min_min, dtype: float64

In [17]:
before_ave_token_length = new_df['average_token_length'].describe()
before_ave_token_length

count    39543.000000
mean         4.559856
std          0.813553
min          0.000000
25%          4.479769
50%          4.665060
75%          4.855455
max          8.041534
Name: average_token_length, dtype: float64

In [18]:
new_df['average_token_length'] = impute_means_on_missing(new_df,'average_token_length',0)
new_df[0:2]

Unnamed: 0,index,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,...,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares,is_nocontent,is_no_kw_min_min
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219.0,0.663594,1.0,0.815385,4,2,...,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593,0,0
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255.0,0.604743,1.0,0.791946,3,1,...,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711,0,0


In [19]:
after_ave_token_length = new_df['average_token_length'].describe()
after_ave_token_length

count    39543.000000
mean         4.684395
std          0.280114
min          3.600000
25%          4.501685
50%          4.665060
75%          4.855455
max          8.041534
Name: average_token_length, dtype: float64

### Exploring Other Valuable Features
Look at features with datatype = object

In [20]:
def find_object_features(df):
    return list(df.dtypes[df.dtypes == 'object'].index)

In [21]:
len(find_object_features(new_df))

3

In [22]:
find_object_features(new_df)

['url', 'data_channel', 'day_of_week']

In [23]:
def informative(df):
    non_informative = [column for column in df.columns if len(df[column].unique()) == 1]
    informative_columns = list(set(df.columns.to_list()) - set(non_informative))
    return df[informative_columns]

In [24]:
def percentage_unique(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [25]:
new_df['url'][0:2]

0    http://mashable.com/2013/01/07/amazon-instant-...
1    http://mashable.com/2013/01/07/ap-samsung-spon...
Name: url, dtype: object

We can extract date of publication on the URL. We can use this as a new feature by trying to find out if month it was published has an effect on the article's popularity.

In [26]:
def extract_date(df,column):
    col_list = list(df[column])
    new_col = []
    for url in col_list:
        url_string = url.split('/')
        date = url_string[3] + '/' + url_string[4] + '/' + url_string[5]
        new_col.append(date)
    return pd.DataFrame(new_col)

In [27]:
new_df['date']=extract_date(new_df,'url')
new_df[0:2]

Unnamed: 0,index,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,...,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares,is_nocontent,is_no_kw_min_min,date
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219.0,0.663594,1.0,0.815385,4,2,...,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593,0,0,2013/01/07
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255.0,0.604743,1.0,0.791946,3,1,...,-0.125,-0.1,0.0,0.0,0.5,0.0,711,0,0,2013/01/07


### Datetime

In [28]:
def contains_date(column):
#     remove nas first, potentially use all
    regex_string = (r'^\d{1,2}-\d{1,2}-\d{4}$|^\d{4}-\d{1,2}-\d{1,2}$' + 
'|^\d{1,2}\/\d{1,2}\/\d{4}$|^\d{4}\/\d{1,2}\/\d{1,2}$')
    return column.str.contains(regex_string).any()

In [29]:
def find_date_features(df):
    series_contains_date = df.apply(contains_date)
    return series_contains_date.index[series_contains_date.values]

In [30]:
def to_dates(df):
    date_features = find_date_features(df)
    return df[date_features].astype('datetime64[ns]')

In [31]:
df_date = to_dates(new_df)

In [32]:
from date_lib import add_datepart
def generate_new_date_columns(dates_df):
    copied_dates_df = dates_df.copy()
    for col in copied_dates_df.columns:
        add_datepart(copied_dates_df, col)
    return copied_dates_df

In [33]:
new_date_col_df = generate_new_date_columns(df_date)
new_date_col_df.head()

Unnamed: 0,Year,Month,Week,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start,Elapsed
0,2013,1,2,7,0,7,False,False,False,False,False,False,1357516800
1,2013,1,2,7,0,7,False,False,False,False,False,False,1357516800
2,2013,1,2,7,0,7,False,False,False,False,False,False,1357516800
3,2013,1,2,7,0,7,False,False,False,False,False,False,1357516800
4,2013,1,2,7,0,7,False,False,False,False,False,False,1357516800


In [34]:
new_date_col_df = new_date_col_df.loc[:,['Year','Month','Week','Day','Dayofweek','Dayofyear']]
new_date_col_df.head()

Unnamed: 0,Year,Month,Week,Day,Dayofweek,Dayofyear
0,2013,1,2,7,0,7
1,2013,1,2,7,0,7
2,2013,1,2,7,0,7
3,2013,1,2,7,0,7
4,2013,1,2,7,0,7


In [35]:
def merge_dfs(original_df, new_df):
    copied_original = original_df.copy()
    date_features = find_date_features(original_df)
    copied_dropped = copied_original.drop(columns = date_features)
    copied_dropped[new_df.columns] = new_df
    return copied_dropped

In [36]:
new_date_df = merge_dfs(new_df, new_date_col_df)
new_date_df.head()

Unnamed: 0,index,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,...,abs_title_sentiment_polarity,shares,is_nocontent,is_no_kw_min_min,Year,Month,Week,Day,Dayofweek,Dayofyear
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219.0,0.663594,1.0,0.815385,4,2,...,0.1875,593,0,0,2013,1,2,7,0,7
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255.0,0.604743,1.0,0.791946,3,1,...,0.0,711,0,0,2013,1,2,7,0,7
2,2,http://mashable.com/2013/01/07/apple-40-billio...,731,9,211.0,0.57513,1.0,0.663866,3,1,...,0.0,1500,0,0,2013,1,2,7,0,7
3,3,http://mashable.com/2013/01/07/astronaut-notre...,731,9,531.0,0.503788,1.0,0.665635,9,0,...,0.0,1200,0,0,2013,1,2,7,0,7
4,4,http://mashable.com/2013/01/07/att-u-verse-apps/,731,13,1072.0,0.415646,1.0,0.54089,19,19,...,0.136364,505,0,0,2013,1,2,7,0,7


### Categorical

In [37]:
def find_categorical(df, threshold = .5):    
    categorical_df = pd.DataFrame({})
    for column in df.columns:
        if percentage_unique(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

In [38]:
df_informative = informative(new_date_df)

potential_categorical = find_categorical(df_informative)
# potential_categorical

In [39]:
import numpy as np

In [40]:
def summarize_counts(df):
    non_empty_columns = df.dropna(axis=1,how='all').columns
    frequencies = np.array([df[column].value_counts(normalize=True).values[0] for column in non_empty_columns]).reshape(-1, 1)
    columns = non_empty_columns.to_numpy().reshape(-1, 1)
    top_values = np.array([df[column].value_counts(normalize=True).index[0] for column in non_empty_columns]).reshape(-1, 1)
    summarize = np.hstack((columns, frequencies, top_values))
    return summarize[summarize[:,1].argsort()[::-1]]

In [41]:
summary = summarize_counts(potential_categorical)
summary

array([['is_nocontent', 0.9726879599423413, '0'],
       ['is_weekend', 0.8695091419467416, '0'],
       ['kw_max_max', 0.7627645853880586, '843300'],
       ['num_videos', 0.630326480034393, '0'],
       ['kw_min_min', 0.5807096072629795, '26.003818627822877'],
       ['is_no_kw_min_min', 0.5807096072629795, '1'],
       ['Year', 0.5419163948107124, '2014'],
       ['abs_title_subjectivity', 0.5188023164656197, '0.5'],
       ['abs_title_sentiment_polarity', 0.503022026654528, '0.0'],
       ['title_sentiment_polarity', 0.503022026654528, '0.0'],
       ['num_imgs', 0.4582100498191842, '1'],
       ['title_subjectivity', 0.45499835621981133, '0.0'],
       ['kw_min_max', 0.4317578332448221, '0'],
       ['kw_min_avg', 0.4316060996889462, '0.0'],
       ['min_positive_polarity', 0.39038515034266497, '0.1'],
       ['max_positive_polarity', 0.3748071719394077, '1.0'],
       ['max_negative_polarity', 0.25157423564221226, '-0.05'],
       ['data_channel', 0.2513973159577966, 'World'],
  

In [42]:
def selected_summaries(df, not_values = [], lower_bound = .1, upper_bound = 1):
    potential_cols = summarize_counts(df)
    potential_cols = potential_cols[potential_cols[:, 1] > lower_bound]
    potential_cols = potential_cols[potential_cols[:, 1] < upper_bound]
    not_tf = ~np.isin(potential_cols[:, 2], not_values)
    return potential_cols[not_tf]

In [43]:
selected = selected_summaries(new_date_df, not_values = ['t', 'f'], upper_bound = .90)
selected

array([['is_weekend', 0.8695091419467416, '0'],
       ['kw_max_max', 0.7627645853880586, '843300'],
       ['num_videos', 0.630326480034393, '0'],
       ['kw_min_min', 0.5807096072629795, '26.003818627822877'],
       ['is_no_kw_min_min', 0.5807096072629795, '1'],
       ['Year', 0.5419163948107124, '2014'],
       ['abs_title_subjectivity', 0.5188023164656197, '0.5'],
       ['title_sentiment_polarity', 0.503022026654528, '0.0'],
       ['abs_title_sentiment_polarity', 0.503022026654528, '0.0'],
       ['num_imgs', 0.4582100498191842, '1'],
       ['title_subjectivity', 0.45499835621981133, '0.0'],
       ['kw_min_max', 0.4317578332448221, '0'],
       ['kw_min_avg', 0.4316060996889462, '0.0'],
       ['min_positive_polarity', 0.39038515034266497, '0.1'],
       ['max_positive_polarity', 0.3748071719394077, '1.0'],
       ['max_negative_polarity', 0.25157423564221226, '-0.05'],
       ['data_channel', 0.2513973159577966, 'World'],
       ['num_self_hrefs', 0.19280277166628734, '2'],

In [44]:
def num_is_digit(array, str_index = 0):
    return np.array([value[str_index].isdigit() for value in array])

In [45]:
num_is_digit(selected[:, 2], str_index = 0)[0:10]

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

In [46]:
def remove_digits_from_selected(selected_matrix, col_idx, str_indices = [0, -1]):
    for idx in str_indices:
        selected_col = selected_matrix[~num_is_digit(selected_matrix[:, col_idx], idx)]
    return selected_col

In [47]:
selected_sums_no_digits = remove_digits_from_selected(selected, 2, [0, -1])
selected_sums_no_digits

array([['data_channel', 0.2513973159577966, 'World'],
       ['day_of_week', 0.18749209721063145, 'Wednesday']], dtype=object)

In [48]:
def categorical_plus_values(df, threshold = 5):
    categorical_cols = find_categorical(df)
    return [column for column in categorical_cols if len(df[column].value_counts()) > threshold]

In [49]:
selected_cat_cols = selected_sums_no_digits[:, 0]

selected_cat_cols

array(['data_channel', 'day_of_week'], dtype=object)

In [50]:
cat_cols_df = df_informative[selected_cat_cols]
cat_cols_df[:3]

Unnamed: 0,data_channel,day_of_week
0,Entertainment,Monday
1,Bus,Monday
2,Bus,Monday


In [51]:
updated_non_digits = categorical_plus_values(cat_cols_df)

In [52]:
len(updated_non_digits)

2

In [53]:
updated_non_digits

['data_channel', 'day_of_week']

In [54]:
new_date_df[updated_non_digits].describe()

Unnamed: 0,data_channel,day_of_week
count,33457,39543
unique,6,7
top,World,Wednesday
freq,8411,7414


In [55]:
new_date_df['data_channel'].fillna(value='missing',inplace=True)
new_date_df['data_channel'].describe()

count     39543
unique        7
top       World
freq       8411
Name: data_channel, dtype: object

In [56]:
copy_df = new_date_df.copy()
copy_df.head()

Unnamed: 0,index,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,...,abs_title_sentiment_polarity,shares,is_nocontent,is_no_kw_min_min,Year,Month,Week,Day,Dayofweek,Dayofyear
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219.0,0.663594,1.0,0.815385,4,2,...,0.1875,593,0,0,2013,1,2,7,0,7
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255.0,0.604743,1.0,0.791946,3,1,...,0.0,711,0,0,2013,1,2,7,0,7
2,2,http://mashable.com/2013/01/07/apple-40-billio...,731,9,211.0,0.57513,1.0,0.663866,3,1,...,0.0,1500,0,0,2013,1,2,7,0,7
3,3,http://mashable.com/2013/01/07/astronaut-notre...,731,9,531.0,0.503788,1.0,0.665635,9,0,...,0.0,1200,0,0,2013,1,2,7,0,7
4,4,http://mashable.com/2013/01/07/att-u-verse-apps/,731,13,1072.0,0.415646,1.0,0.54089,19,19,...,0.136364,505,0,0,2013,1,2,7,0,7


In [57]:
new_date_df = pd.get_dummies(new_date_df,columns=['data_channel','day_of_week'])
new_date_df.head()

Unnamed: 0,index,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,...,data_channel_Technology,data_channel_World,data_channel_missing,day_of_week_Friday,day_of_week_Monday,day_of_week_Saturday,day_of_week_Sunday,day_of_week_Thursday,day_of_week_Tuesday,day_of_week_Wednesday
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219.0,0.663594,1.0,0.815385,4,2,...,0,0,0,0,1,0,0,0,0,0
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255.0,0.604743,1.0,0.791946,3,1,...,0,0,0,0,1,0,0,0,0,0
2,2,http://mashable.com/2013/01/07/apple-40-billio...,731,9,211.0,0.57513,1.0,0.663866,3,1,...,0,0,0,0,1,0,0,0,0,0
3,3,http://mashable.com/2013/01/07/astronaut-notre...,731,9,531.0,0.503788,1.0,0.665635,9,0,...,0,0,0,0,1,0,0,0,0,0
4,4,http://mashable.com/2013/01/07/att-u-verse-apps/,731,13,1072.0,0.415646,1.0,0.54089,19,19,...,1,0,0,0,1,0,0,0,0,0


### Numerical

In [58]:
def contains_numbers(column):
    # matches price or percentage     
    regex_string = (r'^(?!.*www|.*-|.*\/|.*[A-Za-z]|.* ).*\d.*')
#     regex_string = (r'\$\d+.*|\d+.*\%$|^\d+.*$')
    return column.str.contains(regex_string).all()

In [59]:
def find_numeric_features(df):
    series_contains_number = df.apply(contains_numbers)
    return series_contains_number.index[series_contains_number.values]

In [60]:
numeric_features = find_numeric_features(new_date_df)
numeric_features

Index(['index', 'timedelta', 'n_tokens_title', 'n_tokens_content',
       'n_unique_tokens', 'n_non_stop_words', 'n_non_stop_unique_tokens',
       'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos',
       'average_token_length', 'num_keywords', 'kw_min_min', 'kw_max_min',
       'kw_avg_min', 'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg',
       'kw_max_avg', 'kw_avg_avg', 'self_reference_min_shares',
       'self_reference_max_shares', 'self_reference_avg_sharess', 'is_weekend',
       'LDA_00', 'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04', 'global_subjectivity',
       'global_sentiment_polarity', 'global_rate_positive_words',
       'global_rate_negative_words', 'rate_positive_words',
       'rate_negative_words', 'avg_positive_polarity', 'min_positive_polarity',
       'max_positive_polarity', 'avg_negative_polarity',
       'min_negative_polarity', 'max_negative_polarity', 'title_subjectivity',
       'title_sentiment_polarity', 'abs_title_subjectivity',
       'abs_titl

In [61]:
def numeric_to_fix(df):
    numeric_features = find_numeric_features(df)
    return df[numeric_features].select_dtypes(exclude=['int64', 'float64'])[0:2]

In [62]:
numeric_to_fix(new_date_df)

Unnamed: 0,data_channel_Bus,data_channel_Entertainment,data_channel_Lifestyle,data_channel_Social Media,data_channel_Technology,data_channel_World,data_channel_missing,day_of_week_Friday,day_of_week_Monday,day_of_week_Saturday,day_of_week_Sunday,day_of_week_Thursday,day_of_week_Tuesday,day_of_week_Wednesday
0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
1,1,0,0,0,0,0,0,0,1,0,0,0,0,0


### Boolean

In [63]:
def find_booleans(df):
    columns = df.columns
    boolean_columns = np.array([column for column in columns if len(df[column].value_counts(dropna=True)) == 2])
    boolean_values = np.array([df[column].value_counts(dropna=True).index.to_list() for column in boolean_columns])
    columns_and_values = np.stack((boolean_columns, boolean_values[:, 0], boolean_values[:, 1])).T
    return columns_and_values

In [64]:
boolean_columns = find_booleans(new_date_df)
boolean_columns

array([['is_weekend', '0', '1'],
       ['is_nocontent', '0', '1'],
       ['is_no_kw_min_min', '1', '0'],
       ['Year', '2014', '2013'],
       ['data_channel_Bus', '0', '1'],
       ['data_channel_Entertainment', '0', '1'],
       ['data_channel_Lifestyle', '0', '1'],
       ['data_channel_Social Media', '0', '1'],
       ['data_channel_Technology', '0', '1'],
       ['data_channel_World', '0', '1'],
       ['data_channel_missing', '0', '1'],
       ['day_of_week_Friday', '0', '1'],
       ['day_of_week_Monday', '0', '1'],
       ['day_of_week_Saturday', '0', '1'],
       ['day_of_week_Sunday', '0', '1'],
       ['day_of_week_Thursday', '0', '1'],
       ['day_of_week_Tuesday', '0', '1'],
       ['day_of_week_Wednesday', '0', '1']], dtype='<U26')

In [65]:
def almost_binary(df, threshold = .95):
    return np.array([np.array([cat, top]) for cat, frequency, top in summarize_counts(df) if 1.0 > frequency > threshold])

In [66]:
almost_bin_feats = almost_binary(new_date_df)
almost_bin_feats

array([['is_nocontent', '0']], dtype='<U12')

In [67]:
new_date_df.drop(columns=['index','url'],inplace=True)
new_date_df.head()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,data_channel_Technology,data_channel_World,data_channel_missing,day_of_week_Friday,day_of_week_Monday,day_of_week_Saturday,day_of_week_Sunday,day_of_week_Thursday,day_of_week_Tuesday,day_of_week_Wednesday
0,731,12,219.0,0.663594,1.0,0.815385,4,2,1,0,...,0,0,0,0,1,0,0,0,0,0
1,731,9,255.0,0.604743,1.0,0.791946,3,1,1,0,...,0,0,0,0,1,0,0,0,0,0
2,731,9,211.0,0.57513,1.0,0.663866,3,1,1,0,...,0,0,0,0,1,0,0,0,0,0
3,731,9,531.0,0.503788,1.0,0.665635,9,0,1,0,...,0,0,0,0,1,0,0,0,0,0
4,731,13,1072.0,0.415646,1.0,0.54089,19,19,20,0,...,1,0,0,0,1,0,0,0,0,0


In [68]:
new_date_df.to_csv('cleaned_onlinepopularity.csv')