# Analysing the 'BlogFeedback Data Set' from the UC Irvine Machine Learning repository

This notebook is used to analyze the 'BlogFeedback Data Set' from the UC Irvine Machine Learning repository. The data set is available [here](https://archive.ics.uci.edu/ml/datasets/BlogFeedback). **The objective of the notebook is to create a model to predict the number of comments in a blog post in the upcoming 24 hours**.

This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed. In the train data, the basetimes were in the years 2010 and 2011. In the test data the basetimes were in February and March 2012.

**The data set has 280 attributes. Therefore, in this notebooks we test different techniques to deal with this large number of attributes**. First, we analyze the whole data set without any kind of adjustment. This will be our reference model. Then, we test some feature selection methods to identify the most relevant attributes to predict the target value. Finally, we test the Principal Component Analysis (PCA) dimensionality reduction method.

The notebook is organized as follows:

1. Data exploration
2. Train ML model
3. Evaluate the ML model
4. Conclusion

----------

## 1. Data exploration

In this section, we explore the characteristics of the data set, including its dimensions and characteristics of its variables.

The data set contains 281 columns and 52397 rows.

The attributes of the data set are the following:

Column:
- 1...50: Average, standard deviation, min, max and median of the Attributes 51...60 for the source of the current blog post. With source we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10
- 51: Total number of comments before basetime
- 52: Number of comments in the last 24 hours before the basetime
- 53: Let T1 denote the datetime 48 hours before basetime. Let T2 denote the datetime 24 hours before basetime. This attribute is the number of comments in the time period between T1 and T2
- 54: Number of comments in the first 24 hours after the publication of the blog post, but before basetime
- 55: The difference of Attribute 52 and Attribute 53
- 56...60: The same features as the attributes 51...55, but features 56...60 refer to the number of links (trackbacks), while features 51...55 refer to the number of comments.
- 61: The length of time between the publication of the blog post and basetime
- 62: The length of the blog post
- 63...262: The 200 bag of words features for 200 frequent words of the text of the blog post
- 263...269: binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the basetime
- 270...276: binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the date of publication of the blog post
- 277: Number of parent pages: we consider a blog post P as a parent of blog post B, if B is a reply (trackback) to blog post P.
- 278...280: Minimum, maximum, average number of comments that the parents received
- 281: The target: the number of comments in the next 24 hours (relative to basetime)

In [38]:
import pandas as pd
import numpy as np
#!pip install -U scikit-learn

----------

### Getting the data

In [39]:
attributes = [*range(1, 282, 1)]

df_data = pd.read_csv('/Users/leuzinger/Dropbox/Data Science/Awari/Regressions/BlogFeedback/blogData_train.csv',names=attributes)
df_data.reset_index(inplace=False)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52397 entries, 0 to 52396
Columns: 281 entries, 1 to 281
dtypes: float64(281)
memory usage: 112.3 MB


In [40]:
att=[]
for i in ['total','last24h','24-48h','first24h','difference',
           'total_tr','last24h_tr','24-48h_tr','first24h_tr','difference_tr']:
    att1 = 'blog_avg_' + str(i)
    att2 = 'blog_std_' + str(i)
    att3 = 'blog_min_' + str(i)
    att4 = 'blog_max_' + str(i)
    att5 = 'blog_median_' + str(i)
    att.extend([att1,att2,att3,att4,att5])

att51_62 = ['total','last24h','24-48h','first24h','difference',
           'total_tr','last24h_tr','24-48h_tr','first24h_tr','difference_tr',
           'time_first_post','lenght_post']
att.extend(att51_62)

for i in range(63,263):
    att_word = 'word' + str(i-62)
    att.extend([att_word])

att263_281 = ['Mon_bl','Tue_bl','Wed_bl','Thu_bl','Fri_bl','Sat_bl','Sun_bl',
             'Mon_post','Tue_post','Wed_post','Thu_post','Fri_post','Sat_post','Sun_post',
             'parent_pages','min_parent','max_parent','avg_parent','target']
att.extend(att263_281)

In [41]:
df_data.set_axis(att,axis=1,inplace=True)
df_data.head()

Unnamed: 0,blog_avg_total,blog_std_total,blog_min_total,blog_max_total,blog_median_total,blog_avg_last24h,blog_std_last24h,blog_min_last24h,blog_max_last24h,blog_median_last24h,blog_avg_24-48h,blog_std_24-48h,blog_min_24-48h,blog_max_24-48h,blog_median_24-48h,blog_avg_first24h,blog_std_first24h,blog_min_first24h,blog_max_first24h,blog_median_first24h,blog_avg_difference,blog_std_difference,blog_min_difference,blog_max_difference,blog_median_difference,blog_avg_total_tr,blog_std_total_tr,blog_min_total_tr,blog_max_total_tr,blog_median_total_tr,blog_avg_last24h_tr,blog_std_last24h_tr,blog_min_last24h_tr,blog_max_last24h_tr,blog_median_last24h_tr,blog_avg_24-48h_tr,blog_std_24-48h_tr,blog_min_24-48h_tr,blog_max_24-48h_tr,blog_median_24-48h_tr,blog_avg_first24h_tr,blog_std_first24h_tr,blog_min_first24h_tr,blog_max_first24h_tr,blog_median_first24h_tr,blog_avg_difference_tr,blog_std_difference_tr,blog_min_difference_tr,blog_max_difference_tr,blog_median_difference_tr,total,last24h,24-48h,first24h,difference,total_tr,last24h_tr,24-48h_tr,first24h_tr,difference_tr,time_first_post,lenght_post,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,word12,word13,word14,word15,word16,word17,word18,word19,word20,word21,word22,word23,word24,word25,word26,word27,word28,word29,word30,word31,word32,word33,word34,word35,word36,word37,word38,word39,word40,word41,word42,word43,word44,word45,word46,word47,word48,word49,word50,word51,word52,word53,word54,word55,word56,word57,word58,word59,word60,word61,word62,word63,word64,word65,word66,word67,word68,word69,word70,word71,word72,word73,word74,word75,word76,word77,word78,word79,word80,word81,word82,word83,word84,word85,word86,word87,word88,word89,word90,word91,word92,word93,word94,word95,word96,word97,word98,word99,word100,word101,word102,word103,word104,word105,word106,word107,word108,word109,word110,word111,word112,word113,word114,word115,word116,word117,word118,word119,word120,word121,word122,word123,word124,word125,word126,word127,word128,word129,word130,word131,word132,word133,word134,word135,word136,word137,word138,word139,word140,word141,word142,word143,word144,word145,word146,word147,word148,word149,word150,word151,word152,word153,word154,word155,word156,word157,word158,word159,word160,word161,word162,word163,word164,word165,word166,word167,word168,word169,word170,word171,word172,word173,word174,word175,word176,word177,word178,word179,word180,word181,word182,word183,word184,word185,word186,word187,word188,word189,word190,word191,word192,word193,word194,word195,word196,word197,word198,word199,word200,Mon_bl,Tue_bl,Wed_bl,Thu_bl,Fri_bl,Sat_bl,Sun_bl,Mon_post,Tue_post,Wed_post,Thu_post,Fri_post,Sat_post,Sun_post,parent_pages,min_parent,max_parent,avg_parent,target
0,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,14.044226,32.615417,0.0,377.0,2.0,34.567566,48.475178,0.0,378.0,12.0,1.479934,46.18691,-356.0,377.0,0.0,1.076167,1.795416,0.0,11.0,0.0,0.400491,1.078097,0.0,9.0,0.0,0.377559,1.07421,0.0,9.0,0.0,0.972973,1.704671,0.0,10.0,0.0,0.022932,1.521174,-8.0,9.0,0.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,14.044226,32.615417,0.0,377.0,2.0,34.567566,48.475178,0.0,378.0,12.0,1.479934,46.18691,-356.0,377.0,0.0,1.076167,1.795416,0.0,11.0,0.0,0.400491,1.078097,0.0,9.0,0.0,0.377559,1.07421,0.0,9.0,0.0,0.972973,1.704671,0.0,10.0,0.0,0.022932,1.521174,-8.0,9.0,0.0,6.0,2.0,4.0,5.0,-2.0,0.0,0.0,0.0,0.0,0.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,14.044226,32.615417,0.0,377.0,2.0,34.567566,48.475178,0.0,378.0,12.0,1.479934,46.18691,-356.0,377.0,0.0,1.076167,1.795416,0.0,11.0,0.0,0.400491,1.078097,0.0,9.0,0.0,0.377559,1.07421,0.0,9.0,0.0,0.972973,1.704671,0.0,10.0,0.0,0.022932,1.521174,-8.0,9.0,0.0,6.0,2.0,4.0,5.0,-2.0,0.0,0.0,0.0,0.0,0.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,14.044226,32.615417,0.0,377.0,2.0,34.567566,48.475178,0.0,378.0,12.0,1.479934,46.18691,-356.0,377.0,0.0,1.076167,1.795416,0.0,11.0,0.0,0.400491,1.078097,0.0,9.0,0.0,0.377559,1.07421,0.0,9.0,0.0,0.972973,1.704671,0.0,10.0,0.0,0.022932,1.521174,-8.0,9.0,0.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,14.044226,32.615417,0.0,377.0,2.0,34.567566,48.475178,0.0,378.0,12.0,1.479934,46.18691,-356.0,377.0,0.0,1.076167,1.795416,0.0,11.0,0.0,0.400491,1.078097,0.0,9.0,0.0,0.377559,1.07421,0.0,9.0,0.0,0.972973,1.704671,0.0,10.0,0.0,0.022932,1.521174,-8.0,9.0,0.0,3.0,1.0,2.0,2.0,-1.0,0.0,0.0,0.0,0.0,0.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0


----------

### Data correlation

We start our analysis looking to which attributes have the higher correlation with the price. First, we create a correlation matrix. 

We can see that the varibales that have the stronger postive correlations with the target value are the blog_median_last24h and blog_avg_difference. Besides, we see that the total blog publications, publications in the last 24h and publications between 24h-48h have a strong correlation with each other.

In [43]:
corr_matrix = df_data.drop(df_data.iloc[:,261:277],axis=1).corr()
corr_matrix.loc['target'].sort_values(ascending=False)

target                       1.000000
blog_median_last24h          0.506540
blog_avg_difference          0.503375
blog_avg_last24h             0.497631
blog_median_total            0.491707
blog_avg_24-48h              0.490111
blog_median_24-48h           0.489674
blog_median_first24h         0.486316
blog_avg_total               0.485464
last24h                      0.472061
blog_avg_first24h            0.471999
blog_median_last24h_tr       0.461627
blog_std_difference          0.440003
blog_std_24-48h              0.439152
blog_std_last24h             0.433578
blog_std_total               0.424616
blog_std_first24h            0.384654
blog_max_total               0.356604
blog_median_total_tr         0.338961
blog_avg_24-48h_tr           0.337775
blog_avg_last24h_tr          0.335829
blog_avg_first24h_tr         0.329670
blog_avg_total_tr            0.328525
blog_median_first24h_tr      0.323661
blog_max_24-48h              0.322775
blog_max_last24h             0.322106
blog_max_dif

----------

### Creating the Train and Test sets

Creating a test set at the beginning of the project avoid *data snooping* bias, i.e., "when you estimate the generalization error using the test set, your estimate will be too optimistic, and you will launch a system that will not perform as well as expected" (GÉRON, 2019).

In this data set, the test set has already been divided. Therefore, we do not need to create a test set, just separete the target value from the other attributes to create our training set.

In [6]:
blog_X_train = df_data.drop('target',axis=1).copy()
blog_y_train = df_data['target'].copy()

----------

### Preparing the data for ML algorithms

Before creating the ML models, we need to prepare the data so that the ML algorithms will work properly.

First, we need to clean missing values from the dataset. Second, we need to put all the attributes in the same scale because "Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales" [(GÉRON, 2019)](https://www.amazon.com.br/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646).

We verify that there is no missing values in our data set. So, we just prepare a pipeline to do the scaling when necessary.

In [8]:
blog_X_train.isnull().values.any()

False

In [9]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

def estimator_transf(estimator):
    pipeline = Pipeline(steps=[('m', estimator)])
    return pipeline

def estimator_scaler(estimator):
    pipeline = Pipeline(steps=[('scaler',StandardScaler()),('model', estimator)])
    return pipeline 

----------

## 2. Train ML model

After preparing the data set, we are ready to select and train our ML model.

We start with a Linear Regression (LR) model. "A regression model, such as linear regression, models an output value based on a linear combination of input values" [(BROWNLEE, 2020)](https://machinelearningmastery.com/introduction-to-time-series-forecasting-with-python/).

Then, we try some regularized linear models. This kind of model constrain the weights of the model, avoiding overfitting (GÉRON, 2019). We try three regularized linear models [(BROWNLEE, 2016)](https://machinelearningmastery.com/machine-learning-with-python/):

1. Ridge regression. This model model uses the L2 regularization. It adds “squared magnitude” of coefficient as a penalty term to the loss function [(NAGPAL, 2017)](https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c).
2. Lasso regression. This model model uses the L1 regularization. It adds “absolute value of magnitude” of coefficient as penalty term to the loss function (NAGPAL, 2017).
3. Elastic Net. This model combines the Ridge and the Lasso models. "It seeks to minimize the complexity of the regression model (magnitude and number of regression coefficients) by penalizing the model using both the L2-norm (sum squared coefficient values) and the L1-norm (sum absolute coefficient values)" (BROWNLEE, 2016).

Finally, we also try some nonlinear algorithms:

1. Classification and Regression Trees (CART). It uses "the train- ing data to select the best points to split the data in order to minimize a cost metric" (BROWNLEE, 2016).
2. k-Nearest Neighbors (KNN). This model "locates the k most similar instances in the training dataset for a new data instance" (BROWNLEE, 2016).

The models are evaluated using the mean absolute error (MAE), root square mean error (RMSE), and R². RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score. This is due to the square of the error value. MAE does not give more or less weight to different types of errors and instead the scores increase linearly with increases in error. MAE is the simplest evaluation metric and most easily interpreted. R² tells you how much variance your model accounts for. In the case of the MAE and RMSE, the lower the better. But for R², the close the value is to 1, the better ([HALE, 2020](https://towardsdatascience.com/which-evaluation-metric-should-you-use-in-machine-learning-regression-problems-20cdaef258e); [BROWNLEE, 2021](https://machinelearningmastery.com/regression-metrics-for-machine-learning/)).

Besides, "the key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data. You can achieve this by forcing each algorithm to be evaluated on a consistent test harness" (BROWNLEE, 2016). In this project, we do this by using the same split in the cross validation. We use the KFold function from the sklearn library with a random value rs as the random_state parameter. Although the rs value change everytime the notebook is run, once it is set, the same rs value is used in all the models. This guarantees that all the models are evaluated on the same data.

The result of the tests of the models with the training data shows that **the KNN is the best model**. It has the lowest MAE and RMSE, and the highest R².

However, differing scales of the raw data could be negatively impacting the performance of some of the models. Therefore, we test the models again, but this time we standardize the data set.

We can see that the performance of most models improved with standardization. However, the performance of the KNN degraded with the standardized data. Even so, KNN was still the best method.

**Therefore, for this initial test, we verify that KNN without standardization is the best model for our data**.

However, **using the data set with all the 280 attributes requires a lot of computing time**. So, let's try some featuring selection methods to see if we can reduce the number of attributes to be used in our models.

In [10]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold

def estimator_cross_val (model,estimator,pipe,matriz,rs,X,y):
    pipe_ = pipe(estimator)
    scoring = ['neg_mean_absolute_error', 'neg_root_mean_squared_error','r2']
    kfold = KFold(n_splits=5, random_state=rs,shuffle=True)
    scores = cross_validate(pipe_,X,y,cv=kfold,scoring=scoring)
    
    mae_scores = -scores.get('test_neg_mean_absolute_error')
    mae_mean = mae_scores.mean()
    mae_std = mae_scores.std()
    
    rmse_scores = -scores.get('test_neg_root_mean_squared_error')
    rmse_mean = rmse_scores.mean()
    rmse_std = rmse_scores.std()
    
    r2_scores = scores.get('test_r2')
    r2_mean = r2_scores.mean()
    r2_std = r2_scores.std()
    
    results_ = [model,mae_mean,mae_std,rmse_mean,rmse_std,r2_mean,r2_std]
    results_ = pd.Series(results_, index = matriz.columns)
    results = matriz.append(results_,ignore_index=True)
    return results

In [11]:
from random import randrange
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
import warnings

warnings.filterwarnings("ignore")

rs = randrange(10000)
matriz = pd.DataFrame(columns=['model','MAE_mean','MAE_std','RMSE_mean','RMSE_std','R2_mean','R2_std'])

matriz = estimator_cross_val('Linear Regression',LinearRegression(),estimator_transf,matriz,rs,blog_X_train,blog_y_train)
matriz = estimator_cross_val('Ridge Regression',Ridge(),estimator_transf,matriz,rs,blog_X_train,blog_y_train)
matriz = estimator_cross_val('Lasso',Lasso(),estimator_transf,matriz,rs,blog_X_train,blog_y_train)
matriz = estimator_cross_val('Elastic Net',ElasticNet(),estimator_transf,matriz,rs,blog_X_train,blog_y_train)
matriz = estimator_cross_val('KNN',KNeighborsRegressor(),estimator_transf,matriz,rs,blog_X_train,blog_y_train)
matriz = estimator_cross_val('CART',DecisionTreeRegressor(),estimator_transf,matriz,rs,blog_X_train,blog_y_train)
matriz

Unnamed: 0,model,MAE_mean,MAE_std,RMSE_mean,RMSE_std,R2_mean,R2_std
0,Linear Regression,9.535883,0.155004,30.383434,1.101252,0.347938,0.030813
1,Ridge Regression,9.530001,0.155896,30.378057,1.104894,0.348174,0.030823
2,Lasso,9.122793,0.183914,30.302539,1.13734,0.351477,0.030541
3,Elastic Net,9.135748,0.185702,30.309116,1.139229,0.351195,0.030631
4,KNN,6.387953,0.228165,28.589911,1.318047,0.423054,0.028169
5,CART,6.373964,0.214273,33.118618,1.396891,0.220945,0.091865


In [12]:
matriz2 = pd.DataFrame(columns=['model','MAE_mean','MAE_std','RMSE_mean','RMSE_std','R2_mean','R2_std'])

matriz2 = estimator_cross_val('Linear Regression',LinearRegression(),estimator_scaler,matriz2,rs,blog_X_train,blog_y_train)
matriz2 = estimator_cross_val('Ridge Regression',Ridge(),estimator_scaler,matriz2,rs,blog_X_train,blog_y_train)
matriz2 = estimator_cross_val('Lasso',Lasso(),estimator_scaler,matriz2,rs,blog_X_train,blog_y_train)
matriz2 = estimator_cross_val('Elastic Net',ElasticNet(),estimator_scaler,matriz2,rs,blog_X_train,blog_y_train)
matriz2 = estimator_cross_val('KNN',KNeighborsRegressor(),estimator_scaler,matriz2,rs,blog_X_train,blog_y_train)
matriz2 = estimator_cross_val('CART',DecisionTreeRegressor(),estimator_scaler,matriz2,rs,blog_X_train,blog_y_train)
matriz2

Unnamed: 0,model,MAE_mean,MAE_std,RMSE_mean,RMSE_std,R2_mean,R2_std
0,Linear Regression,9.536043,0.158151,30.382693,1.103323,0.347973,0.030809
1,Ridge Regression,9.53592,0.156277,30.380541,1.10218,0.34806,0.030874
2,Lasso,8.435559,0.157798,30.499635,1.17034,0.343283,0.025468
3,Elastic Net,8.448594,0.167776,30.672138,1.181716,0.335904,0.024106
4,KNN,6.82083,0.147086,29.504949,1.002954,0.384447,0.038376
5,CART,6.379402,0.323984,32.991296,1.777952,0.228157,0.086245


-----

### Feature selection

"Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in many cases, to improve the performance of the model" (BROWNLEE, 2021).

There are two main techniques of feature selection: supervised and unsupervised. Supervised methods use the target variable, while unsupervised methods do not (BROWNLEE, 2021).

Besides, the supervised techniques can be divided in (BROWNLEE, 2021):

1. Intrinsic: Algorithms that perform automatic feature selection during training.
2. Wrapper: Search subsets of features that perform according to a predictive model.
3. Filter: Select subsets of features based on their relationship with the target.

### Mutual Information Statistics

Some of the methods of feature selection are more appropriated for numerical variables and others for categorical ones. One popular feature selection techniques used for both numerical variables and categorical variable is Mutual Information Statistics (BROWNLEE, 2021). 

"Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection. Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable"  (BROWNLEE, 2021).

We find that many attributes have negligible information value. 181 features have a contribution score over 0.0001, 144 over 0.001, 65 over 0.01, and only 33 over 0.1. **These numbers can vary depending on the training set**. Therefore, we will test the 30, 70, 150, and 190 best features and compare it with the results obtained using all features. 

**We see that the performance using the 70, 150, and 190 best features are almost the same of using all features. Using the 30 beast features is just slighlty worst than using all features**. Moreover, in all cases the KNN model have the best performance.

We could do a grid search to "systematically test a range of different numbers of selected features and discover which results in the best performing model" (BROWNLEE, 2021). **However, a grid search to determine the optimum number of features would require a lot of computing time and the benefit would not be significant in our evaluation**.

In [13]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

# feature selection
def select_features(X_train, y_train,k_):
    # configure to select all features
    fs = SelectKBest(score_func=mutual_info_classif, k=k_) 
    # learn relationship from training data
    fs.fit(X_train, y_train)
    # transform train input data
    X_train_fs = fs.transform(X_train)
    return X_train_fs, fs

In [14]:
# feature selection
blog_X_train_mi, mi = select_features(blog_X_train, blog_y_train,'all')

# what are scores for the features
MI = pd.DataFrame(mi.scores_, columns = ['Score'])

print(MI[MI > 0.0001].count())
print(MI[MI > 0.001].count())
print(MI[MI > 0.01].count())
print(MI[MI > 0.1].count())

Score    192
dtype: int64
Score    164
dtype: int64
Score    62
dtype: int64
Score    33
dtype: int64


In [15]:
def estimator_cross_val_fea (k,model,estimator,pipe,matriz,rs,X,y):
    pipe_ = pipe(estimator)
    scoring = ['neg_mean_absolute_error', 'neg_root_mean_squared_error','r2']
    kfold = KFold(n_splits=5, random_state=rs,shuffle=True)
    scores = cross_validate(pipe_,X,y,cv=kfold,scoring=scoring)
    
    mae_scores = -scores.get('test_neg_mean_absolute_error')
    mae_mean = mae_scores.mean()
    mae_std = mae_scores.std()
    
    rmse_scores = -scores.get('test_neg_root_mean_squared_error')
    rmse_mean = rmse_scores.mean()
    rmse_std = rmse_scores.std()
    
    r2_scores = scores.get('test_r2')
    r2_mean = r2_scores.mean()
    r2_std = r2_scores.std()
    
    results_ = [k,model,mae_mean,mae_std,rmse_mean,rmse_std,r2_mean,r2_std]
    results_ = pd.Series(results_, index = matriz.columns)
    results = matriz.append(results_,ignore_index=True)
    return results

matriz_mi = pd.DataFrame(columns=['features','model','MAE_mean','MAE_std','RMSE_mean','RMSE_std','R2_mean','R2_std'])

for k in [30,70,150,190]:

    best_features_mi = MI.transpose()
    best_features_mi.columns = blog_X_train.columns
    best_features_mi.sort_values('Score',axis=1,ascending=False,inplace=True)
    best_features_mi.drop(best_features_mi.iloc[:,k:],axis=1,inplace=True)
    blog_X_train_mi = blog_X_train[best_features_mi.columns]
  
    matriz_mi = estimator_cross_val_fea(k,'Linear Regression',LinearRegression(),     estimator_transf,matriz_mi,rs,blog_X_train_mi,blog_y_train)
    matriz_mi = estimator_cross_val_fea(k,'Ridge Regression', Ridge(),                estimator_transf,matriz_mi,rs,blog_X_train_mi,blog_y_train)
    matriz_mi = estimator_cross_val_fea(k,'Lasso',            Lasso(),                estimator_transf,matriz_mi,rs,blog_X_train_mi,blog_y_train)
    matriz_mi = estimator_cross_val_fea(k,'Elastic Net',      ElasticNet(),           estimator_transf,matriz_mi,rs,blog_X_train_mi,blog_y_train)
    matriz_mi = estimator_cross_val_fea(k,'KNN',              KNeighborsRegressor(),  estimator_transf,matriz_mi,rs,blog_X_train_mi,blog_y_train)
    matriz_mi = estimator_cross_val_fea(k,'CART',             DecisionTreeRegressor(),estimator_transf,matriz_mi,rs,blog_X_train_mi,blog_y_train)

matriz_mi

Unnamed: 0,features,model,MAE_mean,MAE_std,RMSE_mean,RMSE_std,R2_mean,R2_std
0,30,Linear Regression,8.088844,0.191775,30.478789,1.097408,0.343845,0.0307
1,30,Ridge Regression,8.088783,0.190884,30.478051,1.099127,0.343878,0.03073
2,30,Lasso,8.092827,0.17467,30.435213,1.164739,0.34583,0.030713
3,30,Elastic Net,8.105448,0.17662,30.441817,1.155065,0.345539,0.030518
4,30,KNN,6.475607,0.228648,28.754414,1.411436,0.414096,0.061922
5,30,CART,7.132292,0.190374,34.518267,1.294679,0.15852,0.038165
6,70,Linear Regression,9.247438,0.197738,30.344834,1.113804,0.349604,0.03098
7,70,Ridge Regression,9.246929,0.199456,30.342582,1.115527,0.349703,0.030987
8,70,Lasso,9.120361,0.185669,30.308663,1.147322,0.351221,0.030775
9,70,Elastic Net,9.13371,0.187704,30.310683,1.145248,0.351125,0.030897


### Wrapper feature selection method

One way to handle data sets that combines numerical and categorical variables is to use a wrapper method. Some ofent used wrapper methods are Tree-Searching Methods, Stochastic Global Search, Step-Wise Models, and Recursive Feature Elimination (BROWNLEE, 2021).

We use the **Recursive Feature Elimination (RFE) method**. This method searches "for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains. This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model" (BROWNLEE, 2021).

We use the RFE to reduce the attributes of the data set. We use the same number of features as in the Mutual information selection method. Thus, we select the 190, 150, 70, and 30 most relevant features and evaluate the models again. 

**We can see that the best model is the KNN with 30 features**. Moreover, the models performed better with the features selected using the RFE than with the ones selected using the mutual information. This model even performed better than the one using all features.

In [16]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeRegressor

matriz_rfe = pd.DataFrame(columns=['features','model','MAE_mean','MAE_std','RMSE_mean','RMSE_std','R2_mean','R2_std'])
rfe_features = pd.DataFrame()

for k in [30,70,150,190]:
    rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=k)
    rfe.fit(blog_X_train,blog_y_train)
    RF = pd.DataFrame(rfe.support_, columns = ['{} Features'.format(k)])
    rfe_features['{} Features'.format(k)] = RF['{} Features'.format(k)]
    
    best_features_rfe = RF.transpose()
    best_features_rfe.columns = blog_X_train.columns
    best_features_rfe.sort_values('{} Features'.format(k),axis=1,ascending=False,inplace=True)
    best_features_rfe.drop(best_features_rfe.iloc[:,k:],axis=1,inplace=True)
    blog_X_train_rfe = blog_X_train[best_features_rfe.columns]
    blog_X_train_rfe.head()
  
    matriz_rfe = estimator_cross_val_fea(k,'Linear Regression',LinearRegression(),     estimator_transf,matriz_rfe,rs,blog_X_train_rfe,blog_y_train)
    matriz_rfe = estimator_cross_val_fea(k,'Ridge Regression', Ridge(),                estimator_transf,matriz_rfe,rs,blog_X_train_rfe,blog_y_train)
    matriz_rfe = estimator_cross_val_fea(k,'Lasso',            Lasso(),                estimator_transf,matriz_rfe,rs,blog_X_train_rfe,blog_y_train)
    matriz_rfe = estimator_cross_val_fea(k,'Elastic Net',      ElasticNet(),           estimator_transf,matriz_rfe,rs,blog_X_train_rfe,blog_y_train)
    matriz_rfe = estimator_cross_val_fea(k,'KNN',              KNeighborsRegressor(),  estimator_transf,matriz_rfe,rs,blog_X_train_rfe,blog_y_train)
    matriz_rfe = estimator_cross_val_fea(k,'CART',             DecisionTreeRegressor(),estimator_transf,matriz_rfe,rs,blog_X_train_rfe,blog_y_train)

matriz_rfe

Unnamed: 0,features,model,MAE_mean,MAE_std,RMSE_mean,RMSE_std,R2_mean,R2_std
0,30,Linear Regression,9.182049,0.189199,30.346092,1.128043,0.349626,0.029745
1,30,Ridge Regression,9.182005,0.189197,30.346085,1.128039,0.349627,0.029745
2,30,Lasso,9.087821,0.193765,30.345144,1.146574,0.349707,0.029621
3,30,Elastic Net,9.116603,0.192219,30.347987,1.150661,0.349599,0.029488
4,30,KNN,6.382809,0.224521,28.510587,1.247672,0.426304,0.024114
5,30,CART,6.495443,0.336321,34.15465,1.970775,0.174276,0.081794
6,70,Linear Regression,9.303662,0.168855,30.31332,1.137163,0.350973,0.031427
7,70,Ridge Regression,9.303296,0.16887,30.313111,1.13705,0.350982,0.031425
8,70,Lasso,9.101846,0.176443,30.289602,1.146506,0.352024,0.031049
9,70,Elastic Net,9.117699,0.1762,30.288292,1.146147,0.352074,0.031171


### Feature Importance

Another alternative to reduce the number of features is "to score input features using a model and use a filter-based feature selection method. These are called Feature Importance methods" (BROWNLEE, 2021). The most use Feature Importance methods are Classification and Regression Trees (CART), Random Forest, Bagged Decision Trees, and Gradient Boosting.

We use the Random Forest algorithm as our feature importance method. Decision tree algorithms, such as Random Forest, "offer importance scores based on the reduction in the criterion used to select split points, like Gini or entropy" (BROWNLEE, 2021).

We use the same number of features as in the previous selection methods. Thus, we select the 190, 150, 70, and 30 most relevant features and evaluate the models again. We see that the 30 more important features represent almost 80% of all importance. With the 70 most important features we reach 90%, and with the 150 most important 99%. 

**We can see that that KNN was the best model and that it performed similarly for the four cases tested**. Besides, the models performed a little worse with these features than with the features selected using the RFE method.

In [17]:
# random forest for feature importance on a regression problem
from sklearn.ensemble import RandomForestRegressor

# feature selection
def select_features_FI(X_train, y_train,):
    # configure to select all features
    RFR = RandomForestRegressor()
    # learn relationship from training data
    RFR.fit(X_train, y_train)
    # transform train input data
    importance = RFR.feature_importances_
    return importance

In [18]:
# feature selection
importance = select_features_FI(blog_X_train, blog_y_train)

# what are scores for the features
FI = pd.DataFrame(importance, columns = ['Importance'])

In [19]:
matriz_fi = pd.DataFrame(columns=['features','model','MAE_mean','MAE_std','RMSE_mean','RMSE_std','R2_mean','R2_std'])

for k in [30,70,150,190]:

    best_features_fi = FI.transpose()
    best_features_fi.columns = blog_X_train.columns
    best_features_fi.sort_values('Importance',axis=1,ascending=False,inplace=True)
    best_features_fi.drop(best_features_fi.iloc[:,k:],axis=1,inplace=True)
    blog_X_train_fi = blog_X_train[best_features_fi.columns]
  
    matriz_fi = estimator_cross_val_fea(k,'Linear Regression',LinearRegression(),     estimator_transf,matriz_fi,rs,blog_X_train_fi,blog_y_train)
    matriz_fi = estimator_cross_val_fea(k,'Ridge Regression', Ridge(),                estimator_transf,matriz_fi,rs,blog_X_train_fi,blog_y_train)
    matriz_fi = estimator_cross_val_fea(k,'Lasso',            Lasso(),                estimator_transf,matriz_fi,rs,blog_X_train_fi,blog_y_train)
    matriz_fi = estimator_cross_val_fea(k,'Elastic Net',      ElasticNet(),           estimator_transf,matriz_fi,rs,blog_X_train_fi,blog_y_train)
    matriz_fi = estimator_cross_val_fea(k,'KNN',              KNeighborsRegressor(),  estimator_transf,matriz_fi,rs,blog_X_train_fi,blog_y_train)
    matriz_fi = estimator_cross_val_fea(k,'CART',             DecisionTreeRegressor(),estimator_transf,matriz_fi,rs,blog_X_train_fi,blog_y_train)

matriz_fi

Unnamed: 0,features,model,MAE_mean,MAE_std,RMSE_mean,RMSE_std,R2_mean,R2_std
0,30,Linear Regression,9.208629,0.197936,30.305936,1.129168,0.351328,0.030278
1,30,Ridge Regression,9.185319,0.192488,30.311341,1.122996,0.351075,0.030518
2,30,Lasso,9.102784,0.170983,30.285397,1.151886,0.352246,0.030287
3,30,Elastic Net,9.122356,0.172949,30.28839,1.149228,0.352111,0.030341
4,30,KNN,6.385736,0.223781,28.531051,1.311944,0.425339,0.029447
5,30,CART,6.355993,0.219612,33.743042,1.089966,0.194396,0.055561
6,70,Linear Regression,9.295442,0.184356,30.330668,1.131971,0.350254,0.03073
7,70,Ridge Regression,9.288471,0.18628,30.331466,1.134145,0.350223,0.030741
8,70,Lasso,9.113311,0.17645,30.291149,1.156861,0.351977,0.031008
9,70,Elastic Net,9.134904,0.183125,30.301134,1.14882,0.351541,0.030901


### Comparing the features

Finally, we compare the features selected by each model. We see that the percentage of features selected by all methods increase as the number of features used increases:

1. 30 best - 9 shared (30.0%)
2. 70 best - 32 shared (45.7%)
3. 150 best - 92 shared (61.3%)
4. 190 best - 134 shared (70.5%)

Besides, we can argue that the 9 features that were selected by all methods as one of the 30 most relevant fatures are the most significant ones for predicting our target variable.

In [20]:
MI_30 = MI.sort_values(by=['Score'],ascending=False)[:30].reset_index()
FI_30 = FI.sort_values(by=['Importance'],ascending=False)[:30].reset_index()
RF_30 = rfe_features.sort_values(by=['30 Features'],ascending=False)[:30].reset_index()
RF_30 = RF_30.drop(columns=['70 Features','150 Features','190 Features'])
merged_30 = pd.merge(FI_30, MI_30, on=['index'], how='inner')
merged_30 = pd.merge(merged_30, RF_30, on=['index'], how='inner')
len(merged_30)

9

In [21]:
MI_70 = MI.sort_values(by=['Score'],ascending=False)[:70].reset_index()
FI_70 = FI.sort_values(by=['Importance'],ascending=False)[:70].reset_index()
RF_70 = rfe_features.sort_values(by=['70 Features'],ascending=False)[:70].reset_index()
RF_70 = RF_70.drop(columns=['30 Features','150 Features','190 Features'])
merged_70 = pd.merge(FI_70, MI_70, on=['index'], how='inner')
merged_70 = pd.merge(merged_70, RF_70, on=['index'], how='inner')
len(merged_70)

32

In [22]:
MI_150 = MI.sort_values(by=['Score'],ascending=False)[:150].reset_index()
FI_150 = FI.sort_values(by=['Importance'],ascending=False)[:150].reset_index()
RF_150 = rfe_features.sort_values(by=['150 Features'],ascending=False)[:150].reset_index()
RF_150 = RF_150.drop(columns=['30 Features','70 Features','190 Features'])
merged_150 = pd.merge(FI_150, MI_150, on=['index'], how='inner')
merged_150 = pd.merge(merged_150, RF_150, on=['index'], how='inner')
len(merged_150)

92

In [23]:
MI_190 = MI.sort_values(by=['Score'],ascending=False)[:190].reset_index()
FI_190 = FI.sort_values(by=['Importance'],ascending=False)[:190].reset_index()
RF_190 = rfe_features.sort_values(by=['190 Features'],ascending=False)[:190].reset_index()
RF_190 = RF_190.drop(columns=['30 Features','70 Features','150 Features'])
merged_190 = pd.merge(FI_190, MI_190, on=['index'], how='inner')
merged_190 = pd.merge(merged_190, RF_190, on=['index'], how='inner')
len(merged_190)

134

----------

### Dimensionality reduction

"Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data" (BROWNLEE, 2021).

There are several techniques to reduce a data set dimensionality. **In this notebook, we use the Principal Component Analysis (PCA), which is the most used method for dimensionality reduction**. "It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data" (BROWNLEE, 2021). 

We reduce the dimensionality of the data set using the same number of features used in the feature selection section, 30, 70, 150, and 190. **We verify that the results are quite similar to the ones we obtained in the feature selection and that the level of the dimensionality reduction have little influence in the results**. Besides, once more the KNN model is better than the other models tested.

In [24]:
from sklearn.decomposition import PCA

def estimator_pca(estimator,k):
    #imputer = SimpleImputer(strategy='median')
    pipeline = Pipeline(steps=[('pca',PCA(n_components=k)),('model', estimator)])
    return pipeline 

def estimator_cross_val_pca(k,model,estimator,pipe,matriz,rs,X,y):
    pipe_ = pipe(estimator,k)
    scoring = ['neg_mean_absolute_error', 'neg_root_mean_squared_error','r2']
    kfold = KFold(n_splits=5, random_state=rs,shuffle=True)
    scores = cross_validate(pipe_,X,y,cv=kfold,scoring=scoring)
    
    mae_scores = -scores.get('test_neg_mean_absolute_error')
    mae_mean = mae_scores.mean()
    mae_std = mae_scores.std()
    
    rmse_scores = -scores.get('test_neg_root_mean_squared_error')
    rmse_mean = rmse_scores.mean()
    rmse_std = rmse_scores.std()
    
    r2_scores = scores.get('test_r2')
    r2_mean = r2_scores.mean()
    r2_std = r2_scores.std()
    
    results_ = [k,model,mae_mean,mae_std,rmse_mean,rmse_std,r2_mean,r2_std]
    results_ = pd.Series(results_, index = matriz.columns)
    results = matriz.append(results_,ignore_index=True)
    return results

matriz_pca = pd.DataFrame(columns=['dimensionality','model','MAE_mean','MAE_std','RMSE_mean','RMSE_std','R2_mean','R2_std'])

for k in [30,70,150,190]:
  
    matriz_pca = estimator_cross_val_pca(k,'Linear Regression',LinearRegression(),     estimator_pca,matriz_pca,rs,blog_X_train,blog_y_train)
    matriz_pca = estimator_cross_val_pca(k,'Ridge Regression', Ridge(),                estimator_pca,matriz_pca,rs,blog_X_train,blog_y_train)
    matriz_pca = estimator_cross_val_pca(k,'Lasso',            Lasso(),                estimator_pca,matriz_pca,rs,blog_X_train,blog_y_train)
    matriz_pca = estimator_cross_val_pca(k,'Elastic Net',      ElasticNet(),           estimator_pca,matriz_pca,rs,blog_X_train,blog_y_train)
    matriz_pca = estimator_cross_val_pca(k,'KNN',              KNeighborsRegressor(),  estimator_pca,matriz_pca,rs,blog_X_train,blog_y_train)
    matriz_pca = estimator_cross_val_pca(k,'CART',             DecisionTreeRegressor(),estimator_pca,matriz_pca,rs,blog_X_train,blog_y_train)

matriz_pca

Unnamed: 0,dimensionality,model,MAE_mean,MAE_std,RMSE_mean,RMSE_std,R2_mean,R2_std
0,30,Linear Regression,9.170274,0.19668,30.337866,1.132621,0.349924,0.031216
1,30,Ridge Regression,9.170276,0.196676,30.337863,1.132618,0.349925,0.031216
2,30,Lasso,9.122299,0.180373,30.305504,1.149764,0.351367,0.030623
3,30,Elastic Net,9.145455,0.185224,30.312383,1.147534,0.351061,0.030783
4,30,KNN,6.391033,0.215455,28.586793,1.320662,0.423153,0.028792
5,30,CART,6.667366,0.249153,34.565825,1.622042,0.151668,0.099029
6,70,Linear Regression,9.291957,0.198461,30.338712,1.113666,0.349879,0.030748
7,70,Ridge Regression,9.292803,0.200156,30.337724,1.111993,0.34992,0.030711
8,70,Lasso,9.122299,0.180373,30.305504,1.149764,0.351367,0.030623
9,70,Elastic Net,9.1449,0.18532,30.311839,1.147532,0.351086,0.030759


---------

# 3. Evaluate the ML model

Now evaluate the performance of our ML model in the test set, to see how it perform with unseen data.

We will do two tests. In the first one we use the KNN model and the 30 features selected using the RFE method. For the second test, we use the KNN model and the data set reduced using the PCA method.

First, we import the test set.

After testing the models, we verify that the performance of our model with the test set is similar to the performance with the train set. The MAE and RMSE are actually a little better but the R² is lower. Besides, the RMSE is considrably higher than the MAE. This result suggests that our data has many outliers and, consequently, our model is making some big errors.

Finally, we see that using the features selected by the RFE method and doing a dimensionality reduction using the PCA have similiar results.

### Getting the test set

In [25]:
import os
import glob

os.chdir(r"/Users/leuzinger/Dropbox/Data Science/Awari/Regressions/BlogFeedback/Test/")
filenames = [i for i in glob.glob("*.csv")]
df = [pd.read_csv(file, sep = ",", header=None,) 
      for file in filenames]

In [26]:
blog_test = df[0]

for i in range(1,len(df)):
    blog_test = blog_test.append(df[i]) 

blog_test.reset_index(drop=True,inplace=True)
blog_test.set_axis(att,axis=1,inplace=True)
blog_test.head()

Unnamed: 0,blog_avg_total,blog_std_total,blog_min_total,blog_max_total,blog_median_total,blog_avg_last24h,blog_std_last24h,blog_min_last24h,blog_max_last24h,blog_median_last24h,blog_avg_24-48h,blog_std_24-48h,blog_min_24-48h,blog_max_24-48h,blog_median_24-48h,blog_avg_first24h,blog_std_first24h,blog_min_first24h,blog_max_first24h,blog_median_first24h,blog_avg_difference,blog_std_difference,blog_min_difference,blog_max_difference,blog_median_difference,blog_avg_total_tr,blog_std_total_tr,blog_min_total_tr,blog_max_total_tr,blog_median_total_tr,blog_avg_last24h_tr,blog_std_last24h_tr,blog_min_last24h_tr,blog_max_last24h_tr,blog_median_last24h_tr,blog_avg_24-48h_tr,blog_std_24-48h_tr,blog_min_24-48h_tr,blog_max_24-48h_tr,blog_median_24-48h_tr,blog_avg_first24h_tr,blog_std_first24h_tr,blog_min_first24h_tr,blog_max_first24h_tr,blog_median_first24h_tr,blog_avg_difference_tr,blog_std_difference_tr,blog_min_difference_tr,blog_max_difference_tr,blog_median_difference_tr,total,last24h,24-48h,first24h,difference,total_tr,last24h_tr,24-48h_tr,first24h_tr,difference_tr,time_first_post,lenght_post,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,word12,word13,word14,word15,word16,word17,word18,word19,word20,word21,word22,word23,word24,word25,word26,word27,word28,word29,word30,word31,word32,word33,word34,word35,word36,word37,word38,word39,word40,word41,word42,word43,word44,word45,word46,word47,word48,word49,word50,word51,word52,word53,word54,word55,word56,word57,word58,word59,word60,word61,word62,word63,word64,word65,word66,word67,word68,word69,word70,word71,word72,word73,word74,word75,word76,word77,word78,word79,word80,word81,word82,word83,word84,word85,word86,word87,word88,word89,word90,word91,word92,word93,word94,word95,word96,word97,word98,word99,word100,word101,word102,word103,word104,word105,word106,word107,word108,word109,word110,word111,word112,word113,word114,word115,word116,word117,word118,word119,word120,word121,word122,word123,word124,word125,word126,word127,word128,word129,word130,word131,word132,word133,word134,word135,word136,word137,word138,word139,word140,word141,word142,word143,word144,word145,word146,word147,word148,word149,word150,word151,word152,word153,word154,word155,word156,word157,word158,word159,word160,word161,word162,word163,word164,word165,word166,word167,word168,word169,word170,word171,word172,word173,word174,word175,word176,word177,word178,word179,word180,word181,word182,word183,word184,word185,word186,word187,word188,word189,word190,word191,word192,word193,word194,word195,word196,word197,word198,word199,word200,Mon_bl,Tue_bl,Wed_bl,Thu_bl,Fri_bl,Sat_bl,Sun_bl,Mon_post,Tue_post,Wed_post,Thu_post,Fri_post,Sat_post,Sun_post,parent_pages,min_parent,max_parent,avg_parent,target
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064516,0.24567,0.0,1.0,0.0,0.032258,0.176685,0.0,1.0,0.0,0.032258,0.176685,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.254,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,1470.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,102.0,91.0,11.0,101.0,80.0,2.0,2.0,0.0,2.0,2.0,27.0,3520.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.056075,0.330159,0.0,2.0,0.0,0.018692,0.192442,0.0,2.0,0.0,0.018692,0.192442,0.0,2.0,0.0,0.056075,0.330159,0.0,2.0,0.0,0.0,0.273434,-2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,800.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064516,0.24567,0.0,1.0,0.0,0.032258,0.176685,0.0,1.0,0.0,0.032258,0.176685,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.254,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,51.0,1468.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,47.776787,93.73747,1.0,598.0,7.5,17.857143,56.888218,0.0,594.0,1.0,17.350447,56.91147,0.0,594.0,1.0,46.38616,91.28414,1.0,595.0,7.0,0.506696,79.06205,-590.0,594.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,5.0,0.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
blog_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7624 entries, 0 to 7623
Columns: 281 entries, blog_avg_total to target
dtypes: float64(281)
memory usage: 16.3 MB


In [28]:
blog_X_test = blog_test.drop('target',axis=1).copy()
blog_y_test = blog_test['target'].copy()

### Evaluating the ML models

In [29]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=30) 
knn = KNeighborsRegressor()
pipe_rfe = Pipeline(steps=[('rfe',rfe),('knn',knn)])

pipe_rfe.fit(blog_X_train,blog_y_train)
blog_y_hat = pipe_rfe.predict(blog_X_test)

final_mae = mean_absolute_error(blog_y_test,blog_y_hat)
final_mse = mean_squared_error(blog_y_test,blog_y_hat)
final_rmse = np.sqrt(final_mse)
final_r2 = r2_score(blog_y_test,blog_y_hat)
print('MAE:  %.2f'%final_mae,'\nRMSE: %.2f'%final_rmse,'\nR2:   %.2f'%final_r2)

MAE:  5.79 
RMSE: 25.12 
R2:   0.32


In [30]:
pipe_pca = Pipeline(steps=[('pca',PCA(n_components=30)),('knn', KNeighborsRegressor())])
pipe_pca.fit(blog_X_train,blog_y_train)
blog_y_hat = pipe_pca.predict(blog_X_test)

final_mae = mean_absolute_error(blog_y_test,blog_y_hat)
final_mse = mean_squared_error(blog_y_test,blog_y_hat)
final_rmse = np.sqrt(final_mse)
final_r2 = r2_score(blog_y_test,blog_y_hat)
print('MAE:  %.2f'%final_mae,'\nRMSE: %.2f'%final_rmse,'\nR2:   %.2f'%final_r2)

MAE:  5.72 
RMSE: 25.10 
R2:   0.32


----------------------

## 4. Conclusion

In this notebook, we created a model to predict the number of blog posts in the next 24h based on several attributes of the post. First, we tested some regression models: 

1. Linear regression
2. Ridge regression
3. Lasso regression
4. Elastic Net
5. Classification and Regression Trees (CART)
6. k-Nearest Neighbors (KNN)

In this first tests, the KNN was the best performing method.

However, we verified that the large number of features in our data was demanding a high computing time to run the models. Therefore, we tested some techniques to reduce the number of features:

1. Mutual Information Statistics
2. Recursive Feature Elimination (RFE)
3. Random Forest

The features selected by the RFE were the ones that resulted in the best performance of the KNN model.

Finaly, we also used a dimensionality reduction method, the Principal Component Analysis (PCA) to reduce the size of our data set. Our results with the train set showed that both the RFE and the PCA, combined with the KNN model, had similar results.

**Therefore, we tested two models with our test set: (i) KNN + RFE and (ii) KNN + PCA. We verified that the models performed almost identically**.

**However, our models performed modestly at best. All evaluation metrics used are poor**, especially the RMSE and the R². The fact that the RMSE is high suggests that our data has many outliers and, consequently, our model is making some big errors. **Nonetheless, given that these are quite simple regression methods, we could consider that the results are reasonable**. More complex models could be used to achieve better predictions. However, these models would probably demand more time to build and more computing power to run, which could actually mean a worse cost-benefit.