We are going to use a sample of the [Mashable Online News Dataset](https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity). This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. 

The goal is to predict the number of shares in social networks (**target variable is "shares"**).

In [2]:
import pandas as pd
news = pd.read_csv("./data/news.csv")

In [3]:
news.shape

(10000, 61)

In [4]:
news.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2014/09/08/safest-cabbies-...,121.0,12.0,1015.0,0.422018,1.0,0.545031,10.0,6.0,33.0,...,0.1,0.8,-0.160714,-0.5,-0.071429,0.0,0.0,0.5,0.0,2900
1,http://mashable.com/2013/07/25/3d-printed-rifle/,532.0,9.0,503.0,0.569697,1.0,0.737542,9.0,0.0,1.0,...,0.136364,1.0,-0.1575,-0.25,-0.1,0.0,0.0,0.5,0.0,1300
2,http://mashable.com/2013/10/30/digital-dinosau...,435.0,9.0,232.0,0.646018,1.0,0.748428,12.0,3.0,4.0,...,0.375,0.5,-0.4275,-1.0,-0.1875,0.0,0.0,0.5,0.0,17700
3,http://mashable.com/2014/08/27/homer-simpson-i...,134.0,12.0,171.0,0.722892,1.0,0.867925,9.0,5.0,0.0,...,0.5,0.5,-0.216667,-0.25,-0.166667,0.4,-0.25,0.1,0.25,1500
4,http://mashable.com/2013/01/10/creepy-robotic-...,728.0,11.0,286.0,0.652632,1.0,0.8,5.0,2.0,0.0,...,0.1,0.6,-0.251786,-0.5,-0.1,0.2,-0.1,0.3,0.1,1400


In [5]:
news.columns

Index(['url', 'timedelta', 'n_tokens_title', 'n_tokens_content',
       'n_unique_tokens', 'n_non_stop_words', 'n_non_stop_unique_tokens',
       'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos',
       'average_token_length', 'num_keywords', 'data_channel_is_lifestyle',
       'data_channel_is_entertainment', 'data_channel_is_bus',
       'data_channel_is_socmed', 'data_channel_is_tech',
       'data_channel_is_world', 'kw_min_min', 'kw_max_min', 'kw_avg_min',
       'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg', 'kw_max_avg',
       'kw_avg_avg', 'self_reference_min_shares', 'self_reference_max_shares',
       'self_reference_avg_sharess', 'weekday_is_monday', 'weekday_is_tuesday',
       'weekday_is_wednesday', 'weekday_is_thursday', 'weekday_is_friday',
       'weekday_is_saturday', 'weekday_is_sunday', 'is_weekend', 'LDA_00',
       'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04', 'global_subjectivity',
       'global_sentiment_polarity', 'global_rate_positive_words',
     

This dataset has a lot of features, we will try to find a way to reduce model complexity and make sure that we are not overfitting.

## Train a Support Vector Machine and a Random Forest Regressor with the target variable "shares" and evaluate their performance in the train and the test set by using the function `cross_validate`. Do any of them overfit?

**hint**: you can use the test score / train score ratio as a  benchmark to check if how model is overfitting.

In [6]:
target = news["shares"]

In [7]:
independent_variables = news[['timedelta', 'n_tokens_title', 'n_tokens_content',
       'n_unique_tokens', 'n_non_stop_words', 'n_non_stop_unique_tokens',
       'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos',
       'average_token_length', 'num_keywords', 'data_channel_is_lifestyle',
       'data_channel_is_entertainment', 'data_channel_is_bus',
       'data_channel_is_socmed', 'data_channel_is_tech',
       'data_channel_is_world', 'kw_min_min', 'kw_max_min', 'kw_avg_min',
       'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg', 'kw_max_avg',
       'kw_avg_avg', 'self_reference_min_shares', 'self_reference_max_shares',
       'self_reference_avg_sharess', 'weekday_is_monday', 'weekday_is_tuesday',
       'weekday_is_wednesday', 'weekday_is_thursday', 'weekday_is_friday',
       'weekday_is_saturday', 'weekday_is_sunday', 'is_weekend', 'LDA_00',
       'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04', 'global_subjectivity',
       'global_sentiment_polarity', 'global_rate_positive_words',
       'global_rate_negative_words', 'rate_positive_words',
       'rate_negative_words', 'avg_positive_polarity', 'min_positive_polarity',
       'max_positive_polarity', 'avg_negative_polarity',
       'min_negative_polarity', 'max_negative_polarity', 'title_subjectivity',
       'title_sentiment_polarity', 'abs_title_subjectivity',
       'abs_title_sentiment_polarity']]

In [8]:
news1 = pd.DataFrame(news)

In [9]:
news1.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2014/09/08/safest-cabbies-...,121.0,12.0,1015.0,0.422018,1.0,0.545031,10.0,6.0,33.0,...,0.1,0.8,-0.160714,-0.5,-0.071429,0.0,0.0,0.5,0.0,2900
1,http://mashable.com/2013/07/25/3d-printed-rifle/,532.0,9.0,503.0,0.569697,1.0,0.737542,9.0,0.0,1.0,...,0.136364,1.0,-0.1575,-0.25,-0.1,0.0,0.0,0.5,0.0,1300
2,http://mashable.com/2013/10/30/digital-dinosau...,435.0,9.0,232.0,0.646018,1.0,0.748428,12.0,3.0,4.0,...,0.375,0.5,-0.4275,-1.0,-0.1875,0.0,0.0,0.5,0.0,17700
3,http://mashable.com/2014/08/27/homer-simpson-i...,134.0,12.0,171.0,0.722892,1.0,0.867925,9.0,5.0,0.0,...,0.5,0.5,-0.216667,-0.25,-0.166667,0.4,-0.25,0.1,0.25,1500
4,http://mashable.com/2013/01/10/creepy-robotic-...,728.0,11.0,286.0,0.652632,1.0,0.8,5.0,2.0,0.0,...,0.1,0.6,-0.251786,-0.5,-0.1,0.2,-0.1,0.3,0.1,1400


In [10]:
news_X = independent_variables
news_y = target

In [11]:
from sklearn.model_selection import train_test_split

news_X_train, news_X_test, news_y_train, news_y_test = train_test_split(
    news_X, news_y, test_size=0.2)

In [12]:
from sklearn.svm import SVC, SVR

In [13]:
estimator_svm =  SVC()

In [15]:
estimator_svm.fit(news_X_train, news_y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [17]:
estimator_svm.predict(news_X_test)[:10]

array([1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100])

In [18]:
from mlxtend.plotting import plot_decision_regions

In [20]:
X = independent_variables
y = target

In [None]:
estimator_svm_lineal = SVC(kernel="linear")
estimator_svm_lineal.fit(X, y)

plot_decision_regions(X, y, clf=estimator_svm_lineal);

## Use Feature Selection to reduce the fit time to train a Support Vector Machine while keeping its performance.


we worked on the project sorry this is so empty

## Using Nested Cross Validation, find the best estimator that you can, choosing between an SVR an a RandomForestRegressor.

## For the best SVR you find, which points in the dataset are the hardest ones to classify?