## Final Model and Conclusion

### Final Model

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

X_train = pd.read_feather('./X_train.feather')
y_train = pd.read_feather('./y_train.feather')['shares']

X_val = pd.read_feather('./X_val.feather')
y_val = pd.read_feather('./y_val.feather')['shares']

rfr = RandomForestRegressor(n_estimators=42, max_features='log2', min_samples_leaf=46)
rfr.fit(X_train, y_train)
rfr.score(X_val, y_val)

0.0424837807992412

In [2]:
X_train.columns

Index(['n_unique_tokens', 'num_hrefs', 'num_imgs', 'num_videos', 'kw_avg_max',
       'kw_min_avg', 'kw_avg_avg', 'self_reference_min_shares',
       'self_reference_max_shares', 'LDA_03', 'avg_negative_polarity',
       'data_channel_missing'],
      dtype='object')

After going through all of the steps to select the best model to predict the number of shares, that is, hyperparameter tuning, determining feature importances, correlations analysis, and focused feature engineering, the final model was selected and this model and has 12 features. The 12 features are shown above.

### Score the Model

In [3]:
X_test = pd.read_feather('./X_test.feather')
y_test = pd.read_feather('./y_test.feather')['shares']

In [4]:
y_test.shape

(7909,)

In [5]:
X_test = X_test[X_train.columns]

In [6]:
X_test.shape

(7909, 12)

In [7]:
rfr.score(X_test,y_test)

0.013597095072079513

Unfortunately, after validating my model, I got a significant drop on my score (from 0.04 to 0.01).

### Linear Regression vs. Random Forest

Although my linear regression model had a higher score than my random forest model, I think that the random forest model in this instance works better than the linear regression model. For the linear regression model, I ended up using 50 features and still getting a low score of 0.13 while with random forest, although the score is lower, I am only using 12 feature variables. The recommendation is to do more feature engineering of variables that will be able to explain the target variable or, in addition explore more on the relationships of the feature variables with the target variable because the relationship may not be really linear at all which is probably the reason why the linear regression model was not quite effective. With linear regression modeling, it was more difficult to come up with the best model because you even have to check for outliers and also try to transform the data in order to get a better score. We do not have to do this part with random forest. Lastly, it is easier to see the impact of the feature variables with the target variable in the end especially that you only are working with a few features. Using the different visualizations, you can see that the effect of each variable is not always linear and also you can really see the effect of increasing from one unit to another. Unlike with linear regression that always assumes a linear relationship which may not always be the case for each variable and dataset that you have.