### Load the data

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

X_train = pd.read_feather('./X_train_1.feather')
y_train = pd.read_feather('./y_train.feather')['prices.amountMax']

X_val = pd.read_feather('./X_val_1.feather')
y_val = pd.read_feather('./y_val.feather')['prices.amountMax']

rfr = RandomForestRegressor(n_estimators=40, max_features='log2')
rfr.fit(X_train, y_train)
rfr.score(X_val, y_val)

0.44860788144562525

In [2]:
X_train.columns

Index(['prices.isSale', 'brand_Dansko', 'brand_Ugg', 'brand_other',
       'manufacturer_Nike', 'manufacturerNumber_other', 'categories_Athletic',
       'categories_Boots', 'categories_Slippers', 'colors_Green'],
      dtype='object')

### A. Evaluate and Draw Conclusions from Random Forest Model
#### Score the model

In [3]:
X_test = pd.read_feather('./X_test.feather')
y_test = pd.read_feather('./y_test.feather')['prices.amountMax']

In [8]:
y_test.shape

(1072,)

In [10]:
X_test = X_test[X_train.columns]

In [11]:
X_test.shape

(1072, 10)

In [12]:
rfr.score(X_test, y_test)

0.42961023864474635

* We get .42 which is .02 lower than the validation score.

#### Conclusion
* In this model, I remove a lot of missing values and outliers, and all time-related features.
* Since all the features in the final model are boolean, it may affect the prediction score. So not able to get a more accurate number.
* From the final features the model chose, isSale was the most significant feature; Some brand names, like Dansko, Ugg Other are selected; Some manufacturer names, like Nike and Other are selected; Some shoe categories, like athletic, boots and slippers are in the final pool; Only one color name, green, remains in the end.

#### Next step
* My dataset is really restricted on the boolean type data. Cause all of the features are boolean except the date-time features, which in the end I need to remove. I don't have numerical features, so it seems like my model can not achive a higher score, like .7 or .8 . If possible, I would like to import some numerical features next.
* But for shoes, the sizes are always in the range from 4 to 12, but with the same price in most situation. And the sale amount is a good feature.

### B. Compare against linear regression

#### Model Accuracy
* In my mid-term project with linear regression model with the same dataset, I used 23 features for the final model with a score .20 . And when I use the model to test the testing dataset I only get accuracy of .10, which was really low, half of the model score.
* In this project, I used 11 features at last, getting model score .44 . And the accuracy is .42, which performes good.
* The reason for the difference is that, I did the model training for only three times, and in each time, I removed too many features by the correlation analysis guidance each step. But in the random forest regression model, every time it just removes the features under .005 weight or .01 . It's much more logical doing in this processment.

#### Feature Importances
* The linear regreesion model's important features, 23 features:
       'brand_Asics', 'brand_Charles by Charles David', 'brand_Converse',
       'brand_Crocs', 'brand_MICHAEL Michael Kors', 'brand_Propet',
       'brand_Telic', 'brand_Ugg', 'brand_Under Armour',
       'manufacturer_Charles by Charles David', 'manufacturer_Converse',
       'manufacturer_Crocs', 'manufacturer_MICHAEL Michael Kors',
       'manufacturer_Novascarpa Group LLC', 'manufacturer_Propet',
       'manufacturer_Under Armour', 'manufacturer_asics',
       'manufacturerNumber_5825', 'manufacturerNumber_Anna',
       'prices.merchant_Overstock.com', 'prices.merchant_Shoes.com',
       'prices.merchant_Walmart', 'prices.merchant_other'
* The random forest model's important features, 11 features:
       'prices.isSale', 'brand_Dansko', 'brand_Ugg', 'brand_other',
       'manufacturer_Nike', 'manufacturerNumber_other', 'categories_Athletic',
       'categories_Boots', 'categories_Slippers', 'colors_Green'

* Most of them are different, since the linear regression model did a bad accuracy. I don't trust its result on importances.

#### Model selection
* In my case, which includes too many categorical features, it's much more difficult for linear regreesion model to draw a fitting model and make a good reasonable prediction. And we only use the correlation analysis method to remove features, which is more subjective to decide where to draw the line.
* It's much more suitable to apply random forest regression model on my case, cause it's all about boolean, which goes left or right. Using permutation importances way enables you to remove a few features step by step, which is more convenient and effictive. What's more important, it's more logical to remove the features with a under .01 weight.