## Data Fields
- datetime - hourly date + timestamp  
- season 
    -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp - temperature in Celsius
- atemp - "feels like" temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals

## Learning:
1. A datevalue might not always be in the correct format. Make sure that you have updated to the correct dtype.


2. The `extract_dateinfo` is a great function created by fast.ai. This function created multiple layers of different segmentation using the date.


3. Remember that a normal distribution is not always the best distribution to use (talking about the predictor variable: Count, Registered, & Casual)


4. A logistic transformation is sometimes preferred. When dealing with highly skewed data, our model might have difficulties correctly making prediction across the possible target variables. Hence, if log transform the highly skewed variable, we might find that distribution is more normal allowing our model to have a similar methodology (and thus, similar residuals) across the possible target variable.
    - If we train a model on a highly skew data, we often tend to see residuals have a different distribution across different ranges of our target variable.


5. Use the `correlation` function located in the "/Users/alexguanga/All_Projects/ds-portfolio/DataScience_Code/correlation_abs.py" to find the plot the correlation.


6. Remember that we need to create your own scorer function, you need to specify the function to sklearn library `make_scorer`

```python
def scorer(actual, predicted):
    sle = (np.power(np.log(np.array((actual))+1) - 
            np.log(np.array(np.abs(predicted))+1), 2))
    msle = np.mean(sle)
    return (np.sqrt(msle))
# Creating the scorer function
rmse_scorer = make_scorer(scorer, greater_is_better=False)

# 10 fold cross validation
cv_score = cross_val_score(rf_model, train_on_dum, train_cnt_labels_dum, cv=10, scoring=rmse_scorer)
```
7. When you're transforming multiple columns through the label encoders (this is useful for models that do not put assign value on the magntitude like regression), we have to create a dictionary to encode the label.

```python
from sklearn import preprocessing
from collections import defaultdict

d = defaultdict(preprocessing.LabelEncoder)
data[encoded_cols] = data[encoded_cols].apply(lambda x: d[x.name].fit_transform(x))
```

8. Hyperparamter Tuning:
    - https://www.kaggle.com/willkoehrsen/intro-to-model-tuning-grid-and-random-search
    - https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e
    - https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
    - https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/
    - **Current hyperparameters**: https://medium.com/@mateini_12893/doing-xgboost-hyper-parameter-tuning-the-smart-way-part-1-of-2-f6d255a45dde
    
    
9. Cycling Variables:
    - By pure chance, I came upon an article that explains certain variables are cyclical. For example, if we look at hours or days, our model won't be able to understand that the hour 0 and 24 are close to one another.
    - http://blog.davidkaleko.com/feature-engineering-cyclical-features.html
    
    
10. `Interpolate`
    - Interpolates gives you the flexibility to fill the missing values with many kinds of interpolations between the values like linear (which fillna does not provide).
    
11. If you would like to plot multiple graphs in the same view, you must do...
    ```python
    # Dropping some variables from the graph
    figs, ax = plt.subplots(3, 1, figsize=(20, 12), sharex=True)
    val_1.drop(PREDICTORS).plot.bar(color='b', ax=axs[0], title="Val 1");
    val_2.drop(PREDICTORS).plot.bar(color='b', ax=axs[1], title="Val 2");
    val_3.drop(PREDICTORS).plot.bar(color='b', ax=axs[2], title="Val 3");
    ```
    
12. If you like to group the variables by their datatype, use...
    ```python
    data.columns.to_series().groupby(data.dtypes).groups
    ```

13. `train_test_split` returns good results. Hence, while I CV are a great methodolgies, `train_test_split` in our example, show improved result.


14. To remove warning that should not be warning (after some research on Stackerflow, you'll realize errors that should not be error), use the following:
    - ```python 
        np.warnings.filterwarnings('ignore')```

15. The `run_model` is a function created by me that will return two models: one trained via all the variables and another trained via the top variables! 
    - Also, parameters are passed into the function!
    - Path `"Users/alexguanga/All_Projects/ds-portfolio/DataScience_Code/run_model.py"`
    - A few other function that are useful are in `"Users/alexguanga/All_Projects/ds-portfolio/DataScience_Code/modeling_functions.py"`
    

## Performance
- Performance was the strongest for the XGB Regressor model. However, the Gradient Boosting Regressor was also pretty strong!
- Random Forest Regressor was suprsingly bad... when trained with the cross validation, it was suboptimal relative to XGB Regressor or Gradient Boosting Regressor.
- Another intersting feature in this dataset was that it had 3 target variables, where two variables were the summation of the main target variables.
    - When trained indepdently (the two target that were the summation of the main one), the performance varies but overall, the model performed best when trained on the main target variable.
- The best methodology used in the project was combining models.. I combined the XGBoost Regressor and the Gradient Boosting Regressor.
    - There were 4 models that I averaged to calculate my final predictions.
        1. Gradient Boosting Model on the main target variable using all the variables.
        2. Gradient Boosting Model on the main target variable using the top variables.
        3. XGBoosting Regressor model on the main target variable using all the variables.
        4. XGBoosting Regressor model on the main target variable using the top variables.
-  **Final RMLSE was 0.41143 (11th percentile)**