## **IMPORT PACKAGES**

To install xgboost package: *!conda install -c conda-forge xgboost*

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

## **READING PREPARED DATASET**

In [2]:
%%time
source = '/Users/antoniobravomunoz/Documents/DATA_SCIENCE_MASTER/TFM/Data-Science-Master-project/DATA/traffic_data_complete.csv'
df = pd.read_csv(source,sep=',')

CPU times: user 14.6 s, sys: 1.76 s, total: 16.3 s
Wall time: 16 s


In [3]:
df.head()

Unnamed: 0,id,intensidad,ocupacion,carga,vmed,periodo_integracion,Hora,Lat,Long,M30,URB,Mes,Dia,Minutos
0,1001,204,12,0,73,5,0,40.409729,-3.740786,1,0,1,1,0
1,1002,252,1,0,79,5,0,40.408029,-3.74376,1,0,1,1,0
2,1003,420,2,0,82,5,0,40.406824,-3.746834,1,0,1,1,0
3,1006,288,1,0,75,5,0,40.411894,-3.736324,1,0,1,1,0
4,1009,276,0,0,76,5,0,40.416233,-3.724909,1,0,1,1,0


In [72]:
df.sample(5)

Unnamed: 0,id,intensidad,ocupacion,carga,vmed,periodo_integracion,Hora,Lat,Long,M30,URB,Mes,Dia,Minutos
3171326,10036,0,0,1,0,15,6,40.423722,-3.693921,0,1,6,19,0
2884492,6056,0,0,0,0,15,4,40.444752,-3.645364,0,1,6,12,15
5514103,6741,576,5,36,60,15,8,40.416908,-3.6579,1,0,2,16,0
7208647,7110,0,0,0,0,15,0,40.507305,-3.685617,0,1,8,31,15
4640363,5993,0,0,0,0,15,4,40.417211,-3.623704,0,1,3,26,0


In [4]:
df.dtypes

id                       int64
intensidad               int64
ocupacion                int64
carga                    int64
vmed                     int64
periodo_integracion      int64
Hora                     int64
Lat                    float64
Long                   float64
M30                      int64
URB                      int64
Mes                      int64
Dia                      int64
Minutos                  int64
dtype: object

## **TRAIN: SPLIT DATASET IN X (OBSERVATIONS) AND Y (TARGET)**

In [5]:
X=df.drop(['carga'],axis=1)

In [6]:
X.shape

(9740545, 13)

In [7]:
y=df['carga']

In [8]:
y.shape

(9740545,)

## **ML REGRESSION ALGORITHMS**

In our case, we have chosen to use Decission Tree Methods due to they have some interesting advantages:
- They are easy to understand.
- They require little data preparation.
- They are robust.
- They work good with large datasets


Using GridSearchCV we look for the best parameters which give the best score in order to use it as final model parameters.

First of all, we are testing with a **Decision Tree Regressor**

In [13]:
%%time
#Decission Tree
reg_DeciTree=GridSearchCV(DecisionTreeRegressor(min_samples_leaf=1,max_depth=4),
                          param_grid={"min_samples_leaf":[10,20,30,40,70,100],"max_depth":range(2,5)},
                          scoring="neg_mean_squared_error"
                          )
reg_DeciTree.fit(X,y)
print(reg_DeciTree.best_score_)
print(np.sqrt(-reg_DeciTree.best_score_))
print(reg_DeciTree.best_params_)

-70.60611753784305
8.402744643141492
{'max_depth': 4, 'min_samples_leaf': 10}
CPU times: user 11min 32s, sys: 1min 19s, total: 12min 51s
Wall time: 12min 9s


In [14]:
%%time
#Decission Tree v2
reg_DeciTree2=GridSearchCV(DecisionTreeRegressor(min_samples_leaf=1,max_depth=4),
                          param_grid={"min_samples_leaf":[10,20,30,40,70,100],"max_depth":range(10,15)},
                          scoring="neg_mean_squared_error"
                          )
reg_DeciTree2.fit(X,y)
print(reg_DeciTree2.best_score_)
print(np.sqrt(-reg_DeciTree2.best_score_))
print(reg_DeciTree2.best_params_)

-14.437166601055841
3.799627166059302
{'max_depth': 14, 'min_samples_leaf': 30}
CPU times: user 1h 4min 24s, sys: 6min 10s, total: 1h 10min 34s
Wall time: 1h 13min 4s


In [15]:
%%time
#Decission Tree v3
reg_DeciTree3=GridSearchCV(DecisionTreeRegressor(min_samples_leaf=9,max_depth=19),
                          param_grid={"min_samples_leaf":[10,20,30,40],"max_depth":range(19,20)},
                          scoring="neg_mean_squared_error"
                          )
reg_DeciTree3.fit(X,y)
print(np.sqrt(-reg_DeciTree3.best_score_))
print(reg_DeciTree3.best_params_)

2.5348422766054957
{'max_depth': 19, 'min_samples_leaf': 10}
CPU times: user 11min 55s, sys: 21.8 s, total: 12min 17s
Wall time: 12min 10s


Testing with a **Random Forest Regressor**

In [20]:
%%time
#Random Forest
reg_RF=GridSearchCV(RandomForestRegressor(n_estimators=50,min_samples_leaf=9,max_depth=14),
                          param_grid={"min_samples_leaf":[20,30,40],"max_depth":range(14,15)},
                          scoring="neg_mean_squared_error"
                          )
reg_RF.fit(X,y)
print(-reg_RF.best_score_)
print(np.sqrt(-reg_RF.best_score_))
print(reg_RF.best_params_)

12.491872273809218
3.534384284965235
{'max_depth': 14, 'min_samples_leaf': 20}
CPU times: user 4h 39min 19s, sys: 1min 22s, total: 4h 40min 42s
Wall time: 4h 39min 4s


To achive to train heavier algorithms like random forest or XGBoost models with others parameters we have decided to do it in through **Google Colab** to have higher computacional capabilities. In notebook *Data Modeling Colab-full dataset.ipynb* we can found this procedure.

Once the model would have been trained, the results are saving with joblib.dump() method. After that, the files are loaded here in order to predict with test dataset.

## **SPLIT SEPTEMBER-2018 DATASET IN X_test AND Y_test**

Before split the dataset, we are going to apply it a processing like in *Data Acquisition and Preparation.ipynb* is done for the data used in train. This can be checked in *Data Acquisition and Preparation_test.ipynb* notebook.

Loading the file

In [9]:
%%time
sourceTest='/Users/antoniobravomunoz/Documents/DATA_SCIENCE_MASTER/TFM/Data-Science-Master-project/DATA/traffic_data_test_complete.csv'
df_test=pd.read_csv(sourceTest,sep=',')

CPU times: user 24.8 ms, sys: 45 ms, total: 69.8 ms
Wall time: 69.3 ms


In [10]:
df_test.head()

Unnamed: 0,id,intensidad,ocupacion,carga,vmed,periodo_integracion,Hora,Lat,Long,M30,URB,Mes,Dia,Minutos
0,6679,2468.0,3.0,38.0,87.0,15,0,40.393663,-3.674823,1,0,9,1,0
1,6679,2488.0,4.0,37.0,87.0,15,0,40.393663,-3.674823,1,0,9,1,15
2,6679,2503.0,4.0,38.0,88.0,15,0,40.393663,-3.674823,1,0,9,1,30
3,6679,2140.0,3.0,32.0,88.0,15,0,40.393663,-3.674823,1,0,9,1,45
4,6679,2156.0,4.0,33.0,90.0,15,1,40.393663,-3.674823,1,0,9,1,0


In [11]:
df_test.shape

(10526, 14)

In [12]:
df_test.dtypes

id                       int64
intensidad             float64
ocupacion              float64
carga                  float64
vmed                   float64
periodo_integracion      int64
Hora                     int64
Lat                    float64
Long                   float64
M30                      int64
URB                      int64
Mes                      int64
Dia                      int64
Minutos                  int64
dtype: object

In [13]:
X_test=df_test.drop(['carga'],axis=1)

In [14]:
X_test.shape

(10526, 13)

In [15]:
y_test=df_test['carga']

In [16]:
y_test.shape

(10526,)

## **PREDICTING Y_test WITH RANDOM FOREST**

Loading the trained model with "Pickle" module.

In [17]:
import pickle 
from sklearn.externals import joblib

In [18]:
pkl_filename_route='/Users/antoniobravomunoz/Documents/DATA_SCIENCE_MASTER/TFM/Data-Science-Master-project/TrainedModels/randomforest2_colab.pkl'

In [19]:
RFreg = joblib.load(pkl_filename_route)

In [67]:
RFreg

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=29,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=20, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [65]:
RFreg.score

<bound method RegressorMixin.score of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=29,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=20, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)>


In [20]:
X_test.describe()

Unnamed: 0,id,intensidad,ocupacion,vmed,periodo_integracion,Hora,Lat,Long,M30,URB,Mes,Dia,Minutos
count,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0
mean,6729.556907,2351.216797,7.346855,79.530401,14.909082,12.917443,40.432354,-3.667283,1.0,0.0,9.0,3.840965,22.34752
std,20.665193,1671.041131,7.427167,14.767808,0.748585,6.604995,0.030781,0.00745,0.0,0.0,0.0,2.032883,16.792269
min,6679.0,5.0,1.0,4.0,1.0,0.0,40.383376,-3.683001,1.0,0.0,9.0,1.0,0.0
25%,6717.0,892.0,3.0,77.0,15.0,8.0,40.407452,-3.673634,1.0,0.0,9.0,2.0,0.0
50%,6729.0,2054.5,6.0,84.0,15.0,13.0,40.435325,-3.66577,1.0,0.0,9.0,4.0,15.0
75%,6746.0,3596.0,9.0,88.0,15.0,19.0,40.461982,-3.660087,1.0,0.0,9.0,6.0,30.0
max,6762.0,7255.0,76.0,100.0,15.0,23.0,40.481905,-3.659082,1.0,0.0,9.0,8.0,45.0


In [21]:
Ypredict=RFreg.predict(X_test)

In [24]:
len(Ypredict)

10526

In [28]:
ErrorScore=RFreg.score(X_test, y_test)
ErrorScore

0.9720606530911907

Hemos tenido muchos problemas para importar correctamente el modelo desde disco. Finalmente, el problema se ha resuelto subiendo a la version 0.20.2 de sklearn. Estabamos en la 0.19.2 y de ahí daba el error.

In [82]:
#!pip install -U scikit-learn

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/cb/5f/dfa0a118b8a503e45cd2cf48acb9cf1de8deaf06a3cef1b1c19bd5cbbc45/scikit_learn-0.20.2-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (7.9MB)
[K    100% |████████████████████████████████| 7.9MB 2.9MB/s ta 0:00:011
[?25hRequirement not upgraded as not directly required: numpy>=1.8.2 in /Users/antoniobravomunoz/anaconda3/lib/python3.6/site-packages (from scikit-learn) (1.15.1)
Requirement not upgraded as not directly required: scipy>=0.13.3 in /Users/antoniobravomunoz/anaconda3/lib/python3.6/site-packages (from scikit-learn) (1.1.0)
[31mtensorflow 1.10.0 has requirement numpy<=1.14.5,>=1.13.3, but you'll have numpy 1.15.1 which is incompatible.[0m
[31mtensorflow 1.10.0 has requirement setuptools<=39.1.0, but you'll have setuptools 40.2.0 which is incompatible.[0m
Installing collected packages: scikit-learn
  Found existing installation: s

In [80]:
#Random Forest v2
reg_RF2=RandomForestRegressor(n_estimators=100,min_samples_leaf=20,max_depth=29)

In [81]:
%%time
reg_RF2.fit(X,y)

KeyboardInterrupt: 

In [None]:
%%time
#Random Forest v3
reg_RF3=GridSearchCV(RandomForestRegressor(n_estimators=100,min_samples_leaf=9,max_depth=29),
                          param_grid={"min_samples_leaf":[10,20,30,40],"max_depth":range(29,30)},
                          scoring="neg_mean_squared_error"
                          )
reg_RF3.fit(X,y)
print(-reg_RF3.best_score_)
print(np.sqrt(-reg_RF3.best_score_))
print(reg_RF3.best_params_)

**Regression metrics**
- ‘explained_variance’ 
- ‘neg_mean_absolute_error’
- ‘neg_mean_squared_error’
- ‘neg_mean_squared_log_error’
- ‘neg_median_absolute_error’

Finally, we are going to test with a **Gradient Boosting Regressor**

In [None]:
%%time
#XGBoost
reg_XGB=GridSearchCV(XGBRegressor(n_estimators=100,min_samples_leaf=9,max_depth=19),
                          param_grid={"min_samples_leaf":[10,20,30,40],"max_depth":range(19,20)},
                          scoring="neg_mean_squared_error"
                          )
reg_XGB.fit(X,y)
print(reg_XGB.best_score_)
print(np.sqrt(-reg_XGB.best_score_))
print(reg_XGB.best_params_)

In [None]:
%%time
#XGBoost
reg_XGB2=GridSearchCV(XGBRegressor(n_estimators=100,min_samples_leaf=9,max_depth=29),
                          param_grid={"min_samples_leaf":[10,20,30,40],"max_depth":range(29,30)},
                          scoring="neg_mean_squared_error"
                          )
reg_XGB2.fit(X,y)
print(reg_XGB2.best_score_)
print(np.sqrt(-reg_XGB2.best_score_))
print(reg_XGB2.best_params_)