### Prediction Model

Data has been already cleaned in SQL.

A database with a timestamp every 5 min was created with the station and weather data.
The day-time savings time change was adjusted.
A column for the day was added.
The status for the night was corrected to closed.
The CSV clean_db.csv was created from that. This database will be used as test and validation set for this prediction model.
Column with just time.

For the prediction model there will be a random forest regression done for each station.

In [1]:
# Import the required packages
# Import package pandas for data analysis
import pandas as pd

# Import package numpy for numeric computing
import numpy as np

# Import package matplotlib for visualisation/plotting
import matplotlib.pyplot as plt

# Imports for random forest regression
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# For showing plots directly in the notebook run the command below
%matplotlib inline

# Connect DB
import sys
sys.path.append('../web/')
from dbConnection import get_clean_db

In [2]:
# Reading from a csv file, into a data frame
#df = pd.read_csv('clean_db.csv', keep_default_na=True, dtype={16: str}, delimiter=',', skipinitialspace=True, encoding='Windows-1252') #sep=',\s+',

In [2]:

sql_query = get_clean_db()
df = pd.DataFrame(sql_query, columns = ['timestamp', 'station_id', 'available_bikes', 'available_bike_stands', 'status', 'temperature', 'pressure', 'humidity', 'clouds', 'wind_speed_beaufort', 'wind_direction', 'precipitation_value', 'precipitation_min', 'precipitation_max', 'precipitation_probability', 'wind_speed_mps', 'weather_type', 'icon_number', 'temperature_feels_like', 'day_flag', 'time', 'day'])

In [12]:
print("Rows: " + str(df.shape[0]))
print("Columns: " + str(df.shape[1]))
print(df.columns)
df.dtypes

Rows: 294205
Columns: 22
Index(['timestamp', 'station_id', 'available_bikes', 'available_bike_stands',
       'status', 'temperature', 'pressure', 'humidity', 'clouds',
       'wind_speed_beaufort', 'wind_direction', 'precipitation_value',
       'precipitation_min', 'precipitation_max', 'precipitation_probability',
       'wind_speed_mps', 'weather_type', 'icon_number',
       'temperature_feels_like', 'day_flag', 'time', 'day'],
      dtype='object')


timestamp                    datetime64[ns]
station_id                            int64
available_bikes                       int64
available_bike_stands                 int64
status                               object
temperature                         float64
pressure                            float64
humidity                            float64
clouds                              float64
wind_speed_beaufort                 float64
wind_direction                      float64
precipitation_value                 float64
precipitation_min                   float64
precipitation_max                   float64
precipitation_probability           float64
wind_speed_mps                      float64
weather_type                         object
icon_number                         float64
temperature_feels_like              float64
day_flag                              int64
time                                 object
day                                  object
dtype: object

In [5]:
X = df.drop(['timestamp', 'weather_type', 'icon_number', 'status', 'time', 'day', 'available_bikes', 'available_bike_stands'],axis=1)
y = df['available_bikes']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train.shape, X_test.shape

((205943, 14), (88262, 14))

In [25]:
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5,
                                       n_estimators=100, oob_score=True)

In [26]:
%%time
classifier_rf.fit(X_train, y_train)

CPU times: user 44 s, sys: 1.33 s, total: 45.3 s
Wall time: 14.7 s


In [27]:
# checking the oob score
classifier_rf.oob_score_

0.08778608329244711

In [28]:
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
params = {
    'max_depth': [2,3,5,10,20],
    'min_samples_leaf': [5,10,20,50,100,200],
    'n_estimators': [10,25,30,50,100,200]
}

In [31]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy")

In [32]:
%%time
grid_search.fit(X_train, y_train)

Fitting 4 folds for each of 180 candidates, totalling 720 fits




KeyboardInterrupt: 

In [None]:
grid_search.best_score_

In [None]:
rf_best = grid_search.best_estimator_
rf_best