#  ..........     Taxi Demand Prediction - New York City     ................

# ............      ML Course Project/Real-World Problem      ............

# Acknowledgments

We would like to thank our supervisor Dr. Hashim Tamimi for providing valuable input and supporting us during this semester.


# Project Supervisor: 
   * Dr. Hashim Tamimi

# Project Team :
    1- Eng. Baha' Abu-Qarandal (ID:176035)
    2- Eng. Alaa  Tamimi       (ID:196206)

# 1 - Introduction:
 

## 1.1 Project Goal:

Taxis are a part of the transportation system of most cities and provide a service to take individuals from point to other point. Predicting taxi demand accurately and supplying the right number of taxis in the right place at the right time is very important and would lead to numerous benefits on several levels; Customers would experience a lower expected wait time, taxi companies would have more efficient resource usage by regulating the number of taxis, and drivers would receive recommendations on where to look for customers as well as a reduction in time spent roaming and queuing for customers.

## 1.2 Elastic Search:
We used  Elastic Search Engine to manage large size of data 

## 1.3 Data Information:
<b>Source of Data:</b> Data can be downloaded from here:<br>
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.<br> 
Here, we have used 2019  data.

## 1.4 Information about Taxis:

* <b>Yellow Taxi:</b> Yellow Medallion Taxicabs<br>
These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.<br><br>

## 1.5 Features in Dataset:
<table>
	<tr>
		<th>Field Name</th>
		<th>Description</th>
	</tr>
	<tr>
		<td>VendorID</td>
		<td>
		A code indicating the TPEP provider that provided the record. 
		<ol>
			<li>Creative Mobile Technologies</li>
			<li>VeriFone Inc.</li>
		</ol>
		</td>
	</tr>
	<tr>
		<td>tpep_pickup_datetime</td>
		<td>The date and time when the meter was engaged.</td>
	</tr>
	<tr>
		<td>tpep_dropoff_datetime</td>
		<td>The date and time when the meter was disengaged.</td>
	</tr>
	<tr>
		<td>Passenger_count</td>
		<td>The number of passengers in the vehicle. This is a driver-entered value.</td>
	</tr>
	<tr>
		<td>Trip_distance</td>
		<td>The elapsed trip distance in miles reported by the taximeter.</td>
	</tr>
	<tr>
		<td>PULocationID</td>
		<td>TLC Taxi Zone in which the taximeter was engaged</td>
	</tr>
	<tr>
		<td>DOLocationID</td>
		<td>TTLC Taxi Zone in which the taximeter was disengaged</td>
	</tr>
	<tr>
		<td>RateCodeID</td>
		<td>The final rate code in effect at the end of the trip.
		<ol>
			<li> Standard rate </li>
			<li> JFK </li>
			<li> Newark </li>
			<li> Nassau or Westchester</li>
			<li> Negotiated fare </li>
			<li> Group ride</li>
		</ol>
		</td>
	</tr>
	<tr>
		<td>Store_and_fwd_flag</td>
		<td>This flag indicates whether the trip record was held in vehicle memory before sending to the vendor,<br\> aka             “store and forward,” because the vehicle did not have a connection to the server.
		<br\>Y= store and forward trip
		<br\>N= not a store and forward trip
		<td>
	<tr>
    </tr> 
       <td> Payment_type  </td>
       <td> A numeric code signifying how the passenger paid for the trip.
       <br\>1= Credit card
       <br\>2= Cash
       <br\>3= No charge
       <br\>4= Dispute
       <br\>5= Unknown
       <br\>6= Voided trip
       <td>    
    <tr>
    </tr>      
      <td> Fare_amount </td>  
      <td> The time-and-distance fare calculated by the meter.  
    <tr>
    </tr>
      <td> Extra  </td>
      <td> Miscellaneous extras and surcharges. Currently, this only includes<br\>
           the $0.50$ and $1$ rush hour and overnight charges..  
    <tr>
    </tr>              
       <td> MTA_tax  </td>
       <td> $0.50$ MTA tax that is automatically triggered based on the meteredrate in use.. 
  <tr>
  </tr>
        <td>Tip_amount </td>
        <td> ip amount – This field is automatically populated for credit card
tips. Cash tips are not included. <br\> 
      <tr>
    </tr>
        <td>Tolls_amount </td>
        <td> Total amount of all tolls paid in trip.
    <tr>
    </tr>
        <td>  Total_amount </td>
        <td> The total amount charged to passengers. Does not include cash tips. <br\>  
    <tr> 
    </tr>
<table>

## 1.6 Problem Formulation: Time Series Forecasting:

Given a region and a 10min interval, we have to predict pickups.

*  Every region of NYC has to be divided into 10 min interval.<br>

We already know, about the pickup at time 't', we will predict the pickup at time 't+1' in the same region. Hence, this problem can be thought of as a 'Time Series Prediction' problem. It is a special case of regression problems. In short, we will use the data at time 't' to predict for time 't+1'.

## 1.7 Performance Metric:
*  Mean Absolute Error (MAE) 
*  Mean Squared Error(MSE)
*  Root Mean Squared Error (RMSE)

# 2 - Project Code:


## 2.1 Operate Elasticsearch and Kibana engine:

* (a):Operate Elastic search from our file location
* (b):Operate Kibana search from our file location.<br>

## 2.2 Import Libariys:

In [1]:
import numpy as np
import pandas as pd
import math
from elasticsearch import Elasticsearch
import json
import warnings    
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('nbagg') 
warnings.simplefilter('ignore')
from datetime import datetime
from datetime import timedelta
import time 
import seaborn as sns
from timeit import default_timer as timer
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn import metrics
from numpy import mean
from sklearn.datasets import make_classification
from matplotlib import pyplot
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier

## 2.3 Connect to Elasticsearch server:

In [2]:
es = Elasticsearch('localhost:9200')

## 2.4 Feature selection:


In [3]:
def read_data_from_ES (my_index,Start_index,Final_index,step):
    col=['tpep_dropoff_datetime','tpep_pickup_datetime','trip_distance','DOLocationID',
         'PULocationID','passenger_count']

    df= pd.DataFrame(columns=col)
    dict_index_fields = {}
    mapping = es.indices.get_mapping(my_index) 
    dict_index_fields[my_index] = []
    for field in mapping[my_index]['mappings']['properties']:
        dict_index_fields[my_index].append(field) 
    j=0
    for i in range(Start_index,Final_index,step):
        res = es.get(index =  my_index,  id=i)
        df.loc[j, ['tpep_dropoff_datetime']] = res['_source']['tpep_dropoff_datetime']
        df.loc[j, ['tpep_pickup_datetime']]  = res['_source']['tpep_pickup_datetime']
        df.loc[j, ['trip_distance']]         = res['_source']['trip_distance']
        df.loc[j, ['DOLocationID']]          = res['_source']['DOLocationID']
        df.loc[j, ['PULocationID']]          = res['_source']['PULocationID']
        df.loc[j, ['passenger_count']]       = res['_source']['passenger_count']
        j+=1
    return df


# 2.5 Read data:
1. Compin all month need in a single data fram.
2. Delete zero trip distance value from data. 

In [4]:
# Read data
start_time = time.monotonic()
df_m1=   read_data_from_ES('m1year2019',1,7667792,760)
print('done m1')
df_m2=   read_data_from_ES('m2year2019',1,7019375,700)
print('done m2')
df_m3=   read_data_from_ES('m3year2019',1,7832545,780)
print('done m3')
df_m4=   read_data_from_ES('m4year2019',1,7433139,740)
print('done m4')
df_m5=   read_data_from_ES('m5year2019',1,7565261,750)
print('done m5')
df_m6=   read_data_from_ES('m6year2019',1,6941024,690)
print('done m6')
df_m7=   read_data_from_ES('m7year2019',1,6310419,630)
print('done m7')
df_m8=   read_data_from_ES('m8year2019',1,6073357,600)
print('done m8')
df_m9=   read_data_from_ES('m9year2019',1,6416056,640)
print('done m9')
df_m10= read_data_from_ES('m10year2019',1,7213891,720)
print('done m10')
df_m11= read_data_from_ES('m11year2019',1,6878111,680)
print('done m11')
df_m12= read_data_from_ES('m12year2019',1,6896317,680)
print('done m12')

# Append all data to one frame

data = df_m1
data =data.append(df_m2 , ignore_index=True)
data =data.append(df_m3 , ignore_index=True)
data =data.append(df_m4 , ignore_index=True)
data =data.append(df_m5 , ignore_index=True)
data =data.append(df_m6 , ignore_index=True)
data =data.append(df_m7 , ignore_index=True)
data =data.append(df_m8 , ignore_index=True)
data =data.append(df_m9 , ignore_index=True)
data =data.append(df_m10 ,ignore_index=True)
data =data.append(df_m11 ,ignore_index=True)
data =data.append(df_m12 ,ignore_index=True)
# Remove some wrong data
data.drop(data[data['trip_distance'] == '.00'].index , inplace=True)
data.drop(data[data['passenger_count'] == ''].index , inplace=True)
data.drop(data[data['passenger_count'] == '0'].index , inplace=True)
elastic_df=data.reset_index()
#Convert from string values to Correct value
elastic_df['trip_distance'] = elastic_df['trip_distance'].astype(float)
elastic_df['PULocationID'] = elastic_df['PULocationID'].astype(int)
elastic_df['DOLocationID'] = elastic_df['DOLocationID'].astype(int)
elastic_df['passenger_count'] = elastic_df['passenger_count'].astype(int)
elastic_df.head()

done m1
done m2
done m3
done m4
done m5
done m6
done m7
done m8
done m9
done m10
done m11
done m12


Unnamed: 0,index,tpep_dropoff_datetime,tpep_pickup_datetime,trip_distance,DOLocationID,PULocationID,passenger_count
0,0,2019-01-01 00:53:20,2019-01-01 00:46:40,1.5,239,151,1
1,1,2019-01-01 01:07:33,2019-01-01 00:52:08,1.6,48,230,2
2,2,2019-01-01 00:54:36,2019-01-01 00:45:31,1.0,261,261,1
3,3,2019-01-01 00:35:53,2019-01-01 00:32:05,0.81,75,75,3
4,4,2019-01-01 00:54:07,2019-01-01 00:49:24,0.87,239,142,1


## 2.6 Show some important information:

In [5]:
elastic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117172 entries, 0 to 117171
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   index                  117172 non-null  int64  
 1   tpep_dropoff_datetime  117172 non-null  object 
 2   tpep_pickup_datetime   117172 non-null  object 
 3   trip_distance          117172 non-null  float64
 4   DOLocationID           117172 non-null  int32  
 5   PULocationID           117172 non-null  int32  
 6   passenger_count        117172 non-null  int32  
dtypes: float64(1), int32(3), int64(1), object(2)
memory usage: 4.9+ MB


In [6]:
elastic_df.describe()

Unnamed: 0,index,trip_distance,DOLocationID,PULocationID,passenger_count
count,117172.0,117172.0,117172.0,117172.0,117172.0
mean,60251.085771,3.007385,161.158195,163.197462,1.597831
std,34867.128466,3.902712,70.032603,65.861761,1.20296
min,0.0,0.01,1.0,1.0,1.0
25%,30043.75,1.0,107.0,116.0,1.0
50%,60162.5,1.64,162.0,162.0,1.0
75%,90436.25,3.0325,233.0,233.0,2.0
max,120720.0,73.93,265.0,265.0,6.0


In [7]:
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
start_time = time.monotonic()

0:05:45.875000


### 2.7 Find the reigon are exisit or not !! (all region = 265 , but Yallow Taxi zone = 57  zone only ) 
For more information see lockup taxi table in source site (i.e 17 mean name of zone in NYC , With other type of taxi operator "Boro Taxi").

In [8]:
elastic_df.loc[elastic_df['PULocationID'] == '17']

Unnamed: 0,index,tpep_dropoff_datetime,tpep_pickup_datetime,trip_distance,DOLocationID,PULocationID,passenger_count


### 2.8 Find Trip duration and speed

In [9]:
# 2019-01-01 00:00:00  >>>  1546300800 
def timeToUnix(t):
    change = datetime.strptime(t,"%Y-%m-%d %H:%M:%S") 
    t_tuple = change.timetuple()
    return time.mktime(t_tuple) 

In [10]:
def Calculate_time_speed():
    pickup_time={}
    dropoff_time={}
    trip_duration={}
    speed={}
    for i in range (len(elastic_df["tpep_pickup_datetime"])):
        pickup_time[i]=timeToUnix(elastic_df["tpep_pickup_datetime"][i])
        dropoff_time[i]=timeToUnix(elastic_df['tpep_dropoff_datetime'][i])
        trip_duration[i]=( dropoff_time[i]-pickup_time[i])/60 # divide by 60 to convert to minutes.
        if trip_duration[i] == 0 :
            trip_duration[i] = 0.00001
        speed[i]=float( elastic_df['trip_distance'][i])/ (trip_duration[i]) /60 # Speed in miles/hr.
    pickup_time= (pd.DataFrame.from_dict(pickup_time.items()))
    pickup_time.drop([0], axis=1, inplace=True)
    
    dropoff_time= (pd.DataFrame.from_dict(dropoff_time.items()))
    dropoff_time.drop([0], axis=1, inplace=True)
    
    trip_duration= (pd.DataFrame.from_dict(trip_duration.items()))
    trip_duration.drop([0], axis=1, inplace=True)
    
    speed= (pd.DataFrame.from_dict(speed.items()))
    speed.drop([0], axis=1, inplace=True)
    
    elastic_df['speed'] = np.array(speed)
    elastic_df['trip_duration'] = np.array(trip_duration)
    elastic_df['dropoff_time'] = np.array(dropoff_time)
    elastic_df['pickup_time'] = np.array(pickup_time)
    return  elastic_df

In [11]:
elastic_df=Calculate_time_speed()

In [12]:
elastic_df.head()

Unnamed: 0,index,tpep_dropoff_datetime,tpep_pickup_datetime,trip_distance,DOLocationID,PULocationID,passenger_count,speed,trip_duration,dropoff_time,pickup_time
0,0,2019-01-01 00:53:20,2019-01-01 00:46:40,1.5,239,151,1,0.00375,6.666667,1546297000.0,1546296000.0
1,1,2019-01-01 01:07:33,2019-01-01 00:52:08,1.6,48,230,2,0.00173,15.416667,1546298000.0,1546297000.0
2,2,2019-01-01 00:54:36,2019-01-01 00:45:31,1.0,261,261,1,0.001835,9.083333,1546297000.0,1546296000.0
3,3,2019-01-01 00:35:53,2019-01-01 00:32:05,0.81,75,75,3,0.003553,3.8,1546296000.0,1546296000.0
4,4,2019-01-01 00:54:07,2019-01-01 00:49:24,0.87,239,142,1,0.003074,4.716667,1546297000.0,1546297000.0


# 3. Data Preparation:

### 3.1 Time binning

In [13]:
# For pickup
# 1546300800 : 2019-01-01 00:00:00   (Equivalent unix time)
# 1577836800 : 2020-01-01 00:00:00   (Equivalent unix time)
def pickup_10min_bins(dataframe,year):
    pickupTime =dataframe['pickup_time']
    unixTime = [1546248880, 1577784880]
    unix_year = unixTime[year-2019]
    #600 = 10 min
    time_10min_bin = [int((i - unix_year)/600) for i in pickupTime]
    dataframe["pickup"] = np.array(time_10min_bin)
    return dataframe

In [14]:
# For dropoff

# 1546300800 : 2019-01-01 00:00:00   (Equivalent unix time)
# 1577836800 : 2020-01-01 00:00:00   (Equivalent unix time)
def dropoff_10min_bins(dataframe,year):
    dropoffTime =dataframe['dropoff_time']
    unixTime = [1546248880, 1577784880]
    unix_year = unixTime[year-2019]
    time_10min_bin = [int((i - unix_year)/600) for i in dropoffTime]
    dataframe["dropoff"] = np.array(time_10min_bin)
    return dataframe

In [15]:
elastic_df=pickup_10min_bins(elastic_df,2019)
elastic_df=dropoff_10min_bins(elastic_df,2019)

In [16]:
elastic_df.head()

Unnamed: 0,index,tpep_dropoff_datetime,tpep_pickup_datetime,trip_distance,DOLocationID,PULocationID,passenger_count,speed,trip_duration,dropoff_time,pickup_time,pickup,dropoff
0,0,2019-01-01 00:53:20,2019-01-01 00:46:40,1.5,239,151,1,0.00375,6.666667,1546297000.0,1546296000.0,79,79
1,1,2019-01-01 01:07:33,2019-01-01 00:52:08,1.6,48,230,2,0.00173,15.416667,1546298000.0,1546297000.0,79,81
2,2,2019-01-01 00:54:36,2019-01-01 00:45:31,1.0,261,261,1,0.001835,9.083333,1546297000.0,1546296000.0,79,79
3,3,2019-01-01 00:35:53,2019-01-01 00:32:05,0.81,75,75,3,0.003553,3.8,1546296000.0,1546296000.0,77,78
4,4,2019-01-01 00:54:07,2019-01-01 00:49:24,0.87,239,142,1,0.003074,4.716667,1546297000.0,1546297000.0,79,79


# 4. Data In figures:

### 4.1 Pickup time Vs passenger count:

In [17]:
#elastic_df.plot(x='pickup_time', y='passenger_count', kind="bar")
#plt.xlabel("pickup time",fontsize=12)
#plt.ylabel("passenger count",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('data1')

### 4.2 Index Vs passenger count:

In [18]:
#elastic_df.plot(x='index', y='passenger_count', kind="bar")
#plt.xlabel("index",fontsize=12)
#plt.ylabel("passenger count",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('data2')

### 4.3 Index Vs trip distance:

In [19]:
#elastic_df.plot(x='index', y='trip_distance', kind="bar")
#plt.xlabel("index",fontsize=12)
#plt.ylabel("trip distance",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('data3')

### 4.4 Index Vs trip duration:

In [20]:
#fig = plt.figure()
#elastic_df.plot(x='index', y='trip_duration', kind="bar")
#plt.xlabel("index",fontsize=12)
#plt.ylabel("trip duration",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('data4')



### 4.5 Index Vs PULocationID:

In [21]:
#elastic_df.plot(x='index', y='PULocationID', kind="bar")
#plt.xlabel("index",fontsize=12)
#plt.ylabel("PULocationID",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('data5')

### 4.6 Index Vs POLocationID:

In [22]:
#elastic_df.plot(x='index', y='DOLocationID', kind="bar")
#plt.xlabel("index",fontsize=12)
#plt.ylabel("DOLocationID",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('data6')

### 4.7 Index Vs speed:

In [23]:
#elastic_df.plot(x='index', y='speed', kind="bar")
#plt.xlabel("index",fontsize=12)
#plt.ylabel("Speed",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('data7')

### 4.8 Grid relations between 'PULocationID',  'DOLocationID',   'pickup',  'passenger_count':

In [24]:
new_data2019 = elastic_df[['PULocationID','DOLocationID','pickup','passenger_count']]
new_data2019.head()

Unnamed: 0,PULocationID,DOLocationID,pickup,passenger_count
0,151,239,79,1
1,230,48,79,2
2,261,261,79,1
3,75,75,77,3
4,142,239,79,1


In [25]:
new_data2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117172 entries, 0 to 117171
Data columns (total 4 columns):
 #   Column           Non-Null Count   Dtype
---  ------           --------------   -----
 0   PULocationID     117172 non-null  int32
 1   DOLocationID     117172 non-null  int32
 2   pickup           117172 non-null  int32
 3   passenger_count  117172 non-null  int32
dtypes: int32(4)
memory usage: 1.8 MB


In [26]:
new_data2019.describe()

Unnamed: 0,PULocationID,DOLocationID,pickup,passenger_count
count,117172.0,117172.0,117172.0,117172.0
mean,163.197462,161.158195,26186.62573,1.597831
std,65.861761,70.032603,15174.628253,1.20296
min,1.0,1.0,19.0,1.0
25%,116.0,107.0,12975.75,1.0
50%,162.0,162.0,26013.5,1.0
75%,233.0,233.0,39273.0,2.0
max,265.0,265.0,56519.0,6.0


In [27]:
#sns.pairplot(new_data2019)
#plt.savefig('data8')

In [28]:
#sns.pairplot(new_data2019, vars = ['PULocationID','DOLocationID','pickup','passenger_count'], hue ='passenger_count', palette='Dark2')
#plt.savefig('data9')

In [29]:
#sns.pairplot(new_data2019, vars =  ['PULocationID','DOLocationID','pickup','passenger_count'], hue ='passenger_count', hue_order = [1.0, 0.0])
#plt.savefig('data10')

In [30]:
#sns.pairplot(new_data2019,vars = ['PULocationID','DOLocationID','pickup','passenger_count'], hue ='passenger_count', kind = 'reg')
#plt.savefig('data11')

# 5 - Define Training and Prediction data: 

In [31]:
x = new_data2019[['PULocationID','DOLocationID','pickup']]
y = new_data2019[['passenger_count']]

##  5.1 Splitting data, 20% for Testing and 80% for Training :

In [32]:
# 20% for Final testing 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#  6 - Cross-validation (Out-of-sample testing):
Cross-validation is a statistical method used to estimate the skill of machine learning models.
It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.



##  6.1 Choose the model:

In [33]:
# retrieve the model to be evaluate
def get_model1():
    model=LogisticRegression()
    return model
def get_model2():
    model=RandomForestClassifier()
    return model
def get_model3():
    model=LinearSVC()
    return model
def get_model4():
    model=MLPClassifier()
    return model
def get_model5():
    model=SGDClassifier()
    return model

##  6.2 Evaluate the model:

### Ideal test condition for  Logistic Regression 

In [34]:
# evaluate the model using a given test condition
def evaluate_model1(cv):
    # get the model
    model = get_model1()
    # evaluate the model
    scores = cross_val_score( model, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
    # return scores
    return mean(scores), scores.min(), scores.max()
# calculate the ideal test condition
ideal1, _, _ = evaluate_model1(KFold())

### Ideal test condition for  Random Forest Classifier

In [35]:
# evaluate the model using a given test condition
def evaluate_model2(cv):
    # get the model
    model = get_model2()
    # evaluate the model
    scores = cross_val_score( model, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
    # return scores
    return mean(scores), scores.min(), scores.max()
# calculate the ideal test condition
ideal2, _, _ = evaluate_model2(KFold())

### Ideal test condition for Linear Support vector Classifier

In [36]:
# evaluate the model using a given test condition
def evaluate_model3(cv):
    # get the model
    model = get_model3()
    # evaluate the model
    scores = cross_val_score( model, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
    # return scores
    return mean(scores), scores.min(), scores.max()
# calculate the ideal test condition
ideal3, _, _ = evaluate_model3(KFold())

### Ideal test condition for Nural Network Classifier

In [37]:
# evaluate the model using a given test condition
def evaluate_model4(cv):
    # get the model
    model = get_model4()
    # evaluate the model
    scores = cross_val_score( model, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
    # return scores
    return mean(scores), scores.min(), scores.max()
# calculate the ideal test condition
ideal4, _, _ = evaluate_model4(KFold())

### Ideal test condition for Stochastic Gradient Descent  Classifier

In [38]:
# evaluate the model using a given test condition
def evaluate_model5(cv):
    # get the model
    model = get_model5()
    # evaluate the model
    scores = cross_val_score( model, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
    # return scores
    return mean(scores), scores.min(), scores.max()
# calculate the ideal test condition
ideal5, _, _ = evaluate_model5(KFold())

##  6.3 K-fold cross validation:

Find the best k from k=2 to k=31, k refers to the number of groups that a given data sample is to be split into.



### K-Fold for  logistic Regression Classifier

In [39]:
# define folds to test
folds = range(2,31)
# record mean and min/max of each set of results
means, mins, maxs ,min1,max1= list(),list(),list(),list(),list()
# evaluate each k value
for k in folds:
    # define the test condition
    cv = KFold(n_splits=k, shuffle=True, random_state=1)
    # evaluate k value
    k_mean, k_min, k_max = evaluate_model1(cv)
    # store mean accuracy
    means.append(k_mean)
    min1.append(k_min)
    max1.append(k_max)
    # store min and max relative to the mean
    mins.append(k_mean - k_min)
    maxs.append(k_max - k_mean)
    df1= pd.DataFrame()
df1['folds']=folds
df1['accuracy_mean']=means
df1['accuracy_min']=min1
df1['accuracy_max']=max1
df1['accuracy_mins']=mins
df1['accuracy_maxs']=maxs
# line plot of k mean values with min/max error bars
#plt.figure()
#pyplot.errorbar(df1['folds'], df1['accuracy_mean'], yerr=[df1['accuracy_mins'],df1['accuracy_maxs']], fmt='o')
# plot the ideal case in a separate color
#pyplot.plot(df1['folds'], [ideal1 for _ in range(len(folds))], color='r')
#plt.xlabel("K-Folds",fontsize=12)
#plt.title("K-Fold for logistic Regression Classifier")
#plt.ylabel("accuracy",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('Kfold1')
CF1=df1.loc[df1['accuracy_max'] == max(df1['accuracy_max'])]
CF1['Ideal']=ideal1 
CF1['Model']='Logistic Regression '
CF=CF1

### K-Fold for  Random Forest Classifier

In [40]:
# define folds to test
folds = range(2,31)
# record mean and min/max of each set of results
means, mins, maxs ,min1,max1= list(),list(),list(),list(),list()
# evaluate each k value
for k in folds:
    # define the test condition
    cv = KFold(n_splits=k, shuffle=True, random_state=1)
    # evaluate k value
    k_mean, k_min, k_max = evaluate_model2(cv)
    # store mean accuracy
    means.append(k_mean)
    min1.append(k_min)
    max1.append(k_max)
    # store min and max relative to the mean
    mins.append(k_mean - k_min)
    maxs.append(k_max - k_mean)
    df1= pd.DataFrame()
df1['folds']=folds
df1['accuracy_mean']=means
df1['accuracy_min']=min1
df1['accuracy_max']=max1
df1['accuracy_mins']=mins
df1['accuracy_maxs']=maxs

# line plot of k mean values with min/max error bars
#plt.figure()
#pyplot.errorbar(df1['folds'], df1['accuracy_mean'], yerr=[df1['accuracy_mins'],df1['accuracy_maxs']], fmt='o')
# plot the ideal case in a separate color
#pyplot.plot(df1['folds'], [ideal2 for _ in range(len(folds))], color='r')
#plt.xlabel("K-Folds",fontsize=12)
#plt.title("K-Fold for Random Forest Classifierr")
#plt.ylabel("accuracy",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('Kfold2')
CF1=df1.loc[df1['accuracy_max'] == max(df1['accuracy_max'])]
CF1['Ideal']=ideal2 
CF1['Model']='Random Forest '
CF=CF.append(CF1, ignore_index=True)

###  K-Fold for  Linear Support Vector Classifier

In [None]:
# define folds to test
folds = range(2,31)
# record mean and min/max of each set of results
means, mins, maxs ,min1,max1= list(),list(),list(),list(),list()
# evaluate each k value
for k in folds:
    # define the test condition
    cv = KFold(n_splits=k, shuffle=True, random_state=1)
    # evaluate k value
    k_mean, k_min, k_max = evaluate_model3(cv)
    # store mean accuracy
    means.append(k_mean)
    min1.append(k_min)
    max1.append(k_max)
    # store min and max relative to the mean
    mins.append(k_mean - k_min)
    maxs.append(k_max - k_mean)
    df1= pd.DataFrame()
df1['folds']=folds
df1['accuracy_mean']=means
df1['accuracy_min']=min1
df1['accuracy_max']=max1
df1['accuracy_mins']=mins
df1['accuracy_maxs']=maxs
# line plot of k mean values with min/max error bars
#plt.figure()
#pyplot.errorbar(df1['folds'], df1['accuracy_mean'], yerr=[df1['accuracy_mins'],df1['accuracy_maxs']], fmt='o')
# plot the ideal case in a separate color
#pyplot.plot(df1['folds'], [ideal3 for _ in range(len(folds))], color='r')
#plt.xlabel("K-Folds",fontsize=12)
#plt.title("K-Fold for Linear Support Vector Classifier")
#plt.ylabel("accuracy",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('Kfold3')
CF1=df1.loc[df1['accuracy_max'] == max(df1['accuracy_max'])]
CF1['Ideal']=ideal3
CF1['Model']='Linear Support Vector'
CF=CF.append(CF1, ignore_index=True)

### K-Fold for  Nural Network  Classifier 

In [None]:
# define folds to test
folds = range(2,31)
# record mean and min/max of each set of results
means, mins, maxs ,min1,max1= list(),list(),list(),list(),list()
# evaluate each k value
for k in folds:
    # define the test condition
    cv = KFold(n_splits=k, shuffle=True, random_state=1)
    # evaluate k value
    k_mean, k_min, k_max = evaluate_model4(cv)
    # store mean accuracy
    means.append(k_mean)
    min1.append(k_min)
    max1.append(k_max)
    # store min and max relative to the mean
    mins.append(k_mean - k_min)
    maxs.append(k_max - k_mean)
    df1= pd.DataFrame()
df1['folds']=folds
df1['accuracy_mean']=means
df1['accuracy_min']=min1
df1['accuracy_max']=max1
df1['accuracy_mins']=mins
df1['accuracy_maxs']=maxs

# line plot of k mean values with min/max error bars
#plt.figure()
#pyplot.errorbar(df1['folds'], df1['accuracy_mean'], yerr=[df1['accuracy_mins'],df1['accuracy_maxs']], fmt='o')
# plot the ideal case in a separate color
#pyplot.plot(df1['folds'], [ideal4 for _ in range(len(folds))], color='r')
#plt.xlabel("K-Folds",fontsize=12)
#plt.title("K-Fold for Nural Network Classifier")
#plt.ylabel("accuracy",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('Kfold4')
CF1=df1.loc[df1['accuracy_max'] == max(df1['accuracy_max'])]
CF1['Ideal']=ideal4
CF1['Model']='Nural Network'
CF=CF.append(CF1, ignore_index=True)

### K-Fold for  Stochastic Gradient Descent  Classifier

In [None]:
# define folds to test
folds = range(2,31)
# record mean and min/max of each set of results
means, mins, maxs ,min1,max1= list(),list(),list(),list(),list()
# evaluate each k value
for k in folds:
    # define the test condition
    cv = KFold(n_splits=k, shuffle=True, random_state=1)
    # evaluate k value
    k_mean, k_min, k_max = evaluate_model5(cv)
    # store mean accuracy
    means.append(k_mean)
    min1.append(k_min)
    max1.append(k_max)
    # store min and max relative to the mean
    mins.append(k_mean - k_min)
    maxs.append(k_max - k_mean)
    df1= pd.DataFrame()
df1['folds']=folds
df1['accuracy_mean']=means
df1['accuracy_min']=min1
df1['accuracy_max']=max1
df1['accuracy_mins']=mins
df1['accuracy_maxs']=maxs
#plt.figure()
# line plot of k mean values with min/max error bars
#pyplot.errorbar(df1['folds'], df1['accuracy_mean'], yerr=[df1['accuracy_mins'],df1['accuracy_maxs']], fmt='o')
# plot the ideal case in a separate color
#pyplot.plot(df1['folds'], [ideal5 for _ in range(len(folds))], color='r')
#plt.xlabel("K-Folds",fontsize=12)
#plt.title("K-Fold for Stochastic Gradient Descent Classifier")
#plt.ylabel("accuracy",fontsize=12)
#plt.autoscale(True, 'both', True)
#plt.savefig('Kfold5')
CF1=df1.loc[df1['accuracy_max'] == max(df1['accuracy_max'])]
CF1['Ideal']=ideal5
CF1['Model']='Stochastic Gradient Descent'
CF=CF.append(CF1, ignore_index=True)

### Cross Validation Results 

* Best K value for K-fold for each  model 
* And best accuracy of model depend on K value 
* Result of Ideal Test conditon  for each model 

In [None]:
CF.head()

### The Best Select model 

In [None]:
print(CF['Model'][CF.index[CF['Ideal'] == max(CF['Ideal'])]])

### We calculated and select best model depend on Cross validation result , but we apply all model to see the result and know how to build each classifier and check our calculation

# 7 - Prediction models

Machine learning is computing technology that uses artificial intelligence tools to develop systems that learn from data,
rather than simply performing programmed instructions.
Machine learning is now widely used by researchers and industry analysts to build predictive models from a wide variety of data. 
As models are fed new data, they are able to independently adapt. They can learn from historical patterns and computations
to produce reliable predictions and results.
In this part we used three techniques to build our taxi demand prediction model.
Linear Regression
Nural Network model
Support Vector Machines


## 7.1- Logistic Regression

In [None]:
def Logistic_Regression(x_train, x_test, y_train, y_test):
    start_t = timer()
    model = LogisticRegression()
    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    end_t = timer()
    time_t = (end_t - start_t)
    print ("Total time for Logistic Regression", time_t)
    MAE=metrics.mean_absolute_error(y_test, predictions)
    MSE=metrics.mean_squared_error(y_test, predictions)
    RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
    #print(model.coef_)
    #print(model.intercept_)
    #plt.scatter(y_test, predictions)
    #plt.hist(y_test - predictions)
    #plt.show()
    return MAE, MSE, RMSE, predictions
LRegM=Logistic_Regression(x_train, x_test, y_train, y_test)
LogR=LRegM[0:3]
print (LogR)


## 7.2- Random Forest Classifier

In [None]:
def Random_Forest_Classifier(x_train, x_test, y_train, y_test):
    start_t = timer()
    model = RandomForestClassifier()
    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    end_t = timer()
    time_t = (end_t - start_t)
    print ("Total time for Random Forest Classifier", time_t)
    MAE=metrics.mean_absolute_error(y_test, predictions)
    MSE=metrics.mean_squared_error(y_test, predictions)
    RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
    #print(model.coef_)
    #print(model.intercept_)
    #plt.scatter(y_test, predictions)
    #plt.hist(y_test - predictions)
    #plt.show()
    return MAE, MSE, RMSE, predictions
RFC=Random_Forest_Classifier(x_train, x_test, y_train, y_test)
RF=RFC[0:3]
print (RF)


## 7.3- Linear Support Vector Machine

In [None]:
def Linear_SVC(x_train, x_test, y_train, y_test):
    start_t = timer()
    model = LinearSVC()
    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    end_t = timer()
    time_t = (end_t - start_t)
    print ("Total time for Linear Support vector  Classifier", time_t)
    MAE=metrics.mean_absolute_error(y_test, predictions)
    MSE=metrics.mean_squared_error(y_test, predictions)
    RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
    #print(model.coef_)
    #print(model.intercept_)
    #plt.scatter(y_test, predictions)
    #plt.hist(y_test - predictions)
    #plt.show()
    return MAE, MSE, RMSE, predictions
LSVC=Linear_SVC(x_train, x_test, y_train, y_test)
LSV=LSVC[0:3]
print (LSV)


## 7.4- Neural Network

In [None]:
def neural_network(x_train, x_test, y_train, y_test):
    start_t = timer()
    model = MLPClassifier()
    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    end_t = timer()
    time_t = (end_t - start_t)
    print ("Total time for neural network  Classifier", time_t)
    MAE=metrics.mean_absolute_error(y_test, predictions)
    MSE=metrics.mean_squared_error(y_test, predictions)
    RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
    #print(model.coef_)
    #print(model.intercept_)
    #plt.scatter(y_test, predictions)
    #plt.hist(y_test - predictions)
    #plt.show()
    return MAE, MSE, RMSE, predictions
NNM=neural_network(x_train, x_test, y_train, y_test)
NN=NNM[0:3]
print (NN)


## 7.5- Stochastic Gradient Descent

In [None]:
def SGD_Classifier(x_train, x_test, y_train, y_test):
    start_t = timer()
    model = SGDClassifier()
    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    end_t = timer()
    time_t = (end_t - start_t)
    print ("Total time for stochastic gradient descent  Classifier", time_t)
    MAE=metrics.mean_absolute_error(y_test, predictions)
    MSE=metrics.mean_squared_error(y_test, predictions)
    RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
    #print(model.coef_)
    #print(model.intercept_)
    #plt.scatter(y_test, predictions)
    #plt.hist(y_test - predictions)
    #plt.show()
    return MAE, MSE, RMSE, predictions
SGDM=SGD_Classifier(x_train, x_test, y_train, y_test)
SGD=SGDM[0:3]
print (SGD)


# 8 - Comparison between model:

In [None]:
c1=['Mean Absolute Error (MAE)','Mean Squared Error (MSE)','Root Mean Squared Error (RMSE)']
#Error = pd.DataFrame({'Error':c1,'Linear regression model':LRM,'Nural Network model ':NN,'Support Vector Machines':SVM})
Error = pd.DataFrame({'Error':c1,'Logistic Regression':LogR,'Random Forest Classifier':RF,
                      'Linear Support vector':LSV,'Neural Network':NN,'stochastic gradient descent':SGD})
Error.head()

In [None]:
Error.plot(x='Error', y=['Logistic Regression','Random Forest Classifier','Linear Support vector',
                         'Neural Network','stochastic gradient descent'], kind="barh")
plt.xlabel('Error',fontsize=12)
plt.title("Comparison between  five model",fontsize=14)
plt.ylabel("Error mesarment Type",fontsize=12)
plt.autoscale(True, 'both', True)
plt.savefig('result1')

In [None]:
end_time = time.monotonic() 
print(timedelta(seconds=end_time-start_time))

# From the result the best model are Nural Network model

###  After reading and training data of different sizes, starting from thousand samples to five million samples, it was found that the result did not improve when increasing the size of data more than million and a half samples,  and also the results of the cross-validation agreed with the results of the training model.
