 # Datathon

 ## Question
 Is it possible to predict the number of for-hire vehicle (FHV) customers in certain areas of the city according to previews pickup data (day, time), weather and neighbourhood demographics as well as the likely destination?

 ## Note
 We received some data that will be first analyzed and cleaned. For that, we created a semi automatic process. You only need to put your csv files in a folder called DataD.
 Then you just run this .py file or run all cells of the ipynb

 ## Libraries
 For current and future work, we import common known libraries such as pandas, numpy, matplotlib, seaborn, folium, geopy, shapely, concurrent futures, tqdm, among others.

In [14]:
import cython

print("Starting to load modules ...")
import numpy                 as np
import pandas                as pd
import matplotlib.pyplot     as plt
import seaborn               as sns
import sklearn.metrics       as Metrics
import matplotlib.pyplot     as plt
import os
from pathlib import Path

import folium  #needed for interactive map
from folium.plugins import HeatMap

# To see if a point belongs to a polygon
# it is used to identified to which nta belongs a point
from shapely.geometry import Point, Polygon

# To calculate distances
from geopy.geocoders import Nominatim
from geopy.distance import geodesic

from   collections           import Counter
from   sklearn               import preprocessing
from   datetime              import datetime
from   collections           import Counter
from   math                  import exp
from   sklearn.linear_model  import LinearRegression as LinReg
from   sklearn.metrics       import mean_absolute_error
from   sklearn.metrics       import median_absolute_error
from   sklearn.metrics       import r2_score


import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import freeze_support, cpu_count
from tqdm.auto import tqdm
from p_tqdm import p_map
from tqdm import tqdm_notebook, tnrange,tqdm


#%matplotlib inline

sns.set()


Starting to load modules ...


 ## Collecting and Reviewing Data
 As declared before, data is coming from files delivered by the TA's as well as data coming from NYC Open Data. To visualize data we are using Visual Studio Code as an IDE with its extension DataPreview in order to explore variables and information provided. But we also create some scripts in order to know some possible problems we could have with the data.
 Files such as .py, ipynb are in a folder called DS4A. Inside this folder must be the folder DataD where the csv files are located. The script will be identifying the files located there and they will be loaded automatically and async.

In [15]:

''' 
# Change working directory from the workspace root to the ipynb file location. Turn this addition off with the DataScience.changeDirOnImportExport setting
# ms-python.python added
'''
try:
	os.chdir(os.path.join(os.getcwd(), 'DS4A'))
	#print(os.getcwd())
except:
	pass

'''
    For the given path, get the List of all files in the directory and subdirectory (I havent test that)
'''

def getListOfFiles(dirName):
    # create a list of file and sub directories 
    # names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = []
    # Iterate over all the entries
    for entry in listOfFile:
        # Create full path
        fullPath = os.path.join(dirName, entry)
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles= allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles 

listOfFiles = getListOfFiles('DataD')

listOfDataFrames={}

print("Starting to load files ...")

with ThreadPoolExecutor(max_workers=2) as executor:
    futures ={executor.submit(pd.read_csv,file,low_memory=False):file for file in listOfFiles}
    with tqdm(total=len(listOfFiles)) as pbar:
        for future in concurrent.futures.as_completed(futures):
            myFile=futures[future]
            pbar.update(1)
            print(myFile + " loaded")
            listOfDataFrames[Path(myFile).stem] = [myFile,future.result()]
            

Starting to load files ...


  0%|                                                                                            | 0/9 [00:00<?, ?it/s]

DataD\demographics.csv loaded


 22%|██████████████████▋                                                                 | 2/9 [00:00<00:01,  5.24it/s]

DataD\geographic.csv loaded


 33%|████████████████████████████                                                        | 3/9 [00:28<00:51,  8.61s/it]

DataD\green_trips.csv loaded


 44%|█████████████████████████████████████▎                                              | 4/9 [00:30<00:32,  6.56s/it]

DataD\mta_trips.csv loaded


 56%|██████████████████████████████████████████████▋                                     | 5/9 [00:39<00:29,  7.42s/it]

DataD\uber_trips_2014.csv loaded


 67%|████████████████████████████████████████████████████████                            | 6/9 [00:39<00:15,  5.24s/it]

DataD\weather.csv loaded


 78%|█████████████████████████████████████████████████████████████████▎                  | 7/9 [01:03<00:21, 10.86s/it]

DataD\uber_trips_2015.csv loaded
DataD\zones.csv loaded


100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [01:19<00:00,  9.89s/it]

DataD\yellow_trips.csv loaded


100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [01:19<00:00,  8.80s/it]


 **To keep dataFrames and general information about them we create a class that represent each dataset**

In [16]:

# A class that represents each file loaded.

''' -----------Class Definition ---------------- '''
class dataset:
    def __init__(self, name, path, dataFrame=None):
        self.name=name
        self.__isLoaded=False
        if dataFrame is None:
            self.loadData(path)
        else:
            self.dataFrame=dataFrame
            self.__isLoaded=True

        self.columnsDescription={}
        self.__isUpdated=False        
        self.path=path
    
    def getRowsCount(self):
        if self.__isLoaded:
            return len(self.dataFrame)
        else:
            print("There is no dataFrame loaded")
    
    def loadData(self):
        try:
            self.dataFrame=pd.read_csv(self.path)
            self.__isLoaded=True
        except:
            self.__isLoaded=False
            print("Attribute path may be empty or file doesn't exist")
            pass
    
    def updateInfo(self):
        #Implementation is missing
        self.__isUpdated=True

    def isUpdate(self):
        return self.__isUpdated
    
    def getColumnsNames(self):
        if self.__isLoaded:
            return list(self.dataFrame.columns.values)
        else:
            print("There is no dataFrame loaded")
    
    def getColumnsDataTypes(self):
        if self.__isLoaded:
            return self.dataFrame.dtypes
        else:
            print("There is no dataFrame loaded")   

    def getColumnsInfo(self):
        if self.__isLoaded:
            return self.dataFrame.info()
        else:
            print("There is no dataFrame loaded")  

    def getDataFrameStatistics(self):
        if self.__isLoaded:
            return self.dataFrame.describe()
        else:
            print("There is no dataFrame loaded")

    #Try to infer which columns require to be converted to datetime

    def inferredDatetimeColumnsForConversion(self):
        if self.__isLoaded:
            inferredColumns=[x for x in self.getColumnsNames() if 'date' in x.lower()]
            for i in inferredColumns:
                self.dataFrame[i]=pd.to_datetime(self.dataFrame[i])
        else:
            print("There is no dataFrame loaded")

    def isNull(self):
        if self.__isLoaded:
            return self.dataFrame.isnull().sum()
        else:
            print("There is no dataFrame loaded")

''' -----------End class Definition ------------'''


' -----------End class Definition ------------'

In [17]:

#Creating a dictionary of datasets. You can access a specific dataset (dataframe or file content) through its key
print("Creating a dictionary of datasets. You can access a specific dataset (dataframe or file content) through its key")
listOfDataSets={}
for key,value in listOfDataFrames.items():
    listOfDataSets[key]=dataset(key,value[0],value[1])


Creating a dictionary of datasets. You can access a specific dataset (dataframe or file content) through its key


 Once loading process is done, next step is to review and extract useful information of the Data.

 For that, we first analyze the data types of datasets in order to see if there is necessary any kind of conversion and above all if the data is homogenous or if a custom transformation is needed

In [18]:

#Code to see datasets columns types
for key,value in listOfDataSets.items():
    print()
    print("Dataset " + key +" with " + str(value.getRowsCount()) + " rows")
    print()
    print("-----Data Types-------")
    print(value.getColumnsDataTypes())
    print()



Dataset demographics with 188 rows

-----Data Types-------
nta_name             object
borough              object
nta_code             object
population            int64
under_5_years         int64
5-9_years             int64
10-14_years           int64
15-19_years           int64
20-24_years           int64
25-29_years           int64
30-34_years           int64
35-39_years           int64
40-44_years           int64
45-49_years           int64
50-54_years           int64
55-59_years           int64
60-64_years           int64
over_65_years         int64
median_age            int64
people_per_acre     float64
households            int64
less_than_10,000      int64
10000_to_14999        int64
15000_to_24999        int64
25000_to_34999        int64
35000_to_49999        int64
50000_to_74999        int64
75000_to_99999        int64
100000_to_149999      int64
150000_to_199999      int64
200000_or_more        int64
median_income         int64
mean_income           int64
dtype: object




 **Now it's time to see if there is any null values we must deal with them.**

In [19]:
print("Starting review process of null values")
for key,value in listOfDataSets.items():
    print()
    print("Dataset " + key +" with " + str(value.getRowsCount()) + " rows")
    print()
    print(value.isNull())
    print()



Starting review process of null values

Dataset demographics with 188 rows

nta_name            0
borough             0
nta_code            0
population          0
under_5_years       0
5-9_years           0
10-14_years         0
15-19_years         0
20-24_years         0
25-29_years         0
30-34_years         0
35-39_years         0
40-44_years         0
45-49_years         0
50-54_years         0
55-59_years         0
60-64_years         0
over_65_years       0
median_age          0
people_per_acre     0
households          0
less_than_10,000    0
10000_to_14999      0
15000_to_24999      0
25000_to_34999      0
35000_to_49999      0
50000_to_74999      0
75000_to_99999      0
100000_to_149999    0
150000_to_199999    0
200000_or_more      0
median_income       0
mean_income         0
dtype: int64


Dataset geographic with 9302 rows

BK88    9042
QN52    9116
QN48    8924
QN51    9060
QN27    9078
        ... 
MN32    9242
MN33    8928
MN99    9278
QN18    9052
QN29    8836
Lengt

 ## Data Cleaning
From the above, we realize that geographics dataframe has many null values. That is because there are nta's bigger than others

We decide to not use that table but create a dictionary instead, with nta as keys and list of tuples with latitude and longitude as values

 Next we can see a preview of 2 elements of the dictionary.

In [20]:

print("Creating dictionary of ntas and their surrounding coordinates")
ntas={}
geographic=listOfDataSets['geographic']
points=[]
for i in geographic.getColumnsNames():
    for j  in range(0,geographic.dataFrame[i].count()):
        if j%2 == 0:
            points.append((geographic.dataFrame.loc[j,i],geographic.dataFrame.loc[j+1,i]))
        else:
            continue
    ntas[i]=points
    points=[]
    
#ntas
from itertools import islice
n = 2
list(islice(ntas.items(),n))


Creating dictionary of ntas and their surrounding coordinates


[('BK88',
  [(-73.9760507905698, 40.6312841471042),
   (-73.9771665542679, 40.6307548954438),
   (-73.97699992349709, 40.6298797372738),
   (-73.9768510772105, 40.629096822478296),
   (-73.9766974777461, 40.62836280506979),
   (-73.9765791908336, 40.62758142753071),
   (-73.97651605490691, 40.6273027311531),
   (-73.9764511382828, 40.6270164960783),
   (-73.9762359705636, 40.6259845911404),
   (-73.9772629343846, 40.6258610131635),
   (-73.97719809044129, 40.6251102187637),
   (-73.97711102688271, 40.6249564998691),
   (-73.9769452575888, 40.6240609900222),
   (-73.9768360858429, 40.623489281844705),
   (-73.97675353017891, 40.623021552135704),
   (-73.9765672728723, 40.6220031428102),
   (-73.9769521418502, 40.6216382467059),
   (-73.97705496237609, 40.621540761316204),
   (-73.9753952383902, 40.6207782266542),
   (-73.9768257690581, 40.6186840391146),
   (-73.9775244525444, 40.617684779502),
   (-73.9778515271106, 40.6172958454415),
   (-73.9756309743991, 40.6159536291278),
   (-73.9

 **Important**
 For our objective, we need to know nta's of green trips, uber trips 2014 and yellow trips where we have information about latitude and longitude of the pickup and dropoff.

 For uber trips 2015 we do not have that kind of information, so we removed that dataset and create a cpu intensive algorithm to find the respectively nta to the dataSets Green, Uber2014 and Yellow Trips.
 Also, mta_trips is deleted because for our first version we are not planning to use that data.

 Next you can find a code to find nta's according to lattitude and longitude info
 ```
 :::python
 def settingNta(latitude, longitude):
     for key, value in ntas.items():
         p1=Point(longitude,latitude)
         poly=Polygon(value)
         if poly.contains(p1):
             return key
     return "N/A"
 green=listOfDataSets['green_trips'].dataFrame
 for i in tnrange(len(green), desc = "Iterating"):
     green["nta"]= settingNta(green["pickup_latitude"][i],green["pickup_longitude"][i])
 ```

In [21]:

print("deleting dataset uber_trips_2015")
if "uber_trips_2015" in listOfDataSets:
    del listOfDataSets["uber_trips_2015"]

print("deleting dataset mta_trips")
if "mta_trips" in listOfDataSets:
    del listOfDataSets["mta_trips"]



deleting dataset uber_trips_2015
deleting dataset mta_trips


In [22]:

# this code is commented because is highly intensive. This process is done only once. Files are located in a folder named Definitive

# def settingNta(latitude, longitude):
#     for key, value in ntas.items():
#         p1=Point(longitude,latitude)
#         poly=Polygon(value)
#         if poly.contains(p1):
#             return key
#     return "N/A"

# green=listOfDataSets['green_trips'].dataFrame
# for i in tnrange(len(green), desc = "Iterating"):
#     green["nta"]= settingNta(green["pickup_latitude"][i],green["pickup_longitude"][i])

# green.info()


 **Important**
 We commented the code to search for nta's due to this high intensive process may run only once.
 Next, we saved the result of the process as a new csv files containing the nta info. So, in a folder called Definitive it's located the updated version of green_trips, uber_trips and yellow_trips
 Now, we must update our datasets respectively. A preview info about them is found below.

In [23]:
print("Updating datasets with nta info ...")
listOfDataSets["uber_trips_2014"].path=r"Definitive\uber_trips_2014.csv"
listOfDataSets["yellow_trips"].path=r"Definitive\yellow_trips.csv"
listOfDataSets["green_trips"].path=r"Definitive\green_trips.csv"
listOfDataSets["uber_trips_2014"].loadData()
listOfDataSets["yellow_trips"].loadData()
listOfDataSets["green_trips"].loadData()

listOfDataSets["uber_trips_2014"].dataFrame.head()
listOfDataSets["yellow_trips"].dataFrame.head()
listOfDataSets["green_trips"].dataFrame.head()


Updating datasets with nta info ...


Unnamed: 0.1,Unnamed: 0,pickup_datetime,dropoff_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,trip_distance,total_amount,pickup_nta_code,dropoff_nta_code
0,0,2015-02-01 01:26:45,2015-02-01 01:49:58,-73.953545,40.811172,-73.984764,40.728386,1,8.11,27.8,MN09,MN22
1,1,2015-01-02 20:06:28,2015-01-02 20:14:04,-73.946709,40.714321,-73.961571,40.711475,1,1.29,9.8,BK90,BK73
2,2,2014-09-27 17:55:38,2014-09-27 18:19:56,-73.957626,40.718094,-73.947304,40.777813,5,6.12,26.3,BK73,MN32
3,3,2014-04-27 02:27:04,2014-04-27 02:39:02,-73.949501,40.713997,-73.987785,40.718582,2,3.68,17.3,BK73,MN27
4,4,2014-05-26 18:32:19,2014-05-26 18:44:13,-73.944092,40.672195,-73.977325,40.664013,1,2.4,11.5,BK61,BK99


 Also, we found that most of data types are well inferred by Pandas, but for future work it is necessary to make some data type conversions for datetime fields

  **So we procced on that.**

In [24]:
print("Starting to infer date types in each dataFrame ...")

with ThreadPoolExecutor(max_workers=2) as executor:
    futures ={executor.submit(value.inferredDatetimeColumnsForConversion):[key,value] for key,value in listOfDataSets.items()}
    with tqdm(total=len(listOfDataSets)) as pbar:
        for future in concurrent.futures.as_completed(futures):
            key=futures[future][0]
            value=futures[future][1]
            pbar.update(1)
            print()
            print("Dataset " + key)
            print()
            print("-----Data Types-------")
            print(value.getColumnsDataTypes())
            print()
           
print("---- Inferring date type process ended ----")
print()

Starting to infer date types in each dataFrame ...


  0%|                                                                                            | 0/7 [00:00<?, ?it/s]


Dataset geographic

-----Data Types-------
BK88    float64
QN52    float64
QN48    float64
QN51    float64
QN27    float64
         ...   
MN32    float64
MN33    float64
MN99    float64
QN18    float64
QN29    float64
Length: 195, dtype: object



 29%|████████████████████████                                                            | 2/7 [00:06<00:15,  3.15s/it]


Dataset demographics

-----Data Types-------
nta_name             object
borough              object
nta_code             object
population            int64
under_5_years         int64
5-9_years             int64
10-14_years           int64
15-19_years           int64
20-24_years           int64
25-29_years           int64
30-34_years           int64
35-39_years           int64
40-44_years           int64
45-49_years           int64
50-54_years           int64
55-59_years           int64
60-64_years           int64
over_65_years         int64
median_age            int64
people_per_acre     float64
households            int64
less_than_10,000      int64
10000_to_14999        int64
15000_to_24999        int64
25000_to_34999        int64
35000_to_49999        int64
50000_to_74999        int64
75000_to_99999        int64
100000_to_149999      int64
150000_to_199999      int64
200000_or_more        int64
median_income         int64
mean_income           int64
dtype: object



 43%|████████████████████████████████████                                                | 3/7 [00:08<00:11,  2.93s/it]


Dataset green_trips

-----Data Types-------
Unnamed: 0                    int64
pickup_datetime      datetime64[ns]
dropoff_datetime     datetime64[ns]
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
passenger_count               int64
trip_distance               float64
total_amount                float64
pickup_nta_code              object
dropoff_nta_code             object
dtype: object



 57%|████████████████████████████████████████████████                                    | 4/7 [00:11<00:08,  2.85s/it]


Dataset uber_trips_2014

-----Data Types-------
Unnamed: 0                   int64
pickup_datetime     datetime64[ns]
pickup_latitude            float64
pickup_longitude           float64
base                        object
nta_code                    object
dtype: object



 71%|████████████████████████████████████████████████████████████                        | 5/7 [00:11<00:04,  2.06s/it]


Dataset zones

-----Data Types-------
location_id      int64
borough         object
zone            object
service_zone    object
nta_code        object
dtype: object


Dataset weather

-----Data Types-------
date             datetime64[ns]
max_temp                  int64
min_temp                  int64
avg_temp                float64
precipitation            object
snowfall                 object
snow_depth               object
location                 object
latitude                float64
longitude               float64
dtype: object


Dataset yellow_trips

-----Data Types-------
Unnamed: 0                    int64
pickup_datetime      datetime64[ns]
dropoff_datetime     datetime64[ns]
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
passenger_count               int64
trip_distance               float64
total_amount                float64
pickup_nta_code              object
dropoff_nta_c

100%|████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:11<00:00,  1.66s/it]


---- Inferring date type process ended ----



 It is time to analyze information about locations in maps. For that we are analyzing green_trips
 For instance, this first mapshows in blue the pickup point in blue and the dropoff in red and a green line joining this 2 points. This shpow us that green trips can not pick up passengers in Manhattan and also help us spot outliers as is the case of a service that started in Brooklin and ended in Madagascar.
 **Important**
 Lines drawn are only the first 200 records of green_trips

In [25]:
folium_map = folium.Map(location=[40.738, -73.98],
                        zoom_start=10,
                        tiles="OpenStreetMap")

green=listOfDataSets['green_trips'].dataFrame

for i in range(0,200):
    p1=[green["pickup_latitude"][i],green["pickup_longitude"][i]]
    p2=[green["dropoff_latitude"][i],green["dropoff_longitude"][i]]
    points1=[p1,p2]

    marker = folium.CircleMarker(location=[p1[0],p1[1]],radius=5,color="blue",fill=True)
    marker.add_to(folium_map)

    marker2 = folium.CircleMarker(location=[p2[0],p2[1]],radius=5,color="red",fill=True)
    marker2.add_to(folium_map)
    
    folium.PolyLine(points1, color="green", weight=2.5, opacity=1).add_to(folium_map)

folium_map

A second map, This map shows us the drop off points with the highest fares. We can see that the distribution is even as it usually dependes on the lenght of the trip instead of only the dropoff point. Despite this, we were able to find that airports usually have more expensive trips.
 
**Important**

Notice that we are only plotting 100 thousand records for issues in the performance

In [29]:
max_amount = float(green['total_amount'].max())

folium_hmap = folium.Map(location=[40.738, -73.98],
                        zoom_start=13,
                        tiles="OpenStreetMap")
green=green.head(100000)

hm_wide = HeatMap( list(zip(green["dropoff_latitude"], green["dropoff_longitude"], green['total_amount'])),
                   min_opacity=0.2,
                   max_val=max_amount,
                   radius=8, blur=6, 
                   max_zoom=15, 
                 )


folium_hmap.add_child(hm_wide)

 ## Modeling and Analyzing the Data Sets
 In progress ...