#### Available Data

The **trip**-related dat apoints are the following:

* 'medallion',
* ' hack_license',
* ' vendor_id',
* ' rate_code',
* ' store_and_fwd_flag',
* ' pickup_datetime',
* ' dropoff_datetime',
* ' passenger_count',
* ' trip_time_in_secs',
* ' trip_distance',
* ' pickup_longitude',
* ' pickup_latitude',
* ' dropoff_longitude',
* ' dropoff_latitude'

and the **fare** CSV file contains the columns as below

* 'medallion',
* ' hack_license',
* ' vendor_id',
* ' pickup_datetime',
* ' payment_type',
* ' fare_amount',
* ' surcharge',
* ' mta_tax',
* ' tip_amount',
* ' tolls_amount',
* ' total_amount'

a.	What is the distribution of number of passengers per trip?
b.	What is the distribution of payment_type?
c.	What is the distribution of fare amount?
d.	What is the distribution of tip amount?
e.	What is the distribution of total amount?
f.	What are top 5 busiest hours of the day?
g.	What are the top 10 busiest locations of the city?
h.	Which trip has the highest standard deviation of travel time?
i.	Which trip has most consistent fares?

In [36]:
import pandas as pd
import re
import seaborn as sns
import time
from collections import Counter, defaultdict

sns.set_style("whitegrid")

In [44]:
class Taxi:
    
    def __init__(self, trip_file, fare_file):
        
        self.trip_file = trip_file
        self.fare_file = fare_file
        
        self.distros = defaultdict(lambda: defaultdict(int))

    def get_distro(self, rows_at_once=10000):
        
        # trip is defined by (medallion, pickup_datetime)
        
        for i, d in enumerate(pd.read_csv('data/' + self.trip_file, chunksize=rows_at_once, 
                                          usecols=['medallion', ' pickup_datetime', ' passenger_count'])):
            for row in d.groupby(['medallion', ' pickup_datetime']).sum().iterrows():
                self.distros['passengers'][row[0]] += row[1][' passenger_count']
                
            if i%10 == 0:
                print(f'done {i*rows_at_once:,} rows..')
        
        for i, d in enumerate(pd.read_csv('data/' + self.fare_file, chunksize=rows_at_once,
                                          usecols=['medallion', ' pickup_datetime', ' payment_type', ' fare_amount', ' tip_amount'])):
            for row in d.groupby(['medallion', ' pickup_datetime']).sum().iterrows():
                
                self.distros['payment_type'][row[0]] += row[1][' payment_type']
                self.distros['fare_amount'][row[0]] += row[1][' fare_amount']
                self.distros['tip_amount'][row[0]] += row[1][' tip_amount']
                
            if i%10 == 0:
                print(f'done {i*rows_at_once:,} rows..')
                
        return self

In [None]:
if __name__ == '__main__':
    
    tx = Taxi(trip_file='trip_data_4.csv', fare_file='trip_fare_4.csv').get_distro()

done 0 rows..
done 100,000 rows..
done 200,000 rows..
done 300,000 rows..
done 400,000 rows..
done 500,000 rows..
done 600,000 rows..
done 700,000 rows..
done 800,000 rows..
done 900,000 rows..
done 1,000,000 rows..
done 1,100,000 rows..
done 1,200,000 rows..
done 1,300,000 rows..
done 1,400,000 rows..
done 1,500,000 rows..
done 1,600,000 rows..
done 1,700,000 rows..
done 1,800,000 rows..
done 1,900,000 rows..
done 2,000,000 rows..
done 2,100,000 rows..
done 2,200,000 rows..
done 2,300,000 rows..
done 2,400,000 rows..
done 2,500,000 rows..
done 2,600,000 rows..
done 2,700,000 rows..
done 2,800,000 rows..
done 2,900,000 rows..
done 3,000,000 rows..
done 3,100,000 rows..
done 3,200,000 rows..
done 3,300,000 rows..
done 3,400,000 rows..
done 3,500,000 rows..
done 3,600,000 rows..
done 3,700,000 rows..
done 3,800,000 rows..
done 3,900,000 rows..
done 4,000,000 rows..
done 4,100,000 rows..
done 4,200,000 rows..
done 4,300,000 rows..
done 4,400,000 rows..
done 4,500,000 rows..
done 4,600,000