In [None]:
# libraries 
import pandas as pd
import seaborn as sns
import numpy as np
import random

%matplotlib inline

import matplotlib.pyplot as plt

import time
import functions
import json
import folium
import geojson
import geopandas

import datetime
from scipy.stats import chi2_contingency, ttest_ind ,chisquare, kruskal, pearsonr

In [None]:
from importlib import reload
reload(functions)

In [None]:
df_names=['data/yellow_tripdata_2018-01.csv','data/yellow_tripdata_2018-02.csv',
          'data/yellow_tripdata_2018-03.csv','data/yellow_tripdata_2018-04.csv',
         'data/yellow_tripdata_2018-05.csv','data/yellow_tripdata_2018-06.csv']

# months to analize 
months = ['January','February','March','April','May','June']

#plots_colors = ['royalblue', 'orange', 'violet', 'crimson', 'darkcyan', 'coral', 'mediumseagreen']

# taxi_zone_lookup.csv file
taxi_zone_lookup = pd.read_csv('data/taxi_zone_lookup.csv')
borough_lst = list(np.unique(taxi_zone_lookup.Borough))
borough_lst.pop()

## Before starting

A brief summary of our situation. 

The function **stats** shows some informations, inspecting the dataset for every month (''mese'')


In [None]:
#### "SHORT" ANALYSIS FUNCTION: SHOWS THE MAIN INFORMATION ABOUT THE CSV FILES
functions.stats(df_names)

As we can see, there are several odd values. For example every month has some attribute values equal to zero. After identifying the columns that have some problematical values, instead of removing them everytime we load the dataset for each question, we decided to make a function that does an initial cleaning of the dataset. 


Therefore, the main goal was to eliminate the rows where the *total_amount* values are equal to 0 or the wrong dates from *tpep_pickup_datetime* attribute. 


We decided to remove these values instead of the others because we think that the total_amount is a better information parameter for an occurred trip. For instance, taximeters can give erroneus values (as fare amount or trip distance equal to zero), but the main reference for a taxi company is the payment.

However, this rough clean is not enough. For each tasks we'll need to select the appropriate data. We will specify that step by step.

## Cleaning the data

Creating new csv files with cleaned data and store them in df_names paths

In [None]:
# Used only once
functions.make_new_csv(df_names)# directory path for csv files (dataset from Jan to Jun 2018)



We found out that one of the fastest way to get data is to extract only the columns that we need from all the csv files, passing to the arg *usecols* of the function *read_csv( )* the list of attributes that we need. 

So, now that we have cleaned the whole dataframe, in each point of the homework below we keep only the columns that are needed.

##  RQ1
### a) plot the daily average for each month
For this task, the only attribute that we need is 'tpep_pickup_datetime'.
From borough we decided to eliminate from our analysis the 'Unknown' one. This item could be a confounder due to the unknown properties.

In [None]:
daily_average_lst = functions.compute_daily_average(df_names)


In [None]:
functions.plot_daily_averages(daily_average_lst,months)

The graph shows that the highest daily average of trips was in April. And the lowest in January.

### b) For each borough, plot the daily average for each month

We've decided to put the results into a dictionary which has as:
    - key = borought
    - value = list of the daily average for that specific borough

In [None]:
# init dictionary of borough's averages
borough_averages = {}

In [None]:
borough_averages = functions.compute_borough_averages(df_names, taxi_zone_lookup)

In [None]:
for key, lst in borough_averages.items():
    print (key, lst)

In [None]:
functions.plot_boroug_averages(borough_averages, months)

The daily average of rides is increasing consistently after February until April in Brooklyn, Manhattan and Queens. About EWR and Ataten Island we should point out that the first is not a real borough but only an airport area. And for the second case we noticed the low number of rides is due to a preferencial usage of another taxi company.

## RQ2 
### a) Plot of passenger count for each daily hours

In [None]:
df = functions.passengers_NY_all_months(df_names)

In [None]:
functions.plot_NY_24_hours(df)

The plot above shows that there is a variability of taxi usage with a substantial incrementation from 6 am to 6 pm and a conseguential drop during the night hours.
From this evidence we decide to create time slots to facilitate the visualisation of taxi usage daily.  

In [None]:
#functions.time_slots_and_plot(df,plots_colors[6])
functions.time_slots_and_plot(df,"green")

### b) Doing that for each borough

In [None]:
functions.passengers_for_each_borough (df, borough_lst, taxi_zone_lookup)

Different boroughs shows a slight variation on taxi usage during the day. For Bronx and Brooklyn we can notice an opposite trend for the rush hour for taxi usage. 

## RQ3 Analyzing the trip duration
### For this analysis we took under consideration trip duration higher than 120 seconds and lower than 5000 seconds


In [None]:
df = functions.make_duration_df(df_names,taxi_zone_lookup)

In [None]:
functions.plot_frequencies(df['durations'], 'NYC')

We decide to visualize the variation of trip duration by dividing the duration into small intervals. From this plot we can observe that taxies in general are used for preferentially for short periods in general.

However when we perform an analysis for each borough separately (plots shows below), we can notice that some distribution moves away from the general trend.

Queens for example has a frequency distribution higher for longer trip duration.


In [None]:
functions.Boroughs_durations_freq(df, borough_lst)

## RQ4
###  a) The number of payments for any possible means

In [None]:
df,payment_type_lst=functions.payments_per_borough(df_names,taxi_zone_lookup,borough_lst)

In [None]:
payment_type_all=list(map(int,df.sum().values))
for ind in range(len(payment_type_lst)-len(df.columns)):
        payment_type_all.append(0)

In [None]:
functions.payment_types_NYC_plot(payment_type_all,payment_type_lst)

When comparing all the boroughs (whole NYC) it can be concluded that the most of the people paid their taxi rides with credit cards, then on the second place cash was one of the most used means of paying after credit card. In addition, there weren't any observations identified as an "Unknown" or "Voided trip".

### Chi-squared test

In [None]:
%%latex
\[H_0\text {={"The method of payment is NOT correlated to the borough"}}\]
\[H_1\text {={"The method of payment is correlated to the borough"}}\]

In [None]:
chi2, p_value, dof, expected = chi2_contingency(df)

In [None]:
p_value

p_value is smaller than 0.01. Therefore, Null hypothesis can be rejected. 
And we can conclude that =>{"The method of payment is correlated to the borough"}. In other word there is a statistically significant correlation between method of payment and borough.

### b) The way payments are executed in each borough 

In [None]:
functions.payment_type_per_borough_plot(df,payment_type_lst)

When comparing methods of payments in each of the boroughs among each other based on the graphs it can be concluded:
That just like in the whole NYC, the most common mean of payment was the credit card, followed by cash.
The rides that ended with dispute were rare in all boroughs, as well as the ones that ended with no charge.

## RQ5
### a) The dependence between distance and duration of the trip
To analyze this point we first decide to filter our dataset taking under account two parameters:
- For Trip duration we took only those values higher than 120 seconds and lowest than 2 hours
- For Trip distance we select all values between 1.2 miles and 50 miles

In [None]:
df = functions.duration_distance_df(df_names)

In [None]:
df.corr()

We obtain a good correlation value for the whole sample

In [None]:
# sampling 1000 rows
temp = df.sample(1000)

In [None]:
temp.corr()

In [None]:
# plotting the sample
temp.plot(y='trip_duration', x = 'trip_distance', kind = 'scatter')

In [None]:
corr, p_value = pearsonr(temp['trip_duration'],temp['trip_distance'])
print('corr: %0.3f' %corr, "p_value:", p_value)

In [None]:
functions.plot_duration_distance_freq(temp)

# CQ1

In [None]:
df = functions.make_df_price_per_mile(df_names, taxi_zone_lookup)

In [None]:
# making a boro_dict with all borought in a datafram
boro_dict = functions.make_boro_dict(df,borough_lst)

### compute the mean and the standard deviation

In [None]:
# mean and std table for each borough
mean_std_table = functions.mean_std_table(boro_dict,borough_lst, 'price_per_mile')
mean_std_table

In [None]:
# plot the price_per_mile for each borough
functions.plot_price_per_mile(boro_dict,borough_lst)

Run the mean and the standard deviation of the new variable for each borough. Then plot the distribution. What do you see?

### Run the t-test among all the possible pairs of distribution of different boroughs

- **H0 : mean of the independent sample are different**
- **H1: mean of the independent sample are equal**

In [None]:
#p value table for price_per_mile
functions.p_value_table(boro_dict, borough_lst, 'price_per_mile')

### P' = P/T

In [None]:
# Create a new colum P1 = P/T
df['p1'] = df['price_per_mile'] / df['trip_duration']

In [None]:
# reload borough
boro_dict = functions.make_boro_dict(df, borough_lst)

In [None]:
#mean and std
functions.mean_std_table(boro_dict,borough_lst, 'p1')

In [None]:
#plot p1 for each borough
functions.plot_p1 (boro_dict, borough_lst)

In [None]:
# make p_value_table
functions.p_value_table(boro_dict,borough_lst,'p1')

# CQ2

In [None]:
maps=functions.pickup_and_dropoff_maps(df_names,taxi_zone_lookup, 'taxi_zones.json')

In [None]:
maps[0]


In [None]:
maps[1]