**[RQ4] What is the most common way of payments? Discover the way payments are executed in each borough and visualize the number of payments for any possible means. Then run the Chi-squared test to see whether the method of payment is correlated to the borough. Then, comment the results.**

In [33]:
###month of JANUARY

#We imported only the columns useful to answer our Rquestion.
#NB:Payment_type represented by a numeric code signifying how the passenger paid for the trip.  
#1= Credit card; 2= Cash; 3= No charge; 4= Dispute; 5= Unknown; 6= Voided trip.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math

data=pd.read_csv('/Users/Enzopc/Desktop/yellow_tripdata_2018-01.csv', usecols=['passenger_count','PULocationID', 'DOLocationID','payment_type','total_amount'])
zone=pd.read_csv('/Users/Enzopc/Desktop/taxi _zone_lookup.csv', usecols=['LocationID', 'Borough'])
datazone=pd.merge(data, zone, how='left', left_on=['PULocationID'],right_on=['LocationID'])

print("Number of rows:", datazone.shape[0])
print("Number of columns: ", datazone.shape[1])
datazone.head()

Number of rows: 8759874
Number of columns:  7


Unnamed: 0,passenger_count,PULocationID,DOLocationID,payment_type,total_amount,LocationID,Borough
0,1,41,24,2,5.8,41,Manhattan
1,1,239,140,2,15.3,239,Manhattan
2,2,262,141,1,8.3,262,Manhattan
3,1,140,257,2,34.8,140,Manhattan
4,2,246,239,1,16.55,246,Manhattan


In [34]:
# What is the most common way of payments? #
datazone['payment_type'].value_counts()

1    6105871
2    2598947
3      43204
4      11852
Name: payment_type, dtype: int64

As we can see from an initial count, it seems that the most common way of payments is by credit card


1: 6105871   2: 2598947    3: 43204    4: 11852

In [35]:
#But now let's see more deeply: 
#Discover the way payments are executed in each borough and visualize the number of payments for any possible means.

contingency_table = pd.crosstab(datazone['payment_type'], datazone['Borough'])
contingency_table

Borough,Bronx,Brooklyn,EWR,Manhattan,Queens,Staten Island,Unknown
payment_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2939,63280,439,5564803,365207,69,109134
2,4114,35339,88,2356898,154704,65,47739
3,271,1320,33,34240,5697,9,1634
4,65,311,11,9762,1298,1,404


**Chi-squared test to see whether the method of payment is correlated to the borough.**

H0 (Null Hypothesis): There is no relationship between variable one and variable two.

H1 (Alternative Hypothesis): There is a relationship between variable 1 and variable 2.

If the p-value is significant, you can reject the null hypothesis and claim that the findings support the alternative hypothesis.


Chi-square Assumptions that need to be met in order for the results of the Chi-square test to be trusted:

 -When testing the data, the cells should be frequencies or counts of cases and not percentages. It is okay to convert to percentages after testing the data
 
 -The levels (categories) of the variables being tested are mutually exclusive
 
 -Each participant contributes to only one cell within the Chi-square table
 
 -The groups being tested must be independent
 
 -The value of expected cells should be greater than 5

If all of these assumptions are met, then Chi-square is the correct test to use.

In [37]:
from scipy import stats

stats.chi2_contingency(contingency_table)

(14463.192599649135,
 0.0,
 18,
 array([[5.15033445e+03, 6.98769831e+04, 3.98002567e+02, 5.55231216e+06,
         3.67267847e+05, 1.00371926e+02, 1.10765299e+05],
        [2.19222553e+03, 2.97429434e+04, 1.69408685e+02, 2.36332622e+06,
         1.56326537e+05, 4.27230309e+01, 4.71469415e+04],
        [3.64428023e+01, 4.94436450e+02, 2.81619165e+00, 3.92871213e+04,
         2.59871852e+03, 7.10212955e-01, 7.83754520e+02],
        [9.99722462e+00, 1.35636997e+02, 7.72555861e-01, 1.07774966e+04,
         7.12897230e+02, 1.94830200e-01, 2.15004596e+02]]))

The first value (14463.19) is the Chi-square value, followed by the p-value ~ 0, then comes the degrees of freedom (18), and lastly it outputs the expected frequencies as an array. 
Since all of the expected frequencies are greater than 5, the Chi-square test results can be trusted.
We can reject the null hypothesis as the p-value is less than 0.05. 
Thus, the results indicate that there is a correlation/some sort of relationship between payment type and borough. For sure we do know that these two variables are not independent of each other.

We could relate this result to the "industrialization" of boroughs. Focusing on Manhattan, Brooklyn, EWR and Queens we can see that the most common way of payment is with credit card. Fos Staten Island there's no big difference and in Bronx most of the people pay with cash.

In [None]:
###Grafico per visualizzare il numero di pagamenti per ogni mezzo###

#contingency_table = pd.crosstab(datazone['payment_type'], datazone['Borough'])
#Assigns the frequency values
#creditcard = contingency_table.iloc[0][0:6].values
#cash = contingency_table.iloc[1][0:6].values
#Plots the bar chart
#fig = plt.figure(figsize=(10, 5))
#sns.set(font_scale=1.5)
#categories = ["Bronx","Brooklyn","Manhattan","Queens","Staten Island","Unknown"]
#p1 = plt.bar(categories, creditcard, color='#d62728')
#p2 = plt.bar(categories, cash, bottom=creditcard)
#plt.legend((p2[0], p1[0]), ('Cash', 'Credit Card'))
#plt.xlabel('Borough')
#plt.ylabel('Count')
#plt.show()