# **Conduct an A/B test**
**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.

### Task 1. Imports and data loading

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
# Load dataset into dataframe
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA) to check data type, range, null value, data size. 

In [3]:
taxi_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22699 entries, 24870114 to 17208911
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               22699 non-null  int64  
 1   tpep_pickup_datetime   22699 non-null  object 
 2   tpep_dropoff_datetime  22699 non-null  object 
 3   passenger_count        22699 non-null  int64  
 4   trip_distance          22699 non-null  float64
 5   RatecodeID             22699 non-null  int64  
 6   store_and_fwd_flag     22699 non-null  object 
 7   PULocationID           22699 non-null  int64  
 8   DOLocationID           22699 non-null  int64  
 9   payment_type           22699 non-null  int64  
 10  fare_amount            22699 non-null  float64
 11  extra                  22699 non-null  float64
 12  mta_tax                22699 non-null  float64
 13  tip_amount             22699 non-null  float64
 14  tolls_amount           22699 non-null  float

1. Data type: float64(8), int64(6), object(3)
2. There are no null value.

In [4]:
taxi_data.describe(include="all")

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


In [5]:
mask = ((taxi_data["fare_amount"] < 0) | (taxi_data["extra"] < 0) | (taxi_data["mta_tax"] < 0) | (taxi_data["improvement_surcharge"] < 0) | (taxi_data["total_amount"] < 0))
taxi_data[mask]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
105454287,2,12/13/2017 2:02:39 AM,12/13/2017 2:03:08 AM,6,0.12,1,N,161,161,3,-2.5,-0.5,-0.5,0.0,0.0,-0.3,-3.8
57337183,2,07/05/2017 11:02:23 AM,07/05/2017 11:03:00 AM,1,0.04,1,N,79,79,3,-2.5,0.0,-0.5,0.0,0.0,-0.3,-3.3
97329905,2,11/16/2017 8:13:30 PM,11/16/2017 8:14:50 PM,2,0.06,1,N,237,237,4,-3.0,-0.5,-0.5,0.0,0.0,-0.3,-4.3
28459983,2,04/06/2017 12:50:26 PM,04/06/2017 12:52:39 PM,1,0.25,1,N,90,68,3,-3.5,0.0,-0.5,0.0,0.0,-0.3,-4.3
833948,2,01/03/2017 8:15:23 PM,01/03/2017 8:15:39 PM,1,0.02,1,N,170,170,3,-2.5,-0.5,-0.5,0.0,0.0,-0.3,-3.8
91187947,2,10/28/2017 8:39:36 PM,10/28/2017 8:41:59 PM,1,0.41,1,N,236,237,3,-3.5,-0.5,-0.5,0.0,0.0,-0.3,-4.8
55302347,2,06/05/2017 5:34:25 PM,06/05/2017 5:36:29 PM,2,0.0,1,N,238,238,4,-2.5,-1.0,-0.5,0.0,0.0,-0.3,-4.3
58395501,2,07/09/2017 7:20:59 AM,07/09/2017 7:23:50 AM,1,0.64,1,N,50,48,3,-4.5,0.0,-0.5,0.0,0.0,-0.3,-5.3
29059760,2,04/08/2017 12:00:16 AM,04/08/2017 11:15:57 PM,1,0.17,5,N,138,138,4,-120.0,0.0,0.0,0.0,0.0,-0.3,-120.3
109276092,2,12/24/2017 10:37:58 PM,12/24/2017 10:41:08 PM,5,0.4,1,N,164,161,4,-4.0,-0.5,-0.5,0.0,0.0,-0.3,-5.3


In [6]:
print("total data size: ", len(taxi_data))
print("Unusual data size: ", len(taxi_data[mask]))
print("Unusual data size porpotion: ", len(taxi_data[mask])/len(taxi_data))

total data size:  22699
Unusual data size:  14
Unusual data size porpotion:  0.0006167672584695361


In [7]:
mask = ((taxi_data["fare_amount"] < 0) | (taxi_data["extra"] < 0) | (taxi_data["mta_tax"] < 0) | (taxi_data["improvement_surcharge"] < 0) | (taxi_data["total_amount"] < 0))
taxi_data = taxi_data[~mask]

Identified some unexpected data points (e.g., negative fare amounts). Given their limited quantity, we opted to remove them to minimize their impact on the A/B test results.

Next, investigate the relationship between payment type and the fare amount the customer
pays. 

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [8]:
replacements = {1: 'Credit card', 2: 'Cash', 3: 'No charge', 4: 'Dispute', 5: 'Unknown'}

In [9]:
fare_amount_by_pay_type = taxi_data[["fare_amount","payment_type"]].groupby("payment_type").mean().reset_index()
fare_amount_by_pay_type['payment_type'] = fare_amount_by_pay_type['payment_type'].replace(to_replace=replacements)
fare_amount_by_pay_type

Unnamed: 0,payment_type,fare_amount
0,Credit card,13.429748
1,Cash,12.213546
2,No charge,13.127368
3,Dispute,15.320513


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, you conduct a hypothesis test.


### Task 3. Hypothesis testing

$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.

Choose 5% as the significance level and proceed with a two-sample t-test.

In [10]:
#hypothesis test, A/B test
#significance level

alpha = 0.05
mask = taxi_data["payment_type"]== 1
credit_cards=taxi_data[mask]
mask = taxi_data["payment_type"]== 2
cash=taxi_data[mask]
stats.ttest_ind(credit_cards["fare_amount"], taxi_data["fare_amount"], equal_var=False)

Ttest_indResult(statistic=2.725027296585607, pvalue=0.006433145056446846)

Since the p-value is significantly smaller than the significance level of 5%, reject the null hypothesis.
The concludsion is that there is a statistically significant difference in the average fare amount between
customers who use credit cards and customers who use cash.

### Task 4. Communicate insights with stakeholders

The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers.

*** Please note that this is educaiton project, this project requires an assumption that passengers were forced to pay one way or the other,
and that once informed of this requirement, they always complied with it. The data was
not collected this way; so, an assumption had to be made to randomly group data entries
to perform an A/B test. This dataset does not account for other likely explanations. For
example, riders might not carry lots of cash, so it’s easier to pay for longer/farther trips with
a credit card. In other words, it’s far more likely that fare amount determines payment type,
rather than vice versa.