## Task: Conduct an A/B test

In [1]:
# Import the necessary packages
import pandas as pd
from scipy import stats

In [2]:
# Read the dataset
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

#### Data exploration

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown

In [6]:
# Descriptive stats for the EDA
taxi_data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,04/15/2017 6:05:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


**Note:** We are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount of each payment type.

In [5]:
taxi_data.groupby('payment_type')['fare_amount'].mean()

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

**Note:** Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, we need to conduct a hypothesis test.

#### Hypothesis testing

**Null hypothesis**: There is no difference in average fare between customers who use credit cards and customers who use cash.  
**Alternative hypothesis**: There is a difference in average fare between customers who use credit cards and customers who use cash.  
**Significance level**: 5%

In [13]:
# Hypothesis test

significance_level = 0.05
credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']
test_results = stats.ttest_ind(a= credit_card, b = cash, equal_var = False)
print(test_results)
if test_results.pvalue < significance_level:
    print('You reject the Null hypothesis')
else:
    print('You fail to reject the Null hypothesis')

TtestResult(statistic=np.float64(6.866800855655372), pvalue=np.float64(6.797387473030518e-12), df=np.float64(16675.48547403633))
You reject the Null hypothesis


#### Conclusion
Since the p-value is less than the significance level we can reject the null hypothesis. We can conclude that there is a statistically significant difference in the average fare amount between customers who use credit cards and customer who use cash. 