# **Yellow Taxi Trip A-B Testing**

**The goal** of this third notebook is <span style="color: var(--vscode-foreground);">to discover if&nbsp;there is a relationship between the total-fare-amount and the payment-type. We will then&nbsp;create a visualization of our findings and add that visualization to the report we share to stakeholders.</span>

**Part 1:** Load the Data

**Part 2:** <span style="color: var(--vscode-foreground);">&nbsp;Prepare the Data</span>

**Part 3:** Construct the A-B Test

**Part 4:** Share Findings with Stakeholders

## **1: Load the Data**

### **Build dataframe**

In [8]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

# Load dataset into dataframe
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)


## **2: Prepare the Data**

### **Clean Data**

In [9]:
# Drop rows with missing values
taxi_data = taxi_data.dropna(axis=0)


### **Preliminary Data Exploration**

We are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type.

In [10]:
# Compute the mean `fare_amount` for each group in `payment_type`
taxi_data.groupby('payment_type')['fare_amount'].mean()


payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, we conduct a hypothesis test.

## **3: Construct the A-B Test**

### **Hypothesis Testing**

- **Null hypothesis**: There is no difference in average fare between customers who use credit cards and customers who use cash (any observed difference in the sample data is due to chance or sampling variability).
- **Alternative hypothesis**: There is a difference in average fare between customers who use credit cards and customers who use cash (any observed difference in the sample data is due to an actual difference in the corresponding population means).

### **Significance Level**

We choose 5% as the significance level and proceed with a two-sample t-test.

### **Find P-Value**

In [12]:
# Conduct a two-sample t-test to compare means

# Save each sample in a variable
credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']

# Implement a t-test using the two samples
stats.ttest_ind(a=credit_card, b=cash, equal_var=False)


TtestResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12, df=16675.48547403633)

### **Hypothesis Result**

<span style="color: var(--vscode-foreground);">Since the p-value is significantly smaller than the significance level of 5%, you reject the null hypothesis.</span>

<span style="color: var(--vscode-foreground);">Our conclusion is that there</span> **is** a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.

## **4: Share Findings with Stakeholders**

### **Conclusions**

The analysis shows that there is a statistically significant difference between the `fare_amount` for the credit card `payment_type` vs the cash `payment_type`. This suggests there might be more profitable for the taxi company to encourage payments by credit card.

However, this assumes that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. **The data was not collected this way**; so the&nbsp;  randomly grouped data entries to perform an A/B test was based on  &nbsp;an assumption that might necessarily be true.

This dataset does not account for other likely explanations. <b>For example</b>, riders might not carry lots of cash, so it's easier to pay for longer, long-distance trips with a credit card. <i>In other words, it's far more likely that fare amount determines payment type, rather than vice versa.</i>

### **Next Steps**

The key business insight is that encouraging customers to pay with credit cards _may_ generate more revenue for taxi cab drivers. For example, the taxi company can install signs that read “Credit card payments are preferred” in their cabs, and implement a protocol that requires cab drivers to verbally inform customers that credit card payments are preferred.