# Automatidata project 
**Course 4 - The Power of Statistics**

You are a data professional in a data analytics firm, called Automatidata. The current project for their newest client, the New York City Taxi & Limousine Commission (New York City TLC) is reaching its midpoint, having completed a project proposal, Python coding work, and exploratory data analysis.

You receive a new email from Uli King, Automatidata’s project manager. Uli tells your team about a new request from the New York City TLC: to analyze the relationship between fare amount and payment type. You also discover follow-up emails from three other team members: Deshawn Washington, Luana Rodriguez, and Udo Bankole. These emails discuss the details of the analysis. A final email from Luana includes your specific assignment: to conduct an A/B test. 


# Course 4 End-of-course project: Statistical analysis

In this activity, you will explore the data provided and conduct A/B and hypothesis testing.  
<br/>   

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests.
  
**The goal** is to apply descriptive statistics and hypothesis testing in Python.

<br/>  
*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How did computing descriptive statistics help you analyze your data? 

* How did you formulate your null hypothesis and alternative hypothesis? 

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from your A/B test?

* What business recommendations do you propose based on your results?

# **Conduct an A/B test**
 
In this activity, you will practice using statistics to analyze and interpret data. The activity covers fundamental concepts such as descriptive statistics and hypothesis testing. 

**The purpose** of this A/B test is to find ways to generate more revenue for taxi cab drivers. 

**Note:** For the purpose of this exercise, assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

**The goal** for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.

*This activity has two parts:*

**Part 1:** Exploratory data analysis 
Explore the NYC Taxi dataset with Python using a Jupyter notebook. This includes: 

* Computing descriptive statistics

**Part 2:** Hypothesis testing with Python

* Conducting a two-sample hypothesis test



## PACE: Plan 

In this stage, consider the following questions where applicable to complete your code response:
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.


The purpose of this A/B test is to find ways to generate more revenue for taxi cab drivers. 

We will conduct an A/B test on the dataset from the New York City Taxi & Limousine Commission (TLC). The purpose of the A/B test is to analyze the relationship between fare amount and payment type. 

After conducting the A/B test and interpreting the results, an executive summary must be written to communicate the key findings of the analysis to the team. If the analysis concludes that there is a statistically significant relationship between credit card payment and fare amount, the task also requires proposing next steps or strategies to encourage customers to pay with a credit card.

Research question: The hypothesis to test is whether customers who use a credit card pay higher fare amounts than those who pay with cash.

### Task 1. Imports and data loading

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [1]:
import numpy as np
import pandas as pd
import seaborn.objects as so
from scipy import stats

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA. 

#==> ENTER YOUR CODE HERE
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

## PACE: **Analyze and Construct**

In this stage, consider the following questions where applicable to complete your code response:
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


Computing descriptive statistics can help us understand the distribution of our numeric data. learn the central tendency of our data and the spread of our data. This includes mean, median, mode, range, standard deviation. This can provide insights into the normality of data, skewness, and presence of outliers. We can also learn the distribution of our data such as the normal distribution, binomial and poisson. This can prepare us for tasks such as Machine learning.

Identifying patterns and relationships such as correlation coefficients can indicate whether there are relationships between different variables in the dataset. We can search for colinearity and multi colineraity. This violates statistical assumptions. 

Data cleaning and data summary.

### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA). 

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [3]:
taxi_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [4]:
# Prints the shape of the dataframe
taxi_data.shape

(22699, 17)

In [5]:
# Prints the data types of the dataframe
taxi_data.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
RatecodeID                 int64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

In [6]:
# summary of the descriptive statistics
taxi_data.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
mean,1.556236,1.642319,2.913313,1.043394,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,1.285231,3.653171,0.708391,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,1.0,0.99,1.0,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,1.0,1.61,1.0,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,2.0,3.06,1.0,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8
max,2.0,6.0,33.96,99.0,265.0,265.0,4.0,999.99,4.5,0.5,200.0,19.1,0.3,1200.29


In [7]:
# filter the fare amounts and payment type

taxi_data = (taxi_data
              .filter(['payment_type','fare_amount'])
              )

In [8]:
taxi_data.head()

Unnamed: 0,payment_type,fare_amount
24870114,1,13.0
35634249,1,16.0
106203690,1,6.5
38942136,1,20.5
30841670,2,16.5


In [9]:
# filter for credit card and cash only.

taxi_data = (taxi_data
              .query('(payment_type == 1) or (payment_type == 2)')
             )

In [10]:
taxi_data

Unnamed: 0,payment_type,fare_amount
24870114,1,13.0
35634249,1,16.0
106203690,1,6.5
38942136,1,20.5
30841670,2,16.5
...,...,...
14873857,2,4.0
66632549,1,52.0
74239933,2,4.5
60217333,1,10.5


You are interested in the relationship between payment type and the total fare amount the customer pays. One approach is to look at the average total fare amount for each payment type. 

In [11]:
payment_type = (taxi_data
            .groupby(by = 'payment_type')
            .agg(['mean','median','std'])
            )

In [12]:
# Credit card is payment type 1 and cash is type 2
payment_type

Unnamed: 0_level_0,fare_amount,fare_amount,fare_amount
Unnamed: 0_level_1,mean,median,std
payment_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,13.429748,9.5,13.848964
2,12.213546,9.0,11.68994


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger total fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in total fare amount. To assess whether the difference is statistically significant, you conduct a hypothesis test.


### Task 3. Hypothesis testing

Before you conduct your hypothesis test, consider the following questions where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses. Consider your hypotheses for this project as listed below.

Research question: The hypothesis to test is whether customers who use a credit card pay higher fare amounts than those who pay with cash.

$H_0$: There is no difference in the average total fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average total fare amount between customers who use credit cards and customers who use cash.



Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test: 


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 



**Note:** For the purpose of this exercise, your hypothesis test is the main component of your A/B test. 

You choose 5% as the significance level and proceed with a two-sample t-test.

In [13]:
# subesetting the data into groups for hypothesis testing
# code is not very readable

credit = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']

cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']

In [14]:
# subesetting the data into groups for hypothesis testing
# code reads like a recipe, clear and concise

credit = (taxi_data
          .query('payment_type == 1')
          .fare_amount
          )

cash = (taxi_data
        .query('payment_type == 2')
        .fare_amount
        )

In [15]:
# For this analysis, the significance level is 5%

significance_level = 0.05

In [16]:
# hypothesis testing
t_stat, p_val = stats.ttest_ind(credit, cash, equal_var=False)

In [17]:
t_stat

6.866800855655372

In [18]:
p_val

6.797387473030518e-12

In [19]:
credit.shape

(15265,)

In [20]:
cash.shape

(7267,)


## PACE: **Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### Task 4. Communicate insights with stakeholders

*Ask yourself the following questions:*

1. What business insight(s) can you draw from the result of your hypothesis test?
2. Consider why this A/B test project might not be realistic, and what assumptions had to be made for this pedagogical project.

With a p-value of almost (0.00) being less than 0.05 (as my significance level is 5%), reject the null hypothesis in favor of the alternative hypothesis.

We can make the conclusion that there is a difference in the average total fare amount between customers who use credit cards and customers who use cash. We conclude customers who use a credit card pay higher fare amounts than those who pay with cash. 

There are some assumptions and limitations that must be considered. While the A/B test does suggest that there is a significant difference in the fare amounts between customers who use credit cards and those who use cash, it does not necessarily imply causation. Not a representative sample of all taxi rides there is a bias in the data collection as credit payments has more than double the number of samples the cash payment type contains. Unabalanced classes can cause issues and limitations.

Finally The t-test assumes that the data is normally distributed. If this assumption is not met, the results of the t-test could be violated.