#   Automatidata project
### The Power of Statistics
You are a data professional in a data analytics firm, called Automatidata. The current project for
their newest client, the New York City Taxi & Limousine Commission (New York City TLC) is
reaching its midpoint, having completed a project proposal, Python coding work, and exploratory
data analysis.
You receive a new email from Uli King, Automatidata’s project manager. Uli tells your team about
a new request from the New York City TLC: to analyze the relationship between fare amount
and payment type. You also discover follow-up emails from three other team members: Deshawn
Washington, Luana Rodriguez, and Udo Bankole. These emails discuss the details of the analysis.
A final email from Luana includes your specific assignment: to conduct an A/B test

### Statistical analysis
In this activity, you will explore the data provided and conduct A/B and hypothesis testing.
The purpose of this project is to demostrate knowledge of how to prepare, create, and analyze
A/B tests.
The goal is to apply descriptive statistics and hypothesis testing in Python.
This activity has three parts:
#### Part 1: Imports and data loading 
* What data packages will be necessary for hypothesis testing?
#### Part 2: Conduct hypothesis testing 
* How did computing descriptive statistics help you analyze your data?
* How did you formulate your null hypothesis and alternative hypothesis?
#### Part 3: Communicate insights with Stakeholders
* What key business insight(s) emerged from your A/B test?
* What business recommendations do you propose based on your results?
Follow the instructions and answer the questions below to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.
Recall that you have a helpful tool at your disposal! Refer to the PACE strategy document to help apply your learnings, apply new problem-solving skills, and guide your approach to this project.

### Conduct an A/B test
In this activity, you will practice using statistics to analyze and interpret data. The activity covers
fundamental concepts such as descriptive statistics and hypothesis testing.
The purpose of this A/B test is to find ways to generate more revenue for taxi cab drivers.
Note: For the purpose of this exercise, assume that the sample data comes from an experiment in
which customers are randomly selected and divided into two groups: 1) customers who are required
to pay with credit card, 2) customers who are required to pay with cash. Without this assumption,
we cannot draw causal conclusions about how payment method affects fare amount.
    
The goal for this A/B test is to sample data and analyze whether there is a relationship between
payment type and fare amount. For example: discover if customers who use credit cards pay higher
fare amounts than customers who use cash.
    
This activity has two parts:
Part 1: Exploratory data analysis Explore the NYC Taxi dataset with Python using a Jupyter
notebook. This includes:
* Computing descriptive statistics
Part 2: Hypothesis testing with Python
* Conducting a two-sample hypothesis test

### PACE stages
* [Plan](#scrollTo=psz51YkZVwtN&line=3&uniqifier=1)
* [Analyze](#scrollTo=mA7Mz_SnI8km&line=4&uniqifier=1)
* [Construct](#scrollTo=Lca9c8XON8lc&line=2&uniqifier=1)
* [Execute](#scrollTo=401PgchTPr4E&line=2&uniqifier=1)

### PACE: Plan
In this stage, consider the following questions where applicable to complete your code response:
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.
 1. Answer: The research question for this data project: “Is there a relationship between total 
fare amount and payment type?”
Complete the following steps to perform statistical analysis of your data


### Task 1. Imports and data loading
Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis
test.
Hint:
Before you begin, recall the following Python packages and functions that may be useful:
Main functions: stats.ttest_ind(a, b, equal_var)
Other functions: mean()
Packages: pandas, stats.scipy

In [2]:
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
taxi_data = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv', index_col=0)

 ### PACE: Analyze and Construct
In this stage, consider the following questions where applicable to complete your code response: 1.
Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing
descriptive statistics help you learn more about your data in this stage of your analysis?
1. Answer: In general, descriptive statistics are useful because they let you quickly explore and
understand large amounts of data. In this case, computing descriptive statistics helps you
quickly compare the average total fare amount among different payment types.

###  Task 2. Data exploration
Use descriptive statistics to conduct Exploratory Data Analysis (EDA).
Hint:
Refer back to Self Review Descriptive Statistics for this step-by-step proccess.
Note: In the dataset, payment_type is encoded in integers: * 1: Credit card * 2: Cash * 3: No
charge * 4: Dispute * 5: Unknown


In [5]:
 # descriptive stats code for EDA
taxi_data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


You are interested in the relationship between payment type and the total fare amount the customer
pays. One approach is to look at the average total fare amount for each payment type.

In [6]:
taxi_data.groupby('payment_type')['total_amount'].mean()

payment_type
1    17.663577
2    13.545821
3    13.579669
4    11.238261
Name: total_amount, dtype: float64

Based on the averages shown, it appears that customers who pay in credit card tend to pay a
larger total fare amount than customers who pay in cash. However, this difference might arise from
random sampling, rather than being a true difference in total fare amount. To assess whether the
difference is statistically significant, you conduct a hypothesis test.

### Task 3. Hypothesis testing
Before you conduct your hypothesis test, consider the following questions where applicable to
complete your code response:
1. Recall the difference between the null hypothesis and the alternative hypotheses. What are
your hypotheses for this data project?
1. Answer: Null hypothesis: There is no difference in average total fare between customers
who use credit cards and customers who use cash. Alternative hypothesis: There is a difference in average total fare between customers who use credit cards and customers who
use cash.
Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis
test:
1. State the null hypothesis and the alternative hypothesis
2. Choose a signficance level
3. Find the p-value
4. Reject or fail to reject the null hypothesis
##### Note: For the purpose of this exercise, your hypothesis test is the main component of your A/B test.
##### H0: There is no difference in the average total fare amount between customers who use credit cards and customers who use cash.
##### HA: There is a difference in the average total fare amount between customers who use credit cards and customers who use cash.
##### You choose 5% as the significance level and proceed with a two-sample t-test

In [7]:
#hypothesis test, A/B test
#significance level
credit_card = taxi_data[taxi_data['payment_type'] == 1]['total_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['total_amount']
stats.ttest_ind(a=credit_card, b=cash, equal_var=False)


TtestResult(statistic=20.34644022783838, pvalue=4.5301445359736376e-91, df=19245.398563776336)

Since the p-value is extremely small (much smaller than the significance level of 5%), you reject
the null hypothesis. You conclude that there is a statistically significant difference in the average
total fare amount between customers who use credit cards and customers who use cash.

PACE: Execute
Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### Task 4. Communicate insights with stakeholders
In conclusion, ask yourself the following questions:
1. What business insight(s) can you draw from the result of your hypothesis test?
2. Consider why this A/B test project might not be realistic, and what assumptions had to be
made for this pedagogical project.
### Responses: 
1. The key business insight is that encouraging customers to pay with
credit cards will likely generate more revenue for taxi cab drivers.