In [1]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# wrangle data
import pandas as pd
import numpy as np

# Exploring
import scipy.stats as stats
import pandas_profiling

# Visualizing
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# default pandas decimal number display format
pd.options.display.float_format = '{:20,.2f}'.format

# my modules
import acquire
import summarize
import prepare
import env

### Acquire df

In [2]:
df = sns.load_dataset("tips")

In [3]:
pandas_profiling.ProfileReport(df)



### Statistical Testing 

Note:

    - t-test: to compare one categorical and one continuous variable
    
    
    - chi2: to compare two categorical variables


    - pearson r: to compare two continuous variables
    
    

- Perform a 1 sample T-test on the df

- $H_0$: The bill for smokers is no different from the population  mean.

- $H_a$: The bill for smokers is different from the population mean.

In [11]:
smokers_total_bills = df[df.smoker == 'Yes'].total_bill
overall_total_bill_mean = df.total_bill.mean()

In [12]:
test_results = stats.ttest_1samp(smokers_total_bills, overall_total_bill_mean)
test_results

Ttest_1sampResult(statistic=0.951796790928544, pvalue=0.3436939512284921)

- I fail to reject the Null Hypothesis that the average bill for smokers is no different from the population mean based on the p-value of .3437.

- $H_0$: The tips on Sunday are no different from the population mean.

- $H_a$: The tips on Sunday are different from the population mean.

In [13]:
sun_tip_mean = df[df.day == "Sun"].tip
overall_tip_mean = df.tip.mean()

In [16]:
test_results = stats.ttest_1samp(sun_tip_mean, overall_tip_mean)
test_results

Ttest_1sampResult(statistic=1.8132863682799842, pvalue=0.0737884052452269)

- I reject the Null Hypothesis that the tips on Sunday are no different from the population mean based on the p-value of .0738 because although the p-value is above my alpha of .05, it is close enough to make me believe there is some significance.

- Perform a 2 sample T-test on the df

- $H_0$: The average size of the tip left by parties of 2 and parties of 4 is the same.

- $H_a$: The average size of the tip left by parties of 2 and parties of 4 is not the same.

- Create the dataset to pass into the 2 sample t-test

In [30]:
parties_of_2 = df[df["size"] == 2]
parties_of_4 = df[df["size"] == 4]

In [31]:
test_results  = stats.ttest_ind(parties_of_2.tip, parties_of_4.tip)
test_results

Ttest_indResult(statistic=-7.462130391296251, pvalue=2.924028981378475e-12)

- I reject the Null Hypothesis that the average size of the tip left by parties of 2 and 4 are the same because of the p-value of 2.924028981378475e-12.

- $H_0$: The average size of the tip left by parties on Saturdays and Sundays is the same.

- $H_a$: The average size of the tip left by parties on Saturdays and Sundays is the same.

In [23]:
sunday_tip = df[df["day"] == "Sun"].tip
saturday_tip = df[df["day"] == "Sat"].tip

In [27]:
test_results = stats.ttest_ind(sunday_tip, saturday_tip)
test_results

Ttest_indResult(statistic=1.1431231469058438, pvalue=0.25468441632531236)

- I fail to reject the Null Hypothesis that the average size of the tip left by parties on Saturdays and Sundays is the same because of the p-value of .2547.

- Perform a chi2 test

- $H_0$: Sex is independent of whether someone is a smoker.


- $H_a$: Sex is not independent of whether someone is a smoker.

- Create contingency table to pass to function

In [33]:
contingency_table = pd.crosstab(df.sex, df.smoker)
contingency_table

smoker,Yes,No
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,60,97
Female,33,54


- Pass the table to the chi2 function

In [34]:
_, p, _, expected = stats.chi2_contingency(contingency_table)

In [35]:
print(f"The p-value is:  {p}")

The p-value is:  0.925417020494423


- I fail to reject the Null Hypothesis that sex is independent of whether someone is a smoker based on the high p-value returned from the chi2 test.

- Perform a Pearson R test on total_bill and tip

- $H_0$: There is not linear correlation between the total bill and the tip amount.

- $H_a$: There is linear correlation between the total bill and the tip amount.

In [36]:
r, p = stats.pearsonr(df.total_bill, df.tip)

print(f"The r value is: {r}")
print(f"The p value is: {p}")

The r value is: 0.6757341092113645
The p value is: 6.692470646863477e-34


- Based on the above results of the pearsonr test, I reject the Null Hypothesis that there is not linear correlation between the total bill and the tip amount.