<a href="https://www.kaggle.com/code/didemparlak/do-smokers-tip-more-at-restaurants?scriptVersionId=120864997" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Do smokers tip more at restaurants?**

Let's look at the answer to this interesting question by applying an AB testing to the tips dataset in the seaborn library.



> A brief summary of the AB test (as a mean comperasion)

**Step 1: Hypothesis**


* HO:M1=M2 (There is no statistically significant difference between the tips given by the smokers and non-smokers at the restaurant)


* H1:M1!=M2 (There is statistically significant difference between the tips given by the smokers and non-smokers at the restaurant)

**Step2:Testing  of Assumptions**

**2.1.** Testing the Normality Assumption 
(H0:M1=M2  Normal distribution is provided.

 H1:M1!=M2  Normal distribution is NOT provided.)
 
Apply Shapiro Test for normality. 
If p value<0.05 , H0 is rejected. This means it does **not** have a normal distribution. 

If p value of test is greater than .05, **HO can not be rejected**. This means normal distribution.

**2.2.** If Normal Distribution, check homogeneity with Levene's test


**Step3:Apply Hypothesis Testing** (3.1 or 3.2)

**3.1** Apply Independent Samples t Test (Parametric Method)
* If both assumption is provided,  Independent Samples t Test will be done with **"equal_var=True"** extention. 
* If only Homogeneity did not provided, Independent Samples t Test will be done with **"equal_var=False"** extention.

**3.2** Apply Mann-Whitney U Test (Non-Parametric Method)
* If both assumption is not provided or only normality assumption is not provided, apply Mann-Whitney U Test

**Step4:** According to the p value of the applied test result, the hypothesis I created in step 1 will be rejected or can not be rejected. 






**About Dataset**

* total_bill: total price of the meal (including tip and tax)
* tip
* sex: gender of the payer
* smoker: Are there any smokers in the group?
* day
* time
* size: How many people are in the group?

In [1]:
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# !pip install statsmodels
import statsmodels.stats.api as sms
from scipy.stats import ttest_1samp, shapiro, levene, ttest_ind, mannwhitneyu, \
    pearsonr, spearmanr, kendalltau, f_oneway, kruskal
from statsmodels.stats.proportion import proportions_ztest

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [2]:
df = sns.load_dataset("tips")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df.groupby("smoker").agg({"tip": "mean"})
 
#AB test is performed to see statistically whether they have significant differences.

Unnamed: 0_level_0,tip
smoker,Unnamed: 1_level_1
Yes,3.00871
No,2.99185


**STEP1**
HO:M1=M2  (There is no statistically significant difference between the tips given by the smokers and non-smokers at the restaurant) 

HO:M1!=M2 (There is statistically significant difference between the tips given by the smokers and non-smokers at the restaurant)


In [4]:
#Step2.1 Normality Assumption
#It may be helpful to perform outlier analysis and correction prior to normality analysis.

#H0:M1=M2 (Normal distribution is provided)
#H1:M1!=M2 (.... NOT provided.)

test_stat, pvalue = shapiro(df.loc[df["smoker"] == "Yes", "tip"])
print(f"Test Stat = {test_stat:.4f}, p-value = {pvalue:.4f}")

Test Stat = 0.8788, p-value = 0.0000


0.0000<0.05 HO rejected. Normal distribution is not provided

In [5]:
test_stat, pvalue = shapiro(df.loc[df["smoker"] == "No", "tip"])
print(f"Test Stat = {test_stat:.4f}, p-value = {pvalue:.4f}")


Test Stat = 0.9040, p-value = 0.0000


0.0000<0.05 HO is rejected. Normal distribution is not provided. 

In [6]:
#There is no need to look at homogeneity as the assumption of normality is not provided, but if it were necessary it would be as follows.
test_stat, pvalue = levene(df.loc[df["smoker"] == "Yes", "tip"],
                           df.loc[df["smoker"] == "No", "tip"])
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = 0.0205, p-value = 0.8862


p-value=0.8862 >0.05 HO(homogeneity is provided) can not be rejected.
However, since the assumption of normality is not provided, Mann-Whitney U Test (non-parametric method) is applied.

In [7]:
#Step3
test_stat, pvalue = mannwhitneyu(df.loc[df["smoker"] == "Yes", "tip"],
                                 df.loc[df["smoker"] == "No", "tip"])

print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = 7163.0000, p-value = 0.7919


Step4: According to the p value, the hypothesis in step1 will be evaluated.

p-value from step 3 = 0.7919

0.7919>0.05 

**HO can not be rejected** because the p value is greater than 0.05.

Let's remember the HO in the first step.

HO:M1=M2 (There is no statistically significant difference between the tips given by the smokers and non-smokers at the restaurant)



**As a result,There is no statistically significant difference between the tips given by the smokers and non-smokers at the restaurant**



