# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct t-tests and an ANOVA on a real-world dataset and interpret the results

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#Your code here
df = pd.read_csv('multipleChoiceResponses_cleaned.csv')
print(df['FormalEducation'].head())
print(df['AdjustedCompensation'].head())

0    Bachelor's degree
1      Master's degree
2      Master's degree
3      Master's degree
4      Doctoral degree
Name: FormalEducation, dtype: object
0         NaN
1         NaN
2         NaN
3    250000.0
4         NaN
Name: AdjustedCompensation, dtype: float64


## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

You may import the functions stored in the `flatiron_stats.py` file to help perform your hypothesis tests. It contains the stats functions that you previously coded: `welch_t(a,b)`, `welch_df(a, b)`, and `p_value(a, b, two_sided=False)`. 

Note that `scipy.stats.ttest_ind(a, b, equal_var=False)` performs a two-sided Welch's t-test and that p-values derived from two-sided tests are two times the p-values derived from one-sided tests. See the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.    

In [47]:
from flatiron_stats import welch_t, welch_df, p_value_welch_ttest

#Your code here
# h_o: there is no compensation difference between people who hold a Bachelor's degree compared to those who have a Master's degree.
# h_a: there is a compensation difference between people who hold a Bachelor's degree compared to those who have a Master's degree.

df_2_cols = pd.DataFrame(df['FormalEducation'])
df_2_cols['AdjustedCompensation'] = pd.DataFrame(df['AdjustedCompensation'])
df_2_cols.dropna()
df_2_cols = df_2_cols[(df_2_cols['FormalEducation'] == 'Bachelor\'s degree') | (df_2_cols['FormalEducation'] == 'Master\'s degree')]
print(df_2_cols.describe())

       AdjustedCompensation
count          3.097000e+03
mean           6.761977e+04
std            2.132118e+05
min            0.000000e+00
25%            1.793739e+04
50%            4.820250e+04
75%            8.950000e+04
max            9.999999e+06


In [48]:
df_2_cols.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3097 entries, 3 to 26378
Data columns (total 2 columns):
FormalEducation         3097 non-null object
AdjustedCompensation    3097 non-null float64
dtypes: float64(1), object(1)
memory usage: 72.6+ KB


In [59]:
df_b = df_2_cols[df_2_cols['FormalEducation'] == 'Bachelor\'s degree']
df_m = df_2_cols[df_2_cols['FormalEducation'] == 'Master\'s degree']
b_arr = np.array(df_b['AdjustedCompensation'])
m_arr = np.array(df_m['AdjustedCompensation'])

welch_t(b_arr, m_arr)

0.43786693335411514

In [60]:
welch_df(b_arr, m_arr)

1350.0828973008781

In [83]:
p_value_welch_ttest(b_arr, m_arr, two_sided=False)

0.33077639451272445

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [79]:
df2_cols = pd.DataFrame(df['FormalEducation'])
df2_cols['AdjustedCompensation'] = pd.DataFrame(df['AdjustedCompensation'])
df2_cols.dropna(inplace=True)
df2_cols = df2_cols[(df2_cols['FormalEducation'] == 'Bachelor\'s degree') | (df2_cols['FormalEducation'] == 'Doctoral degree')]

df_b = df2_cols[df2_cols['FormalEducation'] == 'Bachelor\'s degree']
df_d = df2_cols[df2_cols['FormalEducation'] == 'Doctoral degree']
b_arr = np.array(df_b['AdjustedCompensation'])
d_arr = np.array(df_d['AdjustedCompensation'])

#Your code here
import statistics as s
s1 = s.median(list(d_arr))
s2 = s.median(list(b_arr))
print(s1, s2)

74131.91999999997 38399.4


In [80]:
print(len(d_arr),len(b_arr))

967 1107


In [84]:
p_value_welch_ttest(b_arr, d_arr, two_sided=False)

0.15682381994720251

In [63]:
# Median Values: 
# s1:74131.92 
# s2:38399.4
# Sample sizes: 
# s1: 967 
# s2: 1107
# Welch's t-test p-value: 0.1568238199472023


# Repeated Test with Ouliers Removed:
# Sample sizes: 
# s1: 964 
# s2: 1103
# Welch's t-test p-value with outliers removed: 0.0

In [91]:
df_d_no_outliers = df_d[(df_d['AdjustedCompensation'] > 0) & (df_d['AdjustedCompensation'] < 1000000)]
df_b_no_outliers = df_b[(df_b['AdjustedCompensation'] > 0) & (df_b['AdjustedCompensation'] < 1000000)]

print(len(df_d_no_outliers), len(df_b_no_outliers))

959 1096


In [93]:
d_no_outliers = np.array(df_d_no_outliers['AdjustedCompensation'])
b_no_outliers = np.array(df_b_no_outliers['AdjustedCompensation'])

p_value_welch_ttest(b_no_outliers, d_no_outliers, two_sided=False)

0.0

## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [97]:
import scipy.stats as st

#Your code here
print(st.f_oneway(list(d_no_outliers), list(b_no_outliers))) #doctorate compared to bachelor's degrees
print(st.f_oneway(list(d_no_outliers), list(d_arr))) #doctorate compared to master's degrees
print(st.f_oneway(list(b_no_outliers), list(d_arr))) #bachelor's degree compared to master's

F_onewayResult(statistic=140.8642997756584, pvalue=1.7960011262560295e-31)
F_onewayResult(statistic=1.0064146083986878, pvalue=0.3158894727750025)
F_onewayResult(statistic=1.1527704060999588, pvalue=0.2830956451575657)


## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!