# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct t-tests and an ANOVA on a real-world dataset and interpret the results

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [2]:
#Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
%matplotlib inline
df = pd.read_csv('multipleChoiceResponses_cleaned.csv', encoding='latin1')
display(df.head())

dfschema = pd.read_csv('schema.csv', encoding='latin1')
display(dfschema.head())

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,1.0,250000.0
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,


Unnamed: 0,Column,Question,Asked
0,GenderSelect,Select your gender identity. - Selected Choice,All
1,GenderFreeForm,Select your gender identity. - A different ide...,All
2,Country,Select the country you currently live in.,All
3,Age,What's your age?,All
4,EmploymentStatus,What's your current employment status?,All


## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

You may import the functions stored in the `flatiron_stats.py` file to help perform your hypothesis tests. It contains the stats functions that you previously coded: `welch_t(a,b)`, `welch_df(a, b)`, and `p_value(a, b, two_sided=False)`. 

Note that `scipy.stats.ttest_ind(a, b, equal_var=False)` performs a two-sided Welch's t-test and that p-values derived from two-sided tests are two times the p-values derived from one-sided tests. See the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.    

In [19]:
#Your code here
from flatiron_stats import *
import scipy.stats as stats
m_sal = df.AdjustedCompensation[df.FormalEducation == "Master's degree"].dropna()
b_sal = df.AdjustedCompensation[df.FormalEducation == "Bachelor's degree"].dropna()
print('masters degree mean salary',m_sal.mean(),', with stdev {}'.format(m_sal.std()), ",with {} samples".format(len(m_sal)))
print('bach degree mean salary',b_sal.mean(),', with stdev {}'.format(b_sal.std()), ",with {} samples".format(len(b_sal)))
print("since they have different std deviation, we need welch's t-test (of course we shouldn't pre-test but I was curious.)")


print('welch t: ', welch_t(m_sal, b_sal))
print('welch df:', welch_df(m_sal, b_sal))
#imported pvalue doesn't work, so...
def p_value(a, b, two_sided=False):
    # Your code here
    from scipy import stats
    t = welch_t(a, b)
    df = welch_df(a, b)
    p = 1 - stats.t.cdf(t, df)
    if two_sided:
        return p*2
    else:
        return p
     # Return the p-v
print('welch p:', p_value(m_sal, b_sal, two_sided = False))

stats.ttest_ind(m_sal, b_sal, equal_var=False)

masters degree mean salary 69139.8998712 , with stdev 135527.2085045828 ,with 1990 samples
bach degree mean salary 64887.097994618794 , with stdev 306935.8723879783 ,with 1107 samples
since they have different std deviation, we need welch's t-test (of course we shouldn't pre-test but I was curious.)
welch t:  0.43786693335411514
welch df: 1350.0828973008781
welch p: 0.33077639451272445


Ttest_indResult(statistic=0.43786693335411514, pvalue=0.6615527890254489)

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [20]:
#Your code here
#Your code here
d_sal = df.AdjustedCompensation[df.FormalEducation == "Doctoral degree"].dropna()
#display(df.FormalEducation)
#b_sal = df.AdjustedCompensation[df.FormalEducation == "Bachelor's degree"]
print('masters degree mean salary',m_sal.mean(),', with stdev {}'.format(m_sal.std()), ",with {} samples".format(len(m_sal)))
print('bach degree mean salary',b_sal.mean(),', with stdev {}'.format(b_sal.std()), ",with {} samples".format(len(b_sal)))
print('Doc degree mean salary',d_sal.mean(),', with stdev {}'.format(d_sal.std()), ",with {} samples".format(len(d_sal)))
print("since they have different std deviation, we need welch's t-test (of course we shouldn't pre-test but I was curious.)")


print('welch t: ', welch_t(d_sal, b_sal))
print('welch df:', welch_df(d_sal, b_sal))
#imported pvalue doesn't work, so...
def p_value(a, b, two_sided=False):
    # Your code here
    from scipy import stats
    t = welch_t(a, b)
    df = welch_df(a, b)
    p = 1 - stats.t.cdf(t, df)
    if two_sided:
        return p*2
    else:
        return p
     # Return the p-v
print('welch p:', p_value(d_sal, b_sal, two_sided = False))
print("That is a significant p value")

masters degree mean salary 69139.8998712 , with stdev 135527.2085045828 ,with 1990 samples
bach degree mean salary 64887.097994618794 , with stdev 306935.8723879783 ,with 1107 samples
Doc degree mean salary 29566175.762453098 , with stdev 909998082.3346785 ,with 967 samples
since they have different std deviation, we need welch's t-test (of course we shouldn't pre-test but I was curious.)
welch t:  1.0081234695549772
welch df: 966.0001919995985
welch p: 0.15682381994720251
That is a significant p value


## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [23]:
#Your code here
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df2 = df[['FormalEducation','AdjustedCompensation']].dropna()
df2.head()
formula = 'AdjustedCompensation ~ C(FormalEducation)'
lm = ols(formula, df2).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

                          sum_sq      df         F    PR(>F)
C(FormalEducation)  6.540294e+17     6.0  0.590714  0.738044
Residual            7.999414e+20  4335.0       NaN       NaN


In [24]:
#groups = {'1':data1, ...}
f_stat, p = stats.f_oneway(b_sal, m_sal, d_sal) 
display(f_stat, p)

1.6276474409881452

0.19651914373154994

## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!