# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct t-tests and an ANOVA on a real-world dataset and interpret the results

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [2]:
# imports

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


In [3]:
# loading the dataset
df = pd.read_csv('multipleChoiceResponses_cleaned.csv', encoding='latin1')
df.head(10)


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,1.0,250000.0
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,
5,Male,Brazil,46.0,Employed full-time,,,Yes,,Data Scientist,Fine,...,,,,,,,,,,
6,Male,United States,35.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,
7,Female,India,22.0,Employed full-time,,,No,Yes,Software Developer/Software Engineer,Fine,...,Very Important,Somewhat important,Somewhat important,Not important,Very Important,Very Important,Somewhat important,Somewhat important,,
8,Female,Australia,43.0,Employed full-time,,,Yes,,Business Analyst,Fine,...,,,,,,,,,0.80231,64184.8
9,Male,Russia,33.0,Employed full-time,,,Yes,,Software Developer/Software Engineer,Fine,...,,,,,,,,,0.017402,20882.4


In [4]:
# checking what the dataset contains
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26394 entries, 0 to 26393
Columns: 230 entries, GenderSelect to AdjustedCompensation
dtypes: float64(15), object(215)
memory usage: 46.3+ MB


In [5]:
# checking the numerical values of the dataset
df.describe()

Unnamed: 0,LearningCategorySelftTaught,LearningCategoryOnlineCourses,LearningCategoryWork,LearningCategoryUniversity,LearningCategoryKaggle,LearningCategoryOther,TimeGatheringData,TimeModelBuilding,TimeProduction,TimeVisualizing,TimeFindingInsights,TimeOtherSelect,CompensationAmount,exchangeRate,AdjustedCompensation
count,16236.0,16253.0,16238.0,16249.0,16253.0,16221.0,10657.0,10655.0,10644.0,10656.0,10650.0,10640.0,5178.0,4499.0,4343.0
mean,33.596945,25.81468,13.760184,21.13327,4.467212,1.449728,35.680304,27.455279,10.007657,13.639968,9.249953,2.254041,41294940.0,0.703416,6636071.0
std,23.78135,24.558786,17.845975,23.784604,10.186693,8.437395,19.36495,17.450835,10.45843,9.947624,12.429025,10.302431,1965335000.0,0.486681,429399600.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-99.0,3e-05,-73.51631
25%,20.0,10.0,0.0,0.0,0.0,0.0,25.0,15.0,5.0,10.0,0.0,0.0,50000.0,0.058444,20369.42
50%,30.0,20.0,10.0,15.0,0.0,0.0,30.0,30.0,10.0,10.0,5.0,0.0,90000.0,1.0,53812.17
75%,50.0,35.0,20.0,40.0,5.0,0.0,50.0,40.0,10.0,15.0,15.0,0.0,190000.0,1.0,95666.08
max,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,303.0,100.0,100000000000.0,2.652053,28297400000.0


## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

You may import the functions stored in the `flatiron_stats.py` file to help perform your hypothesis tests. It contains the stats functions that you previously coded: `welch_t(a,b)`, `welch_df(a, b)`, and `p_value(a, b, two_sided=False)`. 

Note that `scipy.stats.ttest_ind(a, b, equal_var=False)` performs a two-sided Welch's t-test and that p-values derived from two-sided tests are two times the p-values derived from one-sided tests. See the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.    

In [33]:
# Null Hypothesis: There is no significant difference in salary between people with Master's  Degree and those with Bachelor's degrees
# Alternative Hypothesis: There is a significant difference in salary between people with Master's Degrees and those with Bachelor's degrees

In [23]:
# check for the value_count of FormalEducation
df.FormalEducation.value_counts()

Master's degree                                                      8204
Bachelor's degree                                                    4811
Doctoral degree                                                      3543
Some college/university study without earning a bachelor's degree     786
Professional degree                                                   451
I did not complete any formal education past high school              257
I prefer not to answer                                                 90
Name: FormalEducation, dtype: int64

In [24]:
# check for the value_count of AdjustedCompensation
df.AdjustedCompensation.value_counts()

100000.000    60
120000.000    59
150000.000    58
71749.560     47
59791.300     45
              ..
100489.936     1
55080.000      1
7075.200       1
123000.000     1
1.000          1
Name: AdjustedCompensation, Length: 1627, dtype: int64

In [28]:
# finding the p_value 

from scipy import stats
rng = np.random.default_rng()

s1 = df.loc[df['FormalEducation'] == "Master's degree", 'AdjustedCompensation'].dropna()
s2 = df.loc[df['FormalEducation'] == "Bachelor's degree", 'AdjustedCompensation'].dropna()
stats.ttest_ind(a, b, equal_var=False)


Ttest_indResult(statistic=0.43786693335411514, pvalue=0.6615527890254489)

In [None]:
# p-value is greater than 0.05 meaning the test is not statistically significant and indicates strong evidence of the null hypothesis.


## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [12]:
# check unique values in the column
df['FormalEducation'].unique()

array(["Bachelor's degree", "Master's degree", 'Doctoral degree', nan,
       "Some college/university study without earning a bachelor's degree",
       'I did not complete any formal education past high school',
       'Professional degree', 'I prefer not to answer'], dtype=object)

In [32]:
#Your code here
print(df['FormalEducation'].unique())
c = df.loc[df['FormalEducation'] == 'Doctoral degree', 'AdjustedCompensation'].dropna()
stats.ttest_ind(a, c, equal_var=False)

["Bachelor's degree" "Master's degree" 'Doctoral degree' nan
 "Some college/university study without earning a bachelor's degree"
 'I did not complete any formal education past high school'
 'Professional degree' 'I prefer not to answer']


Ttest_indResult(statistic=-1.0079781866796824, pvalue=0.3137173462373764)

## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [8]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#performing two-way ANOVA
model = ols('AdjustedCompensation ~ C(FormalEducation)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(FormalEducation),6.540294e+17,6.0,0.590714,0.738044
Residual,7.999414e+20,4335.0,,


## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!