# Comparing Means


_______________________________________________________________________

|Goal|$H_{0}$|Data Needed|Parametric Test|Assumptions*||  
|---|---|---|---|---|---|  
|Compare observed mean to theoretical one|$\mu_{obs} = \mu_{th}$|array-like of observed values & float of theoretical|One sample t-test: scipy.stats.ttest_1samp|Normally Distributed\*\*||   
|Compare two observed means (independent samples)|$\mu_{a} = \mu_{b}$|2 array-like samples|Independent t-test (or 2-sample): scipy.stats.ttest_ind|Independent, Normally Distributed\*\*, Equal Variances\*\*\*||   
|Compare several observed means (independent samples)|$\mu_{a} = \mu_{b} = \mu_{n}$|n array-like samples|ANOVA: scipy.stats.f_oneway|Independent, Normally Distributed\*\*, Equal Variances**||   

\*If assumptions can't be met, the equivalent non-parametric test can be used.  
\*\*Normal Distribution assumption can be be met by having a large enough sample (due to Central Limit Theorem), or the data can be scaled using a Gaussian Scalar.   
\*\*\*The argument in the stats.ttest_ind() method of `equal_var` can be set to `False` to accomodate this assumption.   

## One Sample T-Test

Goal: Compare observed mean to theoretical one. 

1. Plot Distributions (i.e. Histograms!)  

2. Establish Hypotheses   

||||  
|-----|-----|---------|  
|Null Hypothesis|$H_{0}$|$\mu_{obs} = \mu_{th}$|  
|Alternative Hypothesis (2-tail, significantly different)|$H_{a}$|$\mu_{obs} != \mu_{th}$|  
|Alternative Hypothesis (1-tail, significantly smaller)|$H_{a}$|$\mu_{obs} < \mu_{th}$|  
|Alternative Hypothesis (1-tail, significantly larger)|$H_{a}$|$\mu_{obs} > \mu_{th}$|      

3. Set Significance Level: $\alpha = .05$

4. Verify Assumptions: Normal Distribution, or at least 30 observations and "kinda" normal. The more observations you have, the less "normal" it needs to appear. (CLT)  

5. Compute test statistic and probability (t-statistic & p-value) using `scipy.stats.ttest_1samp`. 

6. Decide. **For a 2-tailed test, we take the p-value as is. For a 1-tailed test, we evaluate $p/2 < \alpha$ and $t > 0$ (to test if higher), and of a less-than test when $p/2 < \alpha$ and $t < 0$.**

Answer with the type of test you would use (assume normal distribution):

        Is there a difference in grades of students on the second floor compared to grades of all students? one sample t-test 
        
        Are adults who drink milk taller than adults who dont drink milk? independent t-test 
        
        Is the the price of gas higher in texas or in new mexico? two sample t-test
        
        Are there differences in stress levels between students who take data science vs students who take web development vs students who take cloud academy? ANOVA

### Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. Use a .05 level of significance. Independent t-test 

H$_0$ : There is no difference in the average sale time of properties via office no 1 and office no 2

H$_a$ : There is a difference in the average sale time of properties between office no. 1 and office no. 2

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from math import sqrt

from scipy import stats
from pydataset import data

In [None]:
α = 0.05

μ1 = 90 # office1
μ2 = 100 # office2

sample_size1 = 40
sample_size2 = 50

stdev1 = 15 # office1
stdev2 = 20 # office2

In [None]:
office1 = stats.norm(μ1, stdev1).rvs(sample_size1)

In [None]:
office1

In [None]:
office2 = stats.norm(μ2, stdev2).rvs(sample_size2)

In [None]:
office2

In [None]:
x = np.arange(60, 165)
y = stats.norm(μ1, stdev1).pdf(x)
y2 = stats.norm(μ2, stdev2).pdf(x)
plt.plot(x, y)
plt.plot(x, y2)
plt.title('Office Differences by Probability of Sale Time')
plt.show()

> Levene Testing

H$_0$: there is equal variance between the two offices

H$_a$: there is inequal variance between the two offices

In [None]:
stat, p_levene = stats.levene(office1, office2)

In [None]:
stat

In [None]:
p_levene 

In [None]:
p_levene < α

> Failed to reject the null hypothesis which presumes equal variance

In [None]:
t_stat, p_val = stats.ttest_ind_from_stats(
    μ1,
    stdev1,
    sample_size1,
    μ2,
    stdev2,
    sample_size2,
    equal_var=True)

In [None]:
# two tailed test
if p_val < α:
    print('Null hypothesis rejected, presumes a difference in mean sale time between the offices')
else:
    print('Failed to reject the null hypothesis')

### Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

H$_0$: There is no difference in fuel efficency in 2008 vehicles when compared to 1999 vehicle

H$_a$: There is a difference in fuel efficiency in 2008 vehicles when compared 1999 vehicles

In [None]:
#view the dataframe
mpg

In [None]:
# create mask
mpg['f_econ'] = stats.hmean((mpg['hwy'], mpg['cty']))

In [None]:
# setup variables
f_econ_1999 = mpg[mpg.year == 1999].fuel_efficiency
f_econ_2008 = mpg[mpg.year == 2008].fuel_efficiency

In [None]:
plt.hist(f_econ_1999)

In [None]:
plt.hist(f_econ_2008)

In [None]:
# check variance
stat, p_val = stats.levene(f_econ_1999, f_econ_2008)

In [None]:
stat

In [None]:
p_val < α

In [None]:
# test for dependence
stat, p_val = stats.ttest_ind(f_econ_1999, f_econ_2008)

In [None]:
p_val < α

> Failed to reject the null hypothesis

### Are compact cars more fuel-efficient than the average car? 

H$_0$ : Compact cars have a lower or equal average fuel efficiency compared to all cars

H$_a$ :Compact cars have a greater average fuel efficiency compared to all cars

In [None]:
μth = mpg.f_econ.mean()

In [None]:
mpg

In [None]:
t_stat, p_val = stats.ttest_1samp(
    mpg[mpg['veh_class'] == 'compact'].f_econ,
    μth)

In [None]:
if ((p_val / 2 ) < α) and (t_stat > 0):
    print('Reject the null hypothesis!')
    print('Evidence discovered points to a alternative hypothesis!')
else:
    print('Failure to reject the null hypothesis')
    

### Do manual cars get better gas mileage than automatic cars?

H$_0$: Manual cars receive worse or equal gas mileage compared to automatic vehicls

H$_a$: Manual cars receive better gas mileage than automatic vehicles

In [None]:
manual_f_econ = mpg[mpg.trans.str.startswith('manual')].f_econ
auto_f_econ = mpg[mpg.trans.str.startswith('auto')].f_econ

In [None]:
stats.levene(manual_f_econ, auto_f_econ)

> Failed to reject null hypothesis, equal variance presumed

In [None]:
t_stat, p_val = stats.ttest_ind(manual_f_econ, auto_f_econ)

In [None]:
p_val

In [None]:
t_stat

> Reject the null hypothesis, difference betweeen auto and manual vehicle fuel economy

Answer with the type of stats test you would use (assume normal distribution):
Is there a relationship between the length of your arm and the length of your foot? pearson r
Do guys and gals quit their jobs at the same rate? chi-squared
Does the length of time of the lecture correlate with a students grade? pearson r

Use the telco_churn data.
Does tenure correlate with monthly charges?
Total charges?
What happens if you control for phone and internet service?

In [5]:
telco_churn = pd.read_csv('telco_customers.csv')

In [6]:
telco_churn

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,device_protection,tech_support,streaming_tv,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn
0,0002-ORFBO,Female,0,Yes,Yes,9,Yes,No,1,No,...,No,Yes,Yes,No,2,Yes,2,65.60,593.3,No
1,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,No,No,No,Yes,1,No,2,59.90,542.4,No
2,0004-TLHLJ,Male,0,No,No,4,Yes,No,2,No,...,Yes,No,No,No,1,Yes,1,73.90,280.85,Yes
3,0011-IGKFF,Male,1,Yes,No,13,Yes,No,2,No,...,Yes,No,Yes,Yes,1,Yes,1,98.00,1237.85,Yes
4,0013-EXCHZ,Female,1,Yes,No,3,Yes,No,2,No,...,No,Yes,Yes,No,1,Yes,2,83.90,267.4,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,9987-LUTYD,Female,0,No,No,13,Yes,No,1,Yes,...,No,Yes,No,No,2,No,2,55.15,742.9,No
7039,9992-RRAMN,Male,0,Yes,No,22,Yes,Yes,2,No,...,No,No,No,Yes,1,Yes,1,85.10,1873.7,Yes
7040,9992-UJOEL,Male,0,No,No,2,Yes,No,1,No,...,No,No,No,No,1,Yes,2,50.30,92.75,No
7041,9993-LHIEB,Male,0,Yes,Yes,67,Yes,No,1,Yes,...,Yes,Yes,No,Yes,3,No,2,67.85,4627.65,No


In [174]:
telco_churn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7032 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               7032 non-null   object 
 1   gender                    7032 non-null   object 
 2   senior_citizen            7032 non-null   int64  
 3   partner                   7032 non-null   object 
 4   dependents                7032 non-null   object 
 5   tenure                    7032 non-null   int64  
 6   phone_service             7032 non-null   object 
 7   multiple_lines            7032 non-null   object 
 8   internet_service_type_id  7032 non-null   int64  
 9   online_security           7032 non-null   object 
 10  online_backup             7032 non-null   object 
 11  device_protection         7032 non-null   object 
 12  tech_support              7032 non-null   object 
 13  streaming_tv              7032 non-null   object 
 14  streamin

### Does tenure correlate with monthly charges?

H$_0$: There is no linear relationship between tenure and monthly charges
    
H$_a$: There is linear relationship between tenure  and monthly charges

In [12]:
x = telco_churn.tenure
y = telco_churn.monthly_charges

In [13]:
r, p = stats.pearsonr(x, y)

In [14]:
α = 0.05

In [15]:
if p < α:
    print( 'Reject')
    print('The pearson r value is:', r)
else:
    print(' Failed to reject H0 ')

Reject
The pearson r value is: 0.24789985628615246


> There is a correlation between tuenure and montly charges

### Does tenure correlate with total charges?

In [31]:
telco_churn['total_charges'] = pd.to_numeric(telco_churn['total_charges'], errors = 'coerce')

In [40]:
# drop all rows that have NaN/None values
telco_churn=telco_churn.dropna()
telco_churn

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,device_protection,tech_support,streaming_tv,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn
0,0002-ORFBO,Female,0,Yes,Yes,9,Yes,No,1,No,...,No,Yes,Yes,No,2,Yes,2,65.60,593.30,No
1,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,No,...,No,No,No,Yes,1,No,2,59.90,542.40,No
2,0004-TLHLJ,Male,0,No,No,4,Yes,No,2,No,...,Yes,No,No,No,1,Yes,1,73.90,280.85,Yes
3,0011-IGKFF,Male,1,Yes,No,13,Yes,No,2,No,...,Yes,No,Yes,Yes,1,Yes,1,98.00,1237.85,Yes
4,0013-EXCHZ,Female,1,Yes,No,3,Yes,No,2,No,...,No,Yes,Yes,No,1,Yes,2,83.90,267.40,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,9987-LUTYD,Female,0,No,No,13,Yes,No,1,Yes,...,No,Yes,No,No,2,No,2,55.15,742.90,No
7039,9992-RRAMN,Male,0,Yes,No,22,Yes,Yes,2,No,...,No,No,No,Yes,1,Yes,1,85.10,1873.70,Yes
7040,9992-UJOEL,Male,0,No,No,2,Yes,No,1,No,...,No,No,No,No,1,Yes,2,50.30,92.75,No
7041,9993-LHIEB,Male,0,Yes,Yes,67,Yes,No,1,Yes,...,Yes,Yes,No,Yes,3,No,2,67.85,4627.65,No


In [36]:
tt = telco_churn.tenure
ttc = telco_churn.total_charges

In [41]:
telco_churn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7032 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               7032 non-null   object 
 1   gender                    7032 non-null   object 
 2   senior_citizen            7032 non-null   int64  
 3   partner                   7032 non-null   object 
 4   dependents                7032 non-null   object 
 5   tenure                    7032 non-null   int64  
 6   phone_service             7032 non-null   object 
 7   multiple_lines            7032 non-null   object 
 8   internet_service_type_id  7032 non-null   int64  
 9   online_security           7032 non-null   object 
 10  online_backup             7032 non-null   object 
 11  device_protection         7032 non-null   object 
 12  tech_support              7032 non-null   object 
 13  streaming_tv              7032 non-null   object 
 14  streamin

In [37]:
r, p = stats.pearsonr(tt, ttc)

In [38]:
if p < α:
    print( 'Reject')
    print('The pearson r value is:', r)
else:
    print(' Failed to reject H0 ')

Reject
The pearson r value is: 0.8258804609332071


> There is a correlation between tenure and total charges

### What happens if you control for phone and internet service?

In [177]:
x = telco_churn.phone_service
y = telco_churn.internet_service

AttributeError: 'DataFrame' object has no attribute 'internet_service'

In [None]:
r, p = stats.pearsonr(x, y)

In [None]:
if p < α:
    print( 'Reject')
    print('The pearson r value is:', r)
else:
    print(' Failed to reject H0 ')

Use the employees database.
Is there a relationship between how long an employee has been with the company and their salary?
Is there a relationship between how long an employee has been with the company and the number of titles they have had?

Use the sleepstudy data.
Is there a relationship between days and reaction time?

In [109]:
salaries = pd.read_csv('salaries.csv')

In [110]:
salaries

Unnamed: 0,emp_no,salary,from_date,to_date
0,10001,88958,2002-06-22,9999-01-01
1,10002,72527,2001-08-02,9999-01-01
2,10003,43311,2001-12-01,9999-01-01
3,10004,74057,2001-11-27,9999-01-01
4,10005,94692,2001-09-09,9999-01-01
...,...,...,...,...
240119,499995,52868,2002-06-01,9999-01-01
240120,499996,69501,2002-05-12,9999-01-01
240121,499997,83441,2001-08-26,9999-01-01
240122,499998,55003,2001-12-25,9999-01-01


In [149]:
import datetime

from datetime import datetime, timedelta

In [150]:
salaries['from_date'] = pd.to_datetime(salaries['from_date'])

In [151]:
salaries['days_w_org'] = datetime.now().date() - salaries['from_date'].dt.date

In [152]:
salaries

Unnamed: 0,emp_no,salary,from_date,to_date,days_w_org
0,10001,88958,2002-06-22,9999-01-01,7563 days
1,10002,72527,2001-08-02,9999-01-01,7887 days
2,10003,43311,2001-12-01,9999-01-01,7766 days
3,10004,74057,2001-11-27,9999-01-01,7770 days
4,10005,94692,2001-09-09,9999-01-01,7849 days
...,...,...,...,...,...
240119,499995,52868,2002-06-01,9999-01-01,7584 days
240120,499996,69501,2002-05-12,9999-01-01,7604 days
240121,499997,83441,2001-08-26,9999-01-01,7863 days
240122,499998,55003,2001-12-25,9999-01-01,7742 days


In [153]:
salaries['days_w_org'] = salaries['days_w_org'] / np.timedelta64(1, 'D')

In [154]:
salaries

Unnamed: 0,emp_no,salary,from_date,to_date,days_w_org
0,10001,88958,2002-06-22,9999-01-01,7563.0
1,10002,72527,2001-08-02,9999-01-01,7887.0
2,10003,43311,2001-12-01,9999-01-01,7766.0
3,10004,74057,2001-11-27,9999-01-01,7770.0
4,10005,94692,2001-09-09,9999-01-01,7849.0
...,...,...,...,...,...
240119,499995,52868,2002-06-01,9999-01-01,7584.0
240120,499996,69501,2002-05-12,9999-01-01,7604.0
240121,499997,83441,2001-08-26,9999-01-01,7863.0
240122,499998,55003,2001-12-25,9999-01-01,7742.0


In [119]:
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240124 entries, 0 to 240123
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype          
---  ------      --------------   -----          
 0   emp_no      240124 non-null  int64          
 1   salary      240124 non-null  int64          
 2   from_date   240124 non-null  datetime64[ns] 
 3   to_date     240124 non-null  object         
 4   days_w_org  240124 non-null  timedelta64[ns]
dtypes: datetime64[ns](1), int64(2), object(1), timedelta64[ns](1)
memory usage: 9.2+ MB


In [124]:
salaries.corr()

Unnamed: 0,emp_no,salary,days_w_org
emp_no,1.0,0.001459,0.001904
salary,0.001459,1.0,-0.050614
days_w_org,0.001904,-0.050614,1.0


In [125]:
x = salaries.days_w_org
y = salaries.salary

In [126]:
r, p = stats.pearsonr(x, y)

In [127]:
if p < α:
    print( 'Reject')
    print('The pearson r value is:', r)
else:
    print(' Failed to reject H0 ')

Reject
The pearson r value is: -0.05061363684122252


> There is a correlation between salary and day with the organization

In [129]:
sleepstudy = data('sleepstudy')

In [134]:
sleepstudy

Unnamed: 0,Reaction,Days,Subject
1,249.5600,0,308
2,258.7047,1,308
3,250.8006,2,308
4,321.4398,3,308
5,356.8519,4,308
...,...,...,...
176,329.6076,5,372
177,334.4818,6,372
178,343.2199,7,372
179,369.1417,8,372


In [135]:
sleepstudy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180 entries, 1 to 180
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Reaction  180 non-null    float64
 1   Days      180 non-null    int64  
 2   Subject   180 non-null    int64  
dtypes: float64(1), int64(2)
memory usage: 5.6 KB


In [136]:
x = sleepstudy.Days
y = sleepstudy.Reaction

In [137]:
r, p = stats.pearsonr(x, y)

In [138]:
if p < α:
    print( 'Reject')
    print('The pearson r value is:', r)
else:
    print(' Failed to reject H0 ')

Reject
The pearson r value is: 0.5352302262650255


> There is a correlation between salary and day with the organization

Answer with the type of stats test you would use (assume normal distribution):

Do students get better test grades if they have a rubber duck on their desk? t-test
Does smoking affect when or not someone has lung cancer? chi_squared
Is gender independent of a person’s blood type? chi_squared
A farming company wants to know if a new fertilizer has improved crop yield or not? t_test
Does the length of time of the lecture correlate with a students grade? pearson r
Do people with dogs live in apartments more than people with cats? chi-squared

In [None]:
Use the following contingency table to help answer the question of whether using a macbook and being a codeup student are independent of each other.

 	Codeup Student	Not Codeup Student
Uses a Macbook	49	20
Doesn't Use A Macbook	1	30


### Choose another 2 categorical variables from the mpg dataset and perform a chi2 contingency table test with them. Be sure to state your null and alternative hypotheses.

H$_0$: The variables of displacement and cylinders are independent of each other
    
H$_a$: There is a relationship between vehicle displacment and vehicle cyclinder

In [155]:
mpg = data('mpg')

In [156]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [159]:
observed = pd.crosstab(mpg['displ'], mpg['cyl'])

In [160]:
observed 

cyl,4,5,6,8
displ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.6,5,0,0,0
1.8,14,0,0,0
1.9,3,0,0,0
2.0,21,0,0,0
2.2,6,0,0,0
2.4,13,0,0,0
2.5,14,4,2,0
2.7,5,0,3,0
2.8,0,0,10,0
3.0,0,0,8,0


In [161]:
α = 0.05

In [167]:
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [168]:
p

2.5049380107477617e-49

In [170]:
chi2

474.1751175406871

In [171]:
degf

102

In [172]:
expected 

array([[1.73076923, 0.08547009, 1.68803419, 1.4957265 ],
       [4.84615385, 0.23931624, 4.72649573, 4.18803419],
       [1.03846154, 0.05128205, 1.01282051, 0.8974359 ],
       [7.26923077, 0.35897436, 7.08974359, 6.28205128],
       [2.07692308, 0.1025641 , 2.02564103, 1.79487179],
       [4.5       , 0.22222222, 4.38888889, 3.88888889],
       [6.92307692, 0.34188034, 6.75213675, 5.98290598],
       [2.76923077, 0.13675214, 2.7008547 , 2.39316239],
       [3.46153846, 0.17094017, 3.37606838, 2.99145299],
       [2.76923077, 0.13675214, 2.7008547 , 2.39316239],
       [2.07692308, 0.1025641 , 2.02564103, 1.79487179],
       [3.11538462, 0.15384615, 3.03846154, 2.69230769],
       [1.38461538, 0.06837607, 1.35042735, 1.1965812 ],
       [1.73076923, 0.08547009, 1.68803419, 1.4957265 ],
       [0.69230769, 0.03418803, 0.67521368, 0.5982906 ],
       [1.03846154, 0.05128205, 1.01282051, 0.8974359 ],
       [2.76923077, 0.13675214, 2.7008547 , 2.39316239],
       [1.03846154, 0.05128205,

In [162]:
stats.chi2_contingency(observed)

(474.1751175406871,
 2.5049380107477617e-49,
 102,
 array([[1.73076923, 0.08547009, 1.68803419, 1.4957265 ],
        [4.84615385, 0.23931624, 4.72649573, 4.18803419],
        [1.03846154, 0.05128205, 1.01282051, 0.8974359 ],
        [7.26923077, 0.35897436, 7.08974359, 6.28205128],
        [2.07692308, 0.1025641 , 2.02564103, 1.79487179],
        [4.5       , 0.22222222, 4.38888889, 3.88888889],
        [6.92307692, 0.34188034, 6.75213675, 5.98290598],
        [2.76923077, 0.13675214, 2.7008547 , 2.39316239],
        [3.46153846, 0.17094017, 3.37606838, 2.99145299],
        [2.76923077, 0.13675214, 2.7008547 , 2.39316239],
        [2.07692308, 0.1025641 , 2.02564103, 1.79487179],
        [3.11538462, 0.15384615, 3.03846154, 2.69230769],
        [1.38461538, 0.06837607, 1.35042735, 1.1965812 ],
        [1.73076923, 0.08547009, 1.68803419, 1.4957265 ],
        [0.69230769, 0.03418803, 0.67521368, 0.5982906 ],
        [1.03846154, 0.05128205, 1.01282051, 0.8974359 ],
        [2.76923077, 

In [163]:
mpg.displ.value_counts(normalize=True)

2.0    0.089744
2.5    0.085470
4.7    0.072650
4.0    0.064103
1.8    0.059829
2.4    0.055556
4.6    0.047009
2.8    0.042735
3.3    0.038462
3.8    0.034188
2.7    0.034188
5.4    0.034188
5.7    0.034188
3.0    0.034188
5.3    0.025641
2.2    0.025641
3.1    0.025641
1.6    0.021368
5.2    0.021368
3.5    0.021368
3.4    0.017094
4.2    0.017094
3.7    0.012821
3.9    0.012821
1.9    0.012821
5.9    0.008547
3.6    0.008547
5.0    0.008547
6.2    0.008547
6.5    0.004274
7.0    0.004274
6.1    0.004274
4.4    0.004274
5.6    0.004274
6.0    0.004274
Name: displ, dtype: float64

In [164]:
mpg.cyl.value_counts(normalize=True)

4    0.346154
6    0.337607
8    0.299145
5    0.017094
Name: cyl, dtype: float64

> There is a relationship

In [None]:
Use the data from the employees database to answer these questions:

Is an employee's gender independent of whether an employee works in sales or marketing?
Is an employee's gender independent of whether or not they are or have been a manager?

In [179]:
join_title_salaries = pd.read_csv('title_salary.csv')

array(['Senior Engineer', 'Staff', 'Engineer', 'Senior Staff',
       'Assistant Engineer', 'Technique Leader', 'Manager'], dtype=object)