## Project Description

The goal of this lab is to determine if the median salaries of those in the Major_category of Computers & Mathematics are significantly different than the median salaries of those in the Major_category of Education. 

Our hypotheses are therefore:

Null hypothesis: There is no significant difference between the median salaries of Computers and Mathematics majors versus Education majors.

Alternative hypothesis: There is a significant difference between the median salaries of Computers and Mathematics majors versus Education majors.


## Import Libraries

In [1]:
import numpy as np
from numpy import count_nonzero, median, mean
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#import os
#import zipfile
import scipy.stats
from collections import Counter


%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


## Exploratory Data Analysis

In [2]:
df = pd.read_csv("all-ages.csv")

In [3]:
df

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.03,50000,34000,80000.00
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.03,54000,36000,80000.00
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955,26321,22810,821,0.03,63000,40000,98000.00
3,1103,ANIMAL SCIENCES,Agriculture & Natural Resources,103549,81177,64937,3619,0.04,46000,30000,72000.00
4,1104,FOOD SCIENCE,Agriculture & Natural Resources,24280,17281,12722,894,0.05,62000,38500,90000.00
...,...,...,...,...,...,...,...,...,...,...,...
168,6211,HOSPITALITY MANAGEMENT,Business,200854,163393,122499,8862,0.05,49000,33000,70000.00
169,6212,MANAGEMENT INFORMATION SYSTEMS AND STATISTICS,Business,156673,134478,118249,6186,0.04,72000,50000,100000.00
170,6299,MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION,Business,102753,77471,61603,4308,0.05,53000,36000,83000.00
171,6402,HISTORY,Humanities & Liberal Arts,712509,478416,354163,33725,0.07,50000,35000,80000.00


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Major_code                     173 non-null    int64  
 1   Major                          173 non-null    object 
 2   Major_category                 173 non-null    object 
 3   Total                          173 non-null    int64  
 4   Employed                       173 non-null    int64  
 5   Employed_full_time_year_round  173 non-null    int64  
 6   Unemployed                     173 non-null    int64  
 7   Unemployment_rate              173 non-null    float64
 8   Median                         173 non-null    int64  
 9   P25th                          173 non-null    int64  
 10  P75th                          173 non-null    float64
dtypes: float64(2), int64(7), object(2)
memory usage: 15.0+ KB


In [5]:
df.describe()

Unnamed: 0,Major_code,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
count,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
mean,3879.82,230256.64,166161.98,126307.77,9725.03,0.06,56816.18,38697.11,82506.36
std,1687.75,422068.48,307324.4,242425.39,18022.04,0.02,14706.23,9414.52,20805.33
min,1100.0,2396.0,1492.0,1093.0,0.0,0.0,35000.0,24900.0,45800.0
25%,2403.0,24280.0,17281.0,12722.0,1101.0,0.05,46000.0,32000.0,70000.0
50%,3608.0,75791.0,56564.0,39613.0,3619.0,0.05,53000.0,36000.0,80000.0
75%,5503.0,205763.0,142879.0,111025.0,8862.0,0.07,65000.0,42000.0,95000.0
max,6403.0,3123510.0,2354398.0,1939384.0,147261.0,0.16,125000.0,78000.0,210000.0


In [6]:
df.columns

Index(['Major_code', 'Major', 'Major_category', 'Total', 'Employed', 'Employed_full_time_year_round', 'Unemployed', 'Unemployment_rate', 'Median', 'P25th', 'P75th'], dtype='object')

First, you’ll want to extract the most relevant information from the data which is data related to those in the Major_category of Computers & Mathematics

In [7]:
df2 = df[["Major_category", "Median"]]

In [8]:
df2

Unnamed: 0,Major_category,Median
0,Agriculture & Natural Resources,50000
1,Agriculture & Natural Resources,54000
2,Agriculture & Natural Resources,63000
3,Agriculture & Natural Resources,46000
4,Agriculture & Natural Resources,62000
...,...,...
168,Business,49000
169,Business,72000
170,Business,53000
171,Humanities & Liberal Arts,50000


In [9]:
edu = (df2[df2["Major_category"] == "Education"])
edu

Unnamed: 0,Major_category,Median
25,Education,43000
26,Education,58000
27,Education,41000
28,Education,40000
29,Education,43000
30,Education,48400
31,Education,35300
32,Education,46000
33,Education,45000
34,Education,42000


In [10]:
cm = (df2[df2["Major_category"] == "Computers & Mathematics"])
cm

Unnamed: 0,Major_category,Median
17,Computers & Mathematics,50000
18,Computers & Mathematics,65000
19,Computers & Mathematics,60000
20,Computers & Mathematics,78000
21,Computers & Mathematics,68000
22,Computers & Mathematics,55000
23,Computers & Mathematics,55000
90,Computers & Mathematics,66000
91,Computers & Mathematics,70000
92,Computers & Mathematics,70000


Next, since you are interested in the median salaries, you want to extract only the data from the Median column. To do so, you can use the c() function to extract the data in Median and turn it into a vector.

Follow the directions in Step 1 to extract only the data from d where Major_category is Education using the filter() function.

Follow the direction in Step 2 to convert the Median column of data from the Major_category of Education into a vector using the c() function.

Now that you have the two numeric vectors of the median salaries of Computers & Mathematics and Education, you can use them for data analysis.

In [11]:
edulist = edu["Median"].values
edulist

array([43000, 58000, 41000, 40000, 43000, 48400, 35300, 46000, 45000,
       42000, 45000, 40000, 42000, 42600, 50000, 40000], dtype=int64)

In [12]:
cmlist = cm["Median"].values
cmlist

array([50000, 65000, 60000, 78000, 68000, 55000, 55000, 66000, 70000,
       70000, 92000], dtype=int64)

In [13]:
type(cmlist)

numpy.ndarray

### Save to CSV

In [14]:
#df2.to_csv("statstest.csv", index=False)

## Hypothesis Testing

The goal of hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” The first step is to quantify the size of the apparent effect by choosing a test statistic (t-test, ANOVA, etc). The next step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. Then compute the p-value, which is the probability of the null hypothesis being true, and finally interpret the result of the p-value, if the value is low, the effect is said to be statistically significant, which means that the null hypothesis may not be accurate.

### T-Test

We will be using the t-test for independent samples. For the independent t-test, the following assumptions must be met.

-   One independent, categorical variable with two levels or group
-   One dependent continuous variable
-   Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
-   The dependent variable must follow a normal distribution
-   Assumption of homogeneity of variance


State the hypothesis

-   $H_0: µ\_1 = µ\_2$ ("there is no difference in evaluation scores between male and females")
-   $H_1: µ\_1 ≠ µ\_2$ ("there is a difference in evaluation scores between male and females")


### Levene's Test

In [15]:
scipy.stats.levene(df2[df2['Major_category'] == 'Computers & Mathematics']['Median'],
                   df2[df2['Major_category'] == 'Education']['Median'], center='mean')

LeveneResult(statistic=4.791165752128612, pvalue=0.038159314391911024)

## T-Test

To compute statistical significance between the two numeric vectors, perform an independent t-test on them using the basic syntax of:

### One Sample T-Test

In [16]:
#t, p = scipy.stats.ttest_1samp(a=df.dose, popmean=1.166667)

In [17]:
# print("T-test value is: ", t)
# print("p-value value is: ", p)

### Two Samples T-Test

In [18]:
t, p = scipy.stats.ttest_ind(a=edulist,b=cmlist, equal_var = True)

In [19]:
print("T-test value is: ",t)
print("p-value value is: ",p)

T-test value is:  -6.767976858331431
p-value value is:  4.2974569323762015e-07


### ResearchPy

In [20]:
import researchpy as rp

In [21]:
df2

Unnamed: 0,Major_category,Median
0,Agriculture & Natural Resources,50000
1,Agriculture & Natural Resources,54000
2,Agriculture & Natural Resources,63000
3,Agriculture & Natural Resources,46000
4,Agriculture & Natural Resources,62000
...,...,...
168,Business,49000
169,Business,72000
170,Business,53000
171,Humanities & Liberal Arts,50000


In [22]:
rp.ttest(group1= df2['Median'][df['Major_category'] == 'Computers & Mathematics'], group1_name= "CM",
         group2= df2['Median'][df['Major_category'] == 'Education'], group2_name= "EDU",
         equal_variances=True, paired=False)

(   Variable     N     Mean       SD      SE  95% Conf.  Interval
 0        CM 11.00 66272.73 11790.60 3555.00   58351.70  74193.76
 1       EDU 16.00 43831.25  5174.00 1293.50   41074.22  46588.28
 2  combined 27.00 52974.07 13970.56 2688.64   47447.50  58500.64,
           Independent t-test  results
 0   Difference (CM - EDU) =  22441.48
 1      Degrees of freedom =     25.00
 2                       t =      6.77
 3   Two side test p value =      0.00
 4  Difference < 0 p value =      1.00
 5  Difference > 0 p value =      0.00
 6               Cohen's d =      2.65
 7               Hedge's g =      2.57
 8           Glass's delta =      1.90
 9             Pearson's r =      0.80)

Lab Question 1

Enter the EXACT value of the produced mean of the median salaries of the major category Computers & Mathematics that you generated from your analysis. DO NOT ROUND!


In [23]:
cmlist.mean()

66272.72727272728

Lab Question 2

Enter the EXACT value of the produced mean of the median salaries of the major category Education that you generated from your analysis. DO NOT ROUND!


In [24]:
edulist.mean()

43831.25

Lab Question 3

Based on the generated p-value, SELECT ALL of the following that are TRUE.


We reject the null hypothesis in favor of the alternative hypothesis.

The results show that there is a significant difference between the median salaries of Computers & Mathematics majors versus Education majors.

The p-value is statistically significant.

#### Python code done by Dennis Lam