Question: Is there a difference in the percentage of women enrolled in an undergraduate engineering program depending on which quadrant of the country the school is in?

Dependent Var (y): percentage of women
<br>
Independent Var (x): Quadrant, 4 levels NW, NE, SW, SE

In [3]:
# Import packages
import pandas as pd
import numpy as np
import scipy
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

In [11]:
#Load data
df = pd.read_csv("geo_quad_df_2020.csv")

In [12]:
df.head()

Unnamed: 0,UNITID,EFTOTLT,EFTOTLM,EFTOTLW,W/M_Ratio,INSTNM,CITY,STABBR,ZIP,LATITUDE,LONGITUDE,geometry,NorthSouth,EastWest,Quadrant
0,100654,578,434,144,0.331797,Alabama A & M University,Normal,AL,35762,34.783368,-86.568502,POINT (-86.568502 34.783368),S,E,SE
1,100663,558,382,176,0.460733,University of Alabama at Birmingham,Birmingham,AL,35294-0110,33.505697,-86.799345,POINT (-86.799345 33.505697),S,E,SE
2,100706,2792,2203,589,0.267363,University of Alabama in Huntsville,Huntsville,AL,35899,34.724557,-86.640449,POINT (-86.640449 34.724557),S,E,SE
3,100724,60,28,32,1.142857,Alabama State University,Montgomery,AL,36104-0271,32.364317,-86.295677,POINT (-86.295677 32.364317),S,E,SE
4,100751,4383,3405,978,0.287225,The University of Alabama,Tuscaloosa,AL,35487-0100,33.211875,-87.545978,POINT (-87.54597800000001 33.211875),S,E,SE


In [13]:
# Create a column with the percentage of women
df['PERCENT_WOMEN'] = df['EFTOTLW']/df['EFTOTLT']

In [18]:
# Data wrangling: keep only the columns I'm interested in: PERCENT_WOMEN and Quadrant
df2 = df[['PERCENT_WOMEN', 'Quadrant']].copy()

In [19]:
df2.head()

Unnamed: 0,PERCENT_WOMEN,Quadrant
0,0.249135,SE
1,0.315412,SE
2,0.21096,SE
3,0.533333,SE
4,0.223135,SE


In [20]:
# Recode Quadrant to a number
# NE = 1
# NW = 2
# SE = 3
# SW = 4

def recode (series):
    if series == 'NE':
        return 1
    if series == 'NW':
        return 2
    if series == 'SE':
        return 3
    if series == 'SW':
        return 4

df2['QuadrantR'] = df2['Quadrant'].apply(recode)

In [21]:
df3 = df2[['PERCENT_WOMEN', 'QuadrantR']].copy()

In [22]:
df3.head()

Unnamed: 0,PERCENT_WOMEN,QuadrantR
0,0.249135,3
1,0.315412,3
2,0.21096,3
3,0.533333,3
4,0.223135,3


In [23]:
## Test for assumptions??

In [25]:
# Run one-way ANOVA

stats.f_oneway(df2['PERCENT_WOMEN'][df2['Quadrant']=='NW'],
            df2['PERCENT_WOMEN'][df2['Quadrant']=='NE'],
            df2['PERCENT_WOMEN'][df2['Quadrant']=='SE'],
            df2['PERCENT_WOMEN'][df2['Quadrant']=='SW'])

F_onewayResult(statistic=1.3994475033666858, pvalue=0.2416997063717485)

The result is not significant. Accept the null hypothesis. There is no difference in the means of PERCENT_WOMEN between the different quadrants.

In [27]:
postHoc = MultiComparison(df3['PERCENT_WOMEN'], df3['QuadrantR'])
postHocResults = postHoc.tukeyhsd()
print(postHocResults)

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     1      2   0.0241 0.3552 -0.0137 0.0618  False
     1      3   0.0097 0.8722  -0.023 0.0424  False
     1      4   0.0222 0.3872 -0.0139 0.0583  False
     2      3  -0.0144 0.8214 -0.0572 0.0284  False
     2      4  -0.0018 0.9996 -0.0473 0.0436  False
     3      4   0.0126  0.862 -0.0288 0.0539  False
---------------------------------------------------


As the ANOVA showed, there is no difference in the means,
<br>
In 2020 one quadrant of the country shows the same mean percentage of women enrolled in an engineering undergraduate program as any other.