I'm too lazy to parse any census data, so we'll let [FiveThirtyEight](https://www.fivethirtyeight.com) do all the hard work and just grab their data from https://github.com/fivethirtyeight/data/tree/master/college-majors.

That means this is all data from around 2010.

The [recent-grad.csv](https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv) 

In [1]:
from statsmodels.stats.proportion import proportion_confint
import pandas as pd
import numpy as np

from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.models import Range1d

output_notebook()

pd.set_option('display.float_format', lambda x: '%.5f' % x)
pd.set_option("display.max_rows", 1000)

In [2]:
raw_data = pd.read_csv("recent-grads.csv")

Now, we'll clean up the data and calculate 99.99% [credible intervals](https://en.wikipedia.org/wiki/Credible_interval) (using Jeffrey's prior for binomial distributions, if you're interested), which is [distinct](http://stats.stackexchange.com/questions/2272/whats-the-difference-between-a-confidence-interval-and-a-credible-interval) from a [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval).

Why 99.99%? Well, there are roughly 200 majors represented (173 to be exact) and $0.9999 ^ 200 \approx 0.98$. Don't want to accidentally run into multiple comparison issues now, do we?

It turns out this was much less necessary than I thought it'd be. Originally, I was looking at the "Sample Size" column which made it seem that some majors only had a dozen or respondents in the survey. Luckily, or unluckily for my time efficiency, the "Sample Size" column is for full-time employees. We're more interested in graduates, not employees.

In [3]:
lower, upper = proportion_confint(count=raw_data.Women, nobs=raw_data.Men + raw_data.Women,
                                  alpha=0.0001, method="jeffrey")
ci = pd.DataFrame.from_items([("lower", lower), 
                              ("point", raw_data.ShareWomen),
                              ("upper", upper),
                              ("sample_size", (raw_data.Men + raw_data.Women))]) \
       .set_index(pd.MultiIndex.from_tuples(list(zip(raw_data.Major_category, raw_data.Major)),
                                            names=["category", "major"])) \
       .sort_values("point").sort_index(level="category", sort_remaining=False)
ci

Unnamed: 0_level_0,Unnamed: 1_level_0,lower,point,upper,sample_size
category,major,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Agriculture & Natural Resources,FOOD SCIENCE,0.2182,0.2227,0.22724,128319
Agriculture & Natural Resources,GENERAL AGRICULTURE,0.48329,0.51554,0.54771,3635
Agriculture & Natural Resources,NATURAL RESOURCES MANAGEMENT,0.55689,0.56464,0.57237,62052
Agriculture & Natural Resources,AGRICULTURAL ECONOMICS,0.57778,0.58971,0.60157,25894
Agriculture & Natural Resources,AGRICULTURE PRODUCTION AND MANAGEMENT,0.58003,0.59421,0.60827,18300
Agriculture & Natural Resources,PLANT SCIENCE AND AGRONOMY,0.59009,0.60689,0.62352,12920
Agriculture & Natural Resources,FORESTRY,0.68475,0.69037,0.69594,103480
Agriculture & Natural Resources,MISCELLANEOUS AGRICULTURE,0.70643,0.71997,0.73324,16977
Agriculture & Natural Resources,SOIL SCIENCE,0.75201,0.76443,0.77654,18109
Agriculture & Natural Resources,ANIMAL SCIENCES,0.90626,0.91093,0.91546,58001


You might notice some surprising things, if you look carefully enough. For example, it turns out that computer science is 57.8% female! (Definitely not my experience at Columbia... makes me wonder who they're defining things and who they're sampling). Anthropology looks like it's 96.8% female, which seems pretty insane.

Unfortunately, a giant table like this isn't exactly great for understanding what's going on. Instead, we'll make a bunch of plots.

In each plot below, the each grey dot represents the point estimate for the gender ratio of a major, while the blue line is the 99.99% credible interval. If you can't see the blue line, that means that the credible interval is too small.

Ideally, you'd be able to interactively select which majors you want to see and compare, but my javascript isn't good enough for the time/effort tradeoff to be worth it. I've honestly spent way too long on this as it is...

Anyways, enjoy!

In [18]:
df = pd.DataFrame(ci) \
       .assign(name=ci.index.get_level_values(1)).set_index("name")
        
def plot(df, title):
    intervals = [([row.lower, row.upper], [name, name])
             for name, row in df.iterrows()]
    xs, ys = tuple(zip(*intervals))
    
    p = figure(x_range=[0, 1], y_range=list(df.index),
               plot_width=1000, title=title)
    p.xaxis.axis_label = "Gender Ratio (% Female)"
    p.multi_line(xs, ys, line_width=5, alpha=0.8)
    p.circle(df.point, df.index, size=8, line_color="black", fill_color="lightgrey")
    return p

In [19]:
category_data = raw_data.groupby("Major_category")[["Men", "Women"]].sum()

lower, upper = proportion_confint(count=category_data.Women, nobs=category_data.Men + category_data.Women,
                                  alpha=0.0001, method="jeffrey")

plot_df = pd.DataFrame({"lower": lower, "upper": upper, 
                        "point": category_data.Women / (category_data.Men + category_data.Women)}) \
            .set_index(category_data.index) \
            .sort_values("point")
show(plot(plot_df, "Gender Ratio by Major Category"))

In [20]:
select = [
    "ACCOUNTING", "ANTHROPOLOGY AND ARCHEOLOGY", "ARCHITECTURE", "ART HISTORY AND CRITICISM",
    "BIOCHEMICAL SCIENCES", "BIOLOGICAL ENGINEERING", "BIOLOGY", "BUSINESS MANAGEMENT AND ADMINISTRATION", 
    "CHEMICAL ENGINEERING", "CHEMISTRY", "CIVIL ENGINEERING", "COMPUTER ENGINEERING", "COMPUTER SCIENCE", 
    "DRAMA AND THEATER ARTS", 
    "ECONOMICS", "ELECTRICAL ENGINEERING", "ENGLISH LANGUAGE AND LITERATURE", 
    "FINANCE", 
    "GENERAL BUSINESS", 
    "HISTORY", 
    "JOURNALISM", 
    "MATHEMATICS", "MECHANICAL ENGINEERING", "MUSIC", 
    "NURSING", 
    "PHILOSOPHY AND RELIGIOUS STUDIES", "PHYSICS", "POLITICAL SCIENCE AND GOVERNMENT", "PSYCHOLOGY", 
    "SOCIOLOGY",
]
plot_df = df.loc[select].sort_values("point")
show(plot(plot_df, "Gender Ratio of Selected Majors"))

In [21]:
plot_df = df.sort_values("sample_size").ix[-25:] \
            .sort_values("point")
show(plot(plot_df, "Gender Ratio for 25 Most Common Majors"))

In [22]:
plot_df = df.sort_values("sample_size").ix[:25] \
            .sort_values("point")
plot_df.index = [name if len(name) < 30 else name[:27] + "..." for name in plot_df.index]
show(plot(plot_df, "Gender Ratio for 25 Least Common Majors"))

In [23]:
plot_df = df.assign(extremeness=lambda df: np.abs(df.point - 0.5)) \
            .sort_values("extremeness") \
            .iloc[-20:] \
            .sort_values("point")
show(plot(plot_df, "Gender Ratio of 20 Most Skewed Majors"))

In [24]:
plot_df = df.assign(extremeness=lambda df: np.abs(df.point - 0.5)) \
            .sort_values("extremeness") \
            .iloc[:20] \
            .sort_values("point")
p = plot(plot_df, "Gender Ratio of 20 Most Even Majors")
p.x_range = Range1d(0.4, 0.6)
show(p)