# My first jupyter notebook

Hello!

This is a quick demonstration of what a Jupyter Notebook looks like and what the process of working with data using R (a statistical programming language) looks like.

In this demonstration, we work with a dataset consisting of UC Berkeley's 1973 graduate admission data for six departments, by gender.  In 1973, UC Berkeley was accused of a gender bias in its graduate admissions: that its admissions were skewed against women applicants.

We hope to glean from the data whether UC Berkeley graduate admissions in 1973 were indeed biased against women.

**You are not expected to understand the gory details at the moment.  The main goal today is to get a glimpse of the typical data exploration process using R that is involved in answering a question.**

In [11]:
# load a package (a "toolbox") called dplyr,
#  which is helpful for working with data tables

library(dplyr)

In [12]:
# upload dataset of UC Berkeley graduate admissions in 1973

berkeleydata <- read.csv('berkeley73.csv')

In [13]:
# Let's at the content of the dataset and roughly examine the numbers.

berkeleydata

Department,Men_Applicants,Men_Admitted,Women_Applicants,Women_Admitted
A,825,512,108,89
B,560,353,25,17
C,325,120,593,202
D,417,138,375,131
E,191,53,393,94
F,373,22,341,24


In [14]:
# compute the total number of men and women applicants and 
#  the total number of men and women who were admitted

total <- apply( berkeleydata[,-1], 2, function(x){sum(x)} )
total

In [15]:
# Organize it in a nicer format

berkeley_total <- data.frame( Applicants = c( total[[1]], total[[3]]), 
                              Admitted = c( total[[2]], total[[4]]) )
rownames(berkeley_total) <- c('Men', 'Women')
berkeley_total

Unnamed: 0,Applicants,Admitted
Men,2691,1198
Women,1835,557


In [16]:
# compute the percentage of admitted men (out of all men applicants) and
#  that of admitted women (out of all women applicants)

berkeley_total <- mutate(berkeley_total, percentAdmitted = round(Admitted/Applicants*100, 2) )
rownames(berkeley_total) <- c('Men', 'Women')
berkeley_total

Unnamed: 0,Applicants,Admitted,percentAdmitted
Men,2691,1198,44.52
Women,1835,557,30.35


In the table above, we notice that the admission rate of women is only around 30% while the admission rate of men is almost 45%.  This might support the claim that there is a gender bias against women in UC Berkeley's graduate admissions in 1973.

Are we done?

Well, if we look at the original dataset, which we reproduce below, it seems at a glance that some departments actually admit a high percentage of women applicants.

In [17]:
berkeleydata

Department,Men_Applicants,Men_Admitted,Women_Applicants,Women_Admitted
A,825,512,108,89
B,560,353,25,17
C,325,120,593,202
D,417,138,375,131
E,191,53,393,94
F,373,22,341,24


Let's compute the admission rate of men and women, but this time for each department individually.

In [18]:
berkeleydata$Men_percentAdmitted <- berkeleydata$Men_Admitted / berkeleydata$Men_Applicants * 100
berkeleydata$Women_percentAdmitted <- berkeleydata$Women_Admitted / berkeleydata$Women_Applicants * 100

berkeleydata

Department,Men_Applicants,Men_Admitted,Women_Applicants,Women_Admitted,Men_percentAdmitted,Women_percentAdmitted
A,825,512,108,89,62.060606,82.407407
B,560,353,25,17,63.035714,68.0
C,325,120,593,202,36.923077,34.064081
D,417,138,375,131,33.093525,34.933333
E,191,53,393,94,27.748691,23.918575
F,373,22,341,24,5.898123,7.038123


In fact, the admission rate for women is higher than the admission rate for men in four out of the six departments (A, B, D, and F) .  Furthermore, for the two remaining departments (C and E), the rates are comparable:
+ In department C, 36.9% of men applicants were admitted and 34.1% of women applicants were admitted
+ In department E, 27.7% of men applicants were admitted and 23.9% of women applicants were admitted
It's also important to notice that department A seems to admit a much higher percentage of women than men.

This seems to counter our previous observation when we look at the total numbers (as opposed to the numbers for each individual department)!!

## Take-home messages from this exercise
+ Sometimes a simple data analysis isn't enough to derive a definite conclusion.  More thorough analysis might be needed! <br><br>
+ This dataset illustrates something called "Simpson's Paradox": We obtain one conclusion when we look at the dataset as a whole, but when we look at the data by group (e.g., by department), we obtain a different conclusion.
    
    If you are curious, see https://en.wikipedia.org/wiki/Simpson%27s_paradox for more on Simpson's Paradox!