# Simpson's Paradox

> Simpson's paradox occurs when groups of data show one particular trend, but this trend is reversed when the groups are combined together. Understanding and identifying this paradox is important for correctly interpreting data.

In [1]:
# Load and view first few lines of dataset
import pandas as pd
df = pd.read_csv('datasets/admission_data.csv')
df.head()

Unnamed: 0,student_id,gender,major,admitted
0,35377,female,Chemistry,False
1,56105,male,Physics,True
2,31441,female,Chemistry,False
3,51765,male,Physics,True
4,53714,female,Physics,True


### Proportion and admission rate for each gender

In [2]:
# Proportion of students that are female
total = df['student_id'].count()
female = df.query("gender == 'female'")
female_count = female['gender'].count()
female_prop = female_count/total
female_prop

0.514

In [3]:
# Proportion of students that are male
total = df['student_id'].count()
male = df.query("gender == 'male'")
male_count = male['gender'].count()
male_prop = male_count/total
male_prop

0.486

In [4]:
# Admission rate for females
female_admitted = female[female['admitted']==True]
female_admitted_count = female_admitted['gender'].count()
female_admit_rate = female_admitted_count/female_count
female_admit_rate

0.28793774319066145

In [5]:
# Admission rate for males
male_admitted = male[male['admitted']==True]
male_admitted_count = male_admitted['gender'].count()
male_admit_rate = male_admitted_count/male_count
male_admit_rate

0.48559670781893005

### Proportion and admission rate for physics majors of each gender

In [6]:
# What proportion of female students are majoring in physics?
female_physics = female.query("major=='Physics'")
female_physics_count = female_physics['gender'].count()
female_physics_prop = female_physics_count/female_count
female_physics_prop

0.12062256809338522

In [7]:
# What proportion of male students are majoring in physics?
male_physics = male.query("major=='Physics'")
male_physics_count = male_physics['gender'].count()
male_physics_prop = male_physics_count/male_count
male_physics_prop

0.9259259259259259

In [8]:
# Admission rate for female physics majors
female_physics_admit = female_physics.query("admitted")['gender'].count()
female_physics_admit_rate = female_physics_admit/female_physics_count
female_physics_admit_rate

0.7419354838709677

In [9]:
# Admission rate for male physics majors
male_physics_admit = male_physics.query("admitted")['gender'].count()
male_physics_admit_rate = male_physics_admit/male_physics_count
male_physics_admit_rate

0.5155555555555555

### Proportion and admission rate for chemistry majors of each gender

In [10]:
# What proportion of female students are majoring in chemistry?
female_chemistry = female.query("major=='Chemistry'")
female_chemistry_count = female_chemistry['gender'].count()
female_chemistry_prop = female_chemistry_count/female_count
female_chemistry_prop

0.8793774319066148

In [11]:
# What proportion of male students are majoring in chemistry?
male_chemistry = male.query("major=='Chemistry'")
male_chemistry_count = male_chemistry['gender'].count()
male_chemistry_prop = male_chemistry_count/male_count
male_chemistry_prop

0.07407407407407407

In [12]:
# Admission rate for female chemistry majors
female_chemistry_admit = female_chemistry.query("admitted")['gender'].count()
female_chemistry_admit_rate = female_chemistry_admit/female_chemistry_count
female_chemistry_admit_rate

0.22566371681415928

In [13]:
# Admission rate for male chemistry majors
male_chemistry_admit = male_chemistry.query("admitted")['gender'].count()
male_chemistry_admit_rate = male_chemistry_admit/male_chemistry_count
male_chemistry_admit_rate

0.1111111111111111

### Admission rate for each major

In [14]:
# Admission rate for physics majors
physics_count = len(df[(df['major'] == 'Physics')])
physics_admit = len(df[(df['major'] == 'Physics') & df['admitted']]) 
physics_admit_rate = physics_admit/physics_count
physics_admit_rate

0.54296875

In [15]:
# Admission rate for chemistry majors
chemistry_count = len(df[(df['major'] == 'Chemistry')])
chemistry_admit = len(df[(df['major'] == 'Chemistry') & df['admitted']]) 
chemistry_admit_rate = chemistry_admit/chemistry_count
chemistry_admit_rate

0.21721311475409835

> when we compare the rate of admissions by only looking at gender and admission rates we concluded that males are favored/biased in the admission process. But when we compare the admission rates according to the major it is observed that females admission rate is higher than male admission rate in both the majors, by which we can conclude that females are biased in the admission process. When we change the view of grouping the data we got different conclusions, this paradox is called simpson's paradox.


# Conclusion and Findings

From the above calculations and statistics it may analysed and tabulated as:

| Major     	|    *   	| Male     	| *      	|   *    	| female   	| *      	|
|-----------	|-------	|----------	|-------	|-------	|----------	|-------	|
|    *       	| Total 	| Admitted 	| Rate  	| Total 	| Admitted 	| Rate  	|
| Physics   	| 225   	| 116      	| 51.6% 	| 31    	| 23       	| 74.2% 	|
| Chemistry 	| 18    	| 2        	| 11.1% 	| 226   	| 51       	| 22.6% 	|


From the above table if we compare the rate of admission of male physics majors with rate of admission of female physics majors the female rate is high and even the female admission rate is high for Chemistry major. Hence we can conclude that the **`female admissions are biased/favored.`**

Now lets look at the same table but in different approach: 


| Major     	|    *   	| Male     	| *      	|   *    	| female   	| *      	|
|-----------	|-------	|----------	|-------	|-------	|----------	|-------	|
|    *       	| Total 	| Admitted 	| Rate  	| Total 	| Admitted 	| Rate  	|
|-----------	|-------	|----------	|-------	|-------	|----------	|-------	|
| Physics   	| 225   	| 116      	| 51.6% 	| 31    	| 23       	| 74.2% 	|
| Chemistry 	| 18    	| 2        	| 11.1% 	| 226   	| 51       	| 22.6% 	|
|-----------	|-------	|----------	|-------	|-------	|----------	|-------	|
|*              |243        |   118     |48.6%      |   257     |   74      | 28.8%     |

- Total number of male applications are 243 out of which 118 are admitted which means 48.6% is the male admission rate overall.
- Total number of female applications are 257 out of which 74 are admitted which means 28.8% is the female admission rate overall.


Hence we can conclude that the **`male admissions are favored/biased`**

This kind of paradox is referred to as **`Simpson's paradox`**