# An analysis of demographics information, 1994

**Authors:** Aaron Ou, Brian Lin, Julien Yu

## Abstract

The dataset we work on is named `adult.csv` (1994, UC Irvine Machine Learning Repository), which consists of the demographic information (i.e. age, work class, education, race, sex, etc.) about adults with an occupation. We also aim to forecast whether or not an individual makes more than \$50 thousand dollars per year from the rest of his or her demographic information. 

We chose this dataset because while many papers have cited this data set, they have mostly been done in a machine learning context of attempting to increase the classification accuracy of whether an individual makes more than $50 thousand dollars per year. There has been no study done to actually look into **the relationship between different explanatory variables**. For example, does one's gender affect how many years of education they obtain? What about one's race? These questions are essential, not only to understand the American society of 20 years ago, but also to understand that of today.

## 1. Data Cleaning and Overview of Dataset

The dataset comprises the demographic information about 32561 individuals (rows), each of which is characterized by 15 different attributes (columns) including age, work class, education, race, sex, etc. Missing values in workclass, occupation, and native country exist in 2399 rows. After removing all rows with missing information, we derive and utilize a **cleaned dataset**, whose summary statistics changes little from the raw dataset. 

In [2]:
import pandas as pd
adult = pd.read_hdf('results/df1.h5', 'adult')
print("Raw Dataset")
display(adult.describe())
print("\nCleaned Dataset")
clean_adult = pd.read_hdf('results/df1.h5', 'clean_adult')
display(clean_adult.describe())

Raw Dataset


Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0



Cleaned Dataset


Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0
mean,38.437902,189793.8,10.121312,1092.007858,88.372489,40.931238
std,13.134665,105653.0,2.549995,7406.346497,404.29837,11.979984
min,17.0,13769.0,1.0,0.0,0.0,1.0
25%,28.0,117627.2,9.0,0.0,0.0,40.0
50%,37.0,178425.0,10.0,0.0,0.0,40.0
75%,47.0,237628.5,13.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


A couple of initial observations of the dataset are listed as follow: 
* The age distribution of the cleaned dataset is **skewed to the right**. Our explanation is that children do not have occupations and that people leave their occupations at a certain age (i.e. 60-70 yrs old). 
* The "years of education (since 4th grade)" distribution has **2 peaks**, which denote that the individual finishes his or her high school and college education, respectively. Relatively few people in the cleaned dataset have an education level that is lower than high school or higher than college (i.e. Master, PhD). Back in 1994, graduate schools have not yet become prevalent, while most people have had the chance to graduate from high school. 
* About **one half** of individuals in the cleaned dataset were married, about one third had never been married. This would by no means embody that one third of the Americans never get married, in that most of the data were from people in their 20s and 30s. It is reasonable that they first had an occupation and then got married. 
* About 85% of individuals in the cleaned dataset self-identified as White. It is likely that the **85%** includes Hispanics too. This gives us a striking evidence that in 1994, there were far less immigrants than there are today in the United States. In a 2010 census (on Wikipedia), around 72.4% of the US population are white.
* About **two thirds** of individuals in the cleaned dataset were male, while the other one third were female. Our explanation is that in 1994, female labor participation was still much lower than that of today. Now around 47 percent of labor force in the US is female.
* Conclusion: the graphs below show us the univariate distributions of labor frequency with respect to age, education background, marital status, ethnicity and sex. The labor statistics of sex and ethnicity, in particular, are quite different from those of today given that the US is becoming more inclusive.

![title](fig/columns.png)

## 2. Analysis of Sex/Race versus Occupation

The second notebook mainly explores the pairwise relationship between sex (or race) and occupation. Bar charts are used for this section of our analysis, in that sex, race and occupation are all qualitatively defined.

The following graph shows the comparative distribution of occupations between males and females. A few gaps (when male/female scores a percentage that **at least doubles** that of the other sex) are recorded as follow:
* Males are much more likely to engage in craft repair (i.e. auto mechanic), handling/cleaning (i.e. street cleaner), farming/fishing (i.e. farmer), protective services (i.e. guard), transport (i.e. bus driver), and armed force (i.e. soldier). From these results, males engage more in professions that require **body strength and mechanical skills**, which makes intuitive sense.

* Females are much more likely than males to engage in administrative clerical roles (i.e. secretary), private house services (i.e. housekeeper), and other services that are not explicitly stated (i.e. waitress). From these results, females engage more in professions that require **care and attentiveness**, which also makes intuitive sense.

* Other occupations, such as executive managerial roles (i.e. CEO), professions with specialty (i.e. hairdresser), technical supports (i.e. AT&T worker) and sales (i.e. salesman), have close percentages of males and females.

![title](fig/occupation_by_sex.png)

We then compare the occupation distributions by race. Given that multiple ethnicities are recorded in the cleaned dataset, we show the different distribution patterns (of different ethnicities) in a graph of 5 subplots. Some observations are as follow:
* Around 15% of the individuals identified as White and Asian/Pacific Islander have **executive/managerial roles**, while this percentage of other ethnic groups is significantly lower. These two groups may possess better opportunities than other groups, or it may show an inherent **bias of society**. More detailed study needs to be conducted to vindicate or to refute this observation.
* For the Black group, **other services** and **administrative clerical roles** are almost twice as frequent as other occupations, while for the White group, the occupation distribution spreads nicely across different categories. This, again, may show the inherent bias of society mentioned above.
* The most common occupation for the Asian/Pacific Islander group is "**professions with specialty**". This can be reflective of the fact that immigrants from this ethnic group often come to the US with a specialty, and that the Amercian society is more willing to accept Asian immigrants with a technical skill.
* The most common occupation for the Amer/Indian/Eskimo group is **craft repair**. Individuals from this ethnic group are likely to be talented craft makers. 
* The **armed-forces** occupation has a very small sample. This is probably because it does not really count as an occupation and there are a few exceptions. In the cleaned dataset, no one from the Asian or the "Other race" ethnic group falls into the "armed-forces" category, in that they were likely to serve as new immigrants back in 1994.

![title](fig/occupation_by_race.png)

To determine whether there is an association between two categorical variables, we are applying the **Chi-Square Test of Independence**. The test procedure is considered appropriate only when the expected frequency count for each cell of the contingency table is at least 5, and there are some occupations (i.e. armed-forces) in which one or more sex/ethnic group has no member. Thus, we will have to proceed with caution. 

We thereby assert that if the expected frequency count for a cell is less than 5, then there is an obvious relationship between the categorical variables. We are interested in the pairwise group relationships (i.e. white/black, white/asian. etc.), so we run the chi-square test for each pair in a group. Still, we cannot say that there are causal relationships between sex/race groups and occupations. Correlation is not causation, in that there are many confounding variables in the dataset, such as education level.

As supported by the Chi-Square Test below, there is a **clear association** between **sex** and **occupation**. For instance, females are more likely to be in administrative clerical and other services not explicitly stated, while males are more likely to be in craft repair, handling/cleaning, farming/fishing, and transport. There is only subtle difference in males' and females' engagement in executive managerial roles.

In [3]:
import scipy.stats as stats
%run -i 'functions/find_indices_with_value.py'
%run -i 'functions/chi_square_test.py'
chi_square_test(clean_adult, "occupation", "sex")

statistically significant: ('occupation', 'sex')
p-value is 0.0


sex,Female,Male
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,25.67982,5.934907
Craft-repair,2.208137,18.722694
Exec-managerial,11.684727,13.985568
Farming-fishing,0.664486,4.53586
Handlers-cleaners,1.676549,5.822002
Machine-op-inspct,5.551012,6.98542
Other-service,17.971785,7.137598
Priv-house-serv,1.380086,0.039272
Prof-specialty,15.242282,12.503068
Protective-serv,0.776937,2.788277


(5711.6103020023511, 0.0, 12, array([[ 1207.13766458,  2513.86233542],
        [ 1307.38102345,  2722.61897655],
        [ 1295.05336119,  2696.94663881],
        [  320.84363082,   668.15636918],
        [  437.95642225,   912.04357775],
        [  637.79431566,  1328.20568434],
        [ 1042.01187278,  2169.98812722],
        [   46.39093954,    96.60906046],
        [ 1309.97632076,  2728.02367924],
        [  208.92143402,   435.07856598],
        [ 1162.69319802,  2421.30680198],
        [  295.86389414,   616.13610586],
        [  509.97592279,  1062.02407721]]))

There are 10 charts displayed in the ensuing cell, each one of which denotes a pairwise occupation comparison of different ethnic groups. As supported by the Chi-Square Tests, **all races are statistically different from each other** in terms of occupation. 
* The greatest pairwise p-value is around **8e-3 ("Amer-Indian-Eskimo", "Other")** and the smallest is around **1e-131 ("Black", "White")**. 
* The Amer-Indian-Eskimo group and the "other" group have the most alike (though not categorically alike) occupations possibly because their blood ties are closer than those of any other pairs. 
* The "1e-131" p-value between the White group and the Black group is shocking but not surprising. The second smallest p-value, "2e-35", is between the Asian-Pac-Islander group and the Black group. This result, again, may serve as another illustration of the inherent bias against Black people.

In [12]:
from itertools import combinations
chi_square_test(clean_adult, "occupation", "race", True)

statistically significant: ('Amer-Indian-Eskimo', 'Asian-Pac-Islander')
p-value is 5.39804424413e-11


race,Amer-Indian-Eskimo,Asian-Pac-Islander
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,10.877193,15.039282
Craft-repair,15.438596,9.315376
Exec-managerial,10.526316,13.580247
Farming-fishing,3.508772,1.795735
Handlers-cleaners,7.719298,2.469136
Machine-op-inspct,6.666667,5.611672
Other-service,11.578947,13.131313
Prof-specialty,11.578947,19.753086
Protective-serv,2.807018,1.571268
Sales,9.122807,10.774411


statistically significant: ('Amer-Indian-Eskimo', 'Black')
p-value is 4.29464358621e-07


race,Amer-Indian-Eskimo,Black
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,10.839161,17.204301
Armed-Forces,0.34965,0.035842
Craft-repair,15.384615,8.387097
Exec-managerial,10.48951,8.422939
Farming-fishing,3.496503,1.505376
Handlers-cleaners,7.692308,6.09319
Machine-op-inspct,6.643357,9.641577
Other-service,11.538462,19.820789
Prof-specialty,11.538462,8.100358
Protective-serv,2.797203,3.512545


statistically significant: ('Amer-Indian-Eskimo', 'Other')
p-value is 0.00793749750909


race,Amer-Indian-Eskimo,Other
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,10.877193,10.087719
Craft-repair,15.438596,10.964912
Exec-managerial,10.526316,4.824561
Farming-fishing,3.508772,4.824561
Handlers-cleaners,7.719298,4.824561
Machine-op-inspct,6.666667,17.105263
Other-service,11.578947,16.22807
Prof-specialty,11.578947,12.280702
Protective-serv,2.807018,2.192982
Sales,9.122807,9.649123


statistically significant: ('Amer-Indian-Eskimo', 'White')
p-value is 0.000523857680953


race,Amer-Indian-Eskimo,White
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,10.839161,11.822336
Armed-Forces,0.34965,0.027107
Craft-repair,15.384615,14.110905
Exec-managerial,10.48951,13.921159
Farming-fishing,3.496503,3.523854
Handlers-cleaners,7.692308,4.356413
Machine-op-inspct,6.643357,6.153191
Other-service,11.538462,9.572491
Prof-specialty,11.538462,13.843711
Protective-serv,2.797203,2.009758


statistically significant: ('Asian-Pac-Islander', 'Black')
p-value is 2.08005869185e-35


race,Asian-Pac-Islander,Black
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,14.972067,17.045455
Craft-repair,9.273743,8.309659
Exec-managerial,13.519553,8.34517
Farming-fishing,1.787709,1.491477
Handlers-cleaners,2.458101,6.036932
Machine-op-inspct,5.586592,9.552557
Other-service,13.072626,19.637784
Priv-house-serv,0.446927,0.958807
Prof-specialty,19.664804,8.025568
Protective-serv,1.564246,3.480114


statistically significant: ('Asian-Pac-Islander', 'Other')
p-value is 1.10595148479e-11


race,Asian-Pac-Islander,Other
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,14.972067,9.95671
Craft-repair,9.273743,10.822511
Exec-managerial,13.519553,4.761905
Farming-fishing,1.787709,4.761905
Handlers-cleaners,2.458101,4.761905
Machine-op-inspct,5.586592,16.883117
Other-service,13.072626,16.017316
Priv-house-serv,0.446927,1.298701
Prof-specialty,19.664804,12.121212
Protective-serv,1.564246,2.164502


statistically significant: ('Asian-Pac-Islander', 'White')
p-value is 7.82321814761e-15


race,Asian-Pac-Islander,White
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,14.972067,11.775823
Craft-repair,9.273743,14.055388
Exec-managerial,13.519553,13.866389
Farming-fishing,1.787709,3.50999
Handlers-cleaners,2.458101,4.339273
Machine-op-inspct,5.586592,6.128982
Other-service,13.072626,9.53483
Priv-house-serv,0.446927,0.420427
Prof-specialty,19.664804,13.789246
Protective-serv,1.564246,2.001851


statistically significant: ('Black', 'Other')
p-value is 1.08587334985e-05


race,Black,Other
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,17.045455,9.95671
Craft-repair,8.309659,10.822511
Exec-managerial,8.34517,4.761905
Farming-fishing,1.491477,4.761905
Handlers-cleaners,6.036932,4.761905
Machine-op-inspct,9.552557,16.883117
Other-service,19.637784,16.017316
Priv-house-serv,0.958807,1.298701
Prof-specialty,8.025568,12.121212
Protective-serv,3.480114,2.164502


statistically significant: ('Black', 'White')
p-value is 1.10592086605e-131


race,Black,White
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,17.039404,11.772645
Armed-Forces,0.035499,0.026993
Craft-repair,8.306709,14.051594
Exec-managerial,8.342208,13.862646
Farming-fishing,1.490948,3.509043
Handlers-cleaners,6.034789,4.338102
Machine-op-inspct,9.549166,6.127328
Other-service,19.630813,9.532256
Priv-house-serv,0.958466,0.420314
Prof-specialty,8.022719,13.785524


statistically significant: ('Other', 'White')
p-value is 8.67474040807e-12


race,Other,White
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Adm-clerical,9.95671,11.775823
Craft-repair,10.822511,14.055388
Exec-managerial,4.761905,13.866389
Farming-fishing,4.761905,3.50999
Handlers-cleaners,4.761905,4.339273
Machine-op-inspct,16.883117,6.128982
Other-service,16.017316,9.53483
Priv-house-serv,1.298701,0.420427
Prof-specialty,12.121212,13.789246
Protective-serv,2.164502,2.001851


With more Chi-Square Tests, we reach a few additional interesting results:
* In the Chi-Square Test between race and sex, we use a chart to denote the male/female proportions of both ethnic groups. The Black population differs significantly from other races in that it has a **50:50 male to female labor participation ratio**, while for other races the ratio is around 60:40 or 70:30.
* In the Chi-Square Test between occupation and education, the p-value is 0 and it is officially proved that education (level) is a confounding factor in our discussion of sex/race versus occupation. Hence, **we cannot assert any causality between sex (or race) and occupation** due to the existence of confounding factor.

## 3. Analysis of Sex/Race versus Education
The third notebook delves deeper into education level, the confounding factor in notebook two. We now want to explore the relationship between sex (or race) and education. Boxplots are used for this section of our analysis, in that education can be quantitatively evaluated with number of years.

The following two graphs show the summary statistics of the education level (in number of years) by sex and race, respectively. Our observations are as follow:

* The summary statistics of males' and females' education level looks quite similar. The only noticeable difference is that males have an upper quantile that is one year larger than that of females.
* The summary statistics of different races has more remarkable features to be discussed. The White and Asian/Pacific Islander groups evidently have greater median and upper quantile than Black, Native American/Eskimo, and Other groups. The result is not surprising, and not much has changed since 1994 unfortunately. 
* The Asian/Pacific Islander group has a high median education level, even when compared to the White group. This reflects the fact that Asian immigrants, whose culture values education a lot, tend to have higher education level than other ethnic groups.

![title](fig/education_sex.png)

![title](fig/education_race.png)

The **Two Sample t-test** is used in this section to quantatively validify our conjectures drawn from the boxplots. Our conclusions are printed as the result of the following cell:

In [8]:
%run -i 'functions/two_sample_t_test.py'
male = clean_adult[clean_adult["sex"] == "Male"]
female = clean_adult[clean_adult["sex"] == "Female"]
t, p, reject = two_sample_t_test(male["education.num"], female["education.num"], "Male", "Female")

races = clean_adult.groupby("race")
pairs = [",".join(map(str, comb)).split(",") for comb in combinations(races.groups.keys(), 2)]
for pair in pairs:
    race1_name = pair[0]
    race2_name = pair[1]
    race1 = races.get_group(pair[0])
    race2 = races.get_group(pair[1])
    two_sample_t_test(race1["education.num"], race2["education.num"], race1_name, race2_name)

There is no statistically significant difference between Group Male and Group Female

The mean difference is statistically significant for Group Amer-Indian-Eskimo and Group Asian-Pac-Islander
p-value is 3.13492993644e-23

There is no statistically significant difference between Group Amer-Indian-Eskimo and Group Black

The mean difference is statistically significant for Group Amer-Indian-Eskimo and Group Other
p-value is 0.00709780510747

The mean difference is statistically significant for Group Amer-Indian-Eskimo and Group White
p-value is 1.80818942375e-09

The mean difference is statistically significant for Group Asian-Pac-Islander and Group Black
p-value is 5.2608000932e-44

The mean difference is statistically significant for Group Asian-Pac-Islander and Group Other
p-value is 3.88949414009e-21

The mean difference is statistically significant for Group Asian-Pac-Islander and Group White
p-value is 2.62019447341e-18

The mean difference is statistically significant for Group B

## 4. Analysis of different variables and their relationships with Hours Per Week
The fourth notebook mainly talks about the pairwise relationships between a few different attributes (i.e. age, marital status, and education) and the number of working hours per week. Regression method is used in this section of our analysis, given that we can now display two different quantitative variables on a single scatterplot.

The following scatterplot shows the relationship between age and number of working hours per week, which are highly uncorrelated. Intuitively, there should be a slightly negative correlation between these two variables, given that younger people have the capacity and motivation to work longer hours than old people. Moreover, linear regression might not be the best regression method in this case. It is probable that there is a **downward quadratic relationship**, in which people in their middle ages work the hardest. Although in the actual case, the linear regression method shows a slightly positive slope which is very counterintuitive, we definitely do not consider that the elderly work the most. 

![title](fig/hours_age.png)

We also notice that though the average number of working hours in the entire cleaned dataset is reasonably 41, some people actually overwork (work more than 60 hours a week) a little bit. Hence, we extract the information about overworked individuals from the dataset and compare it with that of the overall population. Our conclusions are as follow:
* For the overall dataset, there is a clear disparity in working hours among individuals of different marital status. Widowed individuals work the fewest number of hours. We thought that married individuals would work fewer hours than non-married individuals since these married individuals would have family commitments, but the reverse is true. One possible explanation is that married individuals should work more to support their family.
* For the overworked individuals, there is not an evident relationship between hours worked and marital status. Those in the armed-forces have the highest average working hours, which is not surprising. Serving our country is a hard job. 
* The two sample t-test results of different marital are listed comprehensively in notebook 4. As can be imagined from the following boxplots, **more** pairwise relationships (between different marital status) have **statistically significant difference** in the **overall** group than in the **overworked** group.

![title](fig/marital_hours.png)

As supported by the two sample t-tests, average working hours depend a lot on **occupations**. 

Overall, it seems like farmers and fishers work the most. This is not that surprising: farming and fishing are very hard and take a long time. We need to be more appreciative of these occupations for their hardwork. On the other hand, other services and private house services work the least. 

For overworked individuals, farmers/fishers and armed forces still work the most, while transportation/moving, sales, and tech-support also work quite a lot. 

![title](fig/hours_occupation.png)

## 5. Forecasting of income

## Appendix: Author Contribution statement
 
 Work | Brian Lin | Julien Yu |  Aaron Ou
 --- | --- | --- | ---
 Conception of idea | Y | Y | Y
 Data finding & cleansing | Y | Y | Y
 Coding | Y | Y | Y
 Analysis of data | nb1-4 | 5&main | NA
 Summary | Y | Y | Y
 Testing | Y | Y | Y
 Host via Sphinx | NA | NA | Y

**Breakdown of Author Contributions**

`Conception of idea`: Got the idea of exploring the relationships between different explanatory variables, using scatterplot, boxplots, bar charts and regression methods for different circumstances, etc.

`Data finding & cleansing`: Looked for datasets, reached the consensus on using the 1994 demographics dataset from UCI, made a cleaned version of the dataset.

`Coding`: Wrote and ran the codes, generated and saved the graphs.

`Analysis of data`: Analyzed the numerical and graphic results in different notebooks. 

`Summary`: Contributed to the README file and the main notebook.

`Testing`: Wrote and ran the test, tested the validity of codes on different computers.

`Host via Sphinx`: Created and hosted the website. 