# CMPSC 448: Homewrok #1
# Exploratory Data Analysis with `pandas`

## Objectives

In this assignment, you are asked to analyze the UCI Adult data set containing demographic information about the US residents. This data was extracted from the census bureau database found at

http://www.census.gov/ftp/pub/DES/www/welcome.html

The features of data with possible values of each feature are listed below:

| Feature Name| Possible Values  |
|------|------|
| age | continuous|
| workclass| Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked|
| fnlwgt| continuous|
| education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|education_num | continuous|
|marital_status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces|
|relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
|race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|sex | Female, Male|
|capital_gain| continuous|
|capital_loss | continuous|
|hours-per-week | continuous |
|native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
|salary | >50K,<=50K |


Please  complete the tasks in the Jupyter notebook by answering following 8 questions.

In [33]:
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')


In [34]:
data = pd.read_csv('adult.data.csv')
print("\n".join(data.columns))

age
 workclass
 fnlwgt
 education
 education-num
 marital-status
 occupation
 relationship
 race
 sex
 capital-gain
 capital-loss
 hours-per-week
 native-country
 salary


In [35]:
data.shape

(32561, 15)

In [36]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 1. How many men and women (sex feature) are represented in this dataset?

In [28]:
# You answer (code + results)
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('adult.data.csv')

print("Men and women represented in this dataset:")
data[' sex'].value_counts()

Men and women represented in this dataset:


 Male      21790
 Female    10771
Name:  sex, dtype: int64

### 2. What is the average age (age feature) of women?

In [20]:
# You answer (code + results)
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('adult.data.csv')

women = data.groupby(' sex').age.mean()
print("Average Women Age:", women[' Female'])

Average Women Age: 36.85823043357163


### 3. What is the percentage of German citizens (native-country feature)?


In [19]:
# You answer (code + results)
Ger = data[data[' native-country'] == ' Germany'].count()
cnt = data[' native-country'].count()
per = (Ger/cnt) * 100

print("Percentage of German citizens:", per[' native-country'])


Percentage of German citizens: 0.42074874850281013


###  4. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

In [18]:
# You answer (code + results)
more = data[data[' salary'] == ' >50K']
less = data[data[' salary'] == ' <=50K']

print("Mean of age for those who earn more than 50K per year:", more.age.mean())
print("Mean of age for those who earn more less 50K per year:", less.age.mean())

print("Standard deviation of age for those who earn more than 50K per year:", more.age.std())
print("Standard deviation of age for those who earn less than 50K per year:", less.age.std())

Mean of age for those who earn more than 50K per year: 44.24984058155847
Mean of age for those who earn more less 50K per year: 36.78373786407767
Standard deviation of age for those who earn more than 50K per year: 10.51902771985177
Standard deviation of age for those who earn less than 50K per year: 14.020088490824813


### 5. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

In [11]:
# You answer (code + results)

more = data[data[' salary'] == ' >50K']
ed = more[more[' education'] == ' 9th']

print("Is it true that people who earn more than 50K have at least high school education?:", ed.empty)

Is it true that people who earn more than 50K have at least high school education?: False


### 6.  Display age statistics for each race (race feature) and each gender (sex feature). 

Hint: Use `groupby()` and `describe()` functions of DataFrame. Find the maximum age of men of Amer-Indian-Eskimo race.

In [10]:
# You answer (code + results)
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('adult.data.csv')

grp = data.groupby([' race', ' sex'])
print(grp['age'].describe())
esk = data[data[' race'] == ' Amer-Indian-Eskimo']
eskmen = esk[esk[' sex'] == ' Male']
maxAge = eskmen['age']
print("Maximum age of men of Amer-Indian-Eskimo:", maxAge.max())

                               count       mean        std   min   25%   50%  \
 race                sex                                                       
 Amer-Indian-Eskimo  Female    119.0  37.117647  13.114991  17.0  27.0  36.0   
                     Male      192.0  37.208333  12.049563  17.0  28.0  35.0   
 Asian-Pac-Islander  Female    346.0  35.089595  12.300845  17.0  25.0  33.0   
                     Male      693.0  39.073593  12.883944  18.0  29.0  37.0   
 Black               Female   1555.0  37.854019  12.637197  17.0  28.0  37.0   
                     Male     1569.0  37.682600  12.882612  17.0  27.0  36.0   
 Other               Female    109.0  31.678899  11.631599  17.0  23.0  29.0   
                     Male      162.0  34.654321  11.355531  17.0  26.0  32.0   
 White               Female   8642.0  36.811618  14.329093  17.0  25.0  35.0   
                     Male    19174.0  39.652498  13.436029  17.0  29.0  38.0   

                               75%   ma

### 7. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?


In [26]:
# You answer (code + results)
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('adult.data.csv')

hrs = data[' hours-per-week'].max()
print("Max Hours a person works per week:", hrs)

ppl = data[' hours-per-week'] == 99
cnt = data[data[' hours-per-week'] == 99]
lppl= ppl.sum()
lpplhigh = cnt[cnt[' salary'] == ' >50K']
lpplhigh = lpplhigh['age'].count()

print("People who worked 99 hours:", lppl)
per = (lpplhigh/ lppl) *100
print("Percentage who earn a lot:", per)

Max Hours a person works per week: 99
People who worked 99 hours: 85
Percentage who earn a lot: 29.411764705882355


### 8. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

In [22]:
# You answer (code + results)
little = data[data[' salary'] == ' <=50K']
lot = data[data[' salary'] == ' >50K']
countryli = little.groupby(' native-country')
countrylo = lot.groupby(' native-country')
print("Average time of work for those who earn little:", countryli[' hours-per-week'].mean())
print("Average time of work for those who earn a lot:", countrylo[' hours-per-week'].mean())

japanli = little[little[' native-country'] == ' Japan']
japanlo = lot[lot[' native-country'] == ' Japan']
print("Japan average time of work for those who earn little:", japanli[' hours-per-week'].mean())
print("Japan average time of work for those who earn a lot:", japanlo[' hours-per-week'].mean())

Average time of work for those who earn little:  native-country
 ?                             40.164760
 Cambodia                      41.416667
 Canada                        37.914634
 China                         37.381818
 Columbia                      38.684211
 Cuba                          37.985714
 Dominican-Republic            42.338235
 Ecuador                       38.041667
 El-Salvador                   36.030928
 England                       40.483333
 France                        41.058824
 Germany                       39.139785
 Greece                        41.809524
 Guatemala                     39.360656
 Haiti                         36.325000
 Holand-Netherlands            40.000000
 Honduras                      34.333333
 Hong                          39.142857
 Hungary                       31.300000
 India                         38.233333
 Iran                          41.440000
 Ireland                       40.947368
 Italy                         39.