# Demographic Data Analyzer

You must use Pandas to answer the following questions:

How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)

What is the average age of men?

What is the percentage of people who have a Bachelor's degree?

What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?

What percentage of people without advanced education make more than 50K?

What is the minimum number of hours a person works per week?

What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

What country has the highest percentage of people that earn >50K and what is that percentage?

Identify the most popular occupation for those who earn >50K in India.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("adult.data.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [79]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

In [3]:
#How many people of each race are represented in this dataset?
#This should be a Pandas series with race names 
#as the index labels. (race column)
df["race"].value_counts()

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

In [4]:
#What is the average age of men?
#df.groupby('sex', as_index=False).age.mean()
df.loc[df['sex'].str.contains('Male'), 'age'].mean()

39.43354749885268

In [61]:
#What is the percentage of people who have a Bachelor's degree?
# counts of people with bachelor's degree divide by counts of all people.
(df[df["education"]=="Bachelors"]["education"].count()/df["education"].count())*100

16.44605509658794

In [74]:
df['education'].value_counts()

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

In [116]:
#What percentage of people with advanced education 
#(Bachelors, Masters, or Doctorate) make more than 50K?

#df[df["education"].isin(["Bachelors", "Masters", "Doctorate"])]

# Count of all ppl with education that makes >50k
#edu_higher50 = df['education'].loc[df['salary']=='>50K'].count()


# counts of ppl who have advanced degree
adv_deg = df.loc[df['education'].str.contains('Bachelors') | 
       df['education'].str.contains('Masters') | 
       df['education'].str.contains('Doctorate'),['education']].count()

# counts of ppl who have advanced degree that earn more than 50K
adv_higher50 = df.loc[df['education'].str.contains('Bachelors') | 
       df['education'].str.contains('Masters') | 
       df['education'].str.contains('Doctorate') , 
       ['education']].loc[df['salary']=='>50K'].count()

#percentage of people with advanced education make more than 50K?
(adv_higher50/adv_deg)*100

education    46.535843
dtype: float64

In [126]:
#What percentage of people without 
#advanced education make more than 50K?

# counts of ppl who don't have advaced degree
lower_deg = df["education"].count() - adv_deg

# counts of people who don't have advaced degree that make >50k
low_higher50 = df["education"].loc[df['salary']=='>50K'].count() - adv_higher50

#percentage of people without advanced education make more than 50K?
(low_higher50/lower_deg)*100

education    17.37136
dtype: float64

In [90]:
#What is the minimum number of hours a person works per week?
df['hours-per-week'].min()

1

In [135]:
#What percentage of the people who work the minimum number of hours per week 
#have a salary of more than 50K?

# Counts of ppl who work the minimum number of hours per week 
number_minimum_hours = df[df["hours-per-week"]==1]["hours-per-week"].count()

# counts of ppl who work the minimum number of hours per week that earn more than 50K
minimum_higher50 = df[df["hours-per-week"]==1]["hours-per-week"].loc[df['salary']=='>50K'].count()

#percentage of people who work the minimum number of hours per week that earn more than 50K
(minimum_higher50/number_minimum_hours)*100

10.0

In [141]:
#What country has the highest percentage of people 
#that earn >50K and what is that percentage?

# Counts of countries that makes >50k
highest_earning_country =(df[df["salary"]==">50K"]["native-country"].value_counts()
                          /df["native-country"].value_counts()*100).idxmax()

print(highest_earning_country)

#Percentage of country that earn >50K
(df[df["salary"]==">50K"]["native-country"].value_counts()/df["native-country"].value_counts()*100).max()

Iran


41.86046511627907

In [146]:
#Identify the most popular occupation for those who earn >50K in India.
df[df['salary'] == ">50K"]["occupation"].value_counts()

(df[(df['salary'] == '>50K') & (df['native-country'] == 'India')]['occupation']).value_counts().idxmax()

'Prof-specialty'