# Demographic Data Analyzer

In this challenge I have analyzed demographic data using Pandas. This is dataset of demographic data that was extracted from the 1994 Census database. Here is a sample of what the data looks like:

In [1]:
# Importing pandas Library
import pandas as pd

In [2]:
# Loading data with pandas read_csv module 
data = pd.read_csv('adult.data.csv')

In [141]:
data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [28]:
# Checking the Size and Type of our data before moving onto the analysis stage 

print(f'No of Columns: {data.shape[1]} \nNo of Rows: {data.shape[0]}')
print(data.dtypes)

No of Columns: 15 
No of Rows: 32561
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
salary            object
dtype: object


In [39]:
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


Now that we have a clean and well defined data set..Lets begin answering some questions!

##### How many people of each race are represented in this dataset?

In [38]:
race_count = data['race'].value_counts()
race_count

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

##### What is the average age of men and women ?

In [88]:
avg_age_men = data.groupby('sex')['age'].mean().round().add_prefix('Avg age of ')
avg_age_men()

sex
Avg age of Female    37.0
Avg age of Male      39.0
Name: age, dtype: float64

##### What is the percentage of people who have a Bachelor's degree?

In [380]:
bachelors_count = data['education'].where(data['education'] == 'Bachelors').value_counts()
bachelors_perc = (bachelors_count / len(data)) * 100
round(bachelors_perc)



Bachelors    16.0
Name: education, dtype: float64

##### What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?

In [439]:
advance_edu = data[data['education'].isin(['Bachelors','Masters','Doctorate'])]
advance_edu_per = round(len(advance_edu[advance_edu['salary'] == '>50K']) / len(advance_edu) * 100,1)



print(f"Approximately {advance_edu_per} percent of people have higher education and earn more than 50K")

Approximately 46.5 percent of people have advanced education and earn more than 50K


##### What percentage of people without advanced education make more than 50K?

In [441]:
lower_edu = data.loc[~data['education'].isin(['Bachelors','Masters','Doctorate'])]
lower_edu_per = round(len(lower_edu[lower_edu['salary'] == '>50K']) / len(lower_edu) * 100, 1)


print(f"Unsurprisingly, approximately {lower_edu_per}% of people have lower education and earn more than 50K")

Unsurprisingly, approximately 17.4% of people have lower education and earn more than 50K


##### What is the minimum number of hours a person works per week?

In [445]:
min_hours = data['hours-per-week'].min()
min_hours

1

##### What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

In [450]:
min_work_hours = data[data['hours-per-week'] == 1]
min_work_hours_per = round(len(min_work_hours[min_work_hours['salary'] == '>50K'])/len(min_work_hours)*100)
print(f'Almost {min_work_hours_per}% of people work one hour per week and earn more than 50K ')



Almost 10% people work one hour per week and earn more than 50K 


In [456]:
    #df[df['salary'] == '>50K']

    # returning a series with number of ppl making over 50k for each country: index -> country; 
    popOver50k = data[data['salary'] == '>50K']['native-country'].value_counts()
    
    # a series with total number of ppl for each country: index -> country
    popTotal = data['native-country'].value_counts()

    # series with % of ppl in country making over 50k: index -> country
    perOver50k = popOver50k / popTotal

    popOver50k
    #popTotal

United-States         7171
?                      146
Philippines             61
Germany                 44
India                   40
Canada                  39
Mexico                  33
England                 30
Italy                   25
Cuba                    25
Japan                   24
Taiwan                  20
China                   20
Iran                    18
South                   16
Puerto-Rico             12
Poland                  12
France                  12
Jamaica                 10
El-Salvador              9
Greece                   8
Cambodia                 7
Hong                     6
Yugoslavia               6
Ireland                  5
Vietnam                  5
Portugal                 4
Haiti                    4
Ecuador                  4
Thailand                 3
Hungary                  3
Guatemala                3
Scotland                 3
Nicaragua                2
Trinadad&Tobago          2
Laos                     2
Columbia                 2
D

##### What country has the highest percentage of people that earn >50K and what is that percentage?

In [467]:
over50k = data['native-country'].where(data['salary'] == '>50K').value_counts()
totalpop = data['native-country'].value_counts()
salary_per_country = (over50k / totalpop) * 100
salary_per_country =pd.DataFrame(salary_per_country)
#Renaming the columns
salary_per_country.rename(columns = {'native-country':'Per of Salary >50K'}, inplace = True)

salary_per_country.head()


Unnamed: 0,Per of Salary >50K
?,25.042882
Cambodia,36.842105
Canada,32.231405
China,26.666667
Columbia,3.389831


##### Identify the most popular occupation for those who earn >50K in India.  

In [475]:
salover50k_india = data.loc[(data['salary'] == '>50K') & (data['native-country'] == 'India')]
salover50k_india['occupation'].value_counts()


Prof-specialty      25
Exec-managerial      8
Other-service        2
Tech-support         2
Transport-moving     1
Sales                1
Adm-clerical         1
Name: occupation, dtype: int64