![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
# Problem Statement: 
#### In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database.
You must use Pandas to answer the following questions:

1. What is the average age of men?
2. What is the percentage of people who have a Bachelor's degree?
3. What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
4. What percentage of people without advanced education make more than 50K?
5. What is the minimum number of hours a person works per week?
6. What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
7. What country has the highest percentage of people that earn >50K and what is that percentage?
8. Identify the most popular occupation for those who earn >50K in India.
9. How many people of each race are represented in this dataset?

Round all decimals to the nearest tenth.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Solution:

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/adults_data.csv')
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [4]:
data['sex'].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

1. The average age of males is...

In [5]:
#answer
males = (data['sex'] == 'Male') 
(data['age'].loc[males].mean()).round(2)

39.43

2. The percentage of people who have Bachelor's Degree is...

Let us see all  th values in the `education` column with their counts.

In [6]:
data['education'].value_counts()

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

In [7]:
data.index

RangeIndex(start=0, stop=32561, step=1)

In [8]:
#answer
b_degree = (data['education'] =='Bachelors')
((data['education'].loc[b_degree].count()/32561)*100).round(2)

16.45

3. What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?

* Let us first see the values in `salary` column with their total counts.

In [9]:
data['salary'].value_counts()

<=50K    24720
>50K      7841
Name: salary, dtype: int64

* Now we can filter our data as per the given condition. 
 
 *which is* `education` == *Bachelors or Masters or Doctorate* *and* `salary` *>50K*

In [10]:
adv_ed = (((data['education'] == 'Bachelors') | (data['education'] == 'Masters')|(data['education'] == 'Doctorate' )) & (data['salary'] == '>50K'))
data['salary'].loc[adv_ed].count()

3486

In [11]:
 #answer
((data['salary'].loc[adv_ed].count()/32561)*100).round(2)

10.71

Let's just select *any not null column* to count values in our dataset, to avoid manually writing the index count.
- we do it as **data.age.count()** 

In [12]:
 #answer
((data['salary'].loc[adv_ed].count()/data.age.count())*100).round(2)

10.71

4. What percentage of people without advanced education make more than 50K?

In [13]:
not_adv_ed = (~((data['education'] == 'Bachelors') | (data['education'] == 'Masters')|(data['education'] == 'Doctorate' )) & (data['salary'] == '>50K'))
data['salary'].loc[adv_ed].count()

3486

In [14]:
 #answer
((data['salary'].loc[not_adv_ed].count()/data.age.count())*100).round(2) 

13.37

5. What is the minimum number of hours a person works per week?

* First we'll find out all the *unique* values in `hours-per-week` column and them filter out the minimun value.

In [15]:
data['hours-per-week'].unique()

array([40, 13, 16, 45, 50, 80, 30, 35, 60, 20, 52, 44, 15, 25, 38, 43, 55,
       48, 58, 32, 70,  2, 22, 56, 41, 28, 36, 24, 46, 42, 12, 65,  1, 10,
       34, 75, 98, 33, 54,  8,  6, 64, 19, 18, 72,  5,  9, 47, 37, 21, 26,
       14,  4, 59,  7, 99, 53, 39, 62, 57, 78, 90, 66, 11, 49, 84,  3, 17,
       68, 27, 85, 31, 51, 77, 63, 23, 87, 88, 73, 89, 97, 94, 29, 96, 67,
       82, 86, 91, 81, 76, 92, 61, 74, 95], dtype=int64)

In [16]:
#answer
data['hours-per-week'].unique().min()

1

6. What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

In [17]:
min_hr = ((data['hours-per-week'] == 1) & (data['salary'] == '>50K'))
data.loc[min_hr].count()

age               2
workclass         2
fnlwgt            2
education         2
education-num     2
marital-status    2
occupation        2
relationship      2
race              2
sex               2
capital-gain      2
capital-loss      2
hours-per-week    2
native-country    2
salary            2
dtype: int64

In [18]:
#answer
((data['salary'].loc[min_hr].count()/data.age.count())*100).round(2)

0.01

7. What country has the highest percentage of people that earn >50K and what is that percentage? 

In [19]:
data['native-country'].unique()

array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)

In [20]:
cond = (data['salary'] == '>50K')

In [21]:
countries50kplus = data['native-country'].loc[cond].value_counts().sum()
countries50kplus

7841

In [22]:
#answer
pct_countries50k  = (data['native-country'].value_counts()/countries50kplus)*100
pct_countries50k.head(1)

United-States    372.018875
Name: native-country, dtype: float64

In [23]:
pct_countries50k.round(1)

United-States                 372.0
Mexico                          8.2
?                               7.4
Philippines                     2.5
Germany                         1.7
Canada                          1.5
Puerto-Rico                     1.5
El-Salvador                     1.4
India                           1.3
Cuba                            1.2
England                         1.1
Jamaica                         1.0
South                           1.0
China                           1.0
Italy                           0.9
Dominican-Republic              0.9
Vietnam                         0.9
Guatemala                       0.8
Japan                           0.8
Poland                          0.8
Columbia                        0.8
Taiwan                          0.7
Haiti                           0.6
Iran                            0.5
Portugal                        0.5
Nicaragua                       0.4
Peru                            0.4
France                      

In [24]:
#answer
pct_countries50k.max().round(2)

372.02

8. Identify the most popular occupation for those who earn >50K in India.

In [25]:
data['native-country'].value_counts()

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                      

In [26]:
data['native-country'].unique()

array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)

In [27]:
indian50kplus = ((data['native-country'] == 'India') & (data['salary'] == '>50K'))

In [28]:
indian50kplus_occupations = data['occupation'].loc[indian50kplus].value_counts()
indian50kplus_occupations

Prof-specialty      25
Exec-managerial      8
Other-service        2
Tech-support         2
Transport-moving     1
Sales                1
Adm-clerical         1
Name: occupation, dtype: int64

In [29]:
#answer
indian50kplus_occupations.head(1)

Prof-specialty    25
Name: occupation, dtype: int64

9. How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.

In [30]:
data['race'].value_counts()

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

### Combining all results:

In [31]:
def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('data/adults_data.csv')

    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df['race'].value_counts()
  

    # What is the average age of men?
    males = (df['sex'] == 'Male') 
    average_age_men = (df['age'].loc[males].mean()).round(2)

    # What is the percentage of people who have a Bachelor's degree?
    b_degree = (df['education'] =='Bachelors')

    percentage_bachelors = ((df['education'].loc[b_degree].count()/32561)*100).round(2)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    higher_education = (((df['education'] == 'Bachelors') | (df['education'] == 'Masters')|(df['education'] == 'Doctorate' )) & (df['salary'] == '>50K'))

    # What percentage of people without advanced education make more than 50K?
    lower_education = (~((df['education'] == 'Bachelors') | (df['education'] == 'Masters')|(df['education'] == 'Doctorate' )) &(df['salary'] == '>50K'))
    

    # with and without `Bachelors`, `Masters`, or `Doctorate`

    # percentage with salary >50K
    higher_education_rich = ((df['salary'].loc[higher_education].count()/32561)*100).round(2)
    lower_education_rich = ((df['salary'].loc[lower_education].count()/df.age.count())*100).round(2)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].unique().min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = ((df['hours-per-week'] == 1) & (df['salary'] == '>50K'))
    rich_percentage = ((df['salary'].loc[num_min_workers].count()/df.age.count())*100).round(2)

    # What country has the highest percentage of people that earn >50K?

    cond = (df['salary'] == '>50K')
    pct_countries50k  = (df['native-country'].value_counts()/(df['native-country'].loc[cond].value_counts().sum()))*100
    highest_earning_country = pct_countries50k.head(1)
    
    highest_earning_country_percentage = pct_countries50k.max().round(2)

    # Identify the most popular occupation for those who earn >50K in India.
    
    indian50kplus = ((df['native-country'] == 'India') & (df['salary'] == '>50K'))
    indian50kplus_occupations = df['occupation'].loc[indian50kplus].value_counts()
    top_IN_occupation = indian50kplus_occupations.head(1)


    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

## Output:

In [32]:
calculate_demographic_data()

Number of each race:
 White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64
Average age of men: 39.43
Percentage with Bachelors degrees: 16.45%
Percentage with higher education that earn >50K: 10.71%
Percentage without higher education that earn >50K: 13.37%
Min work time: 1 hours/week
Percentage of rich among those who work fewest hours: 0.01%
Country with highest percentage of rich: United-States    372.018875
Name: native-country, dtype: float64
Highest percentage of rich people in country: 372.02%
Top occupations in India: Prof-specialty    25
Name: occupation, dtype: int64


{'race_count': White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
 Name: race, dtype: int64,
 'average_age_men': 39.43,
 'percentage_bachelors': 16.45,
 'higher_education_rich': 10.71,
 'lower_education_rich': 13.37,
 'min_work_hours': 1,
 'rich_percentage': 0.01,
 'highest_earning_country': United-States    372.018875
 Name: native-country, dtype: float64,
 'highest_earning_country_percentage': 372.02,
 'top_IN_occupation': Prof-specialty    25
 Name: occupation, dtype: int64}

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<img src="https://imgur.com/3g7LyTV.png" align = 'right' width="240">

$End$ $of$ $the$ $notebook...$