### Assignment

# Demographic Data Analyzer

In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database. Here is a sample of what the data looks like:

|    |   age | workclass        |   fnlwgt | education   |   education-num | marital-status     | occupation        | relationship   | race   | sex    |   capital-gain |   capital-loss |   hours-per-week | native-country   | salary   |
|---:|------:|:-----------------|---------:|:------------|----------------:|:-------------------|:------------------|:---------------|:-------|:-------|---------------:|---------------:|-----------------:|:-----------------|:---------|
|  0 |    39 | State-gov        |    77516 | Bachelors   |              13 | Never-married      | Adm-clerical      | Not-in-family  | White  | Male   |           2174 |              0 |               40 | United-States    | <=50K    |
|  1 |    50 | Self-emp-not-inc |    83311 | Bachelors   |              13 | Married-civ-spouse | Exec-managerial   | Husband        | White  | Male   |              0 |              0 |               13 | United-States    | <=50K    |
|  2 |    38 | Private          |   215646 | HS-grad     |               9 | Divorced           | Handlers-cleaners | Not-in-family  | White  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  3 |    53 | Private          |   234721 | 11th        |               7 | Married-civ-spouse | Handlers-cleaners | Husband        | Black  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  4 |    28 | Private          |   338409 | Bachelors   |              13 | Married-civ-spouse | Prof-specialty    | Wife           | Black  | Female |              0 |              0 |               40 | Cuba             | <=50K    |


You must use Pandas to answer the following questions:
* How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (`race` column)
* What is the average age of men?
* What is the percentage of people who have a Bachelor's degree?
* What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
* What percentage of people without advanced education make more than 50K?
* What is the minimum number of hours a person works per week?
* What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
* What country has the highest percentage of people that earn >50K and what is that percentage?
* Identify the most popular occupation for those who earn >50K in India. 

Use the starter code in the file `demographic_data_anaylizer`. Update the code so all variables set to "None" are set to the appropriate calculation or code. Round all decimals to the nearest tenth.

Unit tests are written for you under `test_module.py`.

### Development

For development, you can use `main.py` to test your functions. Click the "run" button and `main.py` will run.

### Testing 

We imported the tests from `test_module.py` to `main.py` for your convenience. The tests will run automatically whenever you hit the "run" button.

### Submitting

Copy your project's URL and submit it to freeCodeCamp.

### Dataset Source

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

In [153]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [154]:
df=pd.read_csv("adult.data.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [155]:
print("Number of Elements in df---------->","\n",df.size,"\n")
print("keys in df ------------->","\n",df.keys())


Number of Elements in df----------> 
 488415 

keys in df -------------> 
 Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')


In [156]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [157]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


# 1. How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)

In [158]:
pd.Series(df['race'].unique(),name='Race',index=None)

0                 White
1                 Black
2    Asian-Pac-Islander
3    Amer-Indian-Eskimo
4                 Other
Name: Race, dtype: object

In [159]:
df.race.value_counts()
pd.Series(df.race.value_counts())

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

# 2. What is the average age of men?

In [160]:
pd.Series(df.sex.unique())

0      Male
1    Female
dtype: object

In [161]:
print("Mean Age of men --------->  ",df[df["sex"]=="Male"]['age'].mean())

Mean Age of men --------->   39.43354749885268


# 3. What is the percentage of people who have a Bachelor's degree?

In [162]:
t = df.education.count()
b = df['education'][df['education']=="Bachelors"].count()
print("% of people who have a Bachelor's degree \n ", "{:.2f}".format(b*100/t))

% of people who have a Bachelor's degree 
  16.45


# 4 . What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?

In [163]:
df[df['education']=="Doctorate"]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
20,40,Private,193524,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,60,United-States,>50K
63,42,Private,116632,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,45,United-States,>50K
89,43,Federal-gov,410867,Doctorate,16,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,50,United-States,>50K
96,48,Self-emp-not-inc,191277,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1902,60,United-States,>50K
189,58,State-gov,109567,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,1,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32429,51,Local-gov,203334,Doctorate,16,Divorced,Exec-managerial,Not-in-family,White,Female,0,0,45,United-States,>50K
32469,58,Self-emp-inc,181974,Doctorate,16,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,99,?,<=50K
32470,50,Private,485710,Doctorate,16,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,50,United-States,<=50K
32532,34,Private,204461,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,60,United-States,>50K


In [174]:
t = df.age.count()
pd.Series(df.education.unique())

0        Bachelors
1          HS-grad
2             11th
3          Masters
4              9th
5     Some-college
6       Assoc-acdm
7        Assoc-voc
8          7th-8th
9        Doctorate
10     Prof-school
11         5th-6th
12            10th
13         1st-4th
14       Preschool
15            12th
dtype: object

In [175]:
a = df[df['education'].isin(["Bachelors","Masters","Doctorate"])& df['salary'].isin([">50K"])]

In [176]:
print(pd.Series(a.education.unique()))
print(pd.Series(a.salary.unique()))

0      Masters
1    Bachelors
2    Doctorate
dtype: object
0    >50K
dtype: object


In [177]:
c = a['age'].count()

In [179]:
print("percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K\n","{:,.2f}".format(c*100/t))

percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K
 10.71


# 5. What percentage of people without advanced education make more than 50K?

In [173]:
d = df[~df['education'].isin(["Bachelors","Masters","Doctorate"])& df['salary'].isin([">50K"])].age.count()
print("percentage of people WITHOUT advanced education (Bachelors, Masters, or Doctorate) make more than 50K\n","{:,.2f}".format(d*100/t))

percentage of people WITHOUT advanced education (Bachelors, Masters, or Doctorate) make more than 50K
 13.37


# 6. What is the minimum number of hours a person works per week?

In [183]:
df[df["hours-per-week"]==df["hours-per-week"].min()]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
189,58,State-gov,109567,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,1,United-States,>50K
1036,66,Self-emp-inc,150726,9th,5,Married-civ-spouse,Exec-managerial,Husband,White,Male,1409,0,1,?,<=50K
1262,69,?,195779,Assoc-voc,11,Widowed,?,Not-in-family,White,Female,0,0,1,United-States,<=50K
5590,78,?,363134,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,0,1,United-States,<=50K
5632,45,?,189564,Masters,14,Married-civ-spouse,?,Wife,White,Female,0,0,1,United-States,<=50K
5766,62,?,97231,Some-college,10,Married-civ-spouse,?,Wife,White,Female,0,0,1,United-States,<=50K
5808,76,?,211574,10th,6,Married-civ-spouse,?,Husband,White,Male,0,0,1,United-States,<=50K
8447,67,?,244122,Assoc-voc,11,Widowed,?,Not-in-family,White,Female,0,0,1,United-States,<=50K
9147,75,?,260543,10th,6,Widowed,?,Other-relative,Asian-Pac-Islander,Female,0,0,1,China,<=50K
11451,27,Private,147951,HS-grad,9,Never-married,Machine-op-inspct,Other-relative,White,Male,0,0,1,United-States,<=50K


In [184]:
df["hours-per-week"].min()

1

# 7. What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

In [193]:
e = df[df['hours-per-week']==df['hours-per-week'].min() & df['salary'].isin([">50K"])].age.count()
print("percentage of the people who work the minimum number of hours per week have a salary of more than 50K\n","{:,.2f}".format(e*100/t))

percentage of the people who work the minimum number of hours per week have a salary of more than 50K
 0.01


# 8. What country has the highest percentage of people that earn >50K and what is that percentage?

In [207]:
data = df[df['salary'].isin([">50K"])]
x = data['native-country'].value_counts().max()
print('country has the highest percentage of people that earn >50K and what is that percentage',"{:,.2f}".format(x*100/t))

country has the highest percentage of people that earn >50K and what is that percentage 22.02


# 9. Identify the most popular occupation for those who earn >50K in India.

In [210]:
pd.Series(df['native-country'].value_counts())

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                      

In [231]:
india= df[df['salary'].isin([">50K"])&df['native-country'].isin(["India"])]['occupation'].value_counts()
india.index[0]

'Prof-specialty'

In [224]:
df[(df['salary'] == ">50K") & (df['native-country'] == "India")]["occupation"].value_counts().index[0]

'Prof-specialty'