## <center> Demographic Data Analyzer </center>

### What is this?

This is a freeCodeCamp project that is meant to test ones knowledge of data analysis with python and pandas. 
This is a notebook that breaks down the problem and the steps I take to solve said problem.

The following is the problem from freeCodeCamp:

> #### <center> Demographic Data Analyzer</center>
>
> You will be [working on this project with our Replit starter code.](https://replit.com/github/freeCodeCamp/boilerplate-demographic-data-analyzer)
> 
> - Start by importing the project on Replit.
> - Next, you will see a `.replit` window.
> - Select `Use run command` and click the `Done` button.
> 
> We are still developing the interactive instructional part of the Python curriculum. For now, here are some videos on the freeCodeCamp.org YouTube channel that will teach you everything you need to know to complete this project:
> 
> - [Python for Everybody Video Course](https://www.freecodecamp.org/news/python-for-everybody/) (14 hours)
> 
> - [How to Analyze Data with Python Pandas](https://www.freecodecamp.org/news/how-to-analyze-data-with-python-pandas/) (10 hours)
> 
> In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database. Here is a sample of what the data looks like:
> 
> |    |   age | workclass        |   fnlwgt | education   |   education-num | marital-status     | occupation        | relationship   | race   | sex    |   capital-gain |   capital-loss |   hours-per-week | native-country   | salary   |
> |---:|------:|:-----------------|---------:|:------------|----------------:|:-------------------|:------------------|:---------------|:-------|:-------|---------------:|---------------:|-----------------:|:-----------------|:---------|
> |  0 |    39 | State-gov        |    77516 | Bachelors   |              13 | Never-married      | Adm-clerical      | Not-in-family  | White  | Male   |           2174 |              0 |               40 | United-States    | <=50K    |
> |  1 |    50 | Self-emp-not-inc |    83311 | Bachelors   |              13 | Married-civ-spouse | Exec-managerial   | Husband        | White  | Male   |              0 |              0 |               13 | United-States    | <=50K    |
> |  2 |    38 | Private          |   215646 | HS-grad     |               9 | Divorced           | Handlers-cleaners | Not-in-family  | White  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
> |  3 |    53 | Private          |   234721 | 11th        |               7 | Married-civ-spouse | Handlers-cleaners | Husband        | Black  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
> |  4 |    28 | Private          |   338409 | Bachelors   |              13 | Married-civ-spouse | Prof-specialty    | Wife           | Black  | Female |              0 |              0 |               40 | Cuba             | <=50K    |
> 
> #### You must use Pandas to answer the following questions:
> 
> - How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (`race` column)
> - What is the average age of men?
> - What is the percentage of people who have a Bachelor's degree?
> - What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
> - What percentage of people without advanced education make more than 50K?
> - What is the minimum number of hours a person works per week?
> - What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
> - What country has the highest percentage of people that earn >50K and what is that percentage?
> - Identify the most popular occupation for those who earn >50K in India.
> 
> Use the starter code in the file `demographic_data_analyzer`. Update the code so all variables set to "None" are set to the appropriate calculation or code. Round all decimals to the nearest tenth.
> 
> Unit tests are written for you under `test_module.py`.
> 
> #### Development
> For development, you can use `main.py` to test your functions. Click the "run" button and `main.py` will run.
> 
> #### Testing
> We imported the tests from `test_module.py` to `main.py` for your convenience. The tests will run automatically whenever you hit the "run" button.
> 
> #### Submitting
> Copy your project's URL and submit it to freeCodeCamp.
> 
> #### Dataset Source
> Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

### Imports
---

In [26]:
import pandas as pd

### Read the CSV file
---

In [27]:
df = pd.read_csv("adult.data.csv")

In [28]:
print(df)

       age         workclass  fnlwgt   education  education-num  \
0       39         State-gov   77516   Bachelors             13   
1       50  Self-emp-not-inc   83311   Bachelors             13   
2       38           Private  215646     HS-grad              9   
3       53           Private  234721        11th              7   
4       28           Private  338409   Bachelors             13   
...    ...               ...     ...         ...            ...   
32556   27           Private  257302  Assoc-acdm             12   
32557   40           Private  154374     HS-grad              9   
32558   58           Private  151910     HS-grad              9   
32559   22           Private  201490     HS-grad              9   
32560   52      Self-emp-inc  287927     HS-grad              9   

           marital-status         occupation   relationship   race     sex  \
0           Never-married       Adm-clerical  Not-in-family  White    Male   
1      Married-civ-spouse    Exec-manag

### Find all of the races
---

In [29]:
print(len(df['race'].drop_duplicates()))

5


In [30]:
print(df.groupby('race').size())

race
Amer-Indian-Eskimo      311
Asian-Pac-Islander     1039
Black                  3124
Other                   271
White                 27816
dtype: int64


### Average age of males
---

In [31]:
df.loc[df['sex'] == "Male"]['age'].mean()

39.43354749885268

### Percentage with a Bachelors degree
---

In [32]:
len(df.query("education == 'Bachelors'")) / len(df) * 100

16.44605509658794

In [33]:
df.query("education == 'Bachelors' or education == 'Masters' or education == 'Doctorate'")

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32538,38,Private,139180,Bachelors,13,Divorced,Prof-specialty,Unmarried,Black,Female,15020,0,45,United-States,>50K
32539,71,?,287372,Doctorate,16,Married-civ-spouse,?,Husband,White,Male,0,0,10,United-States,>50K
32544,31,Private,199655,Masters,14,Divorced,Other-service,Not-in-family,Other,Female,0,0,30,United-States,<=50K
32553,32,Private,116138,Masters,14,Never-married,Tech-support,Not-in-family,Asian-Pac-Islander,Male,0,0,11,Taiwan,<=50K


### Salary over 50k
---

In [34]:
rich = df [df.salary == '>50K']

### How many with a salary over 50k have advanced degrees
---

In [35]:
len( rich [ rich.education == "Bachelors" ] ) + len( rich [ rich.education == "Masters" ] ) + len( rich [ rich.education == "Doctorate" ] ) 

3486

### Percentage with advanced degrees that make over 50k
---

In [36]:
round( ( (len( rich [ rich.education == "Bachelors" ] ) + len( rich [ rich.education == "Masters" ] ) + len( rich [ rich.education == "Doctorate" ] ) ) / ( ( len( df [ df.education == "Bachelors" ] ) + len( df [ df.education == "Masters" ] ) + len( df [ df.education == "Doctorate" ] ) ) ) *100), 1)

46.5

In [37]:
len(df [df.education == "Masters" ])

1723

In [38]:
len( df.query("education != 'Bachelors' and education != 'Masters' and education != 'Doctorate'").query("salary == '>50K'") ) / len(df) * 100

13.374896348392248

In [39]:
df['hours-per-week'].min()

1

### People that work 1 hour per week and make over 50k
---

In [40]:
df.loc[ (df['hours-per-week'] == 1) ].loc[df['salary'] == '>50K']

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
189,58,State-gov,109567,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,1,United-States,>50K
20072,65,?,76043,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,1,United-States,>50K


### Unique Native Countries
---

In [41]:
df[ 'native-country' ]

0        United-States
1        United-States
2        United-States
3        United-States
4                 Cuba
             ...      
32556    United-States
32557    United-States
32558    United-States
32559    United-States
32560    United-States
Name: native-country, Length: 32561, dtype: object

In [42]:
countries = df[ 'native-country' ].drop_duplicates() 

In [43]:
countries

0                     United-States
4                              Cuba
6                           Jamaica
11                            India
14                                ?
15                           Mexico
27                            South
35                      Puerto-Rico
52                         Honduras
98                          England
112                          Canada
122                         Germany
135                            Iran
152                     Philippines
201                           Italy
217                          Poland
228                        Columbia
255                        Cambodia
265                        Thailand
289                         Ecuador
304                            Laos
311                          Taiwan
338                           Haiti
359                        Portugal
427              Dominican-Republic
497                     El-Salvador
503                          France
771                       Gu

In [53]:
country_average = {}
for c in countries:
  ls = df [ df['native-country'] == c ] 
  print( c, len( ls[ ls.salary == '>50K' ] )/ len( ls ) * 100 )
  country_average.update( { c : len( ls[ ls.salary == '>50K' ] )/ len( ls ) * 100 } )

print( country_average )
  

United-States 24.583476174151524
Cuba 26.31578947368421
Jamaica 12.345679012345679
India 40.0
? 25.04288164665523
Mexico 5.132192846034215
South 20.0
Puerto-Rico 10.526315789473683
Honduras 7.6923076923076925
England 33.33333333333333
Canada 32.231404958677686
Germany 32.11678832116788
Iran 41.86046511627907
Philippines 30.808080808080806
Italy 34.24657534246575
Poland 20.0
Columbia 3.389830508474576
Cambodia 36.84210526315789
Thailand 16.666666666666664
Ecuador 14.285714285714285
Laos 11.11111111111111
Taiwan 39.21568627450981
Haiti 9.090909090909092
Portugal 10.81081081081081
Dominican-Republic 2.857142857142857
El-Salvador 8.49056603773585
France 41.37931034482759
Guatemala 4.6875
China 26.666666666666668
Japan 38.70967741935484
Yugoslavia 37.5
Peru 6.451612903225806
Outlying-US(Guam-USVI-etc) 0.0
Scotland 25.0
Trinadad&Tobago 10.526315789473683
Greece 27.586206896551722
Nicaragua 5.88235294117647
Vietnam 7.462686567164178
Hong 30.0
Ireland 20.833333333333336
Hungary 23.076923076923

In [None]:
df[ df['salary'] == '>50K' ]['native-country'].value_counts().index[0]

'United-States'

In [None]:
df[ df['salary'] == '>50K' ]['native-country'].value_counts()[ 'United-States']

7171

In [None]:
df.loc['native-country':]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary


### Put it together
---

In [None]:

def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')
  
    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby('race').size()

    # What is the average age of men?
    average_age_men = df.loc[ df[ 'sex' ] == "Male" ][ 'age' ].mean().round(1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round( len(df.query("education == 'Bachelors'")) / len(df) * 100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    # higher_education = df.query("education == 'Bachelors' or education == 'Masters' or education == 'Doctorate'") 
    rich = df [df.salary == '>50K']

  
    #people with a higher ed
    higher_education = pd.concat( [ df[ df.education == 'Bachelors' ], df [ df.education == 'Masters' ], df[ df.education == 'Doctorate' ] ] )

    rich_higher_ed = higher_education[ higher_education.salary == '>50K' ]
    # print(higher_education)
  
    lower_education = df.query("education != 'Bachelors' and education != 'Masters' and education != 'Doctorate'") 
    rich_lower_ed = lower_education[ lower_education.salary == '>50K' ]
  
    # percentage with salary >50K
    higher_education_rich = round( len(rich_higher_ed) /  len(higher_education) * 100, 1)
  
    lower_education_rich =  round( len( rich_lower_ed ) / len(lower_education) * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len( df.loc[ (df['hours-per-week'] == min_work_hours) ].loc[df['salary'] == '>50K'] )
    min_workers = df[ df['hours-per-week'] == min_work_hours ]
    rich_percentage = round( len( min_workers[ min_workers.salary == '>50K' ] ) / len( min_workers ) * 100, 1)

    # What country has the highest percentage of people that earn >50K?
    countries = df[ 'native-country' ].drop_duplicates() 
    country_average = {}
  
    for c in countries:
      ls = df [ df['native-country'] == c ]
      #print( c, len( ls[ ls.salary == '>50K' ] )/ len( ls ) * 100 )
      country_average.update( { c : round( len( ls[ ls.salary == '>50K' ] )/ len( ls ) * 100 ,1) } )

    #print( country_average )
    max = [ "", 0 ]
    for c, avg in country_average.items():
      if ( country_average[c] > max[1] ):
        max = [c, country_average[c] ]
    
    #print( max )
    
    highest_earning_country = max[0]
    highest_earning_country_percentage = max[1]

    # Identify the most popular occupation for those who earn >50K in India.
    india_occupations = df[ df[ "native-country" ] == 'India' ].groupby( "occupation" ).size()
    top_IN_occupation = ''
    
    for k, v in india_occupations.items():
      if v == india_occupations.max():
        top_IN_occupation = k


    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }


In [None]:
calculate_demographic_data()

Number of each race:
 race
Amer-Indian-Eskimo      311
Asian-Pac-Islander     1039
Black                  3124
Other                   271
White                 27816
dtype: int64
Average age of men: 39.4
Percentage with Bachelors degrees: 16.4%
Percentage with higher education that earn >50K: 46.5%
Percentage without higher education that earn >50K: 17.4%
Min work time: 1 hours/week
Percentage of rich among those who work fewest hours: 10.0%
Country with highest percentage of rich: Iran
Highest percentage of rich people in country: 41.9%
Top occupations in India: Prof-specialty


{'race_count': race
 Amer-Indian-Eskimo      311
 Asian-Pac-Islander     1039
 Black                  3124
 Other                   271
 White                 27816
 dtype: int64,
 'average_age_men': 39.4,
 'percentage_bachelors': 16.4,
 'higher_education_rich': 46.5,
 'lower_education_rich': 17.4,
 'min_work_hours': 1,
 'rich_percentage': 10.0,
 'highest_earning_country': 'Iran',
 'highest_earning_country_percentage': 41.9,
 'top_IN_occupation': 'Prof-specialty'}