## How Challenges Work

The American Community Survey is a U.S. Census Bureau survey that collects data on everything from housing affordability to industry employment rates. For this challenge, you'll be using the data that the team at FiveThirtyEight derived from the 2010-2012 American Community Surveys. [FiveThirtyEight](http://fivethirtyeight.com/) cleaned the data set and made it available in a [Github repository](https://github.com/fivethirtyeight/data/tree/master/college-majors).<br>

Here's a quick overview of the files we'll be working with:

- all-ages.csv - Employment data by major for all ages
- recent-grads.csv - Employment data by major for recent college graduates only

Here are descriptions for a few of the columns (out of 21 total columns):
- Rank - The major's numerical rank, by post-graduation median earnings
- Major_code - The major's numerical code
- Major - The major's description
- Major_category - The major's category
- Total - The total number of people who studied the major
- Sample_size - Sample size (unweighted) of full-time, year-round students
- Men - The number of men who studied the major
- Women - The number of women who studied the major
- ShareWomen - The share of women (from 0 to 1) who studied the major
- Employed - The number of people who studied the major and obtained a job after graduating
- Low_wage_jobs - Number in low-wage service jobs




Here are the first few rows and columns in recent-grads.csv. The data set all-ages.csv has the same structure, but with different values for some of the column.

By completing this challenge, you'll test your comfort level with using pandas to manipulate DataFrames and calculate summary statistics. First, we'll need to read the data set into pandas.

**Instruction**

- Read all-ages.csv into a DataFrame object, and assign it to all_ages.
- Read recent-grads.csv into a DataFrame object, and assign it to recent_grads.
- Display the first row of all_ages and recent_grads.


In [9]:
import pandas as pd
import numpy as np

all_ages = pd.read_csv("all-ages.csv", error_bad_lines=False)
recent_grads = pd.read_csv("recent-grads.csv", error_bad_lines=False)

In [14]:
all_ages.head(1)

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.026147,50000,34000,80000.0


In [15]:
recent_grads.head(1)

Unnamed: 0,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193


## Summarizing Major Categories
- Using Series.unique() to return the unique values in a column


In [17]:
# Unique values of "Major_category" in all_ages 
print(all_ages["Major_category"].unique())

['Agriculture & Natural Resources' 'Biology & Life Science' 'Engineering'
 'Humanities & Liberal Arts' 'Communications & Journalism'
 'Computers & Mathematics' 'Industrial Arts & Consumer Services'
 'Education' 'Law & Public Policy' 'Interdisciplinary' 'Health'
 'Social Science' 'Physical Sciences' 'Psychology & Social Work' 'Arts'
 'Business']


In [19]:
# Unique values of "Major_category" in recent_grads
print(recent_grads["Major_category"].unique())

['Engineering' 'Business' 'Physical Sciences' 'Law & Public Policy'
 'Computers & Mathematics' 'Agriculture & Natural Resources'
 'Industrial Arts & Consumer Services' 'Arts' 'Health' 'Social Science'
 'Biology & Life Science' 'Education' 'Humanities & Liberal Arts'
 'Psychology & Social Work' 'Communications & Journalism'
 'Interdisciplinary']


**Instruction**
- Use the Total column to calculate the number of people who fall under each Major_category in each data set.
    - Store the result as a separate dictionary for each data set.
    - The key for the dictionary should be the Major_category, and the value should be the total count.
    - For the counts from all_ages, store the results as a dictionary named aa_cat_counts.
    - For the counts from recent_grads, store the results as a dictionary named rg_cat_counts.


In [20]:
# Initialize the dictionary 
aa_cat_counts = dict()
rg_cat_counts = dict()

In [23]:
def total_stats_of_major_category(df, d):
    for major_category in df["Major_category"].unique():
        # Select all rows for this major category
        all_rows = df[df["Major_category"] == major_category]
        # Using Series.sum() to calculate the total 
        d[major_category] = all_rows["Total"].sum()

total_stats_of_major_category(all_ages, aa_cat_counts)
total_stats_of_major_category(recent_grads, rg_cat_counts)

In [24]:
print(aa_cat_counts)

{'Agriculture & Natural Resources': 632437, 'Biology & Life Science': 1338186, 'Engineering': 3576013, 'Humanities & Liberal Arts': 3738335, 'Communications & Journalism': 1803822, 'Computers & Mathematics': 1781378, 'Industrial Arts & Consumer Services': 1033798, 'Education': 4700118, 'Law & Public Policy': 902926, 'Interdisciplinary': 45199, 'Health': 2950859, 'Social Science': 2654125, 'Physical Sciences': 1025318, 'Psychology & Social Work': 1987278, 'Arts': 1805865, 'Business': 9858741}


In [25]:
print(rg_cat_counts)

{'Engineering': 537583, 'Business': 1302376, 'Physical Sciences': 185479, 'Law & Public Policy': 179107, 'Computers & Mathematics': 299008, 'Agriculture & Natural Resources': 79981, 'Industrial Arts & Consumer Services': 229792, 'Arts': 357130, 'Health': 463230, 'Social Science': 529966, 'Biology & Life Science': 453862, 'Education': 559129, 'Humanities & Liberal Arts': 713468, 'Psychology & Social Work': 481007, 'Communications & Journalism': 392601, 'Interdisciplinary': 12296}


**Instruction**

- Use the Low_wage_jobs and Total columns to calculate the proportion of recent college graduates that worked low wage jobs.
    - Recall that you can use the Series.sum() method to return the sum of the values in a column.
- Store the resulting float as low_wage_proportion, and display the value with the print() function.


In [30]:
low_wage_proportion = recent_grads["Low_wage_jobs"].sum()/recent_grads["Total"].sum()
print("low_wage_proportion is", low_wage_proportion)

low_wage_proportion is 0.09852546076122913


Both the all_ages and recent_grads data sets have 173 rows, corresponding to the 173 college major codes. This enables us to do some comparisons between the two data sets, and perform some initial calculations to see how the statistics for recent college graduates compare with those for the entire population.

Next, let's calculate the number of majors where recent graduates did better than the overall population.

**Instruction**

    Use a for loop to iterate over majors.
        For each major, use Boolean filtering to find the corresponding row in both DataFrames.
        Compare the values for Unemployment_rate to see which DataFrame has a lower value.
        Increment rg_lower_count if the value for Unemployment_rate is lower for recent_grads than it is for all_ages.

    Display rg_lower_count with the print() function.


In [48]:
print(np.sort(all_ages["Major"].unique()))

['ACCOUNTING' 'ACTUARIAL SCIENCE' 'ADVERTISING AND PUBLIC RELATIONS'
 'AEROSPACE ENGINEERING' 'AGRICULTURAL ECONOMICS'
 'AGRICULTURE PRODUCTION AND MANAGEMENT' 'ANIMAL SCIENCES'
 'ANTHROPOLOGY AND ARCHEOLOGY' 'APPLIED MATHEMATICS'
 'ARCHITECTURAL ENGINEERING' 'ARCHITECTURE'
 'AREA ETHNIC AND CIVILIZATION STUDIES' 'ART AND MUSIC EDUCATION'
 'ART HISTORY AND CRITICISM' 'ASTRONOMY AND ASTROPHYSICS'
 'ATMOSPHERIC SCIENCES AND METEOROLOGY' 'BIOCHEMICAL SCIENCES'
 'BIOLOGICAL ENGINEERING' 'BIOLOGY' 'BIOMEDICAL ENGINEERING' 'BOTANY'
 'BUSINESS ECONOMICS' 'BUSINESS MANAGEMENT AND ADMINISTRATION'
 'CHEMICAL ENGINEERING' 'CHEMISTRY' 'CIVIL ENGINEERING'
 'CLINICAL PSYCHOLOGY' 'COGNITIVE SCIENCE AND BIOPSYCHOLOGY'
 'COMMERCIAL ART AND GRAPHIC DESIGN'
 'COMMUNICATION DISORDERS SCIENCES AND SERVICES'
 'COMMUNICATION TECHNOLOGIES' 'COMMUNICATIONS'
 'COMMUNITY AND PUBLIC HEALTH' 'COMPOSITION AND RHETORIC'
 'COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY'
 'COMPUTER AND INFORMATION SYSTEMS' 'COMPUTER 

In [58]:
aa_lower_count = 0
rg_lower_count = 0
for major in all_ages["Major"].unique():
    
    aa_rows =  all_ages[all_ages["Major"] == major]
    rg_rows = recent_grads[recent_grads["Major"] == major]
    
    aa_unemp_rate = aa_rows.iloc[0]["Unemployment_rate"]
    rg_unemp_rate = rg_rows.iloc[0]["Unemployment_rate"]
    # Equivalent
    #aa_unemp_rate = aa_rows["Unemployment_rate"].values[0]
    #rg_unemp_rate = rg_rows["Unemployment_rate"].values[0]
    
    if aa_unemp_rate < rg_unemp_rate:
        aa_lower_count += 1
    elif rg_unemp_rate < aa_unemp_rate:
        rg_lower_count += 1
print("rg_lower_count is ", rg_lower_count)
print("Number of majors are", len(all_ages["Major"].unique()))

rg_lower_count is  43
Number of majors are 173


It appears that less recent graduates who studied 43 of the 173 majors ended up having lower unemployment rates than the general population.