# Visualizing Earnings Based On College Majors 

This is a guided project in my course at [Dataquest.io](dataquest.io), which intends to show my ability to visualise data with histograms, scatter plots, bar plots, with the libraries pandas, matplotlib and numpy.

Feel free to discuss this with me!



In this occassion I will work with the dataset `'recent-grads.csv'` on the job outcomes of students who graduated from college between 2010 and 2012 in the USA. The original data on job outcomes was released by [American Community Survey](https://www.census.gov/programs-surveys/acs/), which conducts surveys and aggregates the data. [FiveThirtyEight](https://github.com/fivethirtyeight/data/tree/master/college-majors) cleaned the dataset and released it on their Github repo.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
%matplotlib inline

In [None]:
recent_grads = pd.read_csv('recent-grads.csv')

In [None]:
recent_grads.head(2)

In [None]:
recent_grads.info()

In [None]:
recent_grads.iloc[0]

In [None]:
recent_grads.tail(2)

In [None]:
recent_grads.describe()

Let's count the number of rows in the dataframe `recent_grads` with [`shape`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shape.html):

In [None]:
raw_data_count = recent_grads.shape[0]
raw_data_count

Let's drop the rows that contain missing values with [`dropna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html). By using `inplace = True` this will actually change the dataframe to one with NA entries dropped from it.

In [None]:
recent_grads.dropna(inplace=True)

So now we have 172 rows (only one row had empty cells).

In [None]:
cleaned_data_count = recent_grads.shape[0]
cleaned_data_count

Now let's make some exploratory plots with the method `DataFrame.plot()`.

Remember that each row represents a different major in college.

`Median` is the median salary of full-time, year-round workers

`Full_time` is the number of full time employed (35 hours or more per week)

`Sample_size` sample size (unweighted) of full-time for a major

`ShareWomen` is women as share of total



Rank - Rank by median earnings (the dataset is ordered by this column).

Major_code - Major code.

Major - Major description.

Major_category - Category of major.

Total - Total number of people with major.

Sample_size - Sample size (unweighted) of full-time.

Men - Male graduates.

Women - Female graduates.

ShareWomen - Women as share of total.

Employed - Number employed.

Median - Median salary of full-time, year-round workers.

Low_wage_jobs - Number in low-wage service jobs.

Full_time - Number employed 35 hours or more.

Part_time - Number employed less than 35 hours.


In [None]:
recent_grads.plot(x = 'Sample_size', y = 'Median', 
                  kind = 'scatter', title = 'Median vs Sample Size')


In [None]:
recent_grads.plot(x = 'Sample_size', y = 'Unemployment_rate', 
                  kind = 'scatter',
                  title = 'Unemployment Rate vs Sample Size')

In [None]:
recent_grads.plot(x = 'Full_time', y = 'Median', 
                  kind = 'scatter', 
                  title = 'Median vs Full_time')

Does this show a relationship between the median salaries and the number of full-time employees?

I'm not sure. It seems like most majors have a low-end number of full time employees with a bottom-end mean salary.



In [None]:
recent_grads.plot(x = 'Unemployment_rate', y = 'ShareWomen', 
                  kind = 'scatter', 
                  title = 'Share women vs Unemployment Rate')

In [None]:
recent_grads.plot(x = 'Men', y = 'Median', 
                  kind = 'scatter', 
                  title = 'Median vs Men')

In [None]:
recent_grads.plot(x = 'Women', y = 'Median', 
                  kind = 'scatter', 
                  title = 'Median vs Women')

Let's explore the relation between the number of graduates in a major and their salary:

In [None]:
recent_grads.plot(x = 'Total', y = 'Median', 
                  kind = 'scatter', 
                  title = 'Median vs Total')

We can see that the major that has the highest salary has a few graduates.


Lower salaries are distributed across majors with small and large amounts of graduates.

Let's select the majors with a majority of women in it (that is, 50% or more are women, i.e. `ShareWomen`>0.5).


In [None]:
ax = recent_grads.plot(y = 'Median', 
                       x = 'ShareWomen', 
                      kind = 'scatter', 
                      title = 'Share women vs Median')

plt.axvline(x=0.5, color='r', linestyle='-')
plt.axhline(y = recent_grads['Median'].mean(), 
            color = 'r', linestyle = '-')

It seems like students that majored in subjects that were majority female do not earn more than the majors where the majority of students are men. Majority women means that more than 50% graduates are women.

The red horizontal line marks the mean of the median salaries for full-time working graduates.

The red vertical line marks 0.5 of the column ShareWomen, which is the division between the zone where majority are men and those majors where majority are men.

## Histogram visuals

