# Recent College Graduates Job Outcomes

## Introduction

Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:

<ul>
<li><code>Rank</code> - Rank by median earnings (the dataset is ordered by this column).</li>
<li><code>Major_code</code> - Major code.</li>
<li><code>Major</code> - Major description.</li>
<li><code>Major_category</code> - Category of major.</li>
<li><code>Total</code> - Total number of people with major.</li>
<li><code>Sample_size</code> - Sample size (unweighted) of full-time.</li>
<li><code>Men</code> - Male graduates.</li>
<li><code>Women</code> - Female graduates.</li>
<li><code>ShareWomen</code> - Women as share of total.</li>
<li><code>Employed</code> - Number employed.</li>
<li><code>Median</code> - Median salary of full-time, year-round workers.</li>
<li><code>Low_wage_jobs</code> - Number in low-wage service jobs.</li>
<li><code>Full_time</code> - Number employed 35 hours or more.</li>
<li><code>Part_time</code> - Number employed less than 35 hours.</li>
</ul>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

recent_grads = pd.read_csv('recent-grads.csv')
print(recent_grads.iloc[0,:],
      recent_grads.head(),
      recent_grads.tail(), 
      sep='\n \n')

## Cleaning Data

Removing rows that are missing values

In [None]:
recent_grads.describe()

In [None]:
raw_data_count=recent_grads.shape[0]
raw_data_count

In [None]:
recent_grads=recent_grads.dropna()
cleaned_data_count=recent_grads.shape[0]
cleaned_data_count

## Scatter Plots

In [None]:
recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title='Median vs. Sample_size')

In [None]:
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. Sample_size')

In [None]:
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Median vs. Full_time')

In [None]:
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. ShareWomen')

In [None]:
recent_grads.plot(x='Men', y='Median', kind='scatter', title='Median vs. Men')

In [None]:
recent_grads.plot(x='Women', y='Median', kind='scatter', title='Median vs. Women')

<ul>
<li><code>Do students in more popular majors make more money?</code> 
<br>
<br>
    Looking at the scatter plot comparing Sample Size and Median income there is not a strong indication that a more popular major will lead to a higher income.</li>
<br>
<li><code>Do students that majored in subjects that were majority female make more money?</code> 
<br>
<br>
    The scatter plot comparing Female sample and Median income does not indicate that majors with more females makes more money.</li>
<br>
<li><code>Is there any link between the number of full-time employees and median salary?</code> 
<br>
<br>
    The scatter plot comparing full-time employees to median income is very similar to the total sample-size vs median income scatter plot. Nothing can really be gained from examining those differences.</li>
</ul>

## Histograms

In [None]:
columns = ['Sample_size', 'Median', 'Employed', 'Full_time',
           'ShareWomen', 'Unemployment_rate', 'Men', 'Women']

fig = plt.figure(figsize=(10,24))
for n,c in enumerate(columns):
    ax = fig.add_subplot(4,2,n+1)
    ax = recent_grads[c].hist(bins=20, xrot=40)
    plt.title(c+' Distribution')
plt.show()

<ul>
<li><code>What percent of majors are predominantly male? Predominantly female?</code> 
<br>
<br>
  About 57% of the majors are predominantly female. Therefore, about 43% of the majors are male dominated. </li>
<br>
<li><code>What's the most common median salary range?</code> 
<br>
<br>
  Around 30,000-35,000 dollars is the most common median salary range.</li>
<br>
</ul>

## Scatter Matrix Plot

In [None]:
from pandas.plotting import scatter_matrix

scatter_matrix(recent_grads[['Sample_size', 'Median']],
               figsize=(6,6))
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']],
               figsize=(10,10))
plt.show()

## Bar Plots

In [None]:
recent_grads.head(10).plot.bar(x='Major', 
                               y='ShareWomen', 
                               legend=False)
recent_grads.tail(10).plot.bar(x='Major', 
                               y='ShareWomen', 
                               legend=False)
recent_grads.head(10).plot.bar(x='Major', 
                               y='Unemployment_rate', 
                               legend=False)
recent_grads.tail(10).plot.bar(x='Major', 
                               y='Unemployment_rate', 
                               legend=False)
plt.show()



## Comparison of Men and Women for each Major Category

### Grouped Bar Plot w/o Pivot

In [None]:
%%timeit
men_sum = {}
women_sum = {}

major_categories = recent_grads['Major_category'].unique()
for major in major_categories:
    men_sum[major] = recent_grads.loc[recent_grads['Major_category']==major,'Men'].sum()
    women_sum[major] = recent_grads.loc[recent_grads['Major_category']==major,'Women'].sum()

gender_totals = pd.DataFrame.from_dict(men_sum, orient='index')
gender_totals['Women'] = women_sum.values()
gender_totals.columns = ['Men', 'Women']
gender_totals.sort_index(inplace=True)
gender_totals

In [None]:
gender_totals.plot.bar()

### Grouped Bar Plot w/ Pivot

In [None]:
%%timeit
gender_totals_pivot = pd.pivot_table(recent_grads, 
                                     index='Major_category', 
                                     values=['Men','Women'], 
                                     aggfunc=np.sum)
gender_totals_pivot.index.name = None
gender_totals_pivot

In [None]:
gender_totals_pivot.plot.bar()

## Distribution of Median Salaries and Unemployment Rate

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(10,8))
recent_grads['Unemployment_rate'].plot.box(ax=ax1)
recent_grads['Median'].plot.box(ax=ax2)
plt.show()

In [None]:
recent_grads.plot.hexbin(x='Women', y="Median", figsize=(10,10))