# Eduworth Introduction

Many high schoolers must decide on whether to continue their education after their graduation or not. There are many different paths and oppurtunities that come afterward, but the most encouraged path is to continue education in a college or a university. The problem with this path is that it is very expensive choice and will take many years to obtain the degree. 

Questions<br>
Is college the most financially responsible post-secondary education option?<br>
Does the major of the degree and industry matter when going into college?

## Steps of Project

1. Obtain Data
2. Clean Data
3. Build Visualizations
4. Build/Train A Classification Machine Learning Model


## Data Sources

<a href="https://nces.ed.gov/datalab/index.aspx">National Center for Education Statistics</a> allows us to select variables for the data that they collected from <a href="https://www.data.gov/education/">data.gov</a> and place them onto a table for us to visualize.

<a href="https://collegescorecard.ed.gov/data/">College Scorecard</a><br>
We used this dataset for finding out additional information regarding the performance of students in classes.<br>

<a href="https://research.collegeboard.org/trends/student-aid">Trends in Student Aid 2019</a><br>This dataset gave us information on what kind of loans that students took out in 2019 and how much money they usually would takeout.

## Data Shape Before Cleaning

In the College Scorecard Dataset, there are multiple data files which represent different years. In each file, there were about 1000 columns of different data types and acronyms and even more rows that represent all the different schools. Some of these columns were so specific that we disregarded the information that the column recorded. Also in the College Scorecard Dataset, we had many entries where the information was either null or PrivacySuppressed. Whenever this happened, we would eliminate the row entirely when we pulled information out of it, so it would not mess with our recorded data that we wanted.

The cleaned code below is an example of the cleaning process we did where we would extract the specific columns we are interested in and remove the PrivacySuppressed.

## Cleaning Data Code

In [1]:
import pandas as pd
pd.set_option('mode.chained_assignment', None)

field_data = pd.read_csv('CollegeScorecard_Raw_Data/FieldOfStudyData1516_1617_PP.csv', low_memory = False)
field_data = field_data[['INSTNM', 'MD_EARN_WNE','CIPDESC']]
result = field_data[field_data.MD_EARN_WNE != 'PrivacySuppressed']
result['MD_EARN_WNE'] = result['MD_EARN_WNE'].astype(int)
result = result.groupby(['CIPDESC', 'INSTNM']).mean().astype(int).sort_values('CIPDESC')
result

result.to_csv(r'example_clean_data.csv')

## Cleaned Data File

In [2]:
result = pd.read_csv('example_clean_data.csv')
result

Unnamed: 0,CIPDESC,INSTNM,MD_EARN_WNE
0,Accounting and Computer Science.,Lone Star College System,28100
1,Accounting and Related Services.,Southern Careers Institute-Pharr,19200
2,Accounting and Related Services.,Southern Careers Institute-San Antonio,19200
3,Accounting and Related Services.,Southern Illinois University-Carbondale,45850
4,Accounting and Related Services.,Southern Illinois University-Edwardsville,51350
...,...,...,...
38387,Zoology/Animal Biology.,Ohio State University-Main Campus,23400
38388,Zoology/Animal Biology.,Ohio University-Main Campus,27700
38389,Zoology/Animal Biology.,Oregon State University,20700
38390,Zoology/Animal Biology.,Miami University-Oxford,29300


TODO:
1. Shape of data before cleaning
2. A couple of the cleaning lines 
3. Cleaned Data File

# Machine Learning

To make classification possible, we needed to add some values to classify on. We were looking at the best value schools and deemed that a value of 1.5 and above would result in a sufficient amount of money to pay off your loans. So a True/False category was added into our dataframe. We tried 3 different parameter sets, (Major), (School), and (School Average)

Our baseline classifier for each of these was the DummyClassifier provided by ```sklearn```

In [3]:
from classifierClasses import *

In [4]:
ClassifyByMajor()

Baseline Classifier Accuracy: 0.5797101449275363
SVM Accuracy: 0.9565217391304348


In [5]:
ClassifyByAllSchools()

Baseline Classifier Accuracy: 0.49710108394252583
SVM Accuracy: 0.9432820771363751


In [6]:
ClassifyBySchool()

Baseline Classifier Accuracy: 0.46503496503496505
SVM Accuracy: 0.9020979020979021


Classifying by all schools provides us the best accuracy rate, let's try this using K-Nearest-Neighbors

In [7]:
KNNClassifyByAllSchools()

Baseline Classifier Accuracy: 0.534965034965035
KNN Accuracy: 0.9090909090909091


K-Nearest Neighbors did not provide a better accuracy rate than our SVM

TODO: Maybe add a visualization?

# Visualization

![Loan Distribution Data](imgs/loan_dist.png)

This visualizations shows us the most common amount of money taken in loans. This gives us insight on how much going to college costs and why it is a large financial decision to make for upcoming freshman.

Visit https://datastudio.google.com/open/1g6G-O8LygSsjNdDV32BJNbLDH6IjJqIA to see our visualization for the student debt and student earnings (one year post graduation) by major and school in table form. 

This visualization provides an easy way for students to see their prospective choices for a school. This does not provide intution though as how schools compare to each other.

For example, majoring in CS at Duke University the average earnings is 99,600 with a debt of 7,890. Majoring in CS at UIUC earns an average of 92,200 with a debt of 19,500. Comparing this to another university requires directly comparing the values which does not provide a good intution.

Visit https://datastudio.google.com/u/0/reporting/1UYDnVQdFf6_hKf1z2TgFMWtHsNvXl8s6/page/5WZNB to see our visualization for "Best Value" schools on a map of the United States

"Best Value" is where we took students major at each school and divided their earnings one year post graduation by their debt accumulated. This visualization provides prospective students to see what major at what schools provide the best value. You can search by state and by major.

An example is I want to go to school in either Iowa, Illinois, or Indiana and major in Computer Science. The most vibrant blue dot would show the highest value. In this instance Northwestern University provides the best value.

# Results

TODO:
1. 