# Lab One: Visualization and Data Preprocessing
### Ryan Bass, Brett Benefield, Cho Kim, Nicole Wittlin

In [None]:
# https://stackoverflow.com/questions/38918653/pandas-invalid-literal-for-long-with-base-10-error
# https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex

In [5]:
import pandas as pd
import numpy as np
from IPython.display import HTML, display

In [None]:
%cd "C:\sandbox\SMU\dataMining\7331DataMining\EducationDataNC\2017\Raw Datasets"
dfCollege = pd.read_csv("college-enrollment.csv")
dfTeachers = pd.read_csv("personnel.csv")

In [None]:
# Create a list of columns that are a specific data type
nominal = ['graduation_year', 'unit_code', 'leaname', 'schname', 'status', 'subgroup', 'subgroup_name']
continuous = ['schcount', 'leacount', 'seacount']
ordinal = ['sch_percent_enrolled', 'lea_percent_enrolled', 'sea_percent_enrolled']

# Convert data to correct data type
dfCollege[nominal] = dfCollege[nominal].astype(object)
dfCollege[continuous] = dfCollege[continuous].astype(float)
dfCollege[ordinal] = dfCollege[ordinal].astype(float)

In [None]:
# Only look at the overall total of students that enrolled in college
# The dataset further divides it based on various categories which we can explore later
dfCollegeAll = dfCollege[(dfCollege.subgroup == "ALL") & (dfCollege.status == "ENROLL")]

# Remove schools identified by LEA and SEA unit_codes
# Not sure why this grouping of schools is treated as an invidual school (warrants further investigation)
dfCollegeAll = dfCollegeAll[~dfCollegeAll.unit_code.str.contains('LEA|SEA')]

# Remove schools that didn't report number of students enrolled in college courses (warrents further investigation)
dfCollegeAll = dfCollegeAll[~dfCollegeAll.schcount.isna()]
dfCollegeAll = dfCollegeAll[~dfCollegeAll.leacount.isna()]

In [None]:
dfCollegeAll.info()

In [None]:
dfCollegeAll.describe()

### Business Understanding

Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific.

#### Why data was collected?
 - Data was made possible by John M. Belk Endowment. The founder and the endowment believe in the value of education. 
 - One of their operating principles is "closing achievement gaps" which is defined as follows: "Increasing access to postsecondary education can bolster educational and economic success. We will focus our resource to eliminate barriers for students who are underrepresented on the pathways to success"
 - Purpose is to identify features that directly impact measures of academic success. How is academic success measured by school? Could it be college standardized exam scores (ACT, SAT)? State standardized testing scores? Percentage of students in AP or IB courses and corresponding results?
 
#### Define and Measure outcomes:
 - see last bullet above 
 - or find a new variable (from existing data) that explains school performance?

### Data Meaning Type
Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.

In [8]:
display(HTML('<table><tr><td>Column Name</td><td>Description</td></tr><tr><td>Graduation_year</td><td>Year students graduated from high school</td></tr><tr><td>Unit_code</td><td>Code to identify School/LEAState. "Unit codes belonging to individual schools may be mapped to a given district using the first 3 characters of the unit code. For example, schools belonging to the district "995LEA" will each have unit code that begins with "995.""</td></tr><tr><td>lea name</td><td>LEA (Local Education Agency) Name. LEA is a commonly used acronym for a school district.</td></tr><tr><td>schname</td><td>School Name</td></tr><tr><td>status</td><td>The postsecondary enrollment action as defined in the US Department of Education C160 EDEN (Education Data Exchange Network) specification.<table><tr><td>ENROLL</td><td>Enrolled in an IHE within 16 months of receiving a regular high school diploma.</td></tr><tr><td>NOENROLL</td><td>Did not enroll in an IHE within 16 months of receiving a regular high school diploma</td></tr></table></td></tr></table>'))

0,1
Column Name,Description
Graduation_year,Year students graduated from high school
Unit_code,"Code to identify School/LEAState. ""Unit codes belonging to individual schools may be mapped to a given district using the first 3 characters of the unit code. For example, schools belonging to the district ""995LEA"" will each have unit code that begins with ""995."""""
lea name,LEA (Local Education Agency) Name. LEA is a commonly used acronym for a school district.
schname,School Name
status,The postsecondary enrollment action as defined in the US Department of Education C160 EDEN (Education Data Exchange Network) specification.ENROLLEnrolled in an IHE within 16 months of receiving a regular high school diploma.NOENROLLDid not enroll in an IHE within 16 months of receiving a regular high school diploma

0,1
ENROLL,Enrolled in an IHE within 16 months of receiving a regular high school diploma.
NOENROLL,Did not enroll in an IHE within 16 months of receiving a regular high school diploma


### Data Quality
Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justifications for your methods.

### Simple Statistics
Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics are meaningful.

### Visualize Attributes
Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting). Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate. 

### Explore Joint Attributes
Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.

### Explore Attributes and Class
Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).

### New Features
Are there other features that could be added to the data or created from existing features. Which ones?

### Exceptional Work
You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.