# Project 1 Data Exploration and Analysis
##### 5/16/2020
##### Yang Zhang, Reannan McDaniel, Jonathan Roach, Fred Poon

### Business Understanding - describe the purpose of the data set you selected.

<!-- [10] Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How  would you measure the effectiveness of a good prediction algorithm? Be specific. -->

<!--
title: "ML1-CaseStudy1"
author: "Reannan McDaniel"
output: html_document
-->

Choosing the right data set for to perform multiple types of tests across different types of data can be a challenge.  There needs to be sufficient variables, more than 30 is a good base, and about 25,000 rows of data.  The variables should include a combination of continuous and categorical variables in order to run regression or classification models.  
We decided to work with The Belk Foundation data sets from 2014 to 2017.  These data sets encompass information across 4 continuous years of educational attributes in North Carolina, USA.  For the purpose of this exercise our focus is around school performance on different types  of characteristics like  schools, type of school, social demographic, economic demographics, location and school category within 2014-2017.  The Belk Foundation's website says, "Our goal is to empower today’s workforce by creating pathways to and through postsecondary education for underrepresented students". For the sake of this analysis, we assume that better performing schools have better outcomes in postsecondary education. With North Carolina's rapidly-changing demographics, it is important to take into consideration schools' unique needs when allocating funds to strategic investment initiatives. Here, we explore where funding can be best applied based to educational achievement data.
In order to achieve this goal, we will explore through visual and mathematical modeling which features best predict the School Performance Grade (SPG), a measure of a school's overall success based on test scores and growth measures. 

### Data Understanding
#### Describe the attributes

In [None]:
# [10] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.
import pandas as pd
import numpy as np

schools = pd.read_csv('PublicSchools2014to2017_YZ.csv')
schools.head()

In [None]:
schools.info(verbose=True)

It is important to understand the types of schools that are relevant to this analysis. Let's first take a look at school type and see if there are any anomolies (schools for children with disabilities, for example).

In [None]:
schools.school_type_txt.value_counts()

"Regular" schools seem to make up the majority of the dataset, so we will focus on those for now. Perhaps we can come back to the other school types later and analyze them separately.

In [None]:
schools2 = schools[schools["school_type_txt"] == 'Regular School']
schools2["school_type_txt"].unique()

Let's now take a look at schools by the age ranges of their students: elementary, middle, high, and various combinations of the three.

In [None]:
#schools2['category_cd'] == 'E'
schools2["category_cd"] = schools2["category_cd"].map({"A": "Elem/Mid/High", "E": "Elementary", "H": "High", "I": "Elem/Mid", "M": "Middle"})
schools2.category_cd_modified.value_counts()

Some of these categories have very little representation in the data. For now, we'll remedy this by lumping the combo groups together.

In [None]:
# [15] Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justifications for your methods. 
combo=schools2['category_cd_modified'].str.contains('/', regex=False)

schools2['category_cd_modified'] = np.where(combo, 'Combo', schools2['category_cd_modified'])

schools2.category_cd_modified.value_counts()

We now need to somehow get at the idea of demographic composition of schools. Dr. Drew and his capstone groups have shown that classifying schools as majority-minority when they are composed of >50% non-white students highlights meaningful differences in school performance (likely due to the fact that demographics can serve as a stand-in for economic measures). Let's take the same approach.

In [None]:
schools2['MinorityOverallPct'] = schools2['MinorityMalePct'] + schools2['MinorityFemalePct']

schools2['Majority_Minority'] = np.where(schools2['MinorityOverallPct'] > .5, 1,0)

schools2['Majority_Minority'].value_counts()

#### Visualize appropriate statistics (summary statistics)

In [None]:
# [10] Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are meaningful. 

schools2.describe()

#### Visualize most interesting attributes

In [None]:
# [15] Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting). Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate. 

#### Visualize relationships

Now let's do a univariate analysis on the impact of attendance rates on school performance grade. The expectation prior to doing the analysis is that poor attendance rates result in poor school performance.

In [None]:
# [15] Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships. 
import matplotlib.pyplot as plt
import seaborn as sns

grid = sns.FacetGrid(schools2, col="Year")
grid.map(plt.scatter, 'avg_daily_attend_pct', 'SPG Score')

In [None]:
grid = sns.FacetGrid(schools2, col='Year')
grid.map(plt.scatter, 'student_num', 'SPG Score')

While this isn't a good measure of class size (we have no way of measuring student/teacher ratio. Just raw number of students), it's interesting to observe that larger public schools generally receive higher grades than smaller ones. It could be the case that here the size of the school is a stand-in for the rurality of the school. In rural areas, schools will be smaller, and there will be fewer opportunities for students. What happens when we color by the majority-minority variable? Of particular interest are high schools.

Identify and explain interesting relationships

In [None]:
# [10] Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification). 

grid = sns.FacetGrid(schools, col='Year')
grid.map(plt.scatter, 'student_num', 'SPG Score')

#### Other features

In [None]:
# [5] Are there other features that could be added to the data or created from existing features? Which ones? 

### Exceptional Work

In [None]:
# [10]
# • You have free reign to provide additional analyses.
# • One idea: implement dimensionality reduction, then visualize and interpret the results. 

import seaborn as sb

# reducing the amount of columns
df = df[df["school_type_txt"]!= 0]
sb.lineplot(x="Year", y="SPG Score", data=df)

In [None]:
# trendline by category
sb.lineplot(x="Year", y="SPG Score", hue="category_cd", data=df)

In [None]:
from sklearn.decomposition import PCA
import numpy as np

# let's start by first changing the numeric values to be floats
continuous_features = []

# and the oridnal values to be integers
ordinal_features = []

# we won't touch these variables, keep them as categorical
categ_features = [];

# use the "astype" function to change the variable type
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)

df.info() # now our data looks better!!

df.info(verbose=True)
df_pca =

pca = PCA(n_components=2)
x_pca = pca.fit(df).transform(df)