## Assignment

In this assignment, you'll continue working with the [U.S. Education Dataset](https://www.kaggle.com/noriuk/us-education-datasets-unification-project/home) from Kaggle. The data gives detailed state level information on the several facets of the state of education on annual basis. To learn more about the data and the column descriptions, you can view the Kaggle link above. You should access the data from the Thinkful database. Below are the credentials you can use to connect to the database:

postgres_user = 'dsbc_student'<br>
postgres_pw = '7\*.8G9QH21'<br>
postgres_host = '142.93.121.174'<br>
postgres_port = '5432'<br>
postgres_db = 'useducation'<br>

Don't forget to apply the most suitable missing value filling techniques you applied in the previous checkpoints to the data. You should provide your answers to the following questions after you handled the missing values.

To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

Say, we want to understand the relationship between the expenditures of the governments and the students' overall success in the math and reading.

1. Create a new score variable from the weighted averages of all score variables in the datasets. **Notice that the number of students in the 4th grade isn't the same as the number of students in the 8th grade. So, you should appropriately weigh the scores!**.
2. What are the correlations between this newly created score variable and the expenditure types? Which 1 of the expenditure types is more correlated than the others?
3. Now, apply PCA to the 4 expenditure types. How much of the total variance is explained by the 1st component?
4. What is the correlation between the overall score variable and the 1st principal component? 
5. If you were to choose the best variables for your model, would you prefer using the 1st principal component instead of the expenditure variables? Why?

Submit your work below, and plan on discussing with your mentor. You can also take a look at this [example solution](https://github.com/Thinkful-Ed/data-201-assignment-solutions/blob/master/model_prep_feature_engineering_2/feature_engineering_2_pca.ipynb).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import math
import sklearn
from sklearn import decomposition
from sklearn import preprocessing

In [2]:
%run us_ed_dataset.py

In [3]:
who

create_engine	 decomposition	 engine	 math	 np	 pd	 plot_nulls	 plt	 postgres_db	 
postgres_host	 postgres_port	 postgres_pw	 postgres_user	 preprocessing	 remove_nonstate	 sklearn	 sns	 statecolmean	 
used_cl_df	 usedu_df	 usedu_stateonly_df	 


1. Create a new score variable from the weighted averages of all score variables in the datasets. **Notice that the number of students in the 4th grade isn't the same as the number of students in the 8th grade. So, you should appropriately weigh the scores!**.

In [4]:
#Calculated average score per student, accouting for different no. of students in grades
used_cl_df["score_w"] = ((((used_cl_df.AVG_MATH_4_SCORE + used_cl_df.AVG_READING_4_SCORE)/2)*used_cl_df.GRADES_4_G)+\
                        (((used_cl_df.AVG_MATH_8_SCORE + used_cl_df.AVG_READING_8_SCORE)/2)*used_cl_df.GRADES_8_G))/\
                        (used_cl_df.GRADES_4_G + used_cl_df.GRADES_8_G)

In [5]:
used_cl_df.head()

Unnamed: 0,PRIMARY_KEY,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,...,GRADES_8_G,GRADES_12_G,GRADES_1_8_G,GRADES_9_12_G,GRADES_ALL_G,AVG_MATH_4_SCORE,AVG_MATH_8_SCORE,AVG_READING_4_SCORE,AVG_READING_8_SCORE,score_w
89,1993_OKLAHOMA,OKLAHOMA,1993,312817.0,1436505.0,99809.0,802783.0,533913.0,1428916.0,726475.0,...,46153.0,34744.0,389548.0,162511.0,557515.0,234.706208,274.241518,216.423908,265.08752,247.126819
79,1993_MONTANA,MONTANA,1993,158875.0,813998.0,74386.0,445532.0,294080.0,833435.0,446245.0,...,12834.0,10325.0,103488.0,46111.0,150082.0,237.996124,287.543301,224.604637,264.354682,253.509608
80,1993_NEBRASKA,NEBRASKA,1993,281354.0,1625242.0,94308.0,509432.0,1021502.0,1642376.0,972575.0,...,22611.0,18578.0,178475.0,81671.0,263723.0,235.946495,283.384772,222.822481,263.267394,251.462694
81,1993_NEVADA,NEVADA,1993,222846.0,1148268.0,52876.0,753277.0,342115.0,1265367.0,611914.0,...,17825.0,12749.0,153912.0,60727.0,215876.0,228.979552,272.759805,210.255439,264.410923,243.125501
83,1993_NEW_JERSEY,NEW_JERSEY,1993,653488.0,6747567.0,331924.0,3104597.0,3311046.0,6641624.0,3676378.0,...,79459.0,64402.0,685132.0,288263.0,982620.0,241.186855,290.442595,228.423116,265.064092,255.585091


2. What are the correlations between this newly created score variable and the expenditure types? Which 1 of the expenditure types is more correlated than the others?

In [6]:
used_cl_corr_df = used_cl_df.corr()
used_cl_corr_df.filter(regex='EXPENDITURE', axis=0).score_w.abs().sort_values(ascending=False)

INSTRUCTION_EXPENDITURE         0.093706
SUPPORT_SERVICES_EXPENDITURE    0.088931
TOTAL_EXPENDITURE               0.083042
OTHER_EXPENDITURE               0.014565
CAPITAL_OUTLAY_EXPENDITURE      0.005830
Name: score_w, dtype: float64

#### A:  
Instruction expenditure is the most correlated, although support services and total expenditures are not far behind.

3. Now, apply PCA to the 4 expenditure types. How much of the total variance is explained by the 1st component?

In [7]:
#expenditures only
used_expend_df = used_cl_df.filter(regex='EXPENDITURE', axis=1)

#normalze to mean 0, std 1
used_expend_fit_df = sklearn.preprocessing.StandardScaler().fit_transform(used_expend_df.values) #aka X

#create principal components
pc = sklearn.decomposition.PCA()
Y = pc.fit_transform(used_expend_fit_df)

In [8]:
pc.explained_variance_[1]

0.13171722539369024

In [9]:
pc.explained_variance_ratio_[1]

0.026322010216997864

In [14]:
Y.shape

(1229, 5)

In [15]:
used_expend_df.shape

(1229, 5)

In [10]:
used_expend_df["PCA_1"] = pc.fit_transform(used_expend_fit_df)

ValueError: Wrong number of items passed 5, placement implies 1

4. What is the correlation between the overall score variable and the 1st principal component? 

In [None]:
used_expend_df = pd.concat([used_expend_df, Y_df[0]], axis=1, ignore_index=True)
used_expend_df

5. If you were to choose the best variables for your model, would you prefer using the 1st principal component instead of the expenditure variables? Why?