# Project 1: SAT & ACT Analysis

### Part 2 Exploratory Data Analysis

_By: Evonne Tham_

### Import Libraries

In [1]:
import numpy as np
import pandas as pd

## Data Import 

In [2]:
# Read in Final Dataset
final = pd.read_csv("../dataset/final.csv")
final.head()

Unnamed: 0,state,sat_part_17,sat_erw_17,sat_math_17,sat_total_17,act_part_17,act_eng_17,act_math_17,act_read_17,act_sci_17,...,sat_part_18,sat_erw_18,sat_math_18,sat_total_18,act_part_18,act_eng_18,act_math_18,act_read_18,act_sci_18,act_composite_18
0,Alabama,0.05,593,572,1165,1.0,18.9,18.4,19.7,19.4,...,0.06,595,571,1166,1.0,18.9,18.3,19.6,19.0,19.1
1,Alaska,0.38,547,533,1080,0.65,18.7,19.8,20.4,19.9,...,0.43,562,544,1106,0.33,19.8,20.6,21.6,20.7,20.8
2,Arizona,0.3,563,553,1116,0.62,18.6,19.8,20.1,19.8,...,0.29,577,572,1149,0.66,18.2,19.4,19.5,19.2,19.2
3,Arkansas,0.03,614,594,1208,1.0,18.9,19.0,19.7,19.5,...,0.05,592,576,1169,1.0,19.1,18.9,19.7,19.4,19.4
4,California,0.53,531,524,1055,0.31,22.5,22.7,23.1,22.2,...,0.6,540,536,1076,0.27,22.5,22.5,23.0,22.1,22.7


## Exploratory Data Analysis


### Summary Statistics
Transpose the output of pandas `describe` method to create a quick overview of each numeric feature.

In [3]:
#Code:
final.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sat_part_17,50.0,0.3922,0.353849,0.02,0.04,0.34,0.65,1.0
sat_erw_17,50.0,569.78,45.88241,482.0,532.75,559.5,613.5,644.0
sat_math_17,50.0,557.54,47.362781,468.0,523.25,549.5,601.0,651.0
sat_total_17,50.0,1127.42,92.945911,950.0,1055.25,1107.5,1214.0,1295.0
act_part_17,50.0,0.66,0.320191,0.08,0.3125,0.71,1.0,1.0
act_eng_17,50.0,20.884,2.352884,16.3,19.0,20.55,23.1,25.5
act_math_17,50.0,21.144,1.982902,18.0,19.4,20.9,23.0,25.3
act_read_17,50.0,21.97,2.064298,18.1,20.425,21.7,23.875,26.0
act_sci_17,50.0,21.416,1.738796,18.2,19.925,21.3,22.975,24.9
act_composite_17,50.0,21.478,2.019021,17.8,19.8,21.4,23.4,25.5


#### Manually calculate standard deviation

$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

- Write a function to calculate standard deviation using the formula above

In [4]:
def stddev(x):
    mean = x.mean()
    v = (x-mean)
    v = (v**2).mean()
    return v**0.5

- Use a **dictionary comprehension** to apply your standard deviation function to each numeric column in the dataframe.  **No loops**  
- Assign the output to variable `sd` as a dictionary where: 
    - Each column name is now a key 
    - That standard deviation of the column is the value 
     
*Example Output :* `{'ACT_Math': 120, 'ACT_Reading': 120, ...}`

In [5]:
sd = {col_title: stddev(final[col_title]) for col_title in final.columns if col_title != 'state'}
sd

{'sat_part_17': 0.3502929631037426,
 'sat_erw_17': 45.4212681461009,
 'sat_math_17': 46.88676145779318,
 'sat_total_17': 92.01175794429753,
 'act_part_17': 0.3169731849857334,
 'act_eng_17': 2.3292367848718163,
 'act_math_17': 1.9629732550394061,
 'act_read_17': 2.0435508312738393,
 'act_sci_17': 1.7213204233959463,
 'act_composite_17': 1.9987285958828933,
 'sat_part_18': 0.37063194681516604,
 'sat_erw_18': 47.4275911258415,
 'sat_math_18': 47.6762456575599,
 'sat_total_18': 93.98142156830785,
 'act_part_18': 0.33798704117169925,
 'act_eng_18': 2.441197247253896,
 'act_math_18': 2.0319291326224937,
 'act_read_18': 2.1617039575297996,
 'act_sci_18': 1.8661232542359036,
 'act_composite_18': 2.1012957906967786}

In [6]:
# Cross check between both pandas and manual calculation using np.std
np.std(final)

sat_part_17          0.350293
sat_erw_17          45.421268
sat_math_17         46.886761
sat_total_17        92.011758
act_part_17          0.316973
act_eng_17           2.329237
act_math_17          1.962973
act_read_17          2.043551
act_sci_17           1.721320
act_composite_17     1.998729
sat_part_18          0.370632
sat_erw_18          47.427591
sat_math_18         47.676246
sat_total_18        93.981422
act_part_18          0.337987
act_eng_18           2.441197
act_math_18          2.031929
act_read_18          2.161704
act_sci_18           1.866123
act_composite_18     2.101296
dtype: float64

Do your manually calculated standard deviations match up with the output from pandas `describe`? What about numpy's 
`std` method?

##### Answer: 
The manually calculated and numpy standard deviation and does not match up with the standard deviation in pandas describe because pandas standard deviation function utilises the formula for sample standard deviation where summation values are divided by N-1 instead of N in the manually calculated standard deviation. Furthermore, numpy standard deviation function has “Delta Degrees of Freedom” parameter, which by default is zero. The reason we use a N-1 calculation for sample size is because we want to get an unbiased estimator.

#### Investigate trends in the data
Using sorting and/or masking (along with the `.head` method to not print our entire dataframe), consider the following questions:

- Which states have the highest and lowest participation rates for the:
    - 2017 SAT?
    - 2018 SAT?
    - 2017 ACT?
    - 2018 ACT?
- Which states have the highest and lowest mean total/composite scores for the:
    - 2017 SAT?
    - 2018 SAT?
    - 2017 ACT?
    - 2018 ACT?
- Do any states with 100% participation on a given test have a rate change year-to-year?
- Do any states show have >50% participation on *both* tests either year?

Based on what you've just observed, have you identified any states that you're especially interested in? **Make a note of these and state *why* you think they're interesting**.

**You should comment on your findings at each step in a markdown cell below your code block**. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.

### i. Highest SAT Participation Rate

In [7]:
# 2017
final[['state', 'sat_part_17']].sort_values(by = ['sat_part_17'], ascending = False).head()

Unnamed: 0,state,sat_part_17
21,Michigan,1.0
6,Connecticut,1.0
7,Delaware,1.0
8,District of Columbia,1.0
28,New Hampshire,0.96


In [8]:
# 2018 
final[['state', 'sat_part_18']].sort_values(by = ['sat_part_18'], ascending = False).head()

Unnamed: 0,state,sat_part_18
5,Colorado,1.0
6,Connecticut,1.0
7,Delaware,1.0
21,Michigan,1.0
12,Idaho,1.0


##### Answer: 

Michigan, Connecticut, Delaware and District of Columbia have the highest 2017 SAT participation rates at 100%. These states have compulsory testing for SAT. 
As of 2018, Connecticut, Delaware and Michigan continued to have the highest 2018 SAT participation rates at 100%. With new addition of Colorado and Idaho as they made SAT compulsory. 
District of Columbia however dropped from the top list as they made SAT optional.

### ii. Lowest SAT Participation Rate

In [9]:
# 2017
final[['state', 'sat_part_17']].sort_values(by = ['sat_part_17'], ascending = False).tail()

Unnamed: 0,state,sat_part_17
48,Wisconsin,0.03
49,Wyoming,0.03
33,North Dakota,0.02
23,Mississippi,0.02
15,Iowa,0.02


In [10]:
# 2018
final[['state', 'sat_part_18']].sort_values(by = ['sat_part_18'], ascending = False).tail()

Unnamed: 0,state,sat_part_18
40,South Dakota,0.03
15,Iowa,0.03
48,Wisconsin,0.03
49,Wyoming,0.03
33,North Dakota,0.02


##### Answer: 

North Dakota, Mississippi and Lowa has the lowest SAT participation rate of only 2% in 2017. North Dakota remained the lowest in 2018. Apart from that we can see that generally states that have low participation rate display similar rates year on year.

### iii. Highest ACT Participation Rate

In [11]:
# 2017
final[['state', 'act_part_17']].sort_values(by = ['act_part_17'], ascending = False).head(20)

Unnamed: 0,state,act_part_17
0,Alabama,1.0
17,Kentucky,1.0
48,Wisconsin,1.0
43,Utah,1.0
41,Tennessee,1.0
39,South Carolina,1.0
35,Oklahoma,1.0
32,North Carolina,1.0
27,Nevada,1.0
24,Missouri,1.0


In [12]:
# 2018
final[['state', 'act_part_18']].sort_values(by = ['act_part_18'], ascending = False).head(20)

Unnamed: 0,state,act_part_18
0,Alabama,1.0
17,Kentucky,1.0
48,Wisconsin,1.0
43,Utah,1.0
41,Tennessee,1.0
39,South Carolina,1.0
35,Oklahoma,1.0
34,Ohio,1.0
32,North Carolina,1.0
27,Nevada,1.0


##### Answer: 

Alabama, Kentucky, Wisconsin, Utah, Tennessee, South Carolina, Oklahoma, North Carolina, Nevada, Missouri, Mississippi, Minnesota, Louisiana, Montana, Wyoming, Arkansas and Colorado have the highest 2017 ACT participation rates at 100%. These states have compulsory ACT.

In 2018 Ohio and Nebraska also made ACT compulsory while Minnesota and Colorado made optional.

### iv. Lowest ACT Participation Rate

In [13]:
# 2017
final[['state', 'act_part_17']].sort_values(by = ['act_part_17'], ascending = False).tail()

Unnamed: 0,state,act_part_17
37,Pennsylvania,0.23
38,Rhode Island,0.21
28,New Hampshire,0.18
7,Delaware,0.18
19,Maine,0.08


In [14]:
# 2018
final[['state', 'act_part_18']].sort_values(by = ['act_part_18'], ascending = False).tail()

Unnamed: 0,state,act_part_18
37,Pennsylvania,0.2
7,Delaware,0.17
28,New Hampshire,0.16
38,Rhode Island,0.15
19,Maine,0.07


Maine remained the lowest participation rate in ACT in 2017 and 2018 with a drop of 1% from 8%

### v. Highest SAT Total Score

In [15]:
# 2017
final[['state', 'sat_total_17']].sort_values(by = ['sat_total_17'], ascending = False).head()

Unnamed: 0,state,sat_total_17
22,Minnesota,1295
48,Wisconsin,1291
15,Iowa,1275
24,Missouri,1271
16,Kansas,1260


In [16]:
# 2018
final[['state', 'sat_total_18']].sort_values(by = ['sat_total_18'], ascending = False).head()

Unnamed: 0,state,sat_total_18
22,Minnesota,1298
48,Wisconsin,1294
33,North Dakota,1283
15,Iowa,1265
16,Kansas,1265


##### Answer: 

Minnesota as the highest total score in both 2017 and 2018 of 1295 and 1298 respectively, which did not vary much from year to year.

### vi. Lowest SAT Total Score

In [17]:
# 2017
final[['state','sat_total_17']].sort_values(by = ['sat_total_17'], ascending = False).tail()

Unnamed: 0,state,sat_total_17
19,Maine,1012
12,Idaho,1005
21,Michigan,1005
7,Delaware,996
8,District of Columbia,950


In [18]:
# 2018
final[['state','sat_total_18']].sort_values(by = ['sat_total_18'], ascending = False).tail()

Unnamed: 0,state,sat_total_18
11,Hawaii,1010
12,Idaho,1001
47,West Virginia,999
7,Delaware,998
8,District of Columbia,977


District of Columbia remained the lowest in SAT total score from 2017 to 2018 of 950 to 977. 

### vii. Highest ACT Composite Score

In [19]:
# 2017
final[['state','act_composite_17']].sort_values(by = ['act_composite_17'], ascending = False).head()

Unnamed: 0,state,act_composite_17
28,New Hampshire,25.5
20,Massachusetts,25.4
6,Connecticut,25.2
19,Maine,24.3
31,New York,24.2


##### Answer: 

New Hampshire had the highest mean composite score for 2017 ACT

In [20]:
# 2018
final[['state','act_composite_18']].sort_values(by = ['act_composite_18'], ascending = False).head()

Unnamed: 0,state,act_composite_18
6,Connecticut,25.6
20,Massachusetts,25.5
28,New Hampshire,25.1
31,New York,24.5
21,Michigan,24.4


##### Answer: 

In 2018 Conneticut surpasses New Hampshire and became the highest mean composite score for ACT.

### viii. Lowest ACT Composite Score

In [21]:
# 2017
final[['state','act_composite_17']].sort_values(by = ['act_composite_17'], ascending = False).tail()

Unnamed: 0,state,act_composite_17
32,North Carolina,19.1
11,Hawaii,19.0
39,South Carolina,18.7
23,Mississippi,18.6
27,Nevada,17.8


In [22]:
# 2018
final[['state','act_composite_18']].sort_values(by = ['act_composite_18'], ascending = False).tail()

Unnamed: 0,state,act_composite_18
0,Alabama,19.1
11,Hawaii,18.9
23,Mississippi,18.6
39,South Carolina,18.3
27,Nevada,17.7


##### Answer: 

Nevada had the lowest mean composite score for 2017 ACT and remained the same in 2018

### Do any states with 100% participation on a given test have a rate change year-to-year?

In [23]:
#SAT
final[(final["sat_part_17"]==1) != (final["sat_part_18"]==1)]

Unnamed: 0,state,sat_part_17,sat_erw_17,sat_math_17,sat_total_17,act_part_17,act_eng_17,act_math_17,act_read_17,act_sci_17,...,sat_part_18,sat_erw_18,sat_math_18,sat_total_18,act_part_18,act_eng_18,act_math_18,act_read_18,act_sci_18,act_composite_18
5,Colorado,0.11,606,595,1201,1.0,20.1,20.3,21.2,20.9,...,1.0,519,506,1025,0.3,23.9,23.2,24.4,23.5,23.9
8,District of Columbia,1.0,482,468,950,0.32,24.4,23.5,24.9,23.5,...,0.92,497,480,977,0.32,23.7,22.7,24.4,23.0,23.6
12,Idaho,0.93,513,493,1005,0.38,21.9,21.8,23.0,22.1,...,1.0,508,493,1001,0.36,21.9,21.6,23.2,22.1,22.3


In [24]:
#ACT
final[(final["act_part_17"]==1) != (final["act_part_18"]==1)]

Unnamed: 0,state,sat_part_17,sat_erw_17,sat_math_17,sat_total_17,act_part_17,act_eng_17,act_math_17,act_read_17,act_sci_17,...,sat_part_18,sat_erw_18,sat_math_18,sat_total_18,act_part_18,act_eng_18,act_math_18,act_read_18,act_sci_18,act_composite_18
5,Colorado,0.11,606,595,1201,1.0,20.1,20.3,21.2,20.9,...,1.0,519,506,1025,0.3,23.9,23.2,24.4,23.5,23.9
22,Minnesota,0.03,644,651,1295,1.0,20.4,21.5,21.8,21.6,...,0.04,643,655,1298,0.99,20.2,21.4,21.7,21.4,21.3
26,Nebraska,0.03,629,625,1253,0.84,20.9,20.9,21.9,21.5,...,0.03,629,623,1252,1.0,19.4,19.8,20.4,20.1,20.1
34,Ohio,0.12,578,570,1149,0.75,21.2,21.6,22.5,22.0,...,0.18,552,547,1099,1.0,19.3,20.3,20.8,20.4,20.3


##### Answer: 

100% participation on SAT have a rate change from 2017 to 2018:
- Colorado
- District of Columbia
- Idaho

100% participation on ACT have a rate change from 2017 to 2018:
- Colorado
- Minnesota
- Nebraska
- Ohio

### Do any states show have >50% participation on both tests either year?

In [25]:
# 2017
final[(final["sat_part_17"] > 0.5) & (final["act_part_17"] > 0.5)]

Unnamed: 0,state,sat_part_17,sat_erw_17,sat_math_17,sat_total_17,act_part_17,act_eng_17,act_math_17,act_read_17,act_sci_17,...,sat_part_18,sat_erw_18,sat_math_18,sat_total_18,act_part_18,act_eng_18,act_math_18,act_read_18,act_sci_18,act_composite_18
9,Florida,0.83,520,497,1017,0.73,19.0,19.4,21.0,19.4,...,0.56,550,549,1099,0.66,19.2,19.3,21.1,19.5,19.9
10,Georgia,0.61,535,515,1050,0.55,21.0,20.9,22.0,21.3,...,0.7,542,522,1064,0.53,20.9,20.7,21.2,21.4,21.4
11,Hawaii,0.55,544,541,1085,0.9,17.8,19.2,19.2,19.3,...,0.56,480,530,1010,0.89,18.2,19.0,19.1,19.0,18.9


In [26]:
# 2018
final[(final["sat_part_18"] > 0.5) & (final["act_part_18"] > 0.5)]

Unnamed: 0,state,sat_part_17,sat_erw_17,sat_math_17,sat_total_17,act_part_17,act_eng_17,act_math_17,act_read_17,act_sci_17,...,sat_part_18,sat_erw_18,sat_math_18,sat_total_18,act_part_18,act_eng_18,act_math_18,act_read_18,act_sci_18,act_composite_18
9,Florida,0.83,520,497,1017,0.73,19.0,19.4,21.0,19.4,...,0.56,550,549,1099,0.66,19.2,19.3,21.1,19.5,19.9
10,Georgia,0.61,535,515,1050,0.55,21.0,20.9,22.0,21.3,...,0.7,542,522,1064,0.53,20.9,20.7,21.2,21.4,21.4
11,Hawaii,0.55,544,541,1085,0.9,17.8,19.2,19.2,19.3,...,0.56,480,530,1010,0.89,18.2,19.0,19.1,19.0,18.9
32,North Carolina,0.49,546,535,1081,1.0,17.8,19.3,19.6,19.3,...,0.52,554,543,1098,1.0,18.0,19.3,19.5,19.2,19.1
39,South Carolina,0.5,543,521,1064,1.0,17.5,18.6,19.1,18.9,...,0.55,547,523,1070,1.0,17.3,18.2,18.6,18.5,18.3


##### Answer:

- In 2017, Florida, Georgia and Hawaii have a participation rate of more than 50% in both tests.
- While in 2018, there are Florida, Georgia, Hawaii, North Carolina and South Carolina.

Based on what you've just observed, have you identified any states that you're especially interested in? Make a note of these and state why you think they're interesting.

----> Proceed to the next notebook for [Data Visualisation](./03_Data_Visualisation.ipynb)