<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 1: Standardized Test Analysis

--- 
# Part 1

Part 1 requires knowledge of basic Python.

---

## Evaluating the Impact of Optional ACT and SAT Tests on Student Participation and Performance

As an analyst hired by the US Department of Education, my objective is to evaluate the ACT and SAT participation rates and scores, utilizing data from 2017, 2018, and 2019 versus 2021 when a majority of colleges made these tests optional. Additionally, I will investigate whether the necessity of these tests influences student participation, and if the voluntary test-taking led to improved performance. This analysis aims to assess the ongoing value of investing significant resources and effort into these tests and explore alternative aspects of learning that could be prioritized. Based on the findings, I will provide recommendations to the ED regarding the future of the ACT and SAT requirements beyond the 2021 applications.

### Contents:
- [Background](#Background)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Visualization](#Visualize-the-Data)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Background

The SAT and ACT are standardized tests that many colleges and universities in the United States require for their admissions process. This score is used along with other materials such as grade point average (GPA) and essay responses to determine whether or not a potential student will be accepted to the university.

The SAT has two sections of the test: Evidence-Based Reading and Writing and Math ([*source*](https://www.princetonreview.com/college/sat-sections)). The ACT has 4 sections: English, Mathematics, Reading, and Science, with an additional optional writing section ([*source*](https://www.act.org/content/act/en/products-and-services/the-act/scores/understanding-your-scores.html)). They have different score ranges, which you can read more about on their websites or additional outside sources (a quick Google search will help you understand the scores for each test):
* [SAT](https://collegereadiness.collegeboard.org/sat)
* [ACT](https://www.act.org/content/act/en.html)

Standardized tests have long been a controversial topic for students, administrators, and legislators. Since the 1940's, an increasing number of colleges have been using scores from sudents' performances on tests like the SAT and the ACT as a measure for college readiness and aptitude ([*source*](https://www.minotdailynews.com/news/local-news/2017/04/a-brief-history-of-the-sat-and-act/)). Supporters of these tests argue that these scores can be used as an objective measure to determine college admittance. Opponents of these tests claim that these tests are not accurate measures of students potential or ability and serve as an inequitable barrier to entry. Lately, more and more schools are opting to drop the SAT/ACT requirement for their Fall 2021 applications ([*read more*](https://www.cnn.com/2020/04/14/us/coronavirus-colleges-sat-act-test-trnd/index.html)).

The decision to make these tests optional is a response to various factors, including the ongoing debate surrounding the validity and equity of standardized testing, as well as the impact of external circumstances such as the COVID-19 pandemic ([*source*](https://www.cnn.com/2020/04/14/us/coronavirus-colleges-sat-act-test-trnd/index.html)). This shift in policy opens up opportunities for evaluating the impact of voluntary test participation on student engagement and performance, which forms the basis of this analysis. By examining participation rates and scores from 2017, 2018, and 2019 in comparison to the data from 2021, when many colleges adopted an optional testing policy, this project aims to shed light on the effects of this policy change.

### Data

* [`act_2017.csv`](./data/act_2017.csv): 2017 ACT Scores by State
* [`act_2018.csv`](./data/act_2018.csv): 2018 ACT Scores by State
* [`act_2019.csv`](./data/act_2019.csv): 2019 ACT Scores by State
* [`act_2021.csv`](./data/act_2021.csv): 2021 ACT Scores by State
* [`sat_2017.csv`](./data/sat_2017.csv): 2017 SAT Scores by State
* [`sat_2018.csv`](./data/sat_2018.csv): 2018 SAT Scores by State
* [`sat_2019.csv`](./data/sat_2019.csv): 2019 SAT Scores by State
* [`sat_2021.csv`](./data/sat_2021.csv): 2021 SAT Scores by State
* [`sat_act_by_college.csv`](./data/sat_act_by_college.csv): Ranges of Accepted ACT & SAT Student Scores by Colleges
* [`act_17_18_19_21.csv`](./data/act_17_18_19_21.csv): 2017, 2018, 2019, and 2021 ACT Scores by State Combined
* [`sat_17_18_19_21.csv`](./data/sat_17_18_19_21.csv): 2017, 2018, 2019, and 2021 SAT Scores by State Combined

### Outside Research

https://www.act.org/content/dam/act/unsecured/documents/2021/2021-Average-ACT-Scores-by-State.pdf

Data used to create (2021_act.csv).

https://reports.collegeboard.org/sat-suite-program-results/data-archive

Data used to create (2021_sat.csv).

https://nces.ed.gov/fastfacts/display.asp?id=1122

According to National Center for Education Statistics (U.S. Department of Education), there are 3,931 colleges in the United States in 2021.

### Coding Challenges

1. Manually calculate mean:

    Write a function that takes in values and returns the mean of the values. Create a list of numbers that you test on your function to check to make sure your function works!
    
    *Note*: Do not use any mean methods built-in to any Python libraries to do this! This should be done without importing any additional libraries.

In [1]:
# Code:

def mean_machine(numbers):
    return sum(numbers) / len(numbers)

In [2]:
numbers = [1, 2, 3, 4 , 5]

In [3]:
mean_machine(numbers)

3.0

2. Manually calculate standard deviation:

    The formula for standard deviation is below:

    $$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

    Where $x_i$ represents each value in the dataset, $\mu$ represents the mean of all values in the dataset and $n$ represents the number of values in the dataset.

    Write a function that takes in values and returns the standard deviation of the values using the formula above. Hint: use the function you wrote above to calculate the mean! Use the list of numbers you created above to test on your function.
    
    *Note*: Do not use any standard deviation methods built-in to any Python libraries to do this! This should be done without importing any additional libraries.

In [4]:
# Code:

# calculate the mean of the dataset (already did this with mean_machine(numbers))
# subtract the mean from each value in the list and square it for squared differences
# add all the squared difference values together
# divide the sum total by the number of values for variance
# take the square root of everything for standard deviation

def standard_deviation(numbers):
    mean = mean_machine(numbers)
    sdev = (sum((x - mean)**2 for x in numbers) / len(numbers))**0.5
    return sdev

# CAUTION!!! you have to specify for x in numbers or the function has no idea what to do with the x you've introduced
# NameError: name 'x' is not defined

In [5]:
standard_deviation(numbers)

1.4142135623730951

3. Data cleaning function:
    
    Write a function that takes in a string that is a number and a percent symbol (ex. '50%', '30.5%', etc.) and converts this to a float that is the decimal approximation of the percent. For example, inputting '50%' in your function should return 0.5, '30.5%' should return 0.305, etc. Make sure to test your function to make sure it works!

You will use these functions later on in the project!

In [6]:
# Code:

# replace the % symbol with empty string so it's just a flat number
# convert the string to a float (decimal number)
# you have to divide that by 100 or you get 50.0 not 0.5

# import pandas as pd

# def clean_data(data):
#    data = data.replace('%', '')
#    string_objects = pd.Series(data)
#    float_data = string_objects.astype(float) / 100
#    return float_data

def clean_data(data):
    data = data.replace('%', '')
    float_data = float(data) / 100
    return float_data

In [7]:
clean_data('50%')

0.5

In [8]:
clean_data('30.5%')

0.305

---

# Part 2

Part 2 requires knowledge of Pandas, EDA, data cleaning, and data visualization.

---

*All libraries used should be added here*

In [9]:
# Imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Import and Cleaning

### Data Import & Cleaning

Import the datasets that you selected for this project and go through the following steps at a minimum. You are welcome to do further cleaning as you feel necessary:
1. Display the data: print the first 5 rows of each dataframe to your Jupyter notebook.
2. Check for missing values.
3. Check for any obvious issues with the observations (keep in mind the minimum & maximum possible values for each test/subtest).
4. Fix any errors you identified in steps 2-3.
5. Display the data types of each feature.
6. Fix any incorrect data types found in step 5.
    - Fix any individual values preventing other columns from being the appropriate type.
    - If your dataset has a column of percents (ex. '50%', '30.5%', etc.), use the function you wrote in Part 1 (coding challenges, number 3) to convert this to floats! *Hint*: use `.map()` or `.apply()`.
7. Rename Columns.
    - Column names should be all lowercase.
    - Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`).
    - Column names should be unique and informative.
8. Drop unnecessary rows (if needed).
9. Merge dataframes that can be merged.
10. Perform any additional cleaning that you feel is necessary.
11. Save your cleaned and merged dataframes as csv files.

In [10]:
import os
os.getcwd()

# '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data' for data location

'/Users/argishtiovsepyan/DSI-508/Projects/project-1/code'

In [11]:
act_2017 = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/act_2017.csv'
act_2018 = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/act_2018.csv'
act_2019 = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/act_2019.csv'
act_2021 = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/act_2021.csv'

In [12]:
df_act_2017 = pd.read_csv(act_2017)
df_act_2018 = pd.read_csv(act_2018)
df_act_2019 = pd.read_csv(act_2019)
df_act_2021 = pd.read_csv(act_2021)

In [13]:
df_act_2017.head(1), df_act_2018.head(1), df_act_2019.head(1), df_act_2021.head(1)

(      State Participation  English  Math  Reading  Science Composite
 0  National           60%     20.3  20.7     21.4     21.0      21.0,
      State Participation  Composite
 0  Alabama          100%       19.1,
      State Participation  Composite
 0  Alabama          100%       18.9,
      state 2021_participation  2021_score
 0  Alabama               100%        18.7)

In [14]:
df_act_2017.shape, df_act_2018.shape, df_act_2019.shape, df_act_2021.shape

((52, 7), (52, 3), (52, 3), (51, 3))

In [15]:
act_2017 = df_act_2017.drop(['English', 'Math', 'Reading', 'Science'], axis = 1, inplace = True)

In [16]:
df_act_2017.drop([0],inplace = True)
df_act_2019.drop([51],inplace = True)

In [17]:
df_act_2017.reset_index(drop = True, inplace = True)
df_act_2018.reset_index(drop = True, inplace = True)
df_act_2019.reset_index(drop = True, inplace = True)
df_act_2021.reset_index(drop = True, inplace = True)

In [18]:
# duplicated().sum() shows 1 as true and 0 as false

df_act_2017.duplicated().sum(), df_act_2018.duplicated().sum(), df_act_2019.duplicated().sum()

(0, 1, 0)

In [19]:
df_act_2018 = df_act_2018.drop_duplicates().reset_index(drop = True)

In [20]:
df_act_2017.duplicated().sum(), df_act_2018.duplicated().sum(), df_act_2019.duplicated().sum()

(0, 0, 0)

In [21]:
df_act_2017.head(1), df_act_2018.head(1), df_act_2019.head(1), df_act_2021.head(1)

(     State Participation Composite
 0  Alabama          100%      19.2,
      State Participation  Composite
 0  Alabama          100%       19.1,
      State Participation  Composite
 0  Alabama          100%       18.9,
      state 2021_participation  2021_score
 0  Alabama               100%        18.7)

In [22]:
df_act_2017.tail(1), df_act_2018.tail(1), df_act_2019.tail(1), df_act_2021.tail(1)

(      State Participation Composite
 50  Wyoming          100%     20.2x,
       State Participation  Composite
 50  Wyoming          100%       20.0,
       State Participation  Composite
 50  Wyoming          100%       19.8,
       state 2021_participation  2021_score
 50  Wyoming                91%        19.8)

In [23]:
df_act_2017.dtypes, df_act_2018.dtypes, df_act_2019.dtypes, df_act_2021.dtypes

(State            object
 Participation    object
 Composite        object
 dtype: object,
 State             object
 Participation     object
 Composite        float64
 dtype: object,
 State             object
 Participation     object
 Composite        float64
 dtype: object,
 state                  object
 2021_participation     object
 2021_score            float64
 dtype: object)

In [24]:
# df_act_2017['Composite'] = df_act_2017['Composite'].astype(float)
# ValueError: could not convert string to float: '20.2x'
# Coercing it made Wyoming 20.2x = NaN...

df_act_2017['Composite'] = pd.to_numeric(df_act_2017['Composite'], errors = 'coerce')
df_act_2017.dtypes

State             object
Participation     object
Composite        float64
dtype: object

In [25]:
df_act_2017.loc[50, 'Composite'] = 20.2
df_act_2017.tail(3)

Unnamed: 0,State,Participation,Composite
48,West Virginia,69%,20.4
49,Wisconsin,100%,20.5
50,Wyoming,100%,20.2


In [26]:
df_act = pd.concat([df_act_2017, df_act_2018, df_act_2019, df_act_2021], axis = 1)
df_act.head(1)

Unnamed: 0,State,Participation,Composite,State.1,Participation.1,Composite.1,State.2,Participation.2,Composite.2,state,2021_participation,2021_score
0,Alabama,100%,19.2,Alabama,100%,19.1,Alabama,100%,18.9,Alabama,100%,18.7


In [27]:
# df_act.rename(columns = {'State': 'state', 'Participation': 'participation', 'Composite': 'composite'}) doesn't work with same name columns

df_act.columns.values[0:10] = 'state', '2017_participation', '2017_score', 'extra 1', '2018_participation', '2018_score', 'extra 2', '2019_participation', '2019_score', 'extra 3'
df_act.drop(columns = ['extra 1', 'extra 2', 'extra 3'], inplace = True)

df_act.head(1)

Unnamed: 0,state,2017_participation,2017_score,2018_participation,2018_score,2019_participation,2019_score,2021_participation,2021_score
0,Alabama,100%,19.2,100%,19.1,100%,18.9,100%,18.7


In [28]:
df_act.dtypes

state                  object
2017_participation     object
2017_score            float64
2018_participation     object
2018_score            float64
2019_participation     object
2019_score            float64
2021_participation     object
2021_score            float64
dtype: object

In [29]:
df_act['2017_participation'] = df_act['2017_participation'].apply(clean_data)
df_act['2018_participation'] = df_act['2018_participation'].apply(clean_data)
df_act['2019_participation'] = df_act['2019_participation'].apply(clean_data)
df_act['2021_participation'] = df_act['2021_participation'].apply(clean_data)

In [30]:
df_act.dtypes

state                  object
2017_participation    float64
2017_score            float64
2018_participation    float64
2018_score            float64
2019_participation    float64
2019_score            float64
2021_participation    float64
2021_score            float64
dtype: object

In [31]:
df_act.to_csv('/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/act_17_18_19_21.csv', index = False)

In [32]:
df_act.head()

Unnamed: 0,state,2017_participation,2017_score,2018_participation,2018_score,2019_participation,2019_score,2021_participation,2021_score
0,Alabama,1.0,19.2,1.0,19.1,1.0,18.9,1.0,18.7
1,Alaska,0.65,19.8,0.33,20.8,0.38,20.1,0.16,20.6
2,Arizona,0.62,19.7,0.66,19.2,0.73,19.0,0.35,19.8
3,Arkansas,1.0,19.4,1.0,19.4,1.0,19.3,0.99,19.0
4,California,0.31,22.8,0.27,22.7,0.23,22.6,0.05,26.1


In [33]:
sat_2017 = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/sat_2017.csv'
sat_2018 = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/sat_2018.csv'
sat_2019 = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/sat_2019.csv'
sat_2021 = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/sat_2021.csv'

In [34]:
df_sat_2017 = pd.read_csv(sat_2017)
df_sat_2018 = pd.read_csv(sat_2018)
df_sat_2019 = pd.read_csv(sat_2019)
df_sat_2021 = pd.read_csv(sat_2021)

In [35]:
df_sat_2017.head(1), df_sat_2018.head(1), df_sat_2019.head(1), df_sat_2021.head(1)

(     State Participation  Evidence-Based Reading and Writing  Math  Total
 0  Alabama            5%                                 593   572   1165,
      State Participation  Evidence-Based Reading and Writing  Math  Total
 0  Alabama            6%                                 595   571   1166,
      State Participation Rate  EBRW  Math  Total
 0  Alabama                 7%   583   560   1143,
      state 2021_participation  2021_score
 0  Alabama                 3%        1159)

In [36]:
sat_2017 = df_sat_2017.drop(['Evidence-Based Reading and Writing', 'Math'], axis = 1, inplace = True)
sat_2018 = df_sat_2018.drop(['Evidence-Based Reading and Writing', 'Math'], axis = 1, inplace = True)
sat_2019 = df_sat_2019.drop(['EBRW', 'Math'], axis = 1, inplace = True)

In [37]:
df_sat_2017.head(1), df_sat_2018.head(1), df_sat_2019.head(1), df_sat_2021.head(1)

(     State Participation  Total
 0  Alabama            5%   1165,
      State Participation  Total
 0  Alabama            6%   1166,
      State Participation Rate  Total
 0  Alabama                 7%   1143,
      state 2021_participation  2021_score
 0  Alabama                 3%        1159)

In [38]:
df_sat_2017.shape, df_sat_2018.shape, df_sat_2019.shape, df_sat_2021.shape

((51, 3), (51, 3), (53, 3), (51, 3))

In [39]:
df_act_2017.dtypes, df_act_2018.dtypes, df_act_2019.dtypes, df_act_2021.dtypes

(State             object
 Participation     object
 Composite        float64
 dtype: object,
 State             object
 Participation     object
 Composite        float64
 dtype: object,
 State             object
 Participation     object
 Composite        float64
 dtype: object,
 state                  object
 2021_participation     object
 2021_score            float64
 dtype: object)

In [40]:
df_sat_2017.duplicated().sum(), df_sat_2018.duplicated().sum(), df_sat_2019.duplicated().sum()

(0, 0, 0)

In [41]:
df_sat_2019.drop([39, 47],inplace = True)

In [42]:
df_sat_2019.reset_index(drop = True, inplace = True)

In [43]:
df_sat = pd.concat([df_sat_2017, df_sat_2018, df_sat_2019, df_sat_2021], axis = 1)
df_sat.tail(1)

Unnamed: 0,State,Participation,Total,State.1,Participation.1,Total.1,State.2,Participation Rate,Total.2,state,2021_participation,2021_score
50,Wyoming,3%,1230,Wyoming,3%,1257,Wyoming,3%,1238,Wyoming,2%,1233


In [44]:
df_sat.columns.values[0:10] = 'state', '2017_participation', '2017_score', 'extra 1', '2018_participation', '2018_score', 'extra 2', '2019_participation', '2019_score', 'extra 3'
df_sat.drop(columns = ['extra 1', 'extra 2', 'extra 3'], inplace = True)

df_sat.head(1)

Unnamed: 0,state,2017_participation,2017_score,2018_participation,2018_score,2019_participation,2019_score,2021_participation,2021_score
0,Alabama,5%,1165,6%,1166,7%,1143,3%,1159


In [45]:
df_sat.dtypes

state                 object
2017_participation    object
2017_score             int64
2018_participation    object
2018_score             int64
2019_participation    object
2019_score             int64
2021_participation    object
2021_score             int64
dtype: object

In [46]:
df_sat['2017_participation'] = df_sat['2017_participation'].apply(clean_data)
df_sat['2018_participation'] = df_sat['2018_participation'].apply(clean_data)
df_sat['2019_participation'] = df_sat['2019_participation'].apply(clean_data)
df_sat['2021_participation'] = df_sat['2021_participation'].apply(clean_data)

In [47]:
df_sat.dtypes

state                  object
2017_participation    float64
2017_score              int64
2018_participation    float64
2018_score              int64
2019_participation    float64
2019_score              int64
2021_participation    float64
2021_score              int64
dtype: object

In [48]:
df_sat.to_csv('/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/sat_17_18_19_21.csv', index = False)

In [49]:
df_sat.head()

Unnamed: 0,state,2017_participation,2017_score,2018_participation,2018_score,2019_participation,2019_score,2021_participation,2021_score
0,Alabama,0.05,1165,0.06,1166,0.07,1143,0.03,1159
1,Alaska,0.38,1080,0.43,1106,0.41,1097,0.23,1119
2,Arizona,0.3,1116,0.29,1149,0.31,1134,0.11,1181
3,Arkansas,0.03,1208,0.05,1169,0.06,1141,0.02,1194
4,California,0.53,1055,0.6,1076,0.63,1065,0.24,1057


In [50]:
sat_act_2021_optional = '/Users/argishtiovsepyan/DSI-508/Projects/project-1/data/sat_act_by_college.csv'

In [51]:
df_sat_act_2021_optional = pd.read_csv(sat_act_2021_optional)

In [52]:
df_sat_act_2021_optional.shape

(416, 8)

In [53]:
df_sat_act_2021_optional.dtypes

School                            object
Test Optional?                    object
Applies to Class Year(s)          object
Policy Details                    object
Number of Applicants               int64
Accept Rate                       object
SAT Total 25th-75th Percentile    object
ACT Total 25th-75th Percentile    object
dtype: object

In [54]:
df_sat_act_2021_optional.duplicated().sum()

0

In [55]:
df_sat_act_2021_optional.head(1)

Unnamed: 0,School,Test Optional?,Applies to Class Year(s),Policy Details,Number of Applicants,Accept Rate,SAT Total 25th-75th Percentile,ACT Total 25th-75th Percentile
0,Stanford University,Yes,2021,Stanford has adopted a one-year test optional ...,47452,4.3%,1440-1570,32-35


In [56]:
sat_act_2021_optional = df_sat_act_2021_optional.drop(['Applies to Class Year(s)', 'Policy Details', 'Number of Applicants', 'Accept Rate', 'SAT Total 25th-75th Percentile', 'ACT Total 25th-75th Percentile'], axis = 1, inplace = True)

In [57]:
df_sat_act_2021_optional.columns.values[0:2] = 'school', 'test_optional'

In [58]:
df_sat_act_2021_optional.head(1)

Unnamed: 0,school,test_optional
0,Stanford University,Yes


In [59]:
df_sat_act_2021_optional

Unnamed: 0,school,test_optional
0,Stanford University,Yes
1,Harvard College,Yes
2,Princeton University,Yes
3,Columbia University,Yes
4,Yale University,Yes
...,...,...
411,University of Texas Rio Grande Valley,No
412,University of South Dakota,No
413,University of Mississippi,No
414,University of Wyoming,No


In [60]:
df_sat_act_2021_optional['test_optional'].unique()

array(['Yes', 'Yes (TB)', 'Yes*', 'Yes (TF)', 'No'], dtype=object)

In [61]:
df_sat_act_2021_optional['test_optional'].value_counts()['No']

26

In [62]:
tests_optional = 100 - ((26/416) * 100)
tests_optional

93.75

In [63]:
# According to National Center for Education Statistics (U.S. Department of Education), there are 3,931 colleges in the United States in 2021.

sample_size = ((416/3931) * 100)
sample_size

10.582548969727805

### Data Dictionary

| Feature | Type | Dataset | Description |
|---------|------|---------|-------------|
| year | integer | ACT/SAT Data | The year the data was collected (2017, 2018, 2019, 2021). |
| act_participation | float | ACT Data | The percentage of students participating in the ACT test in the given year. |
| sat_participation | float | SAT Data | The percentage of students participating in the SAT test in the given year. |
| act_scores | float | ACT Data | The average ACT test score in the given year. |
| sat_scores | integer | SAT Data | The average SAT test score in the given year. |
| test_requirement | boolean | College Policy Data | Indicates if the ACT/SAT test was required (True) or optional (False) in the given year.

## Exploratory Data Analysis

Complete the following steps to explore your data. You are welcome to do more EDA than the steps outlined here as you feel necessary:
1. Summary Statistics.
2. Use a **dictionary comprehension** to apply the standard deviation function you create in part 1 to each numeric column in the dataframe.  **No loops**.
    - Assign the output to variable `sd` as a dictionary where: 
        - Each column name is now a key 
        - That standard deviation of the column is the value 
        - *Example Output :* `{'ACT_Math': 120, 'ACT_Reading': 120, ...}`
3. Investigate trends in the data.
    - Using sorting and/or masking (along with the `.head()` method to avoid printing our entire dataframe), consider questions relevant to your problem statement. Some examples are provided below (but feel free to change these questions for your specific problem):
        - Which states have the highest and lowest participation rates for the 2017, 2019, or 2019 SAT and ACT?
        - Which states have the highest and lowest mean total/composite scores for the 2017, 2019, or 2019 SAT and ACT?
        - Do any states with 100% participation on a given test have a rate change year-to-year?
        - Do any states show have >50% participation on *both* tests each year?
        - Which colleges have the highest median SAT and ACT scores for admittance?
        - Which California school districts have the highest and lowest mean test scores?
    - **You should comment on your findings at each step in a markdown cell below your code block**. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.

In [64]:
#Code:

**To-Do:** *Edit this cell with your findings on trends in the data (step 3 above).*

## Visualize the Data

There's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is *always* a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers. It is important to not only create visualizations, but to **interpret your visualizations** as well.

**Every plot should**:
- Have a title
- Have axis labels
- Have appropriate tick labels
- Text is legible in a plot
- Plots demonstrate meaningful and valid relationships
- Have an interpretation to aid understanding

Here is an example of what your plots should look like following the above guidelines. Note that while the content of this example is unrelated, the principles of visualization hold:

![](https://snag.gy/hCBR1U.jpg)
*Interpretation: The above image shows that as we increase our spending on advertising, our sales numbers also tend to increase. There is a positive correlation between advertising spending and sales.*

---

Here are some prompts to get you started with visualizations. Feel free to add additional visualizations as you see fit:
1. Use Seaborn's heatmap with pandas `.corr()` to visualize correlations between all numeric features.
    - Heatmaps are generally not appropriate for presentations, and should often be excluded from reports as they can be visually overwhelming. **However**, they can be extremely useful in identify relationships of potential interest (as well as identifying potential collinearity before modeling).
    - Please take time to format your output, adding a title. Look through some of the additional arguments and options. (Axis labels aren't really necessary, as long as the title is informative).
2. Visualize distributions using histograms. If you have a lot, consider writing a custom function and use subplots.
    - *OPTIONAL*: Summarize the underlying distributions of your features (in words & statistics)
         - Be thorough in your verbal description of these distributions.
         - Be sure to back up these summaries with statistics.
         - We generally assume that data we sample from a population will be normally distributed. Do we observe this trend? Explain your answers for each distribution and how you think this will affect estimates made from these data.
3. Plot and interpret boxplots. 
    - Boxplots demonstrate central tendency and spread in variables. In a certain sense, these are somewhat redundant with histograms, but you may be better able to identify clear outliers or differences in IQR, etc.
    - Multiple values can be plotted to a single boxplot as long as they are of the same relative scale (meaning they have similar min/max values).
    - Each boxplot should:
        - Only include variables of a similar scale
        - Have clear labels for each variable
        - Have appropriate titles and labels
4. Plot and interpret scatter plots to view relationships between features. Feel free to write a custom function, and subplot if you'd like. Functions save both time and space.
    - Your plots should have:
        - Two clearly labeled axes
        - A proper title
        - Colors and symbols that are clear and unmistakable
5. Additional plots of your choosing.
    - Are there any additional trends or relationships you haven't explored? Was there something interesting you saw that you'd like to dive further into? It's likely that there are a few more plots you might want to generate to support your narrative and recommendations that you are building toward. **As always, make sure you're interpreting your plots as you go**.

In [65]:
# Code

## Conclusions and Recommendations

Based on your exploration of the data, what are you key takeaways and recommendations? Make sure to answer your question of interest or address your problem statement here.

**To-Do:** *Edit this cell with your conclusions and recommendations.*

Don't forget to create your README!

**To-Do:** *If you combine your problem statement, data dictionary, brief summary of your analysis, and conclusions/recommendations, you have an amazing README.md file that quickly aligns your audience to the contents of your project.* Don't forget to cite your data sources!