# Homework 5: Interactive visualizations and mapping

Welcome to the fifth homework! 

In this homework you practice creating interactive visualizations using plotly and creating maps using geopandas.

Please complete this notebook by filling in the cells provided. 

For all problems that you must write explanations and sentences for, please provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. 

**Deadline:**

This assignment is due **Sunday February 25th at 11pm.** You can turn in the assignment up to 24 hours late for 90% credit (after that, the homework will only be accepted with a dean's excuse). 

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

You should start early so that you have time to get help if you're stuck. The drop-in office hours schedule can be found on Canvas.  You can also post questions or start discussions on Ed Discussion.

## Getting started

In order to complete the homework, it is necessary to download a few files. Please run the code below **only once** to download data needed to complete the homework. To run the code, click in the cell below and press the play button (or press shift-enter). 

In [1]:
# if you are running this notebook in colabs, please uncomment and run the following two lines
# !pip install https://github.com/emeyers/YData_package/tarball/master

In [2]:
# Please run this code once to download the files you will need to complete the homework 

import YData 

# Downlooad college scorecard data and mapping files
YData.download.download_data("college_scorecard_subset_2021_2022.csv")
YData.download.download_data("CCBASIC_categories.csv")
YData.download.download_data("connecticut.geojson")
YData.download.download_data("States_shapefile.geojson")

The file `college_scorecard_subset_2021_2022.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `CCBASIC_categories.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `connecticut.geojson` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `States_shapefile.geojson` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.


## 0. Quote and reaction

For the past few weeks of class we have been discussing different ways of visualizing data. There are many different theories of what makes a good data visualization, with most of these theories being based on the author's intuition. For example, [Edward Tufte](https://en.wikipedia.org/wiki/Edward_Tufte) emphasized that the data-ink ratio ratio should be as high as possible. But is there any way to more objectively say whether a visualization is "good" along particular dimensions? 

For this week's quote and reaction, you will read [a research paper by Borkin et al.](https://vcg.seas.harvard.edu/publications/what-makes-a-visualization-memorable/paper) where the authors run a psychophysics experiment to assess what features of a data visualization makes it memorable. While one could argue that memorability might not be the most important property of a visualization, the approach of using psychophysics to more objectively evaluate results is a concept that could be broadly useful in Data Science. 

As always, please read this paper, and in the space below, write down the quote as well as a one paragraph description for why you thought the quote was interesting. 

**Question 0.1 (5 points)**  Please write down your "quote and reaction" here.

*Quote:*  ...

Reaction: ... 

In [3]:
# This cell imports functions from packages we will use below.
# Please run it each time you load the Jupyter notebook

import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt

%matplotlib inline

## The College Scorecard data 

To practice creating interactive visualizations and mapping, we will use College Scorecard dataset. The dataset is created by the [United States](https://www.npr.org/sections/ed/2015/09/12/439742485/president-obamas-new-college-scorecard-is-a-torrent-of-data) government to allow consumers to compare the cost and value of higher education institutions in the United States. 

Full dataset can be found at https://collegescorecard.ed.gov/data/ and the full codebook can be download [here](https://collegescorecard.ed.gov/data/documentation/). In order to make the data more manageable, I created created a smaller dataset that contains only data from institutions which the predominant degree offered is a bachlor's degree, and that also only has a smaller subset of variables that are in the full dataset. A code book for the variables in this smaller dataset is below, and the next cell loads the data into pandas DataFrame.

#### Codebook

1. `UNITID`: Unit ID for institution

2. `INSTNM`: Institution name

3. `CITY`: School city

4. `STABBR`: State 

5. `ZIP` zip code

6. `HIGHDEG`: Higest degree offered. 3 is BA, 4 is graduate degree

7. `CONTROL`: Control of institution. 1 = public, 2 = private non-profit, 3 = private for profit

8. `LATITUDE`: Latitude

9. `LONGITUDE`: Longitude

10. `CCBASIC`: Carnegie Classification; e.g., 15 = Doctoral Universities: Very High Research Activity

11. `ADM_RATE`: Admission rate

12. `SAT_AVG`: Average SAT equivalent score of students admitted

13. `UGDS`: Enrollment of undergraduate certificate/degree-seeking students student size

14. `NPT4_PUB`: Average net price for Title IV institutions (public institutions)

15. `NPT4_PRIV`: Average net price for Title IV institutions (private for-profit and nonprofit institutions)

16. `TUITIONFEE_IN`: In-state tuition and fees cost

17. `TUITIONFEE_OUT`: Out-of-state tuition and fees cost

18. `TUITFTE`: Net tuition revenue per full-time equivalent student

19. `INEXPFTE`: Instructional expenditures per full-time equivalent student

20. `AVGFACSAL`: Average faculty salary

21. `PFTFAC`: Proportion of faculty that is full-time

22. `PCTPELL`: Percentage of undergraduates who receive a Pell Grant

23. `C100_4`: Completion rate for first-time, full-time students at four-year institutions (100% of expected time to completion). 100 percent of normal time is typically 4 years.

24. `C150_4`: Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion. 150 percent of normal time is typically 6 years. 

25. `GRAD_DEBT_MDN`: The median debt for students who have completed

26. `WDRAW_DEBT_MDN`: The median debt for students who have not completed

27. `PCIP27`: Percentage of degrees awarded in Mathematics And Statistics

28. `UGDS_MEN`: Total share of enrollment of undergraduate degree-seeking students who are men

29. `UGDS_WOMEN`: Total share of enrollment of undergraduate degree-seeking students who are women

30. `GRADS`: Number of graduate students

31. `BOOKSUPPLY`: Cost of attendance: estimated books and supplies cost

32. `ROOMBOARD_ON`: Cost of attendance: on-campus room and board cost

33. `ENDOWBEGIN`: Value of school's endowment at the beginning of the fiscal year

34. `ENDOWEND`: Value of school's endowment at the end of the fiscal year

35. `GT_THRESHOLD_P10`: Share of students earning more than a high school graduate (threshold earnings) 10 years after entry

36. `MD_EARN_WNE_MALE0_P10`: Median earnings of non-male students working and not enrolled 10 years after entry

37. `MD_EARN_WNE_MALE1_P10`: Median earnings of male students working and not enrolled 10 years after entry


In [4]:
scorecard = pd.read_csv("college_scorecard_subset_2021_2022.csv")

scorecard.head(3)

Unnamed: 0,UNITID,INSTNM,CITY,STABBR,ZIP,HIGHDEG,CONTROL,LATITUDE,LONGITUDE,CCBASIC,...,UGDS_MEN,UGDS_WOMEN,GRADS,BOOKSUPPLY,ROOMBOARD_ON,ENDOWBEGIN,ENDOWEND,GT_THRESHOLD_P10,MD_EARN_WNE_MALE0_P10,MD_EARN_WNE_MALE1_P10
0,100654,Alabama A & M University,Normal,AL,35762,4,1,34.783368,-86.568502,18.0,...,0.3978,0.6022,884.0,1600.0,9240.0,,,0.6044,36050.0,36377.0
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,4,1,33.505697,-86.799345,15.0,...,0.3816,0.6184,8685.0,1200.0,12307.0,537349307.0,539858544.0,0.7472,42007.0,56164.0
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,4,1,34.724557,-86.640449,16.0,...,0.5891,0.4109,1972.0,2200.0,10652.0,77250279.0,75837207.0,0.7769,45170.0,66070.0


## 1. Interactive visualizations

For the first set of exercises we will create some interactive data visualizations using [plotly express](https://plotly.com/python/plotly-express/). As we discussed in class, interactive visualizations can be very useful for:

1. Exploring data to see trends that can be investigated further. 
2. Sharing data on websites to allow other people to explore the data. 

While in this class you will not have to share interactive visualizations on websites, creating interactive visualizations will be useful for your class project in order to find trends that you can then show through static graphics (and I definitely encourage everyone to share static and interactive graphics you create in this class on a website such as GitHub to showcase your skills). In the exercises below we will also focus on answering specific questions and coming up with ways to clearly visualize data, which will again be good preparation for your class project. 

Let's now dive into the data!


**Question 1.1 (4 points)**:  As a first question, create an interactive scatter plot to explore which colleges have the highest endowments. In particular please create a scatter plot using the `px.scatter(data_frame = , x = , y = , hover_name = )` function with the following mappings:

- `x`: Should be a college's endowment at the beginning of the year (i.e., the beginning of the 2021 academic year)
- `y`: Should be the median earnings for students who are not male 10 years after graduation
- `hover_name`: Should be the name of the college/university

For all figures in this homework (and for the rest of the semester), be sure to label you axes appropriately!

In the answer section please report:
1. Which three schools have the highest endowments? 
2. Does there appear to be a clear relationship between endowment and income (i.e., do students who graduate with schools that have higher endowments tend to have higher incomes)? 


In [5]:
import plotly.express as px







<font color='red'> **Answer**:

1. 

2. 
    

**Question 1.2 (3 points)**: To get a little more practice with pandas, in the cell below, please write code that prints the names of the 3 schools that have the highest endowments. 

In the answer section report one advantage of using interactive graphics to find this information, and one advantage of using data manipulation to report this information. 


<font color='red'> **Answer**:




**Question 1.3 (4 points)**:  From the visualization you created for question 1.1, you will notice that it is pretty hard to tell whether students that graduate from schools that have larger endowments end up earning more money. One reason this is difficult is because the endowments of different schools vary greatly, so the majority of schools that have smaller endowements are all squished in the left side of the graph. 

To deal with this we can use a log transformation of the data on the x-axis. This can be done by setting the `log_x` to `True`. 

Please recreate the graph you created in question 1.1, but with the x values being on a $log_{10}$ scale. In the answer section please report:

1. Whether it is easier to see a trend between how much income students are making and schools' endowments when the endowment data is plotted on the log10 scale.
2. Report two schools where students are making much higher incomes than might be predicted based on the size of a school's endowment. 

<font color='red'> **Answer**:

1.

2.


**Question 1.4 (4 points)**: 

As you saw above, there are a few shools where students are earning much higher income than would be predicted based on the school's endowment. Let's see if we can find any trends that might explain why students earn higher incomes from these schools. 

Below is code that loads a DataFrame that has information about the "Carnegie classification" of each school, which categorizes schools into different types, such as schools that are primarily "Master's Colleges & Universities: Larger Programs" or schools that are primarily "Baccalaureate Colleges: Arts & Sciences Focus". For the remainder of this homework, we will refer to the schools' Carnegie classification as "cc-type". 

Using this `cc_categories` DataFrame, as well as the `scorecard` DataFrame, please do the following: 

1. Join the `cc_categories` onto the `scorecard` DataFrame and store the results in a DataFrame called `scorecard2`. Your join should be done such that all the rows of the `scorecard` are kept intact. 

2. Use plotly to create the same interactive scatter plot as you did in question 1.3 but have the color of each point be the school's Carnegie Classification. 

In the answer section below please report:

1. What cc-type of school tends to produce the most outliers, in terms of students earnining higher incomes relative to the schools' endowments.

2. What cc-type of school tends to have the highest endowments. 


In [6]:
cc_categories = pd.read_csv("CCBASIC_categories.csv")
display(cc_categories.head(3))

Unnamed: 0,CCBASIC,carnegie_classification
0,-2,Not applicable
1,0,(Not classified)
2,1,Associate's Colleges: High Transfer-High Tradi...


<font color='red'> **Answer**:

1. 


2. 





**Question 1.5 (4 points)**: Let's explore one more relationship using interactive scatter plots. In particular, let's look at the relationship between SAT scores,  median income and type of school (as specified by the Carnegie Classification). 

Please create a scatter plot exploring this relationship. In the answer section, please answer the following questions:

1. What cc-types of schools tend to have students with the highest SAT scores? 
2. What do you think leads to higher median income? Is it SAT scores, the size of a school's endowment, or something else? 


<font color='red'> **Answer**:

1. 


2. 
   
   
   

**Question 1.6 (4 points)**: Now let's examine a couple other types of interactive graphs we can make using plotly. Let's first create a treemap showing how many schools there are of each cc-type in each state, and also the median income students make in each cc-type of school. 

To start this analysis, let's create a DataFrame called `scorecard_state_stats` where each row corresponds to a unique combination of state abbreviation and Carnegie Classification. The columns of the `scorecard_state_stats` should be: 

1. `STABBR`: The states abbreviation
2. `carnegie_classification`: The Carnegie Classification
3. `count`: The number of institutions in each state which a particular Carnegie Classification
4. `median_income`: The median of the `MD_EARN_WNE_MALE0_P10` variable (i.e., the median of the median incomes)

One you have created `scorecard_state_stats` DataFrame, sort the values in it by the `count` variable and display the first 5 rows. 

Hint: You can group a DataFrame in pandas by multiple variables using `.groupby(['name_col1', 'name_col2'])


**Question 1.7 (4 points)**: Now that we have the `scorecard_state_stats` DataFrame, let's visualize it as a using `px.treemap()` function. Please set the following arguments of the function: 

1. `data_frame`: should be the `scorecard_state_stats` DataFrame
2. `path`: should be a list with a constant set to USA, and then the state abbreviations and Carnegie Classification.
3. `values`: should be the count of the number of schools for each state
4. `color`: should be the median of the median income value for each school type. 

In the answer section, in 1-3 sentences, describe what you think of this visualization. 


<font color='red'> **Answer**:



**Question 1.8 (4 points)**: Finally, let's visualize the scorecard data using heatmaps that shows the tuition revenue depends on the cc-type of college and on whether the college is a public college or non-profit private college. 

To start this analysis, use the `.pivot_table()` method to create a DataFrame called `scorecard_wide`. Each row of this DataFrame should correspond to one of the Carnegie Classification types. The table should also have the following properties:

1. The Index should be the Carnegie Classification 
2. There should a column called `1` which has the median "Net tuition revenue per full-time equivalent student" for public schools (for each Carnegie Classification type). 
3. Likewise, there should a column called `2` which has the median "Net tuition revenue per full-time equivalent student" for private schools (for each Carnegie Classification type). 

Once you have created this basic `scorecard_wide` DataFrame, please modify it by doing the following:
1. Rename the `1` column to be `Public` and the `2` column to be `Private`.
2. Remove the column called `3`
3. Sort the values based on the values in the `Private` column
4. Drop any rows from the DataFrame that have missing values by using the `.dropna()` method. 

Once you have created the  `scorecard_wide` DataFrame print the first of this DataFrame to show your work. 


**Question 1.9 (4 points)**: Now that you have created the `scorecard_wide` DataFrame, please visualize it using the `px.imshow()` function. Also, set the `aspect='auto'` argument to make the columns wide. Finally, set the x-axis label to be "School type" and remove the y-axis label by setting it to an empty string.

In the answer section, report the two cc-types of schools that have the highest median "Net tuition revenue per full-time equivalent student" and whether Yale is one of these types of schools. 

For additional practice, create another Python cell in this Jupyter notebook and see if you can extract Yale's cc-type from the `scorecard2` DataFrame (although if you can figure out another way to get Yale's cc-type, that is fine too). 

<font color='red'> **Answer**:




**Question 1.10 (4 points)**: As you know, since you need to turn in your final project as a pdf document, you won't be able to include interactive graphics on the project. Fortunately, you can still create non-interactive heatmaps of the data using seaborn. 

Let'e explore this now but creating the visualization above using the `sns.heatmap()` function where the arguments to the function should be the `scorecard_wide` DataFrame. Also set the `annot=True` to add written values on the heatmap, and `fmt=".0f"` so that the values are integers. Finally, set the x and y labels to empty strings to turn off these labels. 

In the answer section, briefly state whether you like the plotly or the seaborn heatmap better. 


In [7]:
import seaborn as sns





<font color='red'> **Answer**:




**Question 1.11 (4 points)**: Now create one or more own visualizations of your own using plotly. You can either use the College Scorecard data or the data you are using on your class project - or you could create a visualization of each!  If you create a nice visualization of your own data, feel free to include it in your class project report. 

For the visualization you create, you can use the types of visualization we used above (e.g., scatter plots, heatmaps, etc.) or a different visualzation type we have not discussed by looking at the [plotly visualization documentaton](https://plotly.com/python/plotly-express/). In the answer section, briefly describe what you visualization shows. 


In [8]:
# Create your own data visualization here










<font color='red'> **Answer**:    


    
    

## 2. Mapping

For the second part of this homework we will explore creating maps of data using the geopandas package. In particular, we will create maps where we add points at particular locations, and also choropleth maps where will fill in predefined regions based on particular values. Let's dive in!


**Question 2.1 (2 points)**: 

In order to create maps, we need to load data that has the boundary regions of the areas we would like to map. There are several different formats that are used to store such mapping data. We will load data that is in [GeoJson format](https://geojson.org/), which is a format that is more human-readible than other formats which can be useful if you want to look at the raw data. To start, we will load data on the state of Connecticult which was obtained from https://github.com/glynnbird/usstatesgeojson. 

The code below imports the geopandas packages and uses it to load our data on Connecticut into a geopandas DataFrame. As we discussed in class, a geopandas DataFrame is similar regular pandas DataFrame except that it has an extract column called "geometry" which contains geometric shapes (these shapes are defined as "Shapely objects" which [you are read more about here](https://shapely.readthedocs.io/en/stable/manual.html) if you are interested). For our Connecticut data, there is only one row which contains data on the state of Connecticut. As you can see, this geopandas DataFrame contains information on Connecticut (just like a regular pandas DataFrame has) along with the geometry column contains the outline of the state. 

To start, please use the geopandas `.plot()` method to plot the outline of the state. Also set the `color` arguement to "orange" to make the fill color of the state orange, and the `edgecolor` argument to "green" to create a green outline around the state.


In [9]:
import geopandas as gpd

ct_map = gpd.read_file('connecticut.geojson')

ct_map

Unnamed: 0,name,abbreviation,capital,city,population,area,waterarea,landarea,houseseats,statehood,group,geometry
0,Connecticut,CT,Hartford,Bridgeport,3596080,14356,1816,12541,5,1788-01-09,US States,"POLYGON ((-72.39743 42.03330, -72.19883 42.030..."


In [10]:
# plot Connecticut here




**Question 2.2 (3 points)**: In order to have some interesting data to plot, let's convert our College Scorecard data into a geopandas DataFrame. Since geopandas DataFrames are just like regular pandas DataFrame with a geometry column, all we need to do is to load the regular College Scorecard data along with data specifying what should be in the geometry column. 

If you look at our `scorecard` DataFrame, you will notice that there are columns called `LATITUDE` and `LONGITUDE` which contain the latitude and longitude coordinate locations of each column. What we need to do is to convert these cooridnates into "shapely" geometric POINT objects, which can be stored in our geometric column for our geopandas DataFrame. This can be done using the geopandas `gpd.points_from_xy(long, lat)` function, which takes array objects of longitude and latitude coordinates and returns a "GeometryArray" object that stores our points. 

Please use the `gpd.points_from_xy(long, lat)` function to create a name `college_geometries` which contains the locations of all colleges in our `scorecard` DataFrame. Also print out what type of object`college_geometries` is, and the first 10 values in `college_geometries` (hint, square brackets will help).


**Question 2.3 (3 points)**: 

Now that we have all the locations of colleges in the proper format, we can create a geopandas DataFrame using `gpd.GeoDataFrame()` function. This function takes a regular pandas DataFrame as a first argument, and we can set the values in the geometry column using the `geometry` argument. 

Please create a name `scorecard_gpd` that contains the scorecard data as a geopandas DataFrame. This DataFrame should contain all same columns as the original `scorecard` DataFrame, and should have the geometry column have the longitude and latitude coordinates of each college (as specified in the `college_geometries` name you created above). 

Once you have created this DataFrame, print the first 3 rows to show you have the correct answer.


**Question 2.4 (2 points)**: 

Now let's use the `scorecard_gpd` to create a new geopandas DataFrame called `ct_scorecard` that only has information on schools in Connecticut. Once you have created this DataFrame, print the first 3 rows to show your work.

Hint: Boolean masking or using the .query() method could be helpful here.


**Question 2.5 (3 points)**: 

Now we are ready to plot the locations of all schools in Connecticut. To do this, we can first plot a map of Connecticut, as we did question 2.1, but rather than displaying the plot, store the output of the plot in the name `base`. Also, for this plot of Connecticut, have the color be white and the outline be black.

Once you have created the `base` plot, you can then pass it to a second plot function that plots the locations of our colleges. To do this, call the `.plot()` method on our `ct_scorecard` DataFrame, with the following arguments:

1. Set the `ax` argument to our `base` object so that the outline of Connecticut is also shown. 
2. Set the `color` argument to "red" to make our points red
3. Set the markersize to 5 to make the points for each college an appropriate size.

Note: The plot you will create will look a bit off. You will fix this in the next question.

**Question 2.6 (3 points)**: As mentioned above, there is something off in the plot you created in question 2.5 - namely, you will notice an outlier where one college seems to be located way outside of Connecticut.

In the cell below, use pandas create a DataFrame called `outlier_school` which contains just data from this outlier school (and all the columns for the original `ct_scorecard` DataFrame). Then print out the DataFrame to see which school is the outlier. 


**Question 2.7 (3 points)**: When faced with an outlier in a dataset, what we should do is to investigate why the point is an outlier. If you lookup the name of the school in google maps, you will notice that the coordinates of the school are different than what is in the `ct_scorecard` DataFrame. Thus it seems likely that for some reason, there was an error entering the coordinates for this school (which makes sense because all schools in Connecticut should be located insite the boarders of Connecticut!). 

Now that we have indentified that the outlier is due to an error, we should go ahead and fix this error. One way to do this would be to correct the mistaken coordinate either in the original data, or programmatically in Python. However, we are going to be lazy here and are going to just remove this school from our dataset. 

Please create a geopandas DataFrame called `ct_scorecard2` that has all the same data as `ct_scorecard` but that has the outlier school removed. Then recreate the plot you created in question 2.5 below. 

Hint: The `.drop(index_value)` method could be useful here, although there are several ways to remove the outlier row. 



**Question 2.8 (3 points)**: Let's now switch gears and create a cholopleth map of the mainland United States. The code below loads geometric information on the outlies of all states in the United States as a geopandas DataFrame called `usa`. 

In order to make our visualizations a little easier to see, let's just focus on the mainland United States, which are the "lower 48 states" that do not include Alaska and Hawaii. Please create a geopandas DataFrame called `usa_mainland` which only contains data from the "lower 48 states". Once you have created this DataFrame plot a map of these states.

Hint: the state abbreviation for Alaska is "AK" and the state abbreviation for Hawaii is "HI". 

    

In [11]:
# choropleth map of USA

usa = gpd.read_file('States_shapefile.geojson')

usa = usa.drop(columns = ["FID", "Program", "Flowing_St", "FID_1"]) 

usa.head(3)

Unnamed: 0,State_Code,State_Name,geometry
0,AL,ALABAMA,"POLYGON ((-85.07007 31.98070, -85.11515 31.907..."
1,AK,ALASKA,"MULTIPOLYGON (((-161.33379 58.73325, -161.3824..."
2,AZ,ARIZONA,"POLYGON ((-114.52063 33.02771, -114.55909 33.0..."


In [12]:
# Create your usa_mainland geopandas DataFrame here and plot it






**Question 2.9 (4 points)**: Now that we have a map of the mainland USA, let's create a choropleth map where each state is filled with a color that represents that mean average SAT scores of colleges in the state. To do let's first create the data we need to plot. We can do this by this first creating a DataFrame called `state_SAT` from the `scorecard` DataFrame which just has two columns that are:

1. `STABBR`: which is the abbreviation for each state.
2. `mean_SAT`: which is the mean average SAT scores of all colleges in each state.

Once you have created the `state_SAT` DataFrame, join it onto the `usa_mainland` DataFrame and save the results to the name `usa_endow_map`. Print the first 5 rows of this DataFrame to show you have the appropriate data. 

Hint: The `usa_endow_map` DataFrame should have 5 columns which are: 'State_Code', 'State_Name', 'geometry', 'STABBR', 'mean_SAT' 


**Question 2.10 (4 points)**:  Now that we have the relevant data in the `usa_endow_map` let's create a choropleth map! In particular, each state is filled with a color based the mean of the average SAT scores of all colleges in a state.

To do this use the `.plot()` method with the following arguments: 

1. `column`: should be set to the name of the column that you want to fill in the color of each state based on its values.
2. `legend`: set this to `True` to see a legend
3. `legend_kwds`: Set this to the following dictionary `{'label': "CHANGE TO APPROPRIATE LABEL", 'orientation': "horizontal"}`
    

## 3. Start thinking about your class project!

Believe it or not, it's already time to start thinking about your class project. Your class project will consist of a 6-10 page analysis (done in Jupyter) where you find a data set on your own, and apply the methods we have discussed in class to show some interesting insights. 

As a first step in this process, you need to select a data set! A list of a few sources where you can find data sets is on Canvas (on the left, under the link called "Data sources"). Please take some time this week to look through the data sets on this page and/or search for data you are interested in on Google or Bing. 

On the next homework, there will be a question where it asks you to list what your project will be on, and to also load a data set related to your project, so please start on this process now so that you are prepared.

## 4. Reflection (3 points)

Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on reflection on homework 5. 



## 5. Submission

Please submit your assignment as a .pdf on Gradescope. You can access Gradescope through Canvas on the left hand side of the class home page. The problems in each homework assignment are numbered. **NOTE:** When submitting on Gradescope, please select the correct pages of your pdf that correspond to each problem. **Failure to mark pages correctly will result in points being deducted from your homework score.**

If you are running Jupyter Notebooks through an Anaconda installation on your own computer, you can produce the .pdf by completing the following steps:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "HTML (.html)"
3.  After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
4.  From the print window, select the option to save as a .pdf

If you are running the assignment in a Google Colabs, you can use the following instructions: 
1.  Go to "File" at the top-left of your Jupyter Notebook and select "File" and "Print" (note you will not actually be printing)
2. From the print window, select the option to save as a .pdf
3. Be sure to look over the pdf file to make sure all your code and written work is saved in a clear way. 

<font color='red'> **NOTE ABOUT THE FIGURES IN THE PDF**: Note that the figures created from interactive plot (i.e., plotly) might not show up correctly when you print your homework to a pdf. This is totally fine, and we will be grading these questions about on the code you submit if the figures are not visable or are not showing up correctly. Interactive graphics are not meant to be printed and shared as pdfs, and so they have not been created to print well. 