# Optional: Interactive Visualization

This is a completely optional section of the project. There's no place to submit it. Feel free to come back to it later on when you have more time.

*Note*: You won't be able to work on this section unless you've completed at least Question 3.

In [None]:
import babypandas as bpd
import numpy as np
import plotly.express as px

Run the cell below to load in the `ucsd_state` DataFrame you produced in Question 3.

If you see the following error:

```
FileNotFoundError: [Errno 2] No such file or directory: 'ucsd_state.csv'
```

you need to go back to the [main notebook](../midterm-project.ipynb) and run the cell at the end of the project notebook that starts with `ucsd_state.to_csv`.

In [None]:
ucsd_state = bpd.read_csv('ucsd_state.csv')
ucsd_state

In Question 1.5, you created a scatter plot that had `'Applied'` on the $x$-axis and `'AcceptanceRate'` on the $y$-axis. Like all of the other plots you created in this class, it was _static_, meaning that you couldn't click on it or move things around.

What we'll do here is produce an interactive version of the same plot, and modify the colors to highlight **your high school**!

First, note that we're going to use a new visualization library, called `plotly`. This is what we used to create the [map](https://dsc10.com/resources/midterm_project/q3.11-map.html) you'll see in Question 3.11, and what we used to create the interactive _Little Women_ graph at the end of [Lecture 1](https://dsc10.com/resources/lectures/lec01/lec01.html#Next-time).

We've imported `plotly.express` as `px`, and the relevant function is `px.scatter`. Below, we create a scatter plot that should resemble the one you produced in Question 1.5.

In [None]:
# The .to_df() converts ucsd_state from a babypandas DataFrame to a pandas DataFrame,
# which plotly requires.
px.scatter(ucsd_state.to_df(),      
           x='Applied',
           y='AcceptanceRate')

Note that you can hover over any point and see the exact values of `'Applied'` and `'AcceptanceRate'`.

But, we can take things further! Below, we've modified the plot so that when you hover over a school's name, you see its `'ID'`, and so that in-state schools appear blue and out-of-state schools appear red.

In [None]:
px.scatter(ucsd_state.to_df(),      
           x='Applied',
           y='AcceptanceRate',
           hover_name='ID',
           color='instate')

But, let's take things a step further. Let's suppose we're interested in just the schools from California, and how schools from San Diego County and Canyon Crest Academy specifically fare. First, we can query for only the schools that are in-state.

In [None]:
in_state_only = ucsd_state[ucsd_state.get('instate') == True]
in_state_only

Now, since all of the schools have known `'Region'`s, we can color each point according to the `'Region'`, or county. We can also use `'Name'` as the `hover_name` instead of `'ID'`, since no in-state schools have missing `'Name'`s.

In [None]:
px.scatter(in_state_only.to_df(),      
           x='Applied',
           y='AcceptanceRate',
           hover_name='Name',
           color='Region')

The issue is that there are a lot of colors. It would be nice if we could reduce this plot to just have three colors:
- One for our school of interest, Canyon Crest Academy.
- One for other schools in San Diego County.
- One for all other schools.

To do that, we'll add a column to `in_state_only` that describes the above "category" that each school is in. The following function implements this logic.

In [None]:
def find_category(name, county):
    if name == 'CANYON CREST ACADEMY':
        return 'Canyon Crest Academy'
    elif county == 'San Diego':
        return 'San Diego'
    else:
        return 'Other California County'

Since the function needs to look at multiple columns of `in_state_only` as inputs, we can't use the `.apply` method, since the `.apply` method only works on a single Series. Instead, we'll use a `for`-loop. We almost **never** need to use `for`-loops with DataFrames, but this is one of the rare instances where we do.

In [None]:
categories = np.array([])

for i in np.arange(in_state_only.shape[0]):
    name = in_state_only.get('Name').iloc[i]
    county = in_state_only.get('Region').iloc[i]
    category = find_category(name, county)
    categories = np.append(categories, category)

Now, `categories` is an array containing the category of each school.

In [None]:
len(categories)

We can add it as a column to `in_state_only`.

In [None]:
in_state_only = in_state_only.assign(Category=categories)
in_state_only

Now, instead of setting `color='Region'`, we can set `color='Category'` and we will only see three different colors!

In [None]:
px.scatter(in_state_only.to_df(),      
           x='Applied',
           y='AcceptanceRate',
           hover_name='Name',
           color='Category')

You can see Canyon Crest Academy around the point (350, 0.2). The issue is it's kind of hard to spot, since all of the circles are quite small. One thing we can do is change the `size` argument, to tell `plotly` to make some circles bigger than others. By setting `size='Enrolled'`, the size of a point will be larger for schools where more students actually enrolled at UCSD!

In [None]:
px.scatter(in_state_only.to_df(),      
           x='Applied',
           y='AcceptanceRate',
           hover_name='Name',
           color='Category',
           size='Enrolled')

Starting to look cool! But, we can keep going. We may want to pick our own colors for the points – we can do that by creating a dictionary whose keys are the categories and values are the colors we want.

Note that #777777 is a color in "hex", a standard way of specifying colors. You can see a hex color picker [here](https://g.co/kgs/S2wzwF).

In [None]:
color_dictionary = {
    'Canyon Crest Academy': 'blue',
    'San Diego': 'gold',
    'Other California County': '#777777'
}

By setting the `color_discrete_map` argument to `color_dictionary`, we can see the colors we picked:

In [None]:
px.scatter(in_state_only.to_df(),      
           x='Applied',
           y='AcceptanceRate',
           hover_name='Name',
           color='Category',
           size='Enrolled',
           color_discrete_map=color_dictionary)

We're just getting started, there are plenty of other things we can do to this plot! For instance, we can add a title, change the background color, and font:

In [None]:
fig = px.scatter(in_state_only.to_df(),      
           x='Applied',
           y='AcceptanceRate',
           hover_name='Name',
           color='Category',
           size='Enrolled',
           color_discrete_map=color_dictionary,
           title='Acceptance Rate vs. Number of Applicants to UCSD in Fall 2022')

fig.update_layout(
    plot_bgcolor='#EEEEEE',
    font={'family': 'Arial'}
)

fig.add_annotation(
    x=360,
    y=0.23,
    text='<span style="color:blue"><b>This</b> is my high school!</span>',
)

Feel free to explore! You'll note that the entire notebook above has been done for you.

If you're up for it, modify this plot so that it highlights your high school. If you're not from California, this will require a good amount of work, and your high school may not even be in the dataset (Suraj's isn't). Have fun with it – change the colors, add more annotations, and customize it to your own liking.

Of course, this is entirely optional, and won't be graded in any way. But if you do produce something cool, take a screenshot and share it with us in [this Ed thread](https://edstem.org/us/courses/38383/discussion/3019418)!