# Midterm Project – UCSD Admissions 🎟

## Due Saturday, February 12th at 11:59PM PST

Welcome to the Midterm Project! Projects in DSC 10 are similar in format to homeworks, but are different in a few key ways. First, a project is comprehensive, meaning that it draws upon everything we've learned this quarter so far. Second, since problems can vary quite a bit in difficulty, some problems will be worth more points than others. Finally, in a project, the problems are more open-ended; they will usually ask for some result, but won't tell you what method should be used to get it. There might be several equally-valid approaches, and several steps might be necessary. This is closer to how data science is done in "real life".

It is important that you **start early** on the project! It will take the place of a homework in the week that it is due, but you should also expect it to take longer than a homework. You are especially encouraged to **find a partner** to work through the project with. If you work in a pair, you must follow the [Pair Programming Guidelines](https://dsc10.com/pair-programming/) on the course website. In particular, you must work together at the same time, and you are not allowed to split up the problems and each work on certain problems. If working in a pair, you should submit one notebook to Gradescope for the both of you. Use [this sheet](https://docs.google.com/spreadsheets/d/1m5eDcFdYTQq5bu9VRYINZBFgckCyJEOXZFZGZ9bQqKY/edit?usp=sharing) to find someone else to work with.

**Important:** The `otter` tests don't usually tell you that your answer is correct. More often, they help catch basic mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). Directly sharing answers between groups is not okay, but discussing problems with the course staff or with other students is encouraged.

**Unless explicitly directed to, do not use for-loops or import any packages.** Loops in Python are slow, and looping over arrays and DataFrames should usually be avoided in favor of commands that are meant specifically for these objects. Please do not import any additional packages - you don't need them, and our autograder may not be able to run your code if you do.

As you work through this project, there are a few resources you may want to have open:
- [DSC 10 Course Notes](https://notes.dsc10.com/front.html)
- [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view)
- [`babypandas` documentation](https://babypandas.readthedocs.io/en/latest/)
- Other links in the [Resources](https://dsc10.com/resources/) and [Debugging](https://dsc10.com/debugging/) tabs of the course website

Start early, good luck, and let's get started! 😎

In [None]:
# Don't change this cell; just run it. 
import numpy as np
import babypandas as bpd

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import otter
grader = otter.Notebook()

Use this outline to help you quickly navigate to the part of the project you're working on:

- [The Data 🎓](#the-data)
- [Question 1 – Basic Enrollment Statistics 📈](#q1)
- [Question 2 – Digging Deeper 🕵️](#q2)
- [Question 3 – California Counties 📍](#q3)
- [Question 4 – The Bay vs. SoCal 🆚](#q4)
- [Question 5 – Out-of-State ✈️](#q5)

<a name='the-data'></a>

## The Data 🎓

In this project, we'll take a look at UC San Diego's undergraduate admissions numbers for the class of first-years (i.e. not transfers) that entered in the Fall of 2020. The data we'll work with comes directly from the [University of California's Information Center](https://www.universityofcalifornia.edu/infocenter/admissions-source-school).

Run the cell below to load in our data as a DataFrame into the variable `ucsd_admissions_raw`.

In [None]:
ucsd_admissions_raw = bpd.read_csv('data/ucsd-admissions-2020.csv')
ucsd_admissions_raw

Each row corresponds to a high school. For each high school, we have the following information:

| Column | Description |
|:---|:---|
| `'Name'`| The name of the high school. Note, this is not unique – for instance, the top two rows of `ucsd_admissions_raw` correspond to two different high schools both with the name `'ABRAHAM LINCOLN HIGH SCHOOL'`; one is in San Francisco and one is in San Jose. |
| `'City'` | The city in which the high school is located in. Note, only schools within the US have a valid `'City'` listed; international schools have a city of `NaN`. `NaN` means "missing value". See the code cell below. |
| `'Region'` | The county (**not** country) in which the `'City'` is located if it is in California, or the state in which the `'City'` is located if it is not in California but is inside the US. If the high school is not within the US, `'Region'` is `NaN` (like `'City'`). |
| `'Applied'` | The number of students who applied to UCSD from that high school for admission in Fall 2020. |
| `'Admitted'` | The number of students who were admitted to UCSD from that high school for admission in Fall 2020. |
| `'Enrolled'` | The number of students who actually chose to attend UCSD from that high school starting in Fall 2020. |

Run the cell below. There's nothing you need to change in it; it's just showing you one of the many international high schools in the dataset. Notice that its `'City'` and `'Region'` are both NaN (missing).

In [None]:
ucsd_admissions_raw[ucsd_admissions_raw.get('ID') == 'BEIJING NATIONAL DAY SCHOOL694342']

<a name='q1'></a>

## Question 1 – Basic Enrollment Statistics 📈

Run the cell below to look at `ucsd_admissions_raw` again.

In [None]:
ucsd_admissions_raw

It's a good idea to set the index of our DataFrame to something more meaningful than 0, 1, 2, 3, ... if possible. It turns out that we cannot use `'Name'` as the index here, because there are some high schools with the same name (and we'd like the index of our DataFrame to be unique). One such example is `'ABRAHAM LINCOLN HIGH SCHOOL'`, as was mentioned in the table that described each of the columns of our data.

Instead, we'll have to use `'ID'` as our index, as it is unique for each school (note that the two `'ABRAHAM LINCOLN HIGH SCHOOL'`s have different `'ID'`s).

### Question 1.1 (1 point)

Assign `ucsd` to the DataFrame that results from setting the index of `ucsd_admissions_raw` to `'ID'`.

<!--
BEGIN QUESTION
name: q1_1
points: 1
-->

In [None]:
ucsd = ...
ucsd

In [None]:
grader.check("q1_1")

Great – we'll use `ucsd` moving forward instead of `ucsd_admissions_raw`.

### Question 1.2 (1 point)

Acceptance rate is defined as $$\text{Acceptance Rate} = \frac{\text{# Admitted}}{\text{# Applied}}$$

Amongst students in the dataset, what was the overall acceptance rate at UCSD? Compute the acceptance rate as a proportion, and save your answer to the name `overall_acceptance`.

<!--
BEGIN QUESTION
name: q1_2
points: 1
-->

In [None]:
overall_acceptance = ...
overall_acceptance

In [None]:
grader.check("q1_2")

<!-- BEGIN QUESTION -->

### Question 1.3 (1 point)

The site [acceptancerate.com](https://www.acceptancerate.com/schools/university-of-california-san-diego) states the following:

> The overall acceptance rate for University of California-San Diego was reported as 31.5% in Fall 2020 with over 99100 applications submitted to UCSD. Both in state and out of state applicants are included in these figures. We do not have data on transfer acceptance rates currently.

31.5% is quite different than the acceptance rate you found in Question 1.2. Why is there a discrepancy between the number quoted above and the result you found?

_**Hint:**_ The answer is **not** that the website [acceptancerate.com](https://www.acceptancerate.com/schools/university-of-california-san-diego) is not credible. Instead, to find the answer, you'll want to look at the fine print at the [site we downloaded the data from](https://www.universityofcalifornia.edu/infocenter/admissions-source-school).


<!--
BEGIN QUESTION
name: q1_3
points: 1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 1.4 (1 point)

Yield rate is defined as

$$\text{Yield Rate} = \frac{\text{# Enrolled}}{\text{# Admitted}}$$

Below, add two columns to `ucsd`: `'AcceptanceRate'`, describing the acceptance rate at each high school, and `'YieldRate'`, describing the yield rate of each high school. Sort the resulting DataFrame so that high schools with the highest acceptance rates are at the top. Assign the name `ucsd_admit` to the resulting DataFrame.

<!--
BEGIN QUESTION
name: q1_4
points: 1
-->

In [None]:
ucsd_admit = ...
ucsd_admit

In [None]:
grader.check("q1_4")

<!-- BEGIN QUESTION -->

### Question 1.5 (1 point)

Let's see if there's a relationship between the acceptance rates and yield rates of high schools in `ucsd_admit`. To do this, create a scatter plot with `'AcceptanceRate'` on the $x$-axis and `'YieldRate'` on the $y$-axis.

Does there seem to be any trend in the relationship between acceptance rates and yield rates? (You don't need to answer this interpretation question anywhere; all you need to do in this question is create a plot.)

<!--
BEGIN QUESTION
name: q1_5
points: 1
manual: true
-->

In [None]:
...

<!-- END QUESTION -->

### Question 1.6 (2 points)

Let's try and identify the high schools with very low and very high acceptance rates. Assign `top_eight_acc` to an **array** of the **names** (not IDs) of the high schools with the 8 highest acceptance rates, and `bottom_eight_acc` to an **array** of the **names** of the high schools with the 8 lowest acceptance rates. The order of the names within your arrays does not matter.

_**Note:**_ Do **not** explicitly type the names of any of the high schools. Instead, use a combination of DataFrame, Series, and array manipulation techniques to create both arrays using code.

<!--
BEGIN QUESTION
name: q1_6
points: 2
-->

In [None]:
top_eight_acc = ...
bottom_eight_acc = ...

# Don't change the code below – it just shows you the two arrays you created.
print('Schools with the 8 highest acceptance rates:')
for school in top_eight_acc:
    print(school)
print('\nSchools with the 8 lowest acceptance rates:')
for school in bottom_eight_acc:
    print(school)

In [None]:
grader.check("q1_6")

<!-- BEGIN QUESTION -->

### Question 1.7 (2 points)

Below, complete the implementation of the function `plot_top_n`, which takes in an integer `n` and displays a horizontal overlaid bar chart with:
- One label on the $y$-axis for each of the top `n` high schools that had the most applicants to UCSD. Labels should be the **names** of the high schools, not IDs.
- For each of the aforementioned high schools, one bar displaying the number of students who applied to UCSD, and another bar displaying the number of students who were admitted to UCSD.
- Labels should be sorted such that the school at the top had the most applicants.

For example, `plot_top_n(10)` should show the plot below. Note that `plot_top_n` should not return anything, it should only display a chart.

<img src='images/example-17.png' width=500>

<!--
BEGIN QUESTION
name: q1_7
points: 2
manual: true
-->

In [None]:
def plot_top_n(n):
    ...

<!-- END QUESTION -->



Now, run the cell below to call both `plot_top_n(10)` and `plot_top_n(15)`. This question isn't autograded; instead, we'll be manually verifying that both your code and your outputs below look correct.

In [None]:
plot_top_n(10)
plot_top_n(15)

Do you see your high school above? There's a good chance you do, statistically speaking!

<a name='q2'></a>

## Question 2 – Digging Deeper 🕵️

Now that we've gotten a feel for the `ucsd_admit` dataset, let's perform some queries to learn a bit more about the nature of admissions at UCSD.

### Question 2.1 (1 point)

Let's first look at students from San Diego County. Set `sd_county` to a DataFrame of only the high schools in San Diego County. `sd_county` should have all of the columns that `ucsd_admit` has.

_**Note:**_ We're referring to San Diego County, not the City of San Diego!

In [None]:
sd_county = ...
sd_county

In [None]:
grader.check("q2_1")

### Question 2.2 (1 point)

Compute the overall acceptance rate of students from the county of San Diego. Assign your answer to `sd_county_acceptance`.

_**Hint:**_ If you find yourself computing the mean of a column in `sd_county`, you may want to reconsider your approach.

In [None]:
sd_county_acceptance = ...
sd_county_acceptance

In [None]:
grader.check("q2_2")

### Question 2.3 (1 point)

Run the following cell.

In [None]:
print('Overall acceptance rate:', overall_acceptance)
print('SD County acceptance rate:', sd_county_acceptance)

Compare the two acceptance rates. What do you notice? Just by looking at the results, we can't conclude exactly why this difference exists, but can you think of a possible reason?

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_3
points: 1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 2.4 (1 point)

How many high schools in the county of San Diego had at least 100 applicants and an acceptance rate of at least 40%? Assign your answer to the name `county_query_count`.

In [None]:
county_query_count = ...
county_query_count

In [None]:
grader.check("q2_4")

### Question 2.5 (2 points)

Assign the name `city_query` to a DataFrame containing all high schools from the **city** of San Diego with between 100 (inclusive) and 200 (inclusive) applicants and an acceptance rate between 25% (inclusive) and 35% (inclusive). `city_query` should have all of the columns that `ucsd_admit` has.

In [None]:
city_query = ...
city_query

In [None]:
grader.check("q2_5")

### Question 2.6 (2 points)

Assign the name `burbs` to a DataFrame containing all high schools in Los Angeles County but not the City of Los Angeles, and all of the high schools in San Diego County but not the City of San Diego. `burbs` should have all of the columns that `ucsd_admit` has.

In [None]:
burbs = ...
burbs

In [None]:
grader.check("q2_6")

### Question 2.7 (2 points)

If we wanted to learn about the admissions statistics of a particular high school, we'd need to write a query, which can get cumbersome. Here, we'll write a function to make this process a bit easier. 

Below, complete the implementation of the function `school_stats`, which takes in the **name** (not the ID) of a high school and returns an **array** with three values, containing the number of students who applied, were admitted, and enrolled at UCSD from that high school, in that order. If there is no school with the input name or there are multiple schools with the input name, return the array `np.array([0, 0, 0])`. You may assume the input name is in uppercase.

Example behavior is shown below.

```py
# From CANYON CREST ACADEMY, 300 students applied, 150 were admitted, and 35 enrolled
>>> school_stats('CANYON CREST ACADEMY')
array([300, 150, 35])

# There is no school named VINCENT MASSEY SECONDARY SCHOOL in ucsd_admit
>>> school_stats('VINCENT MASSEY SECONDARY SCHOOL')
array([0, 0, 0])
```

_**Note:**_ Once you've implemented the function, you should verify that it works as intended by trying a few examples yourself. Try it out on your high school! (Fun fact – `'VINCENT MASSEY SECONDARY SCHOOL'` is the high school that Suraj went to.)

In [None]:
def school_stats(name):
    ...

In [None]:
grader.check("q2_7")

### Question 2.8 (2 points)

Now, complete the implementation of the function `school_stats_multiple`, which takes in a **list** of school names and returns an array of three values, containing the **total** number of students who applied, were admitted, and enrolled at UCSD from those high schools combined.

If any of the names in `schools` are not valid schools or if there are multiple schools with a given name, `school_stats_multiple` again returns the array `np.array([0, 0, 0])`. Example behavior is shown below.

```py
# From these three high schools combined, 754 students applied, 326 were admitted, and 76 enrolled
>>> school_stats_multiple(['CANYON CREST ACADEMY', 'TORREY PINES HIGH SCHOOL', 'BERKELEY HIGH SCHOOL'])
array([754, 326, 76])

# There is no school named VINCENT MASSEY SECONDARY SCHOOL in ucsd_admit
>>> school_stats_multiple(['VINCENT MASSEY SECONDARY SCHOOL', 'CANYON CREST ACADEMY'])
array([0, 0, 0])
```

_**Hint 1:**_ Don't reinvent the wheel – use the function `school_stats` in your implementation of `school_stats_multiple`.

_**Hint 2:**_ You will need to use a for-loop.

In [None]:
def school_stats_multiple(schools):
    ...

In [None]:
grader.check("q2_8")

<a name='q3'></a>

## Question 3 – California Counties 📍

Let's switch our focus to studying the nature of admissions at UCSD for in-state high schools, based on county.

Note that the `'Region'` column of `ucsd_admit` contains a variety of values. From the data description table at the start of the project, the `'Region'` column contains:

> The county (**not** country) in which the `'City'` is located if it is in California, or the state in which the `'City'` is located if it is not in California but is inside the US. If the high school is not within the US, `'Region'` is `NaN` (`NaN` means "missing value").

In [None]:
ucsd_admit

### Question 3.1 (2 points)

Below, complete the implementation of the function `in_cali`, which takes in the name of a region and returns `True` if that region is a county in California and `False` otherwise. Example behavior is shown below.

```py
>>> in_cali('San Diego')
True

>>> in_cali('PA')
False

>>> in_cali(np.nan) # This is the region for international schools
False

>>> in_cali('Unknown')
False
```

_**Notes:**_
1. The line `region = str(region)` may seem redundant, since all of the regions are already strings. However, missing values (`NaN`) aren't technically stored as strings; this line converts the missing value to a string. This makes it easy to check if a region is missing; all you need to do is check if `region == 'nan'`.
2. There is a single row in `ucsd_admit` with a region of `'Unknown'`. If you look at this row you'll see that it technically corresponds to a high school in California, but since we don't know the county that the school is in, we will treat it as being out-of-state. As such, `in_cali` should return `False` if `region` is `'Unknown'`.

In [None]:
def in_cali(region):
    region = str(region) # Don't change this
    ...

In [None]:
grader.check("q3_1")

### Question 3.2 (1 point)

Below, create two DataFrames:
- `ucsd_state`, which has all of the columns in `ucsd_admit` plus a new column, `'instate'`, which contains a Boolean value describing whether each high school is in-state or not (decided according to the function in 3.1), and
- `instate_only`, which contains only the high schools that are in-state. `instate_only` should have the same columns as `ucsd_admit`; it should not have a column named `'instate'` since all schools in `instate_only` will be in-state.

In [None]:
ucsd_state = ...
instate_only = ...

# Don't change the code below – it just shows you the two DataFrames you created.
print('ucsd_state')
display(ucsd_state)
print('\ninstate_only')
display(instate_only)

In [None]:
grader.check("q3_2")

### Question 3.3 (1 point)

Assign the name `california_counties` to a DataFrame indexed by county name that contains, for each California county, the total number of students who applied to, were admitted to, and enrolled at UCSD from each county. Sort by number of applicants in decreasing order.

The first few rows of `california_counties` should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Applied</th>
      <th>Admitted</th>
      <th>Enrolled</th>
    </tr>
    <tr>
      <th>Region</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Los Angeles</th>
      <td>11155</td>
      <td>4043</td>
      <td>957</td>
    </tr>
    <tr>
      <th>San Diego</th>
      <td>6677</td>
      <td>2702</td>
      <td>1141</td>
    </tr>
    <tr>
      <th>Orange</th>
      <td>6156</td>
      <td>2010</td>
      <td>440</td>
    </tr>
    <tr>
      <th>Santa Clara</th>
      <td>5610</td>
      <td>1855</td>
      <td>295</td>
    </tr>
    <tr>
      <th>Alameda</th>
      <td>3418</td>
      <td>1147</td>
      <td>263</td>
    </tr>
  </tbody>
</table>

**_Hint 1:_** Start with `instate_only`.

**_Hint 2:_** In both this subpart and the rest of Question 3, expect to use `groupby` often.

In [None]:
california_counties = ...
california_counties

In [None]:
grader.check("q3_3")

### Question 3.4 (1 point)

**Task 1:** Below, complete the implementation of the function `add_rates`, which takes in a DataFrame that has columns named `'Applied'`, `'Admitted'`, and `'Enrolled'`, and returns the same DataFrame with two added columns, `'AcceptanceRate'` and `'YieldRate'`. 

These columns contain the values you'd expect – respectively, they contain the number of students who applied to, were admitted to, and enrolled at UCSD for each row, as well as the acceptance rate and yield rate at UCSD for each row. Each row of the input DataFrame may correspond to a school, or each row of the input DataFrame may correspond to a county – the way you implement `add_rates` should be the same in both cases. Sort the resulting DataFrame by `'AcceptanceRate'` in descending order.

**Task 2:** After defining `add_rates`, define a new DataFrame named `california_counties_admit` that contains all of the columns in `california_counties` plus two new columns, `'AcceptanceRate'` and `'YieldRate'`, containing the overall acceptance rate and yield rate for each county. This will only take one line of code.

_**Hint:**_ Note that the function `add_rates` is a generalization of the code you wrote in Question 1.4; we'd recommend starting by copying the code you wrote there.

In [None]:
def add_rates(df):
    ...

california_counties_admit = ...
california_counties_admit

In [None]:
grader.check("q3_4")

### Question 3.5 (3 points)

In the previous question, we determined the admissions rate and yield rate for each California county. Now, for each county, we want to know the following information:
- `'num_schools'`: the number of high schools from that county in the dataset
- `'max_enrolled'`: the largest number of students any high school in that county had enroll at UCSD
- `'mean_enrolled'`: the average (mean) number of students enrolled at UCSD from high schools in that county
- `'median_enrolled'`: the median number of students enrolled at UCSD from high schools in that county

Below, assign `county_stats` to a DataFrame indexed by county name with four columns, `'num_schools'`, `'max_enrolled'`, `'mean_enrolled'`, and `'median_enrolled'`, corresponding to the statistics above. Keep only the counties that had at least 3 schools in the dataset. Sort `county_stats` by `'mean_enrolled'` in descending order. 

The first few rows of `county_stats` are shown below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>num_schools</th>
      <th>max_enrolled</th>
      <th>mean_enrolled</th>
      <th>median_enrolled</th>
    </tr>
    <tr>
      <th>Region</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>San Diego</th>
      <td>87</td>
      <td>44</td>
      <td>13.114943</td>
      <td>9.0</td>
    </tr>
    <tr>
      <th>San Francisco</th>
      <td>8</td>
      <td>29</td>
      <td>9.875000</td>
      <td>7.5</td>
    </tr>
    <tr>
      <th>Alameda</th>
      <td>27</td>
      <td>49</td>
      <td>9.740741</td>
      <td>5.0</td>
    </tr>
    <tr>
      <th>Contra Costa</th>
      <td>15</td>
      <td>37</td>
      <td>9.266667</td>
      <td>6.0</td>
    </tr>
    <tr>
      <th>Orange</th>
      <td>55</td>
      <td>23</td>
      <td>8.000000</td>
      <td>6.0</td>
    </tr>
  </tbody>
</table>

_**Hint 1:**_ Start with `instate_only`. 

_**Hint 2:**_ You will need to group multiple times, with different aggregation methods each time.

In [None]:
county_stats = ...
county_stats

In [None]:
grader.check("q3_5")

### Question 3.6 (2 points)

We define a large county as being a county that had over 1000 applicants to UCSD. What proportion of **in-state applications** came from students in large counties? Assign your answer to the name `large_county_proportion`.

_**Hint 1:**_ Start with `california_counties`.

_**Hint 2:**_ Note that we're not asking for the acceptance rate from large counties, we're asking for the proportion of applicants that came from large counties, amongst all in-state applicants.

In [None]:
large_county_proportion = ...
large_county_proportion

In [None]:
grader.check("q3_6")

### Question 3.7 (2 points)

Now, let's suppose we're interested in looking at the relationship between the population of a county and the number of students who applied to UCSD from that county.

Below, we load in a dataset that contains the population of each county in California and save it in the DataFrame `county_populations_raw`. The dataset is taken from [here](https://worldpopulationreview.com/us-counties/states/ca).

In [None]:
county_populations_raw = bpd.read_csv('data/county_populations.csv')
county_populations_raw

Your job is to "clean" the `county_populations_raw` DataFrame. Specifically, create a new DataFrame named `county_populations`, that has
- County names in the index, but without the word `' County'` in the name (i.e. `'San Diego'` rather than `'San Diego County'`)
- A single column, `'Population'`, containing the population of each county in 2021

The first 5 rows of `county_populations` are shown below.

| | **Population** |
|---------------|-------------|
| **Los Angeles**    |  9969510 |
| **San Diego**      |  3347270 |
| **Orange**         |  3175130 |
| **Riverside**      |  2520060 |
| **San Bernardino** |  2206750 |

_**Hint:**_ In order, the steps we used were `.assign`, `.assign`, `.set_index`, and `.get`. 

You'll have to create a column that contains the county names without the word `'County'` in them and then set that column to be the index. You should expect to see the name of that column above the index in your final DataFrame as well, though it doesn't matter what name you choose for that intermediate column.

In [None]:
county_populations = ...
county_populations

In [None]:
grader.check("q3_7")

### Question 3.8 (1 point)

Assign `california_admit_pop` to a DataFrame that contains all of the information in `california_counties_admit`, plus a new column `'Population'`, that contains the population of each county. Sort by `'Population'` in descending order.

_**Hint:**_ Use the `.merge` method.

In [None]:
california_admit_pop = ...
california_admit_pop

In [None]:
grader.check("q3_8")

### Question 3.9 (3 points)

You may have noticed that `california_admit_pop` – the DataFrame containing UCSD admissions data for all California counties – has 29 rows, while `county_populations` – the DataFrame containing population data for all California counties – has 58 rows. This means there are many counties that we don't have UCSD admissions data for. (This doesn't mean that nobody went to UCSD from these counties – see your answer to Question 1.3.)

Below, assign `no_sent_students` to an **array** of the names of all of the counties that we do not have enrollment data for. The order of the names in the array does not matter.

_**Hint:**_ There are multiple ways to solve this problem:

- One method involves using the `.merge` method with the optional `how` argument. After merging, you may need the fact that the result of comparing a `NaN` value to any other number is always `False`. The [documentation for `.merge`](https://babypandas.readthedocs.io/en/latest/_autosummary/bpd.DataFrame.merge.html)  and [this image](https://datacomy.com/data_analysis/pandas/merge/pandas-merge-right-2.png) may be helpful if you go down this route. If you find yourself using the `~` symbol in your solution, change it to a `-` (both mean "negate" or "not").
- Alternatively, you could use a for-loop.

Regardless, a good strategy is to first figure **where** the information you need is stored.

In [None]:
no_sent_students = ...
no_sent_students

In [None]:
grader.check("q3_9")

### Question 3.10 (1 point)

We've seen that Los Angeles County had the most applicants to UCSD. But we also know that Los Angeles County is the most populated county in California (and the US), and so they almost certainly have the most 12th-graders as well. Let's try and determine the number of applicants each county had to UCSD, per 12th-grader.

Unfortunately, we don't have access to data that tells us the number of 12th-graders per county, so we'll need to do some estimation. According to [this site](https://www.infoplease.com/us/census/california/demographic-statistics), 15-19 year olds make up 7.2% of the population of California. Since you spend a year in 12th grade, let's assume 12th-graders make up one-fifth of the population of 15-19 year olds, or 1.44% of the population of California. We will further assume that this holds true for each county individually. For instance, since San Diego County has a population of 3347270, we will assume there were $3347270 \cdot 0.0144 = 48200.688$ 12th-graders in San Diego County.

Using the estimate that 1.44% of the population of each county is made up of 12th-graders, compute a Series indexed by county name containing the **proportion of 12th-graders in each county that applied to UCSD**. Sort the Series in decreasing order, and store the result in `apps_per_capita`.

In [None]:
apps_per_capita = ...
apps_per_capita

In [None]:
grader.check("q3_10")

If you answered Question 3.10 correctly, you'll see that Santa Clara County had the most applicants to UCSD per 12th-grader in the county – roughly 20.3% of all 12th-graders in Santa Clara County applied for first-year admission to UCSD. Los Angeles isn't in the top 5.

### Question 3.11 (1 point)

Let's wrap up Question 3 by visualizing your work from 3.10. To do so, we'll use a new type of data visualization called a choropleth. Choropleths show the value of some variable (either numerical or categorical) for each region in a map. For example, the map you see [here](https://www.mayoclinic.org/coronavirus-covid-19/vaccine-tracker) is a choropleth.

Here, we'll look at a choropleth that shows the proportion of 12th-graders in each California county that applied to UCSD. We've already created the visualization for you, and it can be found at the link below. Note that it is interactive! If you hover over a county, you will see the county's name. (Ignore the "fips" number that appears for each county.)

<center><h4><a href="https://dsc10.com/resources/midterm_project/q3.11-map.html">Click here to see the choropleth.</a></h4></center>

**Your Job:** Which of the following can we conclude solely based off the choropleth linked above (i.e. without knowing anything else about each county)? Assign `choropleth_q` to either 1, 2, 3, or 4.

1. Santa Clara County had more applicants to UCSD than Alameda County.
2. Santa Clara County had more applicants to UCSD than Los Angeles County.
3. More students applied to UCSD from the Bay Area than from Southern California.
4. None of the above.

In [None]:
choropleth_q = ...

In [None]:
grader.check("q3_11")

In the choropleth that we just looked at, it seems like there are two "clusters" of darker regions. Let's look a bit closer at those regions.

<a name='q4'></a>

## Question 4 – The Bay vs. SoCal 🆚

When applying to the UCs, did you only apply to UCSD, or did you apply to other UCs as well? What factors did you consider? Competitiveness? Location?

In this question, we will compare admissions data for UCSD to that of UC Berkeley 🐻. Run the cell below to load in the raw data for Berkeley's first-year Fall 2020 admissions and store it in the DataFrame `berkeley_admissions_raw`.

In [None]:
berkeley_admissions_raw = bpd.read_csv('data/berkeley-admissions-2020.csv')
berkeley_admissions_raw

### Question 4.1 (1 point)

Create a DataFrame called `berkeley_admit` that contains all of the information in `berkeley_admissions_raw`, but with columns for `'AcceptanceRate'` and `'YieldRate'` and with an index of `'ID'`. Note this should only take one line of code if you use your `add_rates` function from earlier.

<!--
BEGIN QUESTION
name: q4_1
points: 1
-->

In [None]:
berkeley_admit = ...
berkeley_admit

In [None]:
grader.check("q4_1")

Now, since we have both the UCSD data and Berkeley data stored in a similar format, we can merge them into one DataFrame called `both_ucs_raw`. We have done this for you in the cell below. Run the cell below – make sure you understand the code in the cell, but do not change it.

In [None]:
# Run this cell. DO NOT change it.
both_ucs_raw = ucsd_admit.merge(berkeley_admit, left_index=True, right_index=True) 
both_ucs_raw

Notice that the column names in the merged DataFrame are messy – some contain a `'_x'` and some contain a `'_y'`. We want to clearly label which columns represent UCSD data and which ones represent Berkeley data. Run the following cell to transform all the column names into something meaningful. Don't worry about how this is done.

In [None]:
# Run this cell. DO NOT change it.
column_mapping = {
    'Name': both_ucs_raw.get('Name_x'),
    'City': both_ucs_raw.get('City_x'),
    'Region': both_ucs_raw.get('Region_x'),
    'Applied_UCSD': both_ucs_raw.get('Applied_x'),
    'Admitted_UCSD': both_ucs_raw.get('Admitted_x'),
    'Enrolled_UCSD': both_ucs_raw.get('Enrolled_x'),
    'AcceptanceRate_UCSD': both_ucs_raw.get('AcceptanceRate_x'),
    'YieldRate_UCSD': both_ucs_raw.get('YieldRate_x'),
    'Applied_Berkeley': both_ucs_raw.get('Applied_y'),
    'Admitted_Berkeley': both_ucs_raw.get('Admitted_y'),
    'Enrolled_Berkeley': both_ucs_raw.get('Enrolled_y'),
    'AcceptanceRate_Berkeley': both_ucs_raw.get('AcceptanceRate_y'),
    'YieldRate_Berkeley': both_ucs_raw.get('YieldRate_y')
}

both_ucs = both_ucs_raw.assign(**column_mapping).get(column_mapping.keys())
both_ucs

Notice that the column names in `both_ucs` are now descriptive.

### Question 4.2 (1 point)

Now, let's compare admissions statistics from UCSD and Berkeley with respect to the region of the high school. One way to do this would be to classify high schools being in the Bay Area or Southern California. 

The Bay Area ("The Bay") is made up of the following nine counties: Sonoma, Napa, Solano, Marin, Contra Costa, Alameda, San Mateo, San Francisco, and Santa Clara.

<img src="./images/bay-area-map.png"
     width=300/>

We will define Southern California ("SoCal") to constitute the following ten counties: San Luis Obispo, Kern, Santa Barbara, Ventura, Los Angeles, Orange, San Bernardino, Riverside, Imperial, and (of course) San Diego.

<img src="./images/socal-map.png"
     width=300/>

Below, complete the implementation of the function `bay_or_socal` that takes as input the name of a county and returns either `'bay_area'`, `'socal'`, or `'neither'` based on the definitions above.

For your convenience, we've defined two lists, `bay_counties` and `socal_counties`. You will need to use them in your implementation of `bay_or_socal`.

**_Hint:_** Use the `in` operator (see [here](https://www.askpython.com/python/examples/in-and-not-in-operators-in-python) for more details).

In [None]:
# Don't change the following two lines!
bay_counties = ['Sonoma', 'Napa', 'Solano', 'Marin', 'Contra Costa', 'Alameda', 'San Mateo', 'San Francisco', 'Santa Clara']
socal_counties = ['Kern', 'San Bernardino', 'Los Angeles', 'Santa Barbara', 'San Diego', 'Imperial', 'Riverside', 'Orange', 'San Luis Obispo', 'Ventura']

def bay_or_socal(county):
    ...
    
# Once you've completed your function, uncomment the following line and make sure the outputs
# are what you'd expect.
# [bay_or_socal('Solano'), bay_or_socal('San Diego'), bay_or_socal('Santa Cruz')]

In [None]:
grader.check("q4_2")

### Question 4.3 (1 point)

Now, create a new DataFrame named `both_ucs_region` with the same data as `both_ucs` but with an additional column called `'california_region'` that contains the California region (i.e. `'bay_area'` or `'socal'`) for each high school. Keep only the high schools located in either the Bay Area or Southern California.

In [None]:
both_ucs_region = ...
both_ucs_region

In [None]:
grader.check("q4_3")

### Question 4.4 (1 point)

Assign the name `enrolled_counts` to a DataFrame that contains the number of students enrolled in each of UCSD and Berkeley from the Bay Area and Southern California, amongst the schools in `both_ucs_region`. `enrolled_counts` should have two columns, `'Enrolled_UCSD'` and `'Enrolled_Berkeley'`, and should have an index of `'california_region'`, meaning that it should only have two rows.

In [None]:
enrolled_counts = ...
enrolled_counts

In [None]:
grader.check("q4_4")

Note that `enrolled_counts`, the DataFrame that you just computed, does not contain the true number of students that enrolled at both of these universities from both of these regions in Fall 2020. All numbers in `enrolled_counts` are underestimates. This is because our original datasets for both universities don't account for every single high school (see Question 1.3), and because we merged the original datasets for both universities, meaning that `both_ucs_region` only contains information for high schools that had applicants to both.

(To be clear, this doesn't mean your answer to `enrolled_counts` is wrong – we just wanted to provide context regarding the numbers you calculated.)

For the remainder of Question 4, use `both_ucs`, not `both_ucs_region`.

### Question 4.5 (1 point)

Assign the name `total_enrolled` to a Series containing the total number of students enrolled at UCSD and Berkeley from each high school contained in `both_ucs`. Sort the Series in descending order of total enrollments, and make sure it is indexed by high school `'ID'`.

In [None]:
total_enrolled = ...
total_enrolled

In [None]:
grader.check("q4_5")

### Question 4.6 (1 point)

In the DataFrame `both_ucs`, how many high schools had a higher acceptance rate at Berkeley than at UCSD? Assign this number to a variable `higher_rate_berkeley`.

In [None]:
higher_rate_berkeley = ...
higher_rate_berkeley

In [None]:
grader.check("q4_6")

<a name='q5'></a>

## Question 5 – Out-of-State ✈️

For the final question on the project, we'll return to focusing on just UCSD admissions data. In this question you will do some data visualization and complete a "fun" challenge.

Let's start by looking at the DataFrame `ucsd_state` again. Notice that it has a column `instate` that contains whether or not a high school is in-state.

In [None]:
ucsd_state

### Question 5.1 (2 points)

Create an **overlaid** histogram of UCSD `'AcceptanceRate'`s, with one histogram for in-state high schools and one histogram for out-of-state high schools.

The histogram should have the following features:
- The $x$-axis should be the acceptance rate in **percentage** (not proportion!).
- The bins should range from 0 through 90 with a bin width of 10 (i.e. the first bin should be $[0, 10)$ and the last bin should be $[80, 90]$).
- Set `density=True` (to make it a density histogram) and `figsize=(10, 5)` (to make it large).
- Set `alpha=0.3` to make sure the overlapped region is visible.

_**Note:**_ The method of creating an overlaid plot from class (calling `.plot` and not specifying a `y` argument) will not work here – since each school is either "in-state" or "out-of-state", we don't have separate "in-state" and "out-of-state" columns. 

As a result, you'll have to follow a slightly different approach:
- **You'll need to call `.plot` twice**, once for in-state schools, and once for out-of-state schools. **Create the histogram for in-state schools first.**
- You'll also need to assign your first plot to a variable, then when you create your second plot you'll need to pass in an additional argument `ax` which references that variable. See [Note 14](https://notes.dsc10.com/03-visualization/intro.html#overlaying-scatters-lines-and-histograms) for some examples.

Here's what your plot should look like (though yours will look bigger):

<img src='images/example-51.png' width=400>

<!-- BEGIN QUESTION -->


<!--
BEGIN QUESTION
name: q5_1
points: 2
manual: true
-->

In [None]:
# Make your histogram here.


# Don't change the following three lines
plt.title('UCSD Fall 2020 First-Year Admissions Rates')
plt.legend(['In-State', 'Out-of-State'])
plt.xlabel('Acceptance Rate (%)');

<!-- END QUESTION -->



### Question 5.2 (1 point)

In your own words, describe the differences between these two histograms. How do the shapes of the distributions vary?

<!-- BEGIN QUESTION -->


<!--
BEGIN QUESTION
name: q5_2
points: 1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



Additionally, why do you think the distributions are so different? (You don't need to answer this anywhere, it's just something to think about.)

### Question 5.3 (3 points)

Let's look at `ucsd_state` one last time.

In [None]:
ucsd_state

It turns out that there is exactly one city whose name is repeated in different regions.

Assign `repeated_city` to the name of this city, and assign `repeated_city_regions` to an array of the names of the unique repeated regions for this city (the order of the names in the array does not matter). **This is a very challenging problem!** We intend it to be a "fun" wrap-up exercise to the project. Since it's the last question, no other questions will depend on your answer to this question.

_**Hint 1:**_ You will need to use `.groupby` twice.

_**Hint 2:**_ You can use the `.reset_index` method if you want to convert the index of a DataFrame into a column.

_**Hint 3:**_ Once you have a Series containing the regions corresponding to `repeated_city`, use the function `np.unique` on the Series to get an array of the unique region names.

In [None]:
repeated_city = ...
repeated_city_regions = ...

# Don't change the following two lines
print('repeated city:', repeated_city)
print('repeated regions:', repeated_city_regions)

In [None]:
grader.check("q5_3")

## Congratulations! You've completed the Midterm Project! 🏁

All you need to do now is submit your assignment:

1. Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells. **If you do not do this, we may not be able to grade your work!**
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope. **Don't forget to add your partner to your group on Gradescope!**

If running all the tests at once causes a test to fail that didn't fail when you ran the notebook in order, check to see if you changed a variable's value later in your code. Make sure to use new variable names instead of reusing ones that are used in the tests.

Remember, the tests here and on Gradescope just check the format of your answers. We will run correctness tests after the assignment's due date has passed.

In [None]:
grader.check_all()