# Discussion 04: Merge, Conditionals, and Iteration
---
Welcome to Discussion 4!

Here's a quick rundown of topics for this week:

- **Subgroups**: groupby with multi-indices

- **Merge**: "cross-reference" DataFrames 

- **Conditionals** : if statements

- **Iteration** : for loops

- **Simulation** : probability experiments

You can find additional help on these topics in the course notes: [11.4 Subgroups](https://notes.dsc10.com/02-data_sets/groupby.html#subgroups), [13 Merge](https://notes.dsc10.com/02-data_sets/merging.html), and CIT: [9.1 Conditional Statements](https://inferentialthinking.com/chapters/09/1/Conditional_Statements.html), [9.2 Iteration](https://inferentialthinking.com/chapters/09/2/Iteration.html), [9.3 Simulation](https://inferentialthinking.com/chapters/09/3/Simulation.html)
.

[Here](https://babypandas.readthedocs.io/en/latest/) is a pointer to that reference sheet we saw last time.

<img src="data/panda_tree.jpg" width="1000">


In [1]:
import babypandas as bpd
import numpy as np

import otter
grader = otter.Notebook()


# College Scorecard

### A wild dataset has appeared!

Check out some interactions with the dataset here --> http://collegescorecard.ed.gov

In [2]:
colleges = bpd.read_csv('data/csc_basic.csv').set_index('UNITID')
colleges

# Question 1.1

Which state has the most colleges/universities? Output the state abbreviation.

<!--
BEGIN QUESTION
name: q11
-->

In [3]:
state_most_colleges = ( 
    colleges
    .groupby('STABBR')
    .count()
    .sort_values(by='CITY')
    .get(['CITY'])
    .index[-1] 
)

state_most_colleges

Can we use the name, city of that institution, and state to locate the information for that institution?

In [4]:
# groupby multi-indices
new = colleges.reset_index().get(['UNITID', 'INSTNM', 'CITY', 'STABBR'])
newG = new.groupby(by=['INSTNM', 'CITY', 'STABBR']).count()
newG.loc[newG.get('UNITID')  != 1]

In [5]:
# That's why UNITID was introduced

colleges.loc[colleges.get("INSTNM") == "Western Technical College"]

# State Population

**Another wild dataset has appeared!**

We'll need this data for the next question.

In [6]:
pops = bpd.read_csv('data/state-population.csv')
pops

# Question 1.2

Which state has the largest number of colleges *per person* in 2012?

**Hint**: Play around with the old dataset and this new dataset then use ```.merge()``` 

**Hint**: Needs 2 pieces of information: number of colleges and population size by state
<!--
BEGIN QUESTION
name: q12
-->

In [7]:
# get a DataFrame of colleges_per_state
colleges_per_state = ...
colleges_per_state

In [8]:
# get a DataFrame of pops_by_state
...
pops_by_state

In [9]:
# now merge them! which left to join on (left, right)?
...
colleges_with_pops

In [10]:
# calculate a new per_person column
...
per_person

In [11]:
...
largest_colleges_per_person

# Question 1.3

What if we had set the index of `pops_by_state`?

<!--
BEGIN QUESTION
name: q13
-->

In [12]:
pops_by_state_with_index = ...
pops_by_state_with_index

In [13]:
# merge this with colleges_by_state

...
colleges_with_pops_by_index

In [14]:
# calculate a new per_person column

...
per_person_with_index

In [15]:
...
largest_colleges_per_person_with_index

# Question 1.4

Suppose that a college is considered **"large"** if it has more than **15k** undergrads, **"medium"** if it has more than **5k** but <= **15k**, **"small"** if it has more than **100** but <= **5k**, and **"tiny"** if it has <= than **100** students. Write a function `college_size` which takes in a number of undergrads and returns a string ("tiny", "small", "medium", "large").


<!--
BEGIN QUESTION
name: q14
-->

In [None]:
grader.check("q14")

# College Scorecard with Earnings

### The dataset is evolving!

Here we see a bit more info about all of our colleges.

In [18]:
with_earnings = bpd.read_csv('data/csc_financials.txt')
with_earnings

# Part 2 : Cards

The next few questions will be about a standard deck of cards

In [19]:
# Create a deck of cards as a list

values = [2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'Q', 'K', 'A']
suits = ['hearts', 'diamonds', 'clubs', 'spades']

ALL_CARDS = []
for value in values:
    for suit in suits:
        card = str(value) + ' of ' + suit
        ALL_CARDS.append(card)

In [20]:
ALL_CARDS

# Question 2.1

Simulate drawing 5 cards *without replacement*

<!--
BEGIN QUESTION
name: q21
-->

In [21]:
...

# Question 2.2

Simulate drawing 5 cards *with* replacement
<!--
BEGIN QUESTION
name: q22
-->

In [22]:
...

# Question 2.3

Make a function ```is_card_clubs(card)``` which, given a single card, determines if that card is a club.

**Hint**: use string method(s). If not sure, google and try it out

**Hint**: use operator **in**


<!--
BEGIN QUESTION
name: q23
-->

In [23]:
def is_card_clubs(card):
...

In [None]:
grader.check("q23")

# Question 2.4

Make a function `number_of_suit(cards, suit)` which, given a list/array of cards, counts the number of cards matching the suit.

<!--
BEGIN QUESTION
name: q24
-->

In [25]:
def number_of_suit(cards, suit):
...

In [None]:
grader.check("q24")

# Question 2.5

What is the probability that a hand of 5 cards, drawn *without* replacement, has at least 2 clubs?

1. Figure out how to run one experiment (put it in a function, ex: ```experiment()```). 
2. Run the experiment a bunch of times (use a for loop!).
3. Calculate the proportion of times that the thing is true.

<!--
BEGIN QUESTION
name: q25
-->

### 1. Run one experiment

In [27]:
def experiment():
...

### 2. Run the experiment a bunch of times

Start by running it 1000 times

In [28]:
...

In [29]:
# show results

### 3. Calculate the probability

That is, what proportion of times did we see >= 2 clubs?

In [30]:
...

# Question 2.6

What is the probability that all of the cards in that hand are clubs?

<!--
BEGIN QUESTION
name: q26
-->

In [31]:
...

# Question 2.7

What is the probability of getting all red cards (hearts | diamonds) when drawing 5 cards without replacement?

HINT: Follow the same structure as we did before

<!--
BEGIN QUESTION
name: q27
-->

In [32]:
...

In [33]:
...

In [34]:
...

In [35]:
...