# MIDTERM REVIEW

## Midterm : Wednesday, February 9th during lecture

**Exam Timing and Logistics:** 

You must take the midterm during your assigned lecture slot.

- A00: 1-1:50PM
- B00: 2-2:50PM


The exam format will be a combination  of multiple choice, true/false, filling in a numerical answer, and other short answer questions. The exam will be administered through Gradescope.

The midterm project assesses your coding ability, and the midterm exam will focus more on the conceptual aspects of the course. Questions will be more theoretical, designed to test your understanding of concepts and not so much your implementation in code.

- Lectures 01-13 will be covered
- Open-book, open-notes, open-Google (BUT NO STUDENT COLLABORATION)


**Best way to study:**
- do the project! (great for studying, plus it's due Feb 12th)
- Old homeworks, labs, discussions, review lectures (the exam covers lectures 1-13)
- two practice exams [linked here](https://dsc10.com/resources/)
- this discussion!

*Check out the [post on campuswire](https://campuswire.com/c/G6950E967/feed/529) for more details!*

Here are links to the course [notes](https://notes.dsc10.com/front.html) or the helpful [reference sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view) we often use.

<img src="data/panda_relax.jpg" width="500">

## Some Topics to study for the exam 
#### (this list is not exhuastive)

- News articles and randomized controlled trials (similar to HW1)
- Understanding and working with the index of a DataFrame
- Strategies for extracting information from a DataFrame (knowing how to combine different DataFrame methods to get out the desired information)
- Interpreting the output of code, including DataFrame manipulations
- Knowing when to use different types of visualizations
- Interpreting data visualizations
- Density histograms (calculating height, area, count, and percent)
- Probability questions (similar to Lecture 12)

In [None]:
import babypandas as bpd
import numpy as np

import otter
grader = otter.Notebook()

from notebook.services.config import ConfigManager

cm = ConfigManager()
cm.update(
    "livereveal", {
        'width': 1500,
        'height': 700,
        "scroll": True,
})

## 1. Estimating Probabilities

#### Rolling a die $N$ times

#### Discussion Question

If you roll a die 4 times. What's P(at least one 6)?

|Option|Answer|
|---|---|
|A| $5/6$|
|B| $1-5/6$|
|C| $1-(5/6)^4$|
|D| $1-(1/6)^4$|
|E| None of the above|

#### Answer for 4 rolls:

C) P(at least one 6) = 1 - P(no 6) = 1 - (5/6)\**N

#### Answer for N rolls
* P(at least one 6) = 1 - P(no 6) = 1 - (5/6)\**N

#### Plot the true probability for each N

In [None]:
# chance of getting at least one six 
rolls = np.arange(1, 51)
at_least_one = bpd.DataFrame().assign(n_rolls=rolls, chance=1-(5/6)**rolls)
at_least_one

In [None]:
at_least_one.plot(kind='scatter', x='n_rolls', y='chance')

#### Simulate the probability for N=20
* What is the chance of getting at least one 6 in 20 rolls?

In [None]:
# SOLUTION
faces = np.arange(1, 7)
outcomes = np.random.choice(faces, 20) # pick random number from faces, 20 times
outcomes

In [None]:
# number of positive outcomes
np.count_nonzero(outcomes == 6) # SOLUTION

#### Run this simulation 100,000 times

In [None]:
# SOLUTION
trials = 100000

rolled6 = 0
for i in np.arange(trials):
    outcomes = np.random.choice(faces, 20)
    if np.count_nonzero(outcomes == 6) >= 1:
        rolled6 = rolled6 + 1
        
#estimate the probability
rolled6/trials

#### Simulate the probability for a given N and number of trials
* wrap the experiment in a function that takes the N and number of trials as the input
* run the experiment many times

In [None]:
def roll_n(N, trials):
    rolled6 = 0
    for i in np.arange(trials):
        outcomes = np.random.choice(faces, N)
        if np.count_nonzero(outcomes == 6) >=1:
            rolled6 = rolled6 + 1

    return rolled6/trials

# Simulating the probability of having 'atleast one 6' in 10 rolls with 1000 trails
roll_n(10, 1000)

---

## MBA Player Data

In [None]:
df = bpd.read_csv('data/player_data.csv')
df

## 2. Top Ten DataFrame Patterns and Their Variations

Let's look at the most common patterns we have been using on DataFrames. They are quite simple when you have a computer. 

However, for the exam, you really need to get familiar with them.

Best way to study: Study by writing code with pen and paper. Learn to check your code for logical and syntax errors, without the help of Python!

### 2.1) Get and Drop Columns

**Pattern**: `df.get(column_name)`

**Pattern**: `df.drop(columns = column_name)`

Where column_name is a string

#### What is the output type of the following line of code?

In [None]:
df.get('Points')

**Answer:** Series

In [None]:
df.get(['Age', 'Points'])

**Answer:** DataFrame

#### What will the variable df_modified contain after running the following line of code?

#### Drop column "Age"

In [None]:
df_modified = df.drop("Age")

**Answer:** This is a trick question: the line of code does not execute because you need to specify: columns = "age"

In [None]:
help(bpd.DataFrame.drop)

In [None]:
df_modified = df.drop(columns = "Age")
df_modified

In [None]:
df_modified = df.drop(columns = ["Age", "Team", "Games"])
df_modified

### 2.2) Get something by its label & index
**Pattern**: `df.get(column_name).loc[row_label].`

#### Getting data by its label

In [None]:
df = df.set_index('Name')
df

#### What does the following line of code return?

In [None]:
df.get('Points').loc['LeBron James']

**Answer:** The number of points for LeBron James

#### What does the following line of code return?

In [None]:
df.get('Games').loc['Chris Paul']

**Answer:** The number of games for Chris Paul

#### Getting Multiple datapoints by their labels

Get the points of players James Harden, Stephen Curry, Adreian Payne

#### What is the output type of the last line of code?

In [None]:
query_players = ["James Harden", "Stephen Curry", "Adreian Payne"]
df.get('Points').loc[query_players]

**Answer:** Series object

### 2.3) Find the label with the largest/smallest value

**Pattern**: `df.sort_values(by = "Points").iloc[-1]`

**Pattern**: `df.sort_values(by = "Points", ascending = False).iloc[0]`

#### What information does the following line of code give us?

In [None]:
df.sort_values(by='Points').iloc[0].get('Points')

**Answer:** The number of points for the player with the least points

#### According to score, get the point and name of best player

In [None]:
df.sort_values(by='Points').index[-1] # SOLUTION

#### Current state of the dataframe

In [None]:
df

#### What information is output below?

In [None]:
df = df.sort_values(by = "Age", ascending = False)
df.get("Age").iloc[4]

In [None]:
df.index[4]

**Answer:** The age and name of the 5th oldest player

#### What is the type of the following output, and what information does it contain?

In [None]:
query_range = np.arange(0, 6)

df.get('Age').sort_values(ascending = True).iloc[query_range]

**Answer:** The name and age of the 6 youngest players

#### What is the output below?

In [None]:
df = df.sort_values(by = ["Age", "Points"]) 
df.index[-1]

**Answer:** The name of the oldest player with the largest number of points

#### Get the age and points of this player

In [None]:
df.get("Age").iloc[-1]

In [None]:
df.get("Points").iloc[-1]

### 2.4) Compute a statistic for a subset. Filter to get the subset.

**Example**: Players info for players with age >= 30

**Pattern**:

`
bool_mask = df.get('Age') >= 30
df[bool_mask]
`

#### Return a DataFrame containing entries for players with age >= 30

In [None]:
bool_mask = df.get('Age') >= 30
df[bool_mask]

#### What is the output of the last line of code below?

In [None]:
bool_mask = df.get('Age') >= 30
df[bool_mask].get('Points').mean()

**Answer:** The mean number of points for players that are 30 years or older

#### What is the output of the last line of code below?

In [None]:
mean_points = df.get('Points').mean()
bool_mask = df.get('Points') >= mean_points
df[bool_mask].get('Age').mean()

**Answer:** The mean age of the players, who have scored higher than or equal to the mean points for all players.

### 2.5) Combining Conditions, Filtering and Getting Statistics

**Pattern**:

`bool1 = df.get('col1') > num1
bool2 = df.get('col2') == num2
bool_condition = bool1 & bool2
df[bool_condition]
`

**Pattern**: Don't forget the parantheses if you write it like below:

`
df[(...) & (...) & (...)]
df[(df.get('col1') > num1) & (df.get('col2') == num2)]
`

#### Filter the DataFrame, players who have more than 600 assists and more than 100 steals

In [None]:
df[(df.get('Assists') > 600) & (df.get('Steals') > 100)]

**Answer:** Filters the DataFrame to contain players who have more than 600 assists and more than 100 steals

#### What information do we obtain from the following line of code?

In [None]:
df[(df.get('Rebounds') > 1000) & (df.get('Blocks') > 100)].shape[0]

**Answer:** The number of  players have more than 300 rebounds and more than 20 blocks

#### What information do we obtain from the following lines of code?

In [None]:
mean_points = df.get('Points').mean()
cond1 = df.get('Points') >= mean_points
cond2 = df.get('Games') > 40
bool_cond = cond1 & cond2
df[bool_mask].get('Age').median()

**Answer:** The median age of players who scored higher than the mean points, and played more than 40 games.

### 2.6) Compute statistics for aggregated groups. 

**Pattern**:

`df.groupby(column_name).func()
`
Where func is the aggrageting function

In [None]:
df = df.reset_index()

In [None]:
df.groupby('Team').min()

#### AI Horford is the youngest player on ATL? True or False

**Answer:** False, groupby just takes the minimum of each column for each group. Therefore, the minimum here is just the first name alphabetically on Atlanta's team, and the age column contains the age for the youngest player on Atlanta's team.

#### What does the following line of code return and what is its output type?

In [None]:
df.groupby('Team').count().get(["Points"])

**Answer:** The total number of points for each team. The output type is a dataframe because the argument ["Points"] was passed in as a list into the "get" function

#### What do the values contain in the new column that was created, and what is the name of that column?

In [None]:
new_col = df.get("Points") / df.get("Games")
df_new = (df
    .assign(Points_Per_Game = new_col)
    .sort_values(by = "Points_Per_Game",ascending = False)
)
df_new

**Answer:** the average points per game per Game for each player, column name - Points_Per_Game

### 2.7) Apply function & Conditionals

**Pattern**: `df.get(a_column).apply(a_function)`

#### Given a full name, write a function that finds how many words it has

In [None]:
def find_name_len(string):
    """ Finds how many words the name contains """
    return len(string.split())

In [None]:
find_name_len("Frank Lloyd Wright")

In [None]:
find_name_len("Tony Montana")

#### What does the following line of code output?

In [None]:
df.get("Name").apply(find_name_len)

**Answer:** Series object containing the name length (in terms of number of words) for each player

#### What information does this code output?

In [None]:
( df
 .reset_index()
 .get("Name")
 .apply(find_name_len)
 .max()
)

**Answer:** maximum name length (in terms of number of words)

#### Write a function to assign age groups

In [None]:
# [0, 21) -> 'young', [21, 31) -> 'mid', [31, ] -> 'old'
def assign_age_group(age):
    if age < 21:
        return "young"
    elif age < 31:
        return "mid"
    else:
        return "old"

#### What will the line of code below output?

In [None]:
assign_age_group(21)

In [None]:
assign_age_group(35)

#### Add a new column to the DataFrame, which shows the age group of each player

In [None]:
new_col = df.get("Age").apply(assign_age_group)
df_new = df.assign(Age_Group = new_col)
df_new

### 2.8) Groupby Multiple Columns and look at statistics


**Pattern**:

`df.groupby([column_name1, column_name2]).func()
`
Where func is the aggrageting function

* There should always be an aggregating function. Otherwise we just get a groupby object.

#### What does the following line of code output?

In [None]:
df_new.groupby(["Team", "Age_Group"])

**Answer:** a DataFrameGroupBy object because we haven't used any sort of aggregation yet like .max(), .count(), etc.

#### What does the following line of code output?

In [None]:
df_groups = (df_new
 .groupby(["Team", "Age_Group"]).count()
 .get(["Team", "Age_Group", "Games"])
)

**Answer:** an error because Team and Age_Group are the index, and you cant use .get to get the index

In [None]:
df_groups = (df_new
 .groupby(["Team", "Age_Group"])
 .count()
 .reset_index() # critical change here bc Age_Group was the old index!
 .get(["Team", "Age_Group", "Games"])
)
df_groups

### 2.9) Get all rows containing a string.

**Pattern**

`bool_mask = df.get(column_of_strings).str.contains('James')
df[bool_mask]
`

#### What does the outputted dataframe contain?

In [None]:
bool_mask = df.get("Name").str.contains("James")
df[bool_mask]

**Answer:** All players who have the word "James" somewhere in their full name.

#### Only players with the substring "Reg" and substring "ie" in their full name remain.

In [None]:
mask1 = df.get("Name").str.contains("ie")
mask2 = df.get("Name").str.contains("Reg")
df[mask1 & mask2]

### 2.10) Density Histogram

Here is the scatter plot we created earlier:
<img src="data/scatter_plot_example.png" width="500">

### What will the density histogram look like for the y-axis?

**Answer:** Skewed to left (towards lower probabilities)

### What will the density histogram look like for the x-axis?

**Answer:** uniform, all bars should be the same height because there is one dot on the scatter plot for each number 1-50.

---

## 3. Top Possible Pitfalls & Things to Keep in Mind

### 3.1) Difference between `&` and `and`

Always use `and` with conditionals, always use `&` with boolean arrays.

Same goes for `or` with conditionals, and `|` with boolean arrays.

In [None]:
True and False

In [None]:
np.array([True, False, True]) & np.array([False, False, False])

In [None]:
# don't do this!
np.array([True, False, True]) and np.array([False, False, False])

### 3.2) Parentheses when combining conditionals:

In [None]:
df[df.get("Age") >= 25 & df.get("Points") >= 2000]

In [None]:
df[(df.get("Age") >= 25) & (df.get("Points") >= 1800)]

### 3.3) Column names are meaningless/re-considered after a `groupby` and aggregation

In [None]:
df.groupby("Team").count()

In [None]:
# Has no relation to the actual "Steals" and "Blocks" columns
df.groupby("Team").count().get(["Steals", "Blocks"])

### 3.4) Reset index, especially after grouping with multiple columns.

In [None]:
(df_new
 .groupby(["Team", "Age_Group"])
 .count()
 .reset_index()
)

### 3.5) `iloc[]` vs `loc[]` vs array indexing`[]`

In [None]:
# Before using loc, make sure of what type of index you have:
df.index

In [None]:
df = df.set_index("Name")

In [None]:
df

In [None]:
df.get("Age").loc["Stephen Curry"]

In [None]:
df.get("Age").iloc[2]

In [None]:
df.index[2]

### 3.6) Not specifying column while sorting DataFrame

Wrong: `df = df.sort_values(ascending = False)` 

Correct: `df = df.sort_values(by = column_name, ascending = False)` 

### 3.7) Trying to get the index using .get() instead of .index

In [None]:
# df.get("Name")
df.index

### 3.8) Using df.drop with missing argument

`df.drop(columns = column_name)` without `columns` argument, for example `df.drop(column_name)` is wrong.

In [None]:
# df.drop("Points")
df.drop(columns = "Points")

### 3.9) `np.arange` (and list slicing) includes the first value but excludes the last

In [None]:
np.arange(10, 20)

### 3.10) Use of `counter = 0` before `for` loop and `counter = counter + 1` inside loop

In [None]:
# Example: Get the number of elements that are divisible by 3
data = np.arange(28)

counter = 0
for i in data:
    if i % 3 == 0:
        counter += 1

counter

### 3.11) Variables defined inside the function cannot be used/called outside

In [None]:
# Example: Write a function to get all the elements that are divisible by 3
# Input will be in the form of np.array

def div_by_three(data):
    result = np.array([])
    
    for i in data:
        if i % 3 == 0:
            result = np.append(result, i)
    
    return result

div_by_three(np.arange(28))

In [None]:
result

### 3.12) Take care of the indentations while using `for` loops, `if-else` statements, and functions

In [None]:
# Take a look at above two examples to see the various indentations