# Discussion 4 - Functions, DataFrames, Control Flow, Probability, and Simulation

## DSC 10, Summer 2024

### Agenda
- Review of concepts:
    - Functions.
    - Grouping on multiple columns.
    - Merging DataFrames. 
- Work in groups of 2-4 on practice problems covering these topics.
    - Available at [practice.dsc10.com](https://practice.dsc10.com/).
- All together, go over the ones people had the most trouble with at the end.

## Functions

Functions are a way to divide code into small subparts to prevent us from writing repetitive code, which is **prone to error**.

### 1) Defining functions

To define a function in Python, we use the following structure:
```py
def function(parameters):
    # indent!
    <function body>
    return <expression>
```

Functions take in inputs (**arguments**), do something, and produce outputs:

In [None]:
def pythag(a, b):
    return (a**2 + b**2) ** 0.5

pythag(3, 4)

- `pythag` has 2 **parameters**, `a` and `b` 
    - When we call `pythag` with the **arguments** 3 and 4, `a` is set to 3 and `b` is set 4 to within the function body.

- To be able to **save the function output to a variable, you must use `return`**!

### 2) Applying functions to DataFrames

To apply a function `func` to every element in a column `'col'` in DataFrame `df`, use syntax 
<br>
<br>
<center>
    <code>df.get('col').apply(func)</code>
</center>

- `.apply` is a Series method: it is used on a Series, and outputs a Series.
- Each entry in `'col'` will be passed in individually as the argument to `func`.
- Only pass the name of the function!

In [None]:
import babypandas as bpd
roster = bpd.read_csv('data/roster-anon.csv')
roster

In [None]:
def first_name(full_name):
    return full_name.split(' ')[0]

In [None]:
roster.get('name').apply(first_name) # each name passed into the function and produces a Series of these function outputs

## Grouping on Multiple Columns

To group on multiple columns, use the syntax
<br>
<br>
<center>
    <code>df.groupby(['col_1', 'col_2', ..., 'col_k'])</code>
</center>

- Groups `df` by `'col_1'` first. Within each group in `'col_1'`, groups by `'col_2'`, and so on.
- Results in a DataFrame with **one row per unique combination of values in the specified columns**!

Helps us answer a question like this: How many of each first name are in each section?

In [None]:
roster = roster.assign(first = roster.get('name').apply(first_name))
roster

In [None]:
#roster.groupby('first').count()
roster.groupby(['section', 'first']).count().reset_index()

## Merging DataFrames

To combine information from multiple DataFrames, use `.merge`:

```py
left_df.merge(
    right_df, 
    left_on = 'left_col_name', 
    right_on = 'right_col_name'
)
```

The resulting DataFrame contains a **single row for every match between entries in the two specified columns**.
- Rows in either DataFrame without a match will not appear in the merged DataFrame!

<center>
    <img src="images/merge.png" width=600>
</center>
<br>

- If the names of the columns we are merging on are the same in both DataFrames, use `on = 'col'`.
- If we want to merge using an index instead of a column, use `left_index = True` and/or `right_index = True`.

*[PandasTutor](https://pandastutor.com/) is a great resource for visualizing how DataFrame merging works!*