(ws2)=
# Worksheet 2

:::{epigraph}
Datasets and Probability

-- TODO your name here
:::

:::{admonition} Collaboration Statement
- TODO brief statement on the nature of your collaboration.
- TODO your collaborator's names here.
:::

## Learning Objectives

- Learn about `pandas` and `seaborn` for dataset manipulation and visualization.
- Practice with probability concepts needed for the course:
    - Discrete random events and expectation
    - Contingency tables and conditional probabilities
- Familiarization with broadcasting and axis operations in `numpy`.


# 1. Pandas for tabular data [2 pts]

[Pandas](https://pandas.pydata.org/) is the defacto Python framework for working with tabular data, and it is supported by a large ecosystem of libraries, including integration with NumPy. Pandas provides two main data structures:

- `DataFrame`: a 2-dimensional data structure often used to represent a table with rows and named columns. We can think of a `DataFrame` as a 2D numpy array with named columns.
- `Series`: a 1-dimensional, **labelled** array, often used to represent a single column or row in a `DataFrame`

We will primarily be using pandas to load in and analyze datasets before we pass them into our machine learning models.

To familiarize ourselves with working with the `pandas` library, we will load US Census data provided by the [folktables package](https://github.com/socialfoundations/folktables) and perform some fundamental pandas operations. 

From the creators of [folktables](https://github.com/socialfoundations/folktables):

> Folktables is a Python package that provides access to datasets derived from the US Census, facilitating the benchmarking of machine learning algorithms. The package includes a suite of pre-defined prediction tasks in domains including income, employment, health, transportation, and housing, and also includes tools for creating new prediction tasks of interest in the US Census data ecosystem. The package additionally enables systematic studies of the effect of distribution shift, as each prediction task can be instantiated on datasets spanning multiple years and all states within the US.
> 
> Why the name? Folktables is a neologism describing tabular data about individuals. **It emphasizes that data has the power to create and shape narratives about populations and challenges us to think carefully about the data we collect and use.**

We will study the context and history around machine learning research using US Census data in the upcoming weeks, but for now, let's take a look at one particularly prominent dataset: the [ACS (American Community Survey)](https://www.census.gov/programs-surveys/acs) Income dataset, which contains socioeconomic data about individuals in the US.

In [None]:
import numpy as np

# The standard import idiom for pandas
import pandas as pd

# Load in the ACS Income dataset
income_df = pd.read_csv('~/COMSC-335/data/adult_reconstruction.csv')

:::{note}
To load data from files manually, pandas provides various `pd.read_*` functions. For example, [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) loads (comma-separated values) CSV files. See pandas' [I/O documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for more options.
:::

We often suffix variable names with `_df` to make it clear that we are working with a dataframe.

For high-level inspection of the dataframe, we can use the following functions and attributes:

- `df.head()`: returns the first 5 rows of the dataframe
- `df.tail()`: returns the last 5 rows of the dataframe
- `df.info()`: returns a summary of the dataframe, including the number of rows, columns, and the data types of each column
- `df.columns`: returns the column names of the dataframe
- `df.shape`: returns the number of rows and columns in the dataframe
- `df.dtypes`: returns the data types of each column

You can play around with the dataframe in the cell below:

In [None]:
income_df.head()

In [None]:
# Like numpy, pandas provides a `shape` attribute that returns a tuple of (num rows, num columns)
income_df.shape

From above, we see that the dataframe has 49,531 rows and 14 columns. Here, each row represents an individual, and each column represents a feature of the individual. The machine learning task is to predict the `income` of an individual based on the other features.

Taking a look at the columns, we see that there are a wide range of demographic and occupational features collected:

In [None]:
income_df.columns

## Column selection and filtering

To select a single column, we can square bracket indexing with the name of the column:

In [None]:
# Selects the 'education' column and prints the first 5 rows
income_df['education'].head()

When initially exploring a dataset, it is often useful to see the unique values in a column:

In [None]:
# Get the unique values in the 'education' column
income_df['education'].unique()

The square bracket indexing can be generalized to selecting multiple columns by passing a list of column names:

In [None]:
# Selects multiple columns and prints the last 10 rows
cols = ['education', 'age', 'hours-per-week']
income_df[cols].tail(10)

We can also remove columns by using the `drop` method:

In [None]:
# Drop the 'relationship' column
income_df = income_df.drop(columns=['relationship'])

:::{tip}

Operations that modify the dataframe will return a new dataframe with the changes, and the original dataframe will not be modified. So if we want to modify the dataframe in place, we need to assign the result back to the original variable:

```python
income_df = income_df.drop(columns=['relationship'])
```

:::

Just like NumPy, we can also use boolean indexing to select portions of the dataframe based on a condition:

In [None]:
# Selects individuals who are below the age of 30
sel_df = income_df[income_df['age'] < 30]

sel_df['age'].unique()

We can then use the `value_counts` function to get the frequency of each category in a column:

In [None]:
# Print the value counts of the 'workclass' column for individuals below the age of 30
print(sel_df['workclass'].value_counts())

# Normalize=True to get the proportion of each category
print(sel_df['workclass'].value_counts(normalize=True))

These boolean conditions can be combined using the `&` (AND), `|` (OR), and `~` (NOT) operators. Additionally, there are some special functions that can be used to select data based on a condition:

In [None]:
# Combining conditions: respondents who are less than 30 years old AND have non-zero capital-gain
income_df[(income_df['age'] < 30) 
        & (income_df['capital-gain'] > 0)]

# isin(): select rows where a column is in a list of values
income_df[income_df['workclass'].isin(['Local-gov', 'State-gov'])]

:::{tip}
To avoid errors, always use parentheses when combining conditions:

- Incorrect: `df[column1 == value1 & column2 == value2]`
- Correct: `df[(column1 == value1) & (column2 == value2)]`
:::


**1.1**. Let's practice combining these operations. Select respondents who:

- work **full-time**, defined as `hours-per-week >= 40` AND
- have an `education` level in `['Bachelors', 'Masters', 'Doctorate']`

Within this group, compute the counts of `occupation` using `value_counts()`. This column indicates the type of job the respondent has.



How many individuals are in this group? What is the most common occupation in this group?


**Your response**: TODO

In [None]:
# TODO your code here (make sure to use income_df for the dataframe)
full_time_at_least_bachelors_df = None

:::{admonition} Solution (click to check once you've completed 1.1)
:class: dropdown

You should see that the group of respondents who work full-time and have an education level defined above has 9,463 individuals. The most common occupation in this group is `Prof-specialty` with 3,287 individuals, which corresponds to some professional specialization role (*not* an academic professor).
:::


We can also use boolean logic to create new columns. For example, a common age cutoff used in US policy studies is 65, as it is the age that corresponds to "seniors" and is when people are eligible for Medicare health insurance:

In [None]:
# Create a new column 'is_senior' that is 1 if the respondent is 65 or older, and 0 otherwise
# .astype(int) converts the boolean values into 0s and 1s
income_df['is_senior'] = (income_df['age'] >= 65).astype(int)

income_df['is_senior'].value_counts()

We'll commonly apply this transformation to a continuous variable to create a binary indicator column, such as when we want to create a $y$ column for a binary classification task.

**1.2** Complete the `binarize_column` function below. This function takes a pandas Series and a cutpoint as arguments, and returns a Series with 1s and 0s indicating whether each element of the input series is greater than the cutpoint:

In [None]:
def binarize_column(column: pd.Series, cutpoint: float) -> pd.Series:
    """
    Binarizes a column based on a cutpoint.

    Args:
        column (pd.Series): The column to binarize.
        cutpoint (float): The cutpoint to use for binarization.

    Returns:
        pd.Series: A column with 1s and 0s indicating whether each element of the input series is greater than the cutpoint.
    """
    # TODO your code here
    pass

In [None]:
#### Test binarize_column ####
if __name__ == "__main__":
    df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
    df['A_bin'] = binarize_column(df['A'], 2.5)
    assert df.equals(pd.DataFrame({'A': [1, 2, 3, 4, 5], 'A_bin': [0, 0, 1, 1, 1]})), "Binarization is incorrect"

## Grouping and aggregation

The [`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) operation is a powerful tool for performing aggregations on subsets of the dataframe. We pass in one or more columns as the `by` argument, which then divides the original Dataframe based on the unique values of the column(s). We then often apply an **aggregation** function to each group, resulting in a new dataframe.

You can think of it as:

- split the dataframe into groups (based on unique values of a column), then
- apply an aggregation (like `mean`, `median`, `size`, etc.) to each group.


We can replicate `value_counts()` with `groupby()` and the `size()` aggregation function:

In [None]:
# Count number of people in each workclass category
income_df.groupby(by='workclass').size()

We can also compute summary statistics like `mean()`, `std()`, `median()`, `min()`, `max()` on columns after grouping:

In [None]:
# First, group by the 'workclass' column
# Then, see the average number of work hours per week for each group
income_df.groupby('workclass')['hours-per-week'].mean()

To apply multiple aggregation functions, we can pass in a dictionary of columns as keys and functions as values to [agg()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html):

In [None]:
# we want to apply the following aggregations:
agg_dict = {
    # mean, min, max for hours-per-week
    'hours-per-week': ['mean', 'min', 'max'],
    # median for education_num
    'education-num': ['median']
}

# apply the aggregations, grouping by 'workclass'
income_df.groupby('workclass').agg(agg_dict)

**1.3**. Compute a summary table that groups by the `is_senior` column and then computes the following aggregations:   

- mean and median `income`
- mean and median `capital-gain`

What is the mean `income` for seniors? Do seniors have more or less median `capital-gain` than non-seniors?

**Your response**: TODO

In [None]:
# TODO your code here


:::{admonition} Solution (click to check once you've completed 1.3)
:class: dropdown

The groupby aggregation table should show that seniors have a mean income of about \$29,670. Seniors have a higher median capital gain of \$1,877, compared to non-seniors with a median capital gain of \$1,029.

:::


## Processing categorical variables with `get_dummies`

Many machine learning models expect purely numeric or binary features, but we've seen so far that our dataset has a number of categorical features that are encoded as strings.

A common step taken to prepare data is **one-hot encoding**, where we convert a categorical column like `workclass` into a set of binary indicator columns: a new column is generated for each category within the column, called a **dummy variable**.

In pandas, there is the [`pd.get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) function that can be used to perform this transformation.

Let's first see what the last 5 rows of the `workclass` column look like:

In [None]:
display(income_df['workclass'].tail())

Then, we'll use `pd.get_dummies` to convert the `workclass` column into a set of binary indicator columns, one for each category in the column:

In [None]:
# Generate a new dataframe with binary columns for each category in 'workclass'
# dtype=int converts the boolean values into 0s and 1s
workclass_dummy_df = pd.get_dummies(
    data=income_df,
    columns=['workclass'],
    dtype=int
)

display(workclass_dummy_df)

Notice how in the `workclass_dummy_df`, each row has a 1 in the `workclass_category` column if the original `workclass` column had that category, and 0 otherwise. The columns argument can also be a list of columns to encode.

**1.4**. Use `pd.get_dummies()` to one-hot encode the `occupation` column in the `income_df` dataframe.

What is the shape of the new dataframe compared to the original `income_df`? What does that tell us about the number of categories in the `occupation` column?

**Your response**: TODO

In [None]:
# TODO your code here
occupation_dummy_df = None

display(occupation_dummy_df.head())

:::{admonition} Solution (click to check once you've completed 1.4)
:class: dropdown

The shape of `occupation_dummy_df` is `(49531, 28)`, which has the same number of rows as `income_df` but 15 more columns. This tells us that there are 15 unique categories in the `occupation` column. We can also check this by using the `nunique()` function: `income_df['occupation'].nunique()`

:::

:::{admonition} More on pandas

If you'd like to learn more about pandas operations, see [this quickstart guide](https://pandas.pydata.org/docs/user_guide/10min.html) and the associated links within it.

:::


# 2. Seaborn for data visualization [1 pt]

[seaborn](https://seaborn.pydata.org/tutorial/introduction.html) is one of the most popular libraries for creating visualizations in Python, as it is a higher-level library built on top of the more fundamental [matplotlib](https://matplotlib.org/) library.

We'll use seaborn to visualize some relationships between variables in our income dataset.

First, let's import seaborn using its standard import idiom, which abbreviates the name to `sns`:

In [None]:
import seaborn as sns

Many seaborn plots follow a similar argument pattern, where we often pass in the following arguments:
- `data`: the dataframe to plot
- `x`: the column to plot on the x-axis
- `y`: the column to plot on the y-axis, if applicable
- `hue`: the column to use for color-coding the points

Seaborn has tight integration with pandas, which allows us to use pandas to filter and manipulate the data before passing it to seaborn. Let's generate a [sns.histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot) of `income` for individuals above the age of 30 who work in the private sector:

In [None]:
if __name__ == "__main__":
  # Select out data for individuals above the age of 30 who work in the private sector
  above_30_private_df = income_df[
        (income_df['age'] >= 30) 
      & (income_df['workclass'] == 'Private')
  ]

  # Generate the histogram with:
  # Data: the below_30_private_df dataframe
  # x-axis: 'income'
  # y-axis: is the count of individuals in each bin so it does not need to be specified
  sns.histplot(data=above_30_private_df, x='income')

**2.1.** You should see that the distribution tails off to the right, with a large mass of individuals at the \$100k mark. Speculate on why you think this mass is at \$100k, keeping in mind that this data is generated from a survey of individuals in the US:

**Your response**: TODO

The most common machine learning task associated with this dataset is a binary classification task, where we predict whether the `income` of an individual is greater than or equal to \$50k based on the given features. This dataset in particular has been extensively used historically as a benchmark for evaluating the [**fairness** of machine learning models](https://arxiv.org/pdf/2108.04884), such as whether the model is biased towards certain groups of individuals. We'll examine aspects of fairness in the upcoming weeks, but first let's get a sense of the data. Run the cell below to create a column that is a binary indicator of `income` >= $50k:

In [None]:
income_df['income_>50k'] = income_df['income'] >= 50000

Complete the cell below to generate a histplot of `age` with the following parameters:

- `data=income_df`: use the `income_df` dataframe for the plot
- `x='age'`: plot the `age` column on the x-axis

In [None]:
if __name__ == "__main__":
    # TODO your code here
    pass


**2.2** Now, add the following parameters to your plot above:

- `hue='income_>50k'`: this colors the histogram bars by the `income_>50k` column
- `multiple='stack'`: stack the colored histograms on top of each other

Briefly describe what you see in the relationship between `age` and `income_>50k` in the plot above. Some questions to consider:

- Are there ages where there are very few individuals with incomes > \$50k?
- Is there a particular age, or age range where the proportion of individuals with incomes > \$50k is higher?

**Your response**: TODO

:::{admonition} Observations (click to check once you've completed 2.2)
:class: dropdown

In this data, you should see that there are very few individuals with incomes > \$50k with ages <20, as well as in the ~75-80 age range.

There is a peak in the proportion of individuals with incomes > \$50k around age 47, and generally it looks like the older the individual, the more likely they are to have an income > \$50k up until ~47, where it begins to slowly decline.

:::


The top 4 most common occupations are (How would you check this using the pandas commands we saw earlier?):

- Craft-repair
- Prof-specialty
- Exec-managerial
- Adm-clerical


Let's see how the proportion of individuals with incomes > \$50k varies by these top 4 occupations with an [sns.countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html#seaborn.countplot), which shows counts or percentages across categories. 


**2.3** Complete the code below with the following parameters passed to countplot:

- `data=top4_occupations_df`: use the `top4_occupations_df` dataframe for the plot
- `x='occupation'`: plot the `occupation` column on the x-axis
- `hue='income_>50k'`: this colors the bars by the `income_>50k` column
- `stat='percent'`: show the percentage of individuals in each bar on the y-axis

Which occupations seem to be most "predictive" of income > \$50k?

**Your response**: TODO

In [None]:
if __name__ == "__main__":
    top4_occupations = [
        'Craft-repair',
        'Prof-specialty',
        'Exec-managerial',
        'Adm-clerical',
    ]

    # TODO select the rows in income_df where the 'occupation' column is in the top4_occupations list
    top4_occupations_df = None
    

:::{admonition} Observations (click to check once you've completed 2.3)
:class: dropdown

The `Exec-managerial` and `Prof-specialty` occupations have the highest percentage of individuals with incomes > \$50k, while the `Adm-clerical` and `Craft-repair` occupations have the lowest.

:::

:::{admonition} More on seaborn
:class: note

The following official resources are good references for seaborn if you'd like to learn more:

- [seaborn plotting overview](https://seaborn.pydata.org/tutorial/function_overview.html)
- [seaborn data structures](https://seaborn.pydata.org/tutorial/data_structure.html)

:::

# 3. Probability primer [1 pt]

We're now in the process of moving into machine learning **classification**, where the goal is to predict categories instead of continuous values. For this, we'll need to utilize some fundemental concepts from probability.

## Chance events

Probability is the mathematical framework that allows us to reason about randomness and chance. We often want to reason about discrete events that happen, such as whether a coin flip comes up heads or tails, whether it will rain tomorrow, or whether a machine learning model trained to recognize cats identifies an image as a cat or not.

For all of these situations, we assign probabilities to an event $A$ that happens, $P(A)$. For example, a fair coin has a 50% chance of landing heads and a 50% chance of landing tails, so:

$$P(\text{heads}) = 0.5, \quad P(\text{tails}) = 0.5$$

If we assign heads and tails to be numeric outcomes, e.g. $\text{heads} = 1$ and $\text{tails} = 0$, then the coin flip can be thought of as a **random variable** $Y$.

In order for something to be considered a valid (discrete) random variable, the probability of each event must be between 0 and 1, and the sum of the probabilities of all events **must** equal 1. In the case of a fair coin, $P(Y=1) + P(Y=0) = 0.5 + 0.5 = 1$.

We most frequently work in situations where the random variable is binary, for example:

$$
Y = \begin{cases}
    1 & \text{image has a cat} \\
    0 & \text{image does not have a cat}
\end{cases}
$$

**3.1:** Suppose that we have a dataset of 1000 images, where 300 of them have cats in them. What is $P(Y=0)$, that is, the probability that a image does not have a cat?

**Your response**: TODO

## Expectation

The **expectation** of a random variable $Y$ is the average value of $Y$ over all possible outcomes. It is given by:

$$
E[Y] = \sum_{y \in \mathcal{Y}} y P(Y=y)
$$

where $\mathcal{Y}$ is the set of all possible outcomes for $Y$. In the case of binary random variables, $\mathcal{Y} = \{0, 1\}$ for the two possible outcomes.

**3.2:** Continuing from our cat picture example above, what is the expectation $E[Y]$?

$$
E[Y] = TODO
$$

:::{admonition} Solutions (click to check once you've completed 3.1-3.2)
:class: dropdown

**3.1:** $P(Y=0) = 1 - P(Y=1) = 1 - 0.3 = 0.7$

**3.2:** $E[Y] = 0 \cdot P(Y=0) + 1 \cdot P(Y=1) = 0 \cdot 0.7 + 1 \cdot 0.3 = 0.3$

An identity that is often used is that $E[Y] = P(Y=1)$ for binary random variables.

:::


## Joint and conditional probabilities

We also often want to reason about the probability of two random variables occurring together, called a **joint probability**. If the two random variables $Y$ and $Z$ are binary, we can represent the joint probability using a **2x2 contingency table**:
 
$$
\begin{array}{c|c|c}
       & Y=0 & Y=1 \\
\hline
Z=0 & \text{count of } Z=0 \text{ AND } Y=0 & \text{count of } Z=0 \text{ AND } Y=1 \\
Z=1 & \text{count of } Z=1 \text{ AND } Y=0 & \text{count of } Z=1 \text{ AND } Y=1 \\
\end{array}
$$

Each entry in the contingency table is a count of the number of times the corresponding event occurs. For example, the entry in the first row and first column is the count of the number of times $Y=0$ and $Z=0$ occur together.

Let $N$ be the total count of all the entries in the contingency table. We can take the **marginal probability** of $Y$ by summing over the columns of the contingency table:

$$
P(Y=0) = \frac{(\text{count of } Y=0 \text{ and } Z=0) + (\text{count of } Y=0 \text{ and } Z=1)}{N}
$$

$$
P(Y=1) = \frac{(\text{count of } Y=1 \text{ and } Z=0) + (\text{count of } Y=1 \text{ and } Z=1)}{N}
$$

Similarly, we can take the marginal probability of $Z$ by summing over the rows of the contingency table:

$$
P(Z=0) = \frac{(\text{count of } Z=0 \text{ and } Y=0) + (\text{count of } Z=0 \text{ and } Y=1)}{N}
$$

$$
P(Z=1) = \frac{(\text{count of } Z=1 \text{ and } Y=0) + (\text{count of } Z=1 \text{ and } Y=1)}{N}
$$



The **joint probability** of two events $Y$ and $Z$ occurring together, $P(Y=y, Z=z)$, is given by the entry in the contingency table for the corresponding row and column. For example, if $Y=0$ and $Z=1$, then:

$$
P(Y=0, Z=1) = \frac{\text{count of } Y=0 \text{ and } Z=1}{N}
$$



The conditional probability of $Y$ given $Z$, $P(Y=y \mid Z=z)$, is given by the entry in the contingency table for the corresponding row and column, divided by the marginal probability of $Z$:

$$
P(Y=y \mid Z=z) = \frac{P(Y=y, Z=z)}{P(Z=z)}
$$

Suppose we also know that some of our 1000 images also contain a cardboard box. We can represent this as a new random variable $Z$:

$$
Z = \begin{cases}
    1 & \text{image contains a cardboard box} \\
    0 & \text{image does not contain a cardboard box}
\end{cases}
$$


:::{figure} images/uni_cat_box.jpg
:scale: 50%

A picture where $Z=1$ and $Y=1$. [Source: Uni the cat](https://www.instagram.com/unico_uniuni/)

:::




We then have the following contingency table for our 1000 images:

 
$$
\begin{array}{c|c|c}
 & Y=0 & Y=1 \\
\hline
Z=0 & 600 & 100 \\
Z=1 & 100 & 200 \\
\end{array}
$$

**3.3**: Compute the following probabilities:

- $P(Z=0) = TODO$
- $P(Z=1) = TODO$
- $P(Y=1 \mid Z=0) = TODO$
- $P(Y=1 \mid Z=1) = TODO$

**3.4** Given your answers to 3.3, does the presence or does the absence of a cardboard box seem to more predictive of whether an image has a cat in it?

**Your response**: TODO

:::{admonition} Solutions (click to check once you've completed 3.3 and 3.4)
:class: dropdown

- $P(Z=0) = 0.7$
- $P(Z=1) = 0.3$
- $P(Y=1 \mid Z=0) = 1/7 \approx 0.1429$
- $P(Y=1 \mid Z=1) = 2/3 \approx 0.6667$

To assess "predictiveness", we can look at the conditional probabilities of $Y$ given $Z$. The presence of a cardboard box seems to be more predictive of whether an image has a cat in it, as $P(Y=1 \mid Z=1) > P(Y=1 \mid Z=0)$. Additionally, $P(Y=1 \mid Z=1) > P(Y=1)$, so the presence of a cardboard box seems to be a useful predictor of whether an image has a cat in it overall.

:::

# 4. Broadcasting and axis operations in NumPy [1.5 pts]

As we scale up the number of features in our machine learning models, we can lean on broadcasting and axis operations in NumPy to make calculations across 2D arrays more efficient.


## Broadcasting

[Broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html) is one of the most powerful concepts in NumPy, but it is also one that takes some getting used to. So, let's review the idea of NumPy array shapes and what happens when using operators on differently shaped arrays.

We saw on Worksheet 1 that arithmetic operations performed on arrays with the same shape are computed element-wise. For example:

In [None]:
import numpy as np

a = np.array([1, 1, 1])
b = np.array([2, 4, 6])

print(a + b) # Will print [3 5 7]

Note that both arrays have the same shape (3,), which indicates that they are 1D arrays with 3 elements. **This is different from a 2D array of shape (3, 1)**, which is a vector with 3 rows and 1 column or a 2D array of shape (1, 3), which is a vector with 1 row and 3 columns.

If NumPy encounters two arrays with different shapes, it will attempt to **broadcast** the arrays together. What happens is that NumPy will compare their shapes element-wise, starting from the rightmost dimension and working its way to the left. Two dimensions are compatible when:

- they are equal, or
- one of them is 1

Once one of these conditions is met, the arithmetic operation is performed element-wise across that dimension. Below are some examples.

Below are some examples. 

In [None]:
# a is an array of shape (2, 3)
a = np.array([[1, 2, 3],
              [4, 5, 6]])

# b is an array of shape (2, 3)
b = np.array([[1, 1, 1],
              [1, 1, 1]])

# Will print [[2 3 4], 
#             [5 6 7]]
print(a + b)

Above, working from the rightmost dimension, we see that both dimensions for `a` and `b` are equal:

```python
a       (2d array): 2 x 3
b       (2d array): 2 x 3
result  (2d array): 2 x 3
```

Therefore, the arithmetic operation is performed element-wise across the array.

In [None]:
# a is an array of shape (2, 3)
a = np.array([[1, 2, 3],
              [4, 5, 6]])

# c is an array of shape (3,)
c = np.array([3, 2, 1])


# Will print [[ 3  4  3], 
#             [12 10  6]]
print(c * a)

Even though `c` has a different shape than `a`, we see that the dimension of `c` matches the last dimension of `a`, so the multiplication operation is performed element-wise across that dimension:

```python
c       (1d array):     3
a       (2d array): 2 x 3
result  (2d array): 2 x 3
```

However, if we try to broadcast `a` with a 1D array of shape (2,), NumPy will raise a ValueError because the dimensions do not match:

```python
a       (2d array): 2 x 3
d       (1d array):     2
result: ValueError because of a dimension mismatch
```

In [None]:
# a is an array of shape (2, 3)
a = np.array([[1, 2, 3],
              [4, 5, 6]])

# d is an array of shape (2,)
d = np.array([1, 2])

# Will raise an Error because the dimensions do not match
print(a + d)

**4.1** If `a` is an array of shape `(100,)` and `b` is an array of shape `(100, 10)`, what is the shape of `a - b` (or would it raise an error)?

**Your response**: TODO

**4.2** If `c` is an array of shape `(256, 16)` and `d` is an array of shape `(16,)`, what is the shape of `c + d` (or would it raise an error)?

**Your response**: TODO


**4.3** Work out what the output of the following arithmetic operation will be before checking your answer by running the code:

```python

# w is a 1D array of shape (2,)
w = np.array([1, -2])

# X is a 2D array of shape (3, 2)
X = np.array([[0, 1],
              [1, 0],
              [2, 2]])

print(w * X)
```

**Your response**: 

```python
[[TODO]]
```

::: {admonition} Solutions (click to check once you've completed 4.1-4.3)
:class: dropdown

**4.1:** Since `a` is a 1D array and its shape does not match the rightmost dimension of `b`, this will raise a ValueError.

**4.2:** Since `c`'s rightmost dimension matches `d`'s shape, the addition will be successful. The resulting array will have shape `(256, 16)`, with `d` being added to each **row** of `c`.

**4.3:** The multiplication will result in each row of `X` being multiplied by the corresponding element of `w`:

```python
[[0, -2],
 [1, 0],
 [2, -4]]
```

:::

## Axis operations

When working with 2D arrays, we often want to apply operations across rows or columns. NumPy provides a convenient way to do this using the `axis` parameter in many aggregation functions.

For example, we have used the [np.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) function to sum all the elements in an array:

In [None]:
a = np.array([1, 2, 3])

# Will print 6
print(np.sum(a)) 

However, in 2D arrays we could also sum the elements across the rows or columns. By default, `np.sum` will sum all the elements in the array:

In [None]:
b = np.array([[1, 2, 3],
              [4, 5, 6]])

# Will print a single number, the sum of all the elements in b: 21
print(np.sum(b)) 

If we wanted to sum the column values for each row, we could specify the parameter `axis=1`, which tells `np.sum` to sum across the second dimension:

In [None]:
 # Will print [6 15]
print(np.sum(b, axis=1))

This produces a new array of shape `(2,)`, which is the sum of each row in `b`.

Similarly, if we wanted the sum of each column, we could specify the parameter `axis=0`, which tells `np.sum` to sum across the first dimension:

In [None]:
 # Will print [5 7 9]
print(np.sum(b, axis=0))

This produces a new array of shape `(3,)`, which is the sum of each column in `b`.

Another way we can think about the `axis` parameter is that we're telling NumPy which dimension to collapse in the resulting array:

```python
                           dim: 0   1
b                   (2d array): 2 x 3
# dimension 0 is collapsed
np.sum(b, axis=0)   (1d array): _   3


# dimension 1 is collapsed
np.sum(b, axis=1)   (1d array): 2   _
```

**4.4.** Suppose that `X` is a 2D array of shape `(n, p)`. Write the line of code that computes the mean of each column of `X`, which results in an array of shape `(p,)`:

**Your response**: `TODO`

:::{admonition} Solution (click to check once you've completed 4.4)
:class: dropdown
We want to collapse the dimension that corresponds to the number of rows, so we specify `axis=0`: `np.mean(X, axis=0)`

:::

Finally, let's put broadcasting and axis operations together to more efficiently compute predictions from our linear regression model.

With one feature, our linear regression model prediction for a given example $x_i$ has the form:

$$
\hat{y}_i = w_0 + w_1 x_{i,1}
$$

With two features on Homework 1, our linear regression model had the form:

$$
\hat{y}_i = w_0 + w_1 x_{i,1} + w_2 x_{i,2}
$$

While we could compute the weighted sum for each row of `X` and then add the bias term manually, we'd like to generalize this to any number of features $p$:

$$
\hat{y}_i = w_0 + \sum_{j=1}^{p} w_j x_{i,j}
$$

Furthermore, we'd like to compute the predictions for all examples in `X` at once:
 

$$
\begin{bmatrix}
\hat{y}_1 \\
\hat{y}_2 \\
\vdots \\
\hat{y}_n
\end{bmatrix}
= 
\begin{bmatrix}
w_0 + \sum_{j=1}^{p} w_j x_{1,j} \\
w_0 + \sum_{j=1}^{p} w_j x_{2,j} \\
\vdots \\
w_0 + \sum_{j=1}^{p} w_j x_{n,j}
\end{bmatrix}
$$



Complete the function below to compute the predictions for all examples in `X` at once:

:::{admonition} Hint
:class: tip

This can be done using a single line of code. First, we can use broadcasting to compute the product of `X` and `w`. Then we can sum across the appropriate axis to compute $\sum_{j=1}^{p} w_j x_{i,j}$ for each row. Finally, since $w_0$ is a scalar, we can just add it to the result.
:::

In [None]:
def linreg_predictions(X: np.ndarray, w: np.ndarray, w0: float) -> np.ndarray:
    """Efficiently compute the predictions for all examples in X at once.

    Args:
        X: data examples of shape (n, p)
        w: weights of shape (p,)
        w0: scalar intercept term

    Returns:
        np.ndarray of shape (n,) where each element is the prediction for the corresponding example in X
    """

    # TODO your code here
    return None

In [None]:
if __name__ == "__main__":
    X = np.array([[1., 2.],
                  [3., 4.],
                  [5., 6.]])
    w = np.array([0.1, -0.2])
    w0 = 0.5

    predictions = linreg_predictions(X, w, w0)
    assert predictions.shape == (3,), "predictions should have shape (n,)"
    assert np.allclose(predictions, np.array([0.2, 0.0, -0.2])), "predictions are not correct"


# 5. Reflection [0.5 pts]

**5.1** How much time did it take you to complete this worksheet?

**Your Response**: TODO

**5.2** What is one thing you have a better understanding of after completing this worksheet and going though the class content this week? This could be about the concepts, the reading, or the code.

**Your Response**: TODO

**5.3** What questions or points of confusion do you have about the material covered in the past week of class?

**Your Response**: TODO

# Acknowledgments

- Portions of this worksheet are adapted from Bhargavi's study notes on pandas.
- Some exercises are adapted from [Deisenroth 2020: Mathematics for Machine Learning](https://mml-book.github.io/).
- Folktables was introduced by [Ding et al. 2021: Retiring Adult: New Datasets for Fair Machine Learning](https://arxiv.org/abs/2108.04884)