In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("gla15.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Guided Learning Activity 15: Beyond `datascience`

This Guided Learning Activity is designed for you to complete alongside a Data Ambassador from the course. You might find that it feels like a combination of the lectures and lab assignment. Whether you are participating live or watching the recording of the live meeting, let the Data Ambassador guide you through the following tasks. There will be moments for you to reflect and explore your own ideas as a way to solidify concepts and skills introduced by your instructor. Keep in mind that this is not a graded assignment for MATH 108 by default. If you have any concerns about participation, reach out to your instructor.

---

## Learning Objectives

1. Reflect on the design of MATH 108.
2. Navigate tabular data using the `pandas` library.
3. Create visualizations using the `matplotlib` library.
6. Explore regression using the `sklearn` and `scipy` libraries.
7. Explore classification using the `sklearn` library.

---

## Configure the Notebook

Run the following code cell to set up the notebook.

In [None]:
import numpy as np

---

## MATH 108 Design

- MATH 108 was adapted from UC Berkeley's [**DATA 8** course](https://www.data8.org/)
- Data 8 was designed to introduce first-year students to foundational topics in data science
    - It assumed students had only a High School–level background in math (basic algebra and some statistics)
    - It did **not** assume any prior programming experience
- A central goal of the course was for students to **build their own versions** of common data analysis tools:
    - Hypothesis tests, confidence intervals, regression models, k-nearest neighbors classifiers, and more
- These tools typically have standard library implementations, but the emphasis in MATH 108 was on **understanding by constructing**
- The `datascience` library supported this approach by simplifying the programming interface, making it easier for beginners to focus on core concepts while learning to code
- For this activity, you will complete a few past guided learning activities using some more standard methods and avoid using the `datascience` library.

---

## Common Libraries

---

### `pandas`

- In MATH 108, you worked with tabular data using the `Table` data type from the `datascience` library
- A more common and widely used alternative is the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) from the `pandas` library
- While a `Table` is great for teaching and simplicity, a `DataFrame` offers more powerful and flexible tools for real-world data analysis
- The first notable difference for you is that a `DataFrame` shows the row index number by default
- Run the following code to import `pandas` using the standard alias `pd`

In [None]:
import pandas as pd

- View the [official User Guide for `pandas`](https://pandas.pydata.org/docs/user_guide/index.html) to see what features it has.
- Here is a brief comparison between a `DataFrame` and a `Table`:
    | Concept         | `datascience.Table`                  | `pandas.DataFrame`              |
    |----------------|---------------------------------------|----------------------------------|
    | Rows & columns | ✅ Yes                                 | ✅ Yes                            |
    | Column access  | `tbl.column('label')` → NumPy array   | `df['label']` → Series           |
    | Row access     | `tbl.row(index)`                      | `df.iloc[index]`                 |
    | Select columns | `tbl.select('col1', 'col2')`          | `df[['col1', 'col2']]`           |
    | Filter rows    | `tbl.where('col', condition)`         | `df[df['col'] > value]`          |
    | Add columns    | `tbl.with_column('label', values)`    | `df['label'] = values`           |
    | Summary funcs  | `tbl.group(...)`, `tbl.apply(...)`    | `df.groupby(...)`, `df.apply(...)` |
    | Plotting       | `tbl.plot(...)`                       | `matplotlib` or `df.plot(...)`   |
- If you haven't done so already, go through the notebook provided by UC Berkeley on [Translating from `datascience` to `pandas`](https://colab.research.google.com/drive/1zYnagJUnxZWI2BSrpRdnyG7Knu3Tkada?usp=sharing) that was linked in the MATH 108 Conclusion lecture.

---

### `matplotlib`

- One of the most fundamental visualization libraries in Python is `matplotlib`
- Matplotlib is renowned for its **customizability**, though it can be less intuitive for beginners.
- The `datascience` library is built on top of Matplotlib
- You might have noticed that MATH 108 instructors often use Matplotlib directly to add custom touches like titles, updated axes, colors, and more
- Run the following code to import `matplotlib` using the standard alias `plt` and apply the same styling used in MATH 108

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

View the [official User Guide for `matplotlib`](https://matplotlib.org/stable/users/index) to see what features it has.

---

### `scikit.learn`

- The two general types of machine learning problems introduced in MATH 108 are:
  - **Regression**: Predicting numerical outcomes from numerical features.
  - **Classification**: Predicting categorical outcomes from numerical features.
- [**scikit-learn**](https://scikit-learn.org) is a widely used Python library for implementing machine learning models.
  - It includes tools for linear and nonlinear regression, k-nearest neighbors (kNN) classification, and more.
  - It also provides functions to **standardize data**, **evaluate models**, and **split data into training and testing sets**.
- Use the following line to import `scikit-learn` using the standard alias `sk`:

In [None]:
import sklearn as sk

---

### `scipy`

- [**SciPy**](https://scipy.org) is a core library for scientific computing in Python, built on top of NumPy.
  - It includes tools for **curve fitting**, **optimization**, **integration**, **statistics**, and more.
  - The `scipy.optimize.curve_fit` function allows you to fit **custom mathematical models** to data.
- Run the following line to import `scipy` and give the standard alias of `sp`:

In [None]:
import scipy as sp

---

## Regression

In a previous Guided Learning Activity, you explored the relationship between substrate concentration and reaction rate and fit both a linear and a nonlinear model to the data. You examined the patterns of the residual plots and compared MSE values to decide which of the two model options was better. Now, we will have you rework that activity without the `datascience` library.

---

### Michaelis–Menten Kinetics

<a href="https://en.wikipedia.org/wiki/Michaelis%E2%80%93Menten_kinetics" target="_blank"><img src="./MM-curve.jpg" width=400px alt="Curve of the Michaelis–Menten equation labelled in accordance with IUBMB recommendations"></a>

From <a href="https://en.wikipedia.org/wiki/Michaelis%E2%80%93Menten_kinetics" target="_blank">Wikipedia</a>:

> In biochemistry, Michaelis–Menten kinetics, named after Leonor Michaelis and Maud Menten, is the simplest case of enzyme kinetics, applied to enzyme-catalysed reactions involving the transformation of one substrate into one product.

---

### Biochemistry Basics

- **Enzyme**: A protein that speeds up chemical reactions in the body.
- **Substrate**: The molecule that the enzyme acts on (like a key in a lock).
- **Enzyme + substrate → product**: The reaction rate depends on substrate amount.
- **Saturation point**: When all enzymes are busy, adding more substrate won't increase speed.

---

### The Data

The file [`puromycin.csv`](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Puromycin) contains 23 rows and 3 columns of the reaction velocity versus substrate concentration in an enzymatic reaction involving untreated cells or cells treated with [Puromycin](https://en.wikipedia.org/wiki/Puromycin), an inhibitor that slows down enzymatic reactions.

---

### Task 01 📍

Assign the `puromycin.csv` to the `DataFrame` called `puromycin`. The `DataFrame` should contain only the untreated data and the following columns:

- `'Concentration'`: a numeric vector of substrate concentrations (ppm)
- `'Reaction Rate'`: a numeric vector of instantaneous reaction rates (counts/min/min)

**Conversion Notes**:
- `pd.read_csv()` loads a `CSV` into a `DataFrame`, similar to `Table.read_table(...)`.
- `tbl = tbl[tbl['column_label'] == 'value']]` filters rows of a `DataFrame` like `tbl.where('column_label', 'value')`.
    - Instead of `.where()`, you use Boolean indexing with a condition inside `[]`.
- `.drop(columns=...)` removes a column, similar to `tbl.drop(...)`.
- `.rename(columns={...})` changes column names, like `tbl.relabeled(...)`.
    - The `{...}` is a dictionary, containing `{key:value}` pairs.
    - In this case, the pairs would be `{old_label: new_label, ...}`
- `.reset_index(drop=True)` resets the row index (especially after filtering).

In [None]:
puromycin = ...
puromycin

In [None]:
grader.check("task_01")

---

### Task 02 📍🔎

<!-- BEGIN QUESTION -->

Visualize the relationship between the substrate concentrations and reaction rates, where the concentrations are the explanatory variable.

**Conversion Notes**:
- `df['column']` produces a `Series` (like an array), similar to `tbl.column('column')`.
- `plt.scatter(x, y)` creates a scatter plot like `tbl.scatter(...)` in `datascience`.
- You can add a title and update x and y labels using `plt.title`, `plt.xlabel`, and `plt.ylabel`.
- `plt.show()` displays the plot.

In [None]:
...

<!-- END QUESTION -->

---

### Task 03 📍🔎

<!-- BEGIN QUESTION -->

Fit a linear model to the data in `puromycin`, and use the model to create an array of predicted reaction rates (`y_pred`) based on the model and given concentration amounts.

**Notes**:
- Determining the slope and intercept that minimize MSE is part of fitting a linear model to a data set.
- `sklearn` has a tool `LinearRegression` from the `linear_model` sub-module to help with that.
    - Use `from sklearn.linear_model import LinearRegression`
    - `LinearRegression()` creates an empty linear regression model (e.g. `linear_model`) object-like `fit_line(...)` in MATH 108.
- `X` is a `DataFrame` with the predictor(s)
    - double brackets `[[]]` keep it 2-dimensional (as required by `sklearn`).
    - Like: `tbl.select('column')`
- `y` is a `Series` (1D)
    - This is the variable we're trying to predict.
    - Like: `tbl.column('column')`
- `linear_model.fit(X, y)` determines and stores the correlation coefficient, slope, intercept, etc. associated with the regression line for the provided data.
- After fitting the model, you can use it to predict outcomes (`linear_model.predict(X)`) for given inputs (`X`).
    - This returns an array of predicted `y` values based on the learned regression line.

In [None]:
...
X = ...
y = ...
linear_model = ...
...
y_pred = ...
y_pred

<!-- END QUESTION -->

---

### Task 04 📍🔎

<!-- BEGIN QUESTION -->

Add the fit line to the scatter plot. Include a legend that provides a label for the scatterplot and regression line.

**Notes**:
- `plt.plot(x, y)` creates a **line plot**, similar to `tbl.plot(...)` in the `datascience` library.
- To **label a line** in the plot, use the `label='Name'` parameter inside `plt.plot(...)`.
- `plt.legend()` adds a **legend box** to the plot, displaying the label(s) you’ve specified.
    - This is especially useful when comparing multiple lines or datasets in one plot.

In [None]:
...

<!-- END QUESTION -->

---

### Task 05 📍🔎

<!-- BEGIN QUESTION -->

Create a residual plot for this regression model. Include a horizontal line at $y=0$.

**Notes**:
- `plt.axhline(y=...)` creates a horizontal line plot vertical posted at the provided `y` value.

In [None]:
residuals = ...

...

<!-- END QUESTION -->

---

### Task 06 📍🔎

<!-- BEGIN QUESTION -->

Calculate the mean squared error for the best fit linear model.

**Notes**:
- There is an MSE function in `sklearn.metrics`
- Use `from sklearn.metrics import mean_squared_error`

In [None]:
...

mse = ...
print("Mean Squared Error:", mse)

<!-- END QUESTION -->

---

### Task 07 📍🔎

<!-- BEGIN QUESTION -->

Next, fit the model defined by the Michaelis-Menten Function, and make predictions `y_pred_2` from the provided concentrations in `puromycin`.

**Notes:**
- A general curve can be fit to data using the `scipy` function `curve_fit`.
    - Use  `from scipy.optimize import curve_fit` to access this function.
    - You should provide `curve_fit` with the function, the $X$-data, and the $y$-data for it to fit the function to the data.
    - `curve_fit` returns a few items, the first is an array of parameters for the given function that minimize MSE.
- `df['column_label'].values` provides the values of the `Series` as a NumPy array.

In [None]:
...

def michaelis_menten(S, Vmax, Km):
    return ...

X_vals = ...
y_vals = ...

params, __ = ... # __ is used to "catch" the rest of the output
Vmax, Km = params # expand the params array

y_pred_2 = ...
y_pred_2

<!-- END QUESTION -->

---

### Task 08 📍🔎

<!-- BEGIN QUESTION -->

Visualize the fit line defined by the Michaelis–Menten formula on the scatterplot relating reaction rate vs. substrate concentration.

In [None]:
...

<!-- END QUESTION -->

---

### Task 09 📍🔎

<!-- BEGIN QUESTION -->

Create a residual plot for this model and calculate the MSE value for this model (`mse_2`).

In [None]:
...

<!-- END QUESTION -->

---

## Classification

In a previous Guided Learning Activity, you used gallstone disease diagnoses data and research to create a kNN classifier to classify a patient as potentially having gallstone disease based on a few key features. You fine tuned the classifier by picking an optimal `k` value based on assessing accuracy scores for a validation set. Lastly, you provided an overall accuracy score for the model based on a test set. Now, we will have you rework that activity without the `datascience` library.

---

### Gallstone Disease

<a href="https://en.wikipedia.org/wiki/Gallstone#/media/File:Gallstones.png" target="_blank"><img src="./gallstones.png" width=400px alt="A gallstone blocking a bile duct"></a>

According to [John Hopkins Medicine](https://www.hopkinsmedicine.org/health/conditions-and-diseases/gallstone-disease):

> Gallstone disease is the most common disorder affecting the biliary system, the body's system of transporting bile. Gallstones are solid, pebble-like masses that form in the gallbladder or the biliary tract (the ducts leading from the liver to the small intestine). They form when the bile hardens and are caused by an excess of cholesterol, bile salts or bilirubin.

---

### Research

A recent paper titled [_Early prediction of gallstone disease with a machine learning-based method from bioimpedance and laboratory data_](https://journals.lww.com/md-journal/fulltext/2024/02230/early_prediction_of_gallstone_disease_with_a.40.aspx) showed that vitamin D, C-reactive protein (CRP) level, total body water, and lean mass are crucial features in predicting gallstones.

---

### Data

The UC Irvine Machine Learning Repository hosts [the dataset from this study](https://archive.ics.uci.edu/dataset/1150/gallstone-1), which contains information on 319 individuals, 161 of whom were diagnosed with gallstone disease.

---

### Task 10 📍

Create a `DataFrame` called `gallstone` from the data in `dataset_uci.csv`.

In [None]:
gallstone = ...
gallstone

In [None]:
grader.check("task_10")

---

### Task 11 📍🔎

<!-- BEGIN QUESTION -->

Create a training set `train`, a validation set `val`, and a testing set `test` from the `gallstone` dataset:

- The **test set** should contain **20%** of the total data.
- The **validation set** should contain **20%** of the total data.
- The **training set** should contain the remaining **60%** of the data.

Also, narrow down the columns in the `DataFrames` to only include `'Gallstone Status'` and the key features: `'Vitamin D'`, `'C-Reactive Protein (CRP)'`, `'Total Body Water (TBW)'`, and `'Lean Mass (LM) (%)'` for each `DataFrame`. You should reset the index of the `DataFrame` as well.

**Notes**:
- `sklearn` provides a function called `train_test_split`, which is similar to `tbl.split(...)`.
- Use `from sklearn.model_selection import train_test_split` to import the function.
- Set the `random_state` parameter so that your results are reproducible and match ours.

In [None]:
...

seed = 42
train_val, test = ...
train, val = ...

key_features = ['Vitamin D', 'C-Reactive Protein (CRP)',
                'Total Body Water (TBW)', 'Lean Mass (LM) (%)']

train = train[['Gallstone Status'] + key_features].reset_index(drop=True)
val = val[['Gallstone Status'] + key_features].reset_index(drop=True)
test = test[['Gallstone Status'] + key_features].reset_index(drop=True)
display(train)
display(val)
display(test)

<!-- END QUESTION -->

---

### Task 12 📍🔎

<!-- BEGIN QUESTION -->

Standardize the data in `train`, `val`, and `test` using the mean and standard deviation from `train`, so you don't perform any data leakage on the validation and test sets.

**Notes**:
- `sklearn` has a function `StandardScaler` which helps standardize data values.
    - Use `from sklearn.preprocessing import StandardScaler` to access it.
    - You `fit` the scaler to the data similarly to how you fit the regression models.
- With the fit scaler to the training data, you can use the `transform` method to update the data in all 3 sets based on the statistics of the training data.

In [None]:
...

scaler = ...
...

train[key_features] = ...
val[key_features] = ...
test[key_features] = ...
display(train)
display(val)
display(test)

<!-- END QUESTION -->

---

### Task 13 📍🔎

<!-- BEGIN QUESTION -->

Fit a $k$NN classifier to the training data, check the accuracy for several values of $k$ based on the validation set, and visualize the trend of the accuracy values for various $k$ values to pick the optimal one.

**Notes**:
- `sklearn` has a kNN function called `KNeighborsClassifier`.
    - Use `from sklearn.neighbors import KNeighborsClassifier` to access it.
- `sklearn` also an accuracy function called `accuracy_score`.
    - Use `from sklearn.metrics import accuracy_score` to access it.
- To fit the model, make predictions, and access the accuracy, you'll need the $X$ and $y$ values in `Series` or array format.

In [None]:
...
...

X_train = ...
y_train = ...
X_val = ...
y_val = ...
X_test = ...
y_test = ...

k_values = ...
val_accuracies = ...

for k in k_values:
    knn = ...
    ...
    y_val_pred = ...
    acc = ...
    ...

...
plt.xlabel('k')
plt.ylabel('Validation Accuracy')
plt.title('Validation Accuracy vs. k')
plt.show()

<!-- END QUESTION -->

---

### Task 14 📍🔎

<!-- BEGIN QUESTION -->

Using the optimal `k` value from the previous task, fit a kNN classifier to the training data, make predictions using the test set, and calculate the accuracy for the model based on the test set.

In [None]:
optimal_k = ...

knn = ...
...

y_pred = ...

print("Test Set Accuracy:", ...)

<!-- END QUESTION -->

---

## Reflection

In this activity, you reflected on the course design for MATH 108, learned about a few key Python libraries for data analysis and machine learning, and revisited two Guided Learning Activities by completing similar tasks without the `datascience` library used in MATH 108.

---

## License

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.

<img src="./by-nc-sa.png" width=100px>