In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("gla11.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Guided Learning Activity 11: Linear Regression

This Guided Learning Activity is designed for you to complete alongside a Data Ambassador from the course. You might find that it feels like a combination of the lectures and lab assignment. Whether you are participating live or watching the recording of the live meeting, let the Data Ambassador guide you through the following tasks. There will be moments for you to reflect and explore your own ideas as a way to solidify concepts and skills introduced by your instructor. Keep in mind that this is not a graded assignment for MATH 108 by default. If you have any concerns about participation, reach out to your instructor.

---

## Learning Objectives

1. Review the concepts behind linear regression.
2. Describe what least squares means.
3. Make a prediction using linear regression.
4. Outline predictive maintenance.
5. Apply linear regression to a data set to make a numerical prediction.

---

## Configure the Notebook

Run the following code cell to set up the notebook.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Summary of Linear Regression

- **Purpose**: Linear regression is used to predict values of $Y$ based on observed values of $X$, assuming a linear relationship exists between them.
- **Correlation Coefficient ($ r $)**: Measures the strength and direction of the linear relationship between $X$ and $Y$. The value of $r$ ranges from -1 to 1.
- **Regression Line**: The line of best fit that minimizes the sum of squared errors between the predicted and actual $Y$ values.
  - A line is defined by its **slope** and **intercept**:
    - **Slope**:  
      $$
      \text{slope} = r \cdot \frac{\text{SD}_Y}{\text{SD}_X}
      $$
    - **Intercept**:  
      $$
      \text{Intercept} = \text{average}_Y - \text{Slope} \cdot \text{average}_X
      $$
  - The line can be used to make predictions of $Y$ values using an $X$ value and the equation of the regression line.
- **Prediction Error**: The regression line typically does not pass through all data points, so there is an error in prediction.
- **Mean Squared Error (MSE)**: Measures the average squared difference between actual and predicted $Y$ values. It quantifies how well the regression line fits the data.


---

### Task 01 📍🔎

<!-- BEGIN QUESTION -->

Why is the linear regression line called the least squares regression line? Specifically, what is so special about the linear regression line and the MSE measurement?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Task 02 📍

Suppose that you have the following summary data for two numerical variables $X$ and $Y$:

* $\text{average}_X = 3$
* $\text{average}_Y = 63$
* $\text{SD}_X = 1.58$
* $\text{SD}_Y = 9.35$
* $r = 0.98$

Predict a value of $Y$ using $x = 3.5$ and the linear regression based on the provided summary data. Assign that numerical value to `predicted_y`.

In [None]:
AVE_X = 3
AVE_Y = 63
SD_X = 1.58
SD_Y = 9.35
r = 0.98
x = 3.5
slope = ...
intercept = ...
predicted_y = ...
print(f'Using linear regression, the predicted value of Y \
based on an x value of 3.5 is {predicted_y: .2f}.')

In [None]:
grader.check("task_02")

---

### Task 03 📍

Complete the following code cell to define some functions to help you calculate some of the related regression values using array data.

In [None]:
def standard_units(an_array):
    '''Convert any array of numbers to standard units.'''
    return (an_array - np.mean(an_array))/np.std(an_array)

def get_correlation(x_array, y_array):
    '''Returns the correlation coefficient for the two arrays.'''
    return np.mean(standard_units(x_array)*standard_units(y_array))

def get_slope(x_array, y_array):
    '''Returns the slope of the linear regression line for the two arrays.'''
    r = ...
    return ...

def get_intercept(x_array, y_array):
    '''Returns the intercept of the linear regression line for the two arrays.'''
    return ..

In [None]:
grader.check("task_03")

---

## Predictive Maintenance

<img src="./Deteriorated_asphalt.jpg" alt="The nature and degree of asphalt deterioration is analyzed for predictive maintenance of roadways" width=400px>

[Predictive maintenance](https://en.wikipedia.org/wiki/Predictive_maintenance) is an important field of study to help predict when a component needs to be taken offline for maintenance. For example, asphalt will deteriorate over time and need to be repaired/replaced. It is helpful for those who maintain the asphalt to have a good idea of when that work will need to be done so they can plan equipment and personnel and create alternative routes for the users of that road.

---

### Data

UC Irvine hosts a [synthetic predictive maintenance data set](https://archive.ics.uci.edu/dataset/601/ai4i+2020+predictive+maintenance+dataset) to help with artificial intelligence training, since most companies are reluctant to share performance data publicly.

In [None]:
# # Load the dataset
# url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv'
# Table.read_table(url).to_csv('data.csv')

### Task 04 📍

Assign `data` to the Table containing the data from `data.csv`.

In [None]:
data = ...
data

In [None]:
grader.check("task_04")

---

### Operating Temperature

This synthetic data set includes measurements collected from operating machines. As machines run, they generate heat, and it's reasonable to expect some relationship between the machine's internal temperature and the air temperature in the surrounding environment. In the next step, you'll explore that relationship to see how closely these two variables are connected. Both temperatures are measured in Kelvin.

### Task 05 📍🔎

<!-- BEGIN QUESTION -->

Visualize the relationship between the air temperature and the process temperature with a scatterplot. Utilize the `fit_line = True` parameter of `scatter` to include the linear regression line in the visual.

**Note**: You will predict process temperature from air temperature later in the activity, so place the air temperatures on the horizontal axis as that is common practice to do.

In [None]:
...
plt.title('Process Temp. vs. Air Temp.')
plt.show()

<!-- END QUESTION -->

---

### Task 06 📍

Calculate the correlation coefficient between these two sets of temperatures. Assign that value to `r_temps`.

In [None]:
r_temps = ...
r_temps

In [None]:
grader.check("task_06")

---

### Task 07 📍🔎

<!-- BEGIN QUESTION -->

Describe the relationship between these two variables. Incorporate the correlation coefficient, if relevant. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Fit Regression Line

You should notice that this relationship appears fairly linear, which makes linear regression a reasonable choice for predicting process temperature based on air temperature. Next, you'll model this relationship using a linear regression model, in other words, you'll fit a regression line to the data.

---

### Task 08 📍

Calculate the slope and intercept for the linear regression line associated with the air and process temperatures. Assign these values to `slope_temp` and `intercept_temp`, respectively.

In [None]:
slope_temp = ...
intercept_temp = ...
print(f'The slope of the line is {slope_temp:.3f} \
and the intercept of the line is {intercept_temp:.3f}.')

In [None]:
grader.check("task_08")

---

### Task 09 📍

Define a function called `predict_process_temp` that takes in an air temperature value and returns the predicted process temperature based on the linear regression line defined by `slope_temp` and `intercept_temp`. The function should predict a process temperature of approximately 310 Kelvin for an air temperature of 300 Kelvin.

In [None]:
...

# Check the function using an air temp of 300
predict_process_temp(300)

In [None]:
grader.check("task_09")

---

### Machine Failures

The `data` Table contains a column `'Machine failure'` that shows whether or not the machine had failed in relation to the provided measurements, such as air and process temperature. A value of `1` means that the machine has failed and is in need of repair/replacement.

In [None]:
data.iscatter('Air temperature [K]', 'Process temperature [K]', group='Machine failure')

---

### Task 10 📍🔎

<!-- BEGIN QUESTION -->

What do you notice about the machine failures in relation to the air and process temperatures?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

## Reflection

In this activity, you reviewed the concepts behind linear regression and made predictions using linear models in a variety of situations (using data summary information and an actual data set). You considered this prediction in relation to the subject of predictive maintenance.

---

## License

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.

<img src="./by-nc-sa.png" width=100px>