<a href="https://colab.research.google.com/github/dymiyata/intro-to-ml-and-ai-2025-2026/blob/main/lin_reg_sklearn_homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression with Scikit-learn Homework

In this assignment, you will practice doing linear regression using the Scikit-learn library just like we did in class.

As always, add extra code cells and text cells when appropriate to keep your work organized and presentable.

- If you run into trouble, start by looking through what we did in class.  Most of this assignment is very similar to what we did in class with the penguin dataset (in the Scikit-learn section).
- If you're still stuck, try asking a classmate for help
- If you're still stuck after that you can try searching for Scikit-learn's LinearRegression documentation.
- If you're still stuck always feel free to email me with questions.  



### Problem 1

Import the necessary libraries for working with and plotting data as we have been the last few weeks (you may need Pandas, Seaborn, Matplotlib)

### Problem 2

For this assignment, you'll use the built in Seaborn dataset called `tips`.  This is a dataset created by a waiter in 1995.  He simply recorded information from 244 customers and the amount they tipped.

- Import this dataset just like we did for the `penguins` and `iris` datasets we've used previously.  Be sure to replace `"penguins"` or `"iris"` with `"tips"` when running the function to load the dataset.
- Store the dataframe in a variable. You can name it whatever you want, but I'd recommend either using `df` or something that has to do with "tips" like `tips_data` or `tips_df` or just `tips`.
- Get a feel for what info the dataset has using things like `.head()`, `.info()`, `.describe()`
- Make a histogram for total bill.  Play with the number of bins until you feel you get something representative of the distribution of total bills.

### Problem 3

Make a scatter plot where $x$ is the total bill and $y$ is the tip.
- Does there seem to be a correlation between these variables?
- If so, is it a positive or a negative correlation?

In the following problems, we will create a linear regression model to predict the tip based on the total bill.

### Problem 4

- Import the `LinearRegression` model from Scikit-learn just like we did in class.
- Define the `model` just like we did in class.

### Problem 5

Create `X_train` and `Y_train` like we did in class.  Here `X_train` should use the total bill variable and `Y_train` should use the tip variable.  
- Here the total bill is our *feature variable* and the tip is our *target variable*
- Recall that Scikit-learn's `LinearRegression` model expects the feature variables to be input as a 2D array rather than a 1D array.  Make sure you define `X_train` as such otherwise you'll get an error when you train the model.

### Problem 6

Use `model.fit()` to train the model.
- Again if you get an error here, you probably didn't define your `X_train` as a 2D array.  Look at how we defined `X_train` in class with the penguin dataset to see how to get it to work

What are the values for $w$ (the slope) and $b$ (the y-intercept)?

### Problem 7

Use `model.predict()` to to get an array of all the predicted y values from the training data.  Store this array in a variable called `Y_pred`.

Then,
- create a seaborn plot which contains a scatter plot of the original data (total bill vs actual tip)
- On top of that, add a line plot for the predicted data (total bill vs predicted tip) so you can see your regression line on top of your scatterplot.    
    - Make sure your lineplot is a different color than your scatterplot so it's easy to see.

See what we did in class for help.

Finally, use your model to predict how much a customer whose bill was \$30 would tip and how much a customer whose bill was \$15 would tip.  Do this by printing the output of:

`model.predict([[ amount of total bill here ]])`.  

(The double square brackets are important because again, the model expects a 2D array of inputs.  Also if you get warning because your input doesn't have column labels, don't worry; you can ignore it).

### Comparing our linear model to a different linear model

In 1995 (when the data is taken from) it was considered standard to tip 15% of the total bill
- If you thought the service was particularly good you may have tipped more
- If you thought the service was particularly bad, you may have tipped less

Thus, a naive way to predict the tip from the total bill would be to just take 15% of the total bill.  In other words, to use the following linear model:
$$\hat{y} = 0.15 x$$
(here the slope is $w=0.15$ and the y-intercept is $b=0$)

We will compare this naive linear model to our trained linear model to see which has a lower cost function.

### Problem 8

Define an array of predicted tips for the 15% model mentioned above.  Store it in a variable called `Y_pred_15` using the following code:

```
Y_pred_15 = 0.15 * df["total_bill"]
```

If you named your dataframe something other than `df` be sure to change this code to match whatever name you used.

- Think about what this code is doing and make sure you understand why it's giving you the predicted values from the naive 15% model.

Then,
- create a scatterplot with the original data (total bill vs tip)
- on top of that add a line plot for our trained model's predicted values
- on top of all of that, add a line plot for the 15% model's predicted values
- make sure the colors for all three are different

(The first two steps of the plot are exactly the same as the plot from problem 7 so you can copy paste that.  All you have to do is add one more line plot for the 15% model)

### Problem 9

Like we did in class, import the `mean_squared_error` function from Scikit_learn.

- Use it to compute the MSE cost function for the trained model and the MSE cost function for the 15% model.
- Compare the two values.  Which model is better?

### Unfair Comparison

Finally, I want to mention that the way we compared the models was unfair.

OF COURSE the trained model will do better than the 15% model because the trained model was trained on the **same data** that we used to evaluate the two models.  
- Remember, our trained model found the parameters that minimized the cost function on the training data set.  Therefore, it will have a lower cost function compared to any other linear model on this training set no matter what.
- However, it might perform worse on data that wasn't in the training set.

To properly compare two different models, we actually need to test them on data that neither model was trained with.
- Thus, it's common practice to separate your data into a *training set* and a *testing set*.  
- You use the training set to train your models (find the parameters).
- Then, evaluate the model performance by computing the cost function on the testing set (not the training set).

We will talk more about this idea in future weeks but it's important to start thinking about this kind of stuff early.