# Project: Linear Regression

The story (from Codecademy): Reggie is a mad scientist who has been hired by the local fast food joint to build their newest ball pit in the play area. As such, he is working on researching the bounciness of different balls so as to optimize the pit. He is running an experiment to bounce different sizes of bouncy balls, and then fitting lines to the data points he records. He has heard of linear regression, but needs your help to implement a version of linear regression in Python.

## Step 1: Calculating the Error

The linear regression formula `y = m * x + b` is used in the below function

In [5]:
def get_y(m, b, x):
    total = (m * x) + b
    return total

print(get_y(1, 0, 7) == 7)
print(get_y(5, 10, 3) == 25)

True
True


The function below, `calculate_error()`, takes in `m`, `b`, and an [x, y] point called `point` and returns the difference (error) between the line and the point.

In [6]:
#Write your calculate_error() function here
def calculate_error(m, b, point):
    x_point = point[0]
    y_point = point[-1]
    y_diff = (get_y(m, b, x_point)) - y_point
    return abs(y_diff)

The following tests the `calculate_error()` function:

In [7]:
#this is a line that looks like y = x, so (3, 3) should lie on it. thus, error should be 0:
print(calculate_error(1, 0, (3, 3)))
#the point (3, 4) should be 1 unit away from the line y = x:
print(calculate_error(1, 0, (3, 4)))
#the point (3, 3) should be 1 unit away from the line y = x - 1:
print(calculate_error(1, -1, (3, 3)))
#the point (3, 3) should be 5 units away from the line y = -x + 1:
print(calculate_error(-1, 1, (3, 3)))

0
1
1
5


***

The list `datapoints` below contains bouncy ball information. For example, the first datapoint, `(1,2)`, means the ball is 1cm and bounces 2 meters.

In [6]:
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]

The function `calculate_all_error` below iterates through each point in `points` and returns the total error

In [8]:
#Write your calculate_all_error function here
def calculate_all_error(m, b, points):
    total_error = 0
    for point in points:
        total_error += calculate_error(m, b, point)
    return total_error

The following cell tests the `calculate_all_error` function

In [9]:
#every point in this dataset lies upon y=x, so the total error should be zero:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 0, datapoints))

#every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 1, datapoints))

#every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, -1, datapoints))


#the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be
# 1 + 5 + 9 + 3 = 18
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(-1, 1, datapoints))

0
4
4
18


## Step 2: Slope and Intercept Analysis


For the sake of learning, the approach for determination of line of best fit for this project is via trial and error.

The list `possible_ms` is generated via list comprehension, and contains values ranging from -10.0 to 10.0 (inclusive) by increments of 0.1

In [14]:
possible_ms = [ms * 0.1 for ms in range(-100, 101)]

The list `possible_bs`, also created using list comprehension, includes values ranging -20.0 to 20.0 (inclusive) by increments of of 0.1.

In [16]:
possible_bs = [bs * 0.1 for bs in range(-200, 201)]

The cell below finds the smallest total error, and therefore the optimal `b` and `m` values, using the `calculate_all_error` function.

In [17]:
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]

smallest_error = float("inf")
best_m = 0
best_b = 0
for m in possible_ms:
    for b in possible_bs:
        this_error = calculate_all_error(m, b, datapoints)
        if this_error < smallest_error:
            smallest_error = this_error
            best_m = m
            best_b = b
print("The best b is {}, the best m is {}, and the smallest error is {}".format(best_b, best_m, smallest_error))

The best b is 1.7000000000000002, the best m is 0.30000000000000004, and the smallest error is 4.999999999999999


## Step 3: Prediction via Model

The optimal values of `m` and `b` were determined to be 0.3 and 1.7, respectively.

Plugged into the linear regression formula outlined in Step 1, the current model is:

```
y = 0.3x + 1.7
```

This line can be used to predict other bouncy ball values. The cell below predicts the bounce height for a 6cm ball.

In [25]:
print(get_y(0.3, 1.7, 6))

3.5


Our model predicts that the 6cm ball will bounce 3.5m. This model can now be used for Reggie's ball pit design.