# Example: Old Faithful Model

This dataset lists the eruption time (in minutes) of the Old Faithful geyser as well as the wait time, or time in between eruptions.

- **Before running the code**, do you think there might be a correlation between wait time and eruption time of the Old Faithful geyser?

- **Run the code** to see the scatterplot. Also, look at the console to see the model. Be sure to read the comments and match the code to the output.

- What does the correlation coefficient tell you about the relationship between wait time and eruption time?

Source: https://people.sc.fsu.edu/~jburkardt/datasets/stats/stats.html

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the data
df = pd.read_csv (r"faithful.csv")

# Set values for x and y
x = df.wait_time
y = df.eruption_time

# Determine correlation
correlation = y.corr(x)
print("Correlation: " + str(correlation)) 

# Add labels
plt.title("Old Faithful Eruptions")
plt.xlabel("Wait Time")
plt.ylabel("Eruption Time")

# Plot the scatterplot
plt.scatter(x, y)

# Create the model
model = np.polyfit(x, y, 1)
print(model)

# Print the line of best fit
m = str(round(model[0], 2))
b = str(round(model[1], 2))

print("Model: y = " + m +"x + " + b)

plt.show()

# Problem 1 - Swim Time Model

This dataset lists the gold medalist time for the women’s 400-meter freestyle swimming finals.

1. Import the data and set values for x and y.

2. Determine the correlation.

3. Plot the scatterplot.

4. Create a model using the polyfit() function.

5. Print the information from the model. What is the line of best fit?

Source: https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_swimming_(women)#400_metres_freestyle

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the data
df = pd.read_csv (r"swim_times.csv")

# Example: Predicting Old Faithful

- **Before running the code**:
The model equation that was discovered in the last example was Model: y = 0.08x + -1.87. Using this, what do you predict that the eruption time will be if the wait time is 60 minutes?

- **Run the code**. Look at the console to see the prediction. Be sure to read the comments and match the code to the output.

- Find the variable `my_wait_time` and change its value. Run the code to see the new corresponding predicted value.

Source: https://people.sc.fsu.edu/~jburkardt/datasets/stats/stats.html

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the data
df = pd.read_csv (r"faithful.csv")

# Set values for x and y
x = df.wait_time
y = df.eruption_time

# Determine correlation
correlation = y.corr(x)
print("Correlation: " + str(correlation))  

# Add labels
plt.title("Old Faithful Eruptions")
plt.xlabel("Wait Time")
plt.ylabel("Eruption Time")
  
# Plot the scatterplot
plt.scatter(x, y)

# Create the model
model = np.polyfit(x, y, 1)
print(model)

# Predict using the model
predict = np.poly1d(model)

my_wait_time = 60

my_eruption_time = round(predict(my_wait_time), 2)

print("If you wait " + str(my_wait_time) + " minutes between eruptions, it is predicted that the eruption time of Old Faithful will be " + str(my_eruption_time) + " minutes long.")

plt.show()

# Problem 2 - Predicting Swim Times

Copy over your data from the last Swim Times exercise

1. Use the model equation to make a prediction based on a value.

**Advanced*:*
Utilize user input to have the user input a value that then prints the corresponding prediction.

Source: https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_swimming_(women)#400_metres_freestyle

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the data
df = pd.read_csv (r"swim_times.csv.csv")



# Example: Old Faithful Regression

- **Run the code**. Look at the console to see the minimum and maximum x-values. These values are used in the range() function when plotting the line of best fit. Note: The range must consist of integers and not floats.

- Be sure to read the comments and match the code to the output.

Source: https://people.sc.fsu.edu/~jburkardt/datasets/stats/stats.html

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the data
df = pd.read_csv (r"faithful.csv")

# Set values for x and y
x = df.wait_time
y = df.eruption_time

# Determine correlation
correlation = y.corr(x)
print("Correlation: " + str(correlation)) 

# Add labels
plt.title("Old Faithful Eruptions")
plt.xlabel("Wait Time")
plt.ylabel("Eruption Time")
  
# Plot the scatterplot
plt.scatter(x, y)

# Create the model
model = np.polyfit(x, y, 1)

# Predict using the model
predict = np.poly1d(model)

# Determine the min and max values of the x-axis
print(df.wait_time.min())
print(df.wait_time.max())

# Create the line of best fit

x_lin_reg = range(43, 96) # range is based on the min and max values
y_lin_reg = predict(x_lin_reg)
plt.plot(x_lin_reg, y_lin_reg, color = "red")
  
plt.show()

# Problem 3 - Swim Time Regression

Copy over your data from the last Swim Times exercise

1. Determine what the maximum and minimum values should be for the line of best fit.

2. Plot the line of best fit.

Source: https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_swimming_(women)#400_metres_freestyle

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the data
df = pd.read_csv (r"swim_times.csv")


# Problem 4 - Regression Reflection

## Old Faithful
1. Do the data points in the scatterplot appear to have a linear relationship? How do you know?

2. What does the value of r tell you about this relationship?

3. How accurate do you think this model’s predictions are? Do you think this model is accurate enough to reliably predict the length of the next eruption based on the wait time? Why or why not?

4. Test your model! Click on this link to watch the next Old Faithful eruption! Based on the wait time, was your prediction correct?
https://www.yellowstone.org/old-faithful-streaming-webcam/

## Olympic Swim Times
1. Do the data points in the scatterplot appear to have a linear relationship? How do you know?

2. What does the value of r tell you about this relationship?

3. How accurate do you think this model’s predictions are? Do you think this model is accurate enough to reliably predict the next Olympic swim time? Why or why not?

4. Test your model! Research and find the gold medal time for the 400-meter women’s freestyle final in the most recent Olympics. Was your prediction correct?

In [None]:
# Answers go here