![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&urlpath=notebooks/curriculum-notebooks/Mathematics/CurveFitting/curve-fitting-data.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# Fitting Functions to Real Data

In this notebook, we will try to fit functions to real-world data. You may want to first check out the [Curve Fitting](./curve-fitting.ipynb) notebook.

We'll start with some data from [Baxters Harbour](https://www.google.com/maps/place/Baxters+Harbour) in the [Bay of Fundy](https://www.waterlevels.gc.ca/eng/find/zone/30), an area famous for its unusually high tides.

In [None]:
import pandas as pd

try:
    # Get tide data from Fisheries and Oceans Canada hourly for the next 7 days
    url = 'https://www.waterlevels.gc.ca/eng/station?sid=305'
    data = pd.read_html(url)

    # construct a dataframe from the 7th table on the webpage and melt it to tidy data
    tides = data[7].melt(id_vars=['Event Date'], var_name='Hour', value_name='Tide Height (m)')

    # Join the date and hour values to make datetime values
    tides['Date and Time'] = pd.to_datetime(tides['Event Date'] +' '+ tides['Hour'])
    tides.sort_values(by='Date and Time', inplace=True)
except:
    # if that didn't work, read archived data from the Callysto data-files repository
    tides = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/Mathematics/CurveFitting/tides-data.csv')

# display the dataframe
tides

There are some `Nan` values, so we will drop those rows.

In [None]:
tides.dropna(inplace=True)
tides

Then we can create an interactive visualization of the data using the [Plotly Express](https://plotly.com/python/plotly-express) library.

In [None]:
import plotly.express as px
px.line(tides, x='Date and Time', y='Tide Height (m)', title='Baxters Harbour Water Levels')

We can see periodic variation in the water level due to the tides. Notice that we are using `px.line` instead of `px.scatter` in order to connect the data points and allow us to determine the type of function that would fit the data.

It looks like a sine wave, let's try to find a good curve fit using the function $h=a\sin(bt+c)+d$.

* $a$ = amplitude, the maximum distance from the center of the wave
* $b$ = frequency, how often the wave peaks
* $c$ = phase shift, the horizontal offset of the wave (between 0 and 2$\pi$)
* $d$ = vertical shift

We can calculate the amplitude as half of the distance between maximum and minimum height.

In [None]:
(tides['Tide Height (m)'].max() - tides['Tide Height (m)'].min()) / 2

And the vertical shift should be approximately equal to the mean height value.

In [None]:
tides['Tide Height (m)'].mean()

Adjust the values of $a$, $b$, $c$, and $d$ to see if you can get a reasonable fit to the data.

In [None]:
from scipy.optimize import curve_fit
import numpy as np

a = 4.2
b = np.radians(3) # frequency*2*pi
c = np.radians(90) # phase offset
d = 6.5

x_values = tides.index.to_series()
y_values = tides['Tide Height (m)']


def tide_fit(x, a, b, c, d):
    return a * np.sin(b * x + c) + d

p0 = [a, b, c, d] # some functions give different results from curve_fit based on initial guess p0 
values, fit_quality = curve_fit(tide_fit, x_values, y_values, p0=p0) #p0=[1,1,1,1] is default

tides['Manual Curve Fit'] = tide_fit(x_values, a, b, c, d) #a * np.sin(b * x_values + c) + d
tides['SciPy Curve Fit'] = tide_fit(x_values, *values) #star "unpacks" fit values for function

print('Values:', values)
#cross-correlations in off-diagonal parts of matrix are negligeable/small
print('Uncertainty:', [fit_quality[i,i] for i in range(len(values))]) 

fig1 = px.line(tides, x='Date and Time', y='Tide Height (m)', title='Tide Levels With Curve Fit')
fig1.add_scatter(x=tides['Date and Time'], y=tides['Manual Curve Fit'], name='Manual Curve Fit')
fig1.add_scatter(x=tides['Date and Time'], y=tides['SciPy Curve Fit'], name='SciPy Curve Fit')
fig1.show()

The curve_fit can give better or worse results depending on the initial guess `p0`. Try changing the initial guess to see how the curve_fit algorithm can improve or fail. Most solutions are able to fit the overall frequency, but not the slow increase in the height of the wave.  

Let's look at a data set with more values.

## Atmospheric Carbon Dioxide

Next we will download measurements of carbon dioxide in the atmosphere from 1958 to 2019, then fit a function to this data in order to perhaps make predictions about future atmospheric $CO_2$ concentrations.

In [None]:
# A file containing the measurements
#url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
url = 'https://raw.githubusercontent.com/callysto/data-files/main/Mathematics/CurveFitting/co2_mm_mlo.txt' # the file from Callysto data-file repo

# Download the file, ignore the headers, and specify the data is separated by spaces
co2 = pd.read_csv(url, skiprows=72, sep='\s+').reset_index(drop=True)

# Rename the columns to something more convenient 
co2.columns = ["Year", "Month", "Date", "Carbon Dioxide (ppm)", "Interpolated", "Trend", "Number of Days"]

# Missing data is filled in with -99.99, so we get rid of it 
co2 = co2[~co2['Carbon Dioxide (ppm)'].isin([-99.99])]

# View the data
co2

The only data we need to use is the `Date` column (in [decimal time](https://en.wikipedia.org/wiki/Decimal_time)) and the `Carbon Dioxide (ppm)` column. Let's create an interactive plot of those two columns.

In [None]:
fig2 = px.scatter(co2, x='Date', y='Carbon Dioxide (ppm)', title='Atmospheric Carbon Dioxide Over Time')
fig2.show()

## Try It Yourself

You probably notice that there's an increasing trend and some periodic behavior. Your task is to fit a function to this data.

Feel free to make _any_ transformations you see fit to the data and fit _any_ function you wish to the data as well. Feel free to try multiple transformations. For example, if you wanted to take the square root of the entire column, you could use `co2['Curve Fit'] = np.sqrt(x_data)` to create a new column with the square roots of the `x_data` column.

Below we set up some starter code for you to get working on this task. Remember that you can refer to [this curve fit notebook](./curve-fitting.ipynb) for examples. Good Luck!

In [None]:
x_data = co2['Date']
y_data = co2['Carbon Dioxide (ppm)']

# Create your function to fit here
def co2_fit(x, a, b):
    return a * x + b

values, fit_quality = curve_fit(co2_fit, x_data, y_data)

# Change this to your fit function as well
a, b = values
co2['Curve Fit'] = a * x_data + b

fig3 = px.scatter(co2, x='Date', y='Carbon Dioxide (ppm)', title='Atmospheric Carbon Dioxide Over Time')
fig3.add_scatter(x=co2['Date'], y=co2['Curve Fit'], name='Curve Fit')
fig3.show()
print('Values:', values)
print('Uncertainty:', fit_quality)

# Conclusion

Hopefully you were able to find a reasonable curve fit for the data. For more information about `curve_fit` check out the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html).

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)