# Tutorial 4: Curve Fitting

Developed by Megan Renz for Cornell Physics Labs.

Reminder: the code cells throughout this tutorial will build off of previous code cells. It is important to run every code cell <SHIFT + ENTER> in the document, in order, up to the point where you are working every time you open this tutorial. If you get an error message (particularly one that says that a particular variable is not defined) after attempting to run a code cell, first make sure that you have run every previous code cell. 

Oftentimes, we will want to figure out the relationship between two variables, i.e., $x$ and $y$, as a function: $f(x)=y$. The most common question will be if the relationship between $x$ and $y$ is linear; in this case, we need to also figure out what the slope and intercept of that line should be.  

Let's say we have some data, which we want to plot as $y$ vs $x$ and find out if the relationship between them is linear.  

Below we have a graph where the data are the blue dots and the solid red and dotted green lines show two attempts to fit the data. Run the block below. 

In [1]:
#this is some imports you don't need to worry about.  
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from ipywidgets import *


#Again, don't worry too much about this code, just creating an example.  
x=np.arange(10)
y=np.arange(10)+np.random.random(10)-.5
plt.figure()
plt.errorbar(x,y, np.ones(10)*.2, fmt='.')
plt.plot(x,x,'r', label="f(x)=x")
plt.plot(x,.5*x, 'g-.', label="f(x)=.5*x")
plt.title("y vs x")
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

### 1. Which one of the two lines above seems like a better fit to the data? Please explain your reasoning.


**ANSWER:**
The $f(x)=x$ line seems to be a better fit to the data since the points of the line don't vary too far from the data points. The $f(x)=0.5$ line strays from the data points, increasingly moving away from the data points.

Most of the time, the difference between possible fit lines will be a bit more subtle. In these cases, we want to come up with a way to make our goodness-of-fit assessment quantitative instead of qualitative.  To do that, we are going to use our data and our function and come up with a number or "score" that tells us if our fit is good or bad.  

When the score is small, our fit is good, and when the score is large, our fit is bad.  


We want to take several things into account:  

1.  The score should increase as the points get further away from the function, by our definition above.
2.  We want points with smaller uncertainties to "count" more towards the score; if the function is far away from a point with a small uncertainty, our fit is worse than if the function is far away from a point that has large uncertainty.   
3.  Our score should not depend on units. That is, we want the score to be dimensionless so we can have a standard way of interpreting a "good" or "bad" fit, regardless of the units in our data.
4.  Our score should not change as we add more points that are similar to the ones we already have. That is, we want a standard way of interpreting a "good" or "bad" fit, regardless of the number of data points.


While there are many ways to assess how well a curve fits data, the method that we will use here is called Chi-Squared ("chi" is pronounced like "sky", but without the 's'):


$$\chi^2=\frac{1}{N} \sum_{i=1}^N \frac{(f(x_i)-y_i)^2}{\delta y_i^2}$$

where we have data points $(x_1, y_1) ... (x_N, y_N)$ with associated uncertainties $\delta y_i$, and $f(x_i)$ is the function we are fitting evaluated at $x_i$. In the graph above, the red and green lines are examples of possible functions $f(x)$.



### 2. Explain how the formula for $\chi^2$ fulfills the four requirements above.

**ANSWER:**

1.  The numerator in the summation definitely gets larger as the points get further from the function. The square of the difference between each fitting point and data point represents the square of the distance between the points, so as the distance increases, the square of the difference increases, introducing larger terms in the summation.
2.  The denominator within the summation gives a weighting to each of the summation terms based on the uncertainty in the function. As the uncertainty increases, the denominator increases; this decreases the summation term and diminishes the term's contribution to the sum based on the uncertainty of the point.
3.  Squaring the uncertainty term in the denomincator ensures that units always cancel in each summation term, so it matches the squaring of the variance term in the numerator.
4. The $\frac{1}{N}$ term normalizes the sum based on the number of data points taken. This evens out the $\chi^2$ values to have normalized magnitude, regardless of the number of data points.



$\chi^2$ is written as a python function below. Run the code block below.

In [2]:
def chiSquared(x, y, dy, f, args):
    '''Function Chi-Squared.  
    x, y and dy are numpy arrays, referring to x, y and the uncertainty in y respectively.
    f is the function we are fitting. 
    args are the arguments of the function we have fit.  
    '''
    return 1/(len(x)-len(args))*np.sum((f(x, args)-y)**2/dy**2)

### 3. Compare the equation for $\chi^2$ to the equation for $t^{\prime}$ from the last homework tutorial. In what ways are the equations similar and in what ways are they different?


### 4. What might a small $\chi^2$ value mean? What should count as "small"?
**ANSWER:** A small $\chi^2$ value indicates that the fitting function fits the data points well. A small $\chi^2$ could be considered from $0\le \chi^2 \le 1$. A $\chi^2=0$ would indicate a perfect fit, since the variance terms would all go to 0. A $\chi^2=1$ would indicate that the variances are smaller than the uncertainties, since all the summation terms would be less than 1. 

### 5. What might a large $\chi^2$ mean?  What should count as "large"?
**ANSWER:** A large $\chi^2$ would be something significantly larger than 1. This would indicate a bad fit.

 Below are a few example functions one can fit.  We will most often be fitting a line. Run the code block below. (You do not need to worry about the details)

In [3]:
def poly(x, args):
    '''
    returns the value of the polynomial sum (x**i*args[i])
    '''
    total=x**0*args[0]
    for i in range(1,len(args)):
        total+=x**i*args[i]
    return total
def linear(x, args):
    '''
    A special case of Poly.  
    '''
    return args[0]+x*args[1]


Let's take a look at fitting a line to some data.  

Here we have some data of an experiment in which a spring is stretched and the spring's force ($y$) is measured at certain stretching distances ($x$). Run the code below, which will produce two graphs.

The first is a graph of the data points (black dots with uncertainty bars) and a fit line (in blue). The second is a *residuals* plot, which shows the difference between $f(x_i)$ (the value of our function at $x_i$) and $y_i$ (the measured data at $x_i$) at each $x_i$.  

In [4]:
y=np.array([ 1.36,  3.36,  3.92,  4.11,  3.43,  5.22,
  8.29,  8.22, 11.15, 10.86])
uncertainty=np.ones(10)
uncertainty[4]=4
x=np.linspace(1,8,10)

fig, ax=plt.subplots(1,2, figsize=(8,4))
ax[0].set_title("force vs extension")
line,=ax[0].plot(x,linear(x, [0,1]))
data=ax[0].errorbar(x,y, uncertainty, fmt='.k')
ax[1].set_title("residuals")
residuals=linear(x,[0, 1])-y
res=ax[1].errorbar(x,residuals, uncertainty, fmt='.k')
ax[1].grid(True, which='both')
plt.show()
def update(intercept=0,slope=1):
    fx=linear(x,[intercept, slope])

    line.set_ydata(fx)
    residuals=fx-y
    ax[1].cla()
    res=ax[1].errorbar(x,residuals, uncertainty, fmt='.k')
    ax[1].grid(True, which='both')
    ax[1].set_title("Residuals")
    ax[1].set_xlabel("extension (cm)")
    ax[1].set_ylabel("f(x) - y")
    ax[0].set_xlabel("extension (cm)")
    ax[0].set_ylabel("force (N)")
    fig.canvas.draw_idle()
    print("chi-squared value:  ")
    print(chiSquared(x,y, uncertainty, linear, [intercept, slope]).round(3))
interact(update, intercept=(-5, 20, .1), slope=(-1, 10, .1));

<IPython.core.display.Javascript object>

interactive(children=(FloatSlider(value=0.0, description='intercept', max=20.0, min=-5.0), FloatSlider(value=1…

Try adjusting the slope and intercept of the line above using the sliders. Watch how the chi-squared value changes as the line becomes a better or worse fit.

You can also adjust the slope and intercept values by clicking the numbers to the right of the sliders and typing in your desired value.

### 6. What happens as you change the values for the slope and intercept?


**ANSWER:** As the slope and the intercepts are changed, the chi-square varies with respect to how closely the line fits to the data points. This also corresponds to the residuals approaching 0 on the y-axis.

### 7. What values for the slope and intercept give the smallest value of $\chi^2$ (to the first decimal place)?


**ANSWER:** By adjusting by hand, the best slope and intercept that minimized the Chi-Square were $m=1.4$ and $b=0.2$ corresponding to a $\chi^2=0.857$

### 8. How confident are you that the line you fit is a good representation of the underlying phenomena? (One way to check this is to see if any other lines look like better fits, qualitatively.)


**ANSWER:** I am somewhat confident, by looking at the graph and the residuals it appears to be a good fin that's less than 1. Checking for many more combinations of slope/intecept would increase my confidence in my answer, as more slope/intercpet pairs would be checked.

This brings us to another question - what if our $\chi^2$ is really small?  (This is a rhetorical question).  Run the code block below.  

In [5]:

largeUncertainty=uncertainty*5

fig, ax=plt.subplots(1,2, figsize=(8,4))
ax[0].set_title("force vs extension")
line,=ax[0].plot(x,linear(x, [0,1]))
data=ax[0].errorbar(x,y, largeUncertainty, fmt='.k')
ax[1].set_title("residuals")
residuals=linear(x,[0, 1])-y
res=ax[1].errorbar(x,residuals, largeUncertainty, fmt='.k')
ax[1].grid(True, which='both')
plt.show()
def update(intercept=0,slope=1):
    fx=linear(x,[intercept, slope])
    line.set_ydata(fx)
    residuals=fx-y
    ax[1].cla()
    res=ax[1].errorbar(x,residuals, largeUncertainty, fmt='.k')
    ax[1].set_title("residuals")
    ax[1].grid(True, which='both')
    fig.canvas.draw_idle()
    print("chi-squared value:  ")
    print(chiSquared(x,y, largeUncertainty, linear, [intercept, slope]).round(3))
interact(update, intercept=(-5, 12, .2), slope=(0, 5, .1));





<IPython.core.display.Javascript object>

interactive(children=(FloatSlider(value=0.0, description='intercept', max=12.0, min=-5.0, step=0.2), FloatSlid…

Now you should have a large range of values for which the $\chi^2$ value is quite small. For example, a fit with intercept=-4.6 and slope=2.2 and one with intercept=3.2 and slope=.6 both give $\chi^2$ values around 0.2. *Reminder:  You can enter values into the boxes next to the sliders!*


### 9. How confident are you that either of these sets of fit parameters are a good representation of the underlying phenomena?  Do you trust them?  

**ANSWER:** I am not confident that these sets of fit parameters are a good representation of the underlying phenomena. The parameters fit the data points exceptionally well, but this is because of the large uncertainties that are inherent of the measurements. The low Chi-Square Value isn't difficult to attain since the uncertainties in measurements are so large.

If your $\chi^2$ is too small (e.g. $\chi^2 <<1$), you may have overestimated your uncertainties. That is, your fit is telling you that you measured these data points much more precisely than you thought! Uncertainty overestimation is a problem because it means that it is hard to identify which of the lines that appear to be a good fit actually reflect the underlying physics.   


### 10. What do you think you should do if you obtain a very small $\chi^2$ value?

**ANSWER:** Obtaining a very small Chi-Square value indicates that the uncertainties have been overestimated. This means that the uncertainties should be reduced by taking more precise measurements of the underlying phenomena.

A $\chi^2$ value larger than 9 is considered a very poor fit for the data. (Why 9?) 

For $\chi^2$, there are a few possible outcomes:  

1. $\chi^2\approx1$

2.  $\chi^2<<1$

3.  $1\lesssim\chi^2<9$

4.  $\chi^2 >9$

### 11. Write down different interpretations for what each of these $\chi^2$ values could mean, and what you should do in each case.  *Hint: refer to the interpretations of values of $t^\prime$ from the previous tutorial.*
1. This indicates a good fit for the data, a good balance between uncertainties and variances of measurements.

2.  This indicates that the uncertainties are much greater than the variances, implying that uncertainties have been overestimated.

3.  This indicates a mediocre fit for the data; either better measurements could be taken or the fitting equation could be adjusted.

4. This indicates a bad fit to the data. Variances are much greater than uncertainties and the fitting parameters aren't a good fit to the data.



#### <span style='color:Red'>*You should never manipulate your uncertainties to obtain a specific $\chi^2$ value. Your uncertainties should always reflect your real measurements.*</span>

Now let's investigate the graph called "Residuals".  This is a graph of $f(x_i)-y_i$, the difference between what our fit predicts and what we actually got during the experiment.  The x-axis is the same as the graph "force vs. extension", but the y-axis is the vertical distance between the line and points. 

### 12. Given how you expect points to be distributed around the line, what do you expect to see in your residuals graph, if $f(x_i)$ is a good fit? 

The residual graph of a good fit should have most points centered on the x-axis (x=0), with a small vertical distance indicating small uncertainties.

Looking at the residuals graph is a good way to tell if you are trying to fit the right kind of function. The $\chi^2$  value does not necessarily tell the whole story.

Run the code below, and adjust the slider to slope=5, intercept=-17.

In [6]:
y=np.array([ 1.36,  3.36,  3.92,  4.11,  3.43,  5.22,
  8.29,  8.22, 11.15, 10.86])
uncertainty=np.ones(10)
uncertainty[4]=4
x=np.linspace(1,8,10)

fig, ax=plt.subplots(1,2, figsize=(8,4))
ax[0].set_title("force vs extension")
line,=ax[0].plot(x,linear(x, [0,1]))
data=ax[0].errorbar(x,y, uncertainty, fmt='.k')
ax[1].set_title("residuals")
residuals=linear(x,[0, 1])-y
res=ax[1].errorbar(x,residuals, uncertainty, fmt='.k')
ax[1].grid(True, which='both')
plt.show()
def update(intercept=-17,slope=5):
    fx=linear(x,[intercept, slope])
    line.set_ydata(fx)
    residuals=fx-y
    ax[1].cla()
    res=ax[1].errorbar(x,residuals, uncertainty, fmt='.k')
    ax[1].grid(True, which='both')
    ax[1].set_title("Residuals")
    ax[0].set_xlabel("extension (cm)")
    ax[0].set_ylabel("force (N)")
    fig.canvas.draw_idle()
    print("chi-squared value:  ")
    print(chiSquared(x,y, uncertainty, linear, [intercept, slope]).round(3))
interact(update, intercept=(-20, 20, .1), slope=(-1, 10, .1));





<IPython.core.display.Javascript object>

interactive(children=(FloatSlider(value=-17.0, description='intercept', max=20.0, min=-20.0), FloatSlider(valu…



While the $\chi^2$ value tells us that this fit is bad (large $\chi^2$), the residual graph can give us an idea about *why*. In this case, the residuals show a linear trend and tells us that the first half of the data points are systematically above the line and the second half are systematically below the line. This should clearly suggest that you should change the slope of the line!

Let's try another example.



In [7]:
y=np.array([ 1.36,  3.36,  3.92,  4.21,  5.43,  6.22,
  8.29,  8.22, 10.15, 10.86])
y=.5*(y-2)**2
uncertainty=np.ones(10)
uncertainty[4]=4
x=np.linspace(1,8,10)

fig, ax=plt.subplots(1,2, figsize=(8,4))
ax[0].set_title("force vs extension")
line,=ax[0].plot(x,linear(x, [0,1]))
data=ax[0].errorbar(x,y, uncertainty, fmt='.k')
ax[1].set_title("residuals")
residuals=linear(x,[0, 1])-y
res=ax[1].errorbar(x,residuals, uncertainty, fmt='.k')
ax[1].grid(True, which='both')
plt.show()
def update(intercept=-17,slope=5):
    fx=linear(x,[intercept, slope])
    line.set_ydata(fx)
    residuals=fx-y
    ax[1].cla()
    res=ax[1].errorbar(x,residuals, uncertainty, fmt='.k')
    ax[1].grid(True, which='both')
    ax[1].set_title("Residuals")
    ax[0].set_xlabel("extension (cm)")
    ax[0].set_ylabel("force (N)")
    fig.canvas.draw_idle()
    print("chi-squared value:  ")
    print(chiSquared(x,y, uncertainty, linear, [intercept, slope]).round(3))
interact(update, intercept=(-20, 20, .1), slope=(-1, 10, .1));



<IPython.core.display.Javascript object>

interactive(children=(FloatSlider(value=-17.0, description='intercept', max=20.0, min=-20.0), FloatSlider(valu…

In this case, the residuals show an upside down "v".

### 13. What do you think this shape of residuals might suggest about your fit? How might you change the function to get a better fit?

The shape of the residuals indicates why the fit is incorrect. For the linear fit, adjusting the intercept changes the y position of the residual points while adjusting the slope changes the relative positioning of the residual points. The function parameters can be changes to flatten out the residual points and center them on zero, leading to a small Chi-Square.

Let's practice writing your own code to fit a function to some data manually below.  

For example, let's say you stretch a string to the same extension multiple times and measure the force required each time.  

First, I have created some sample data, which is 10 extensions (in cm) of the spring (stored in x), and a matrix of 10 rows and 5 columns, where each row is the forces (in N) measured for 5 trials.  

Each of the 10 rows corresponds to one of the extension values, in order. Notice that as extension increases, force increases. Within each row, there is no clear trend because each row displays the same measurement taken five times at that extension. 

### 14. Print out the measurements of the extensions and forces in a matrix. Why are the number of values in "extensions" equal to the number of rows in "forces"? What are the units of all the values in "extensions"? What are the units of all the values in "forces"? 

**ANSWER:** For the measurements in the experiment, each the force of the spring was measured 5 times for each extension length of the spring. The correspondence between an extension/force measurement is why they have the same number of rows. The units of the extensions are in centimeters and the force measurements are in Newtons.

In [8]:
extensions=np.linspace(0,9, 10)
forces=np.random.normal(0,.5,size=(5,10))
forces=forces+extensions[None,:]
forces=forces.T

print("extensions: \n", extensions)
print("forces: \n",forces)


extensions: 
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
forces: 
 [[-0.5691605   0.0124447  -0.1285558   0.48688633  0.1403111 ]
 [ 1.51991396  1.60801488  1.06057833  0.44866587  1.3386632 ]
 [ 2.2952643   2.08679195  1.66489531  1.93698473  2.53212745]
 [ 2.880389    3.1589627   2.90444596  4.29631728  3.31794368]
 [ 3.74884856  3.31406673  3.86569437  3.87545802  4.42150758]
 [ 4.82604569  4.95672982  5.40404298  4.9006935   4.61320634]
 [ 5.89546958  6.46854729  5.83014719  5.04799738  6.44620772]
 [ 6.52975918  6.41517144  6.58391054  6.75034493  6.53728626]
 [ 8.67165898  7.63211211  8.41089706  7.14914337  7.78011955]
 [ 9.35775111  8.58487593  9.6119853   8.60976316  8.25644028]]


We do not want to plot all 50 data points in the table above. Instead, we want to average the 5 data points for each extension, so that we are only plotting 10 data points (with a clear trend). The uncertainty in the mean should be used to make the errorbars.  

As a hint, take a look at the code below:  



In [9]:
numpyExample=np.array([[1,2],[3,4], [5,1]])
print(numpyExample)
print("summing over axis=0:")
print(np.sum(numpyExample, axis=0))
print("summing over axis=1:")
print(np.sum(numpyExample, axis=1))

[[1 2]
 [3 4]
 [5 1]]
summing over axis=0:
[9 7]
summing over axis=1:
[3 7 6]


Note: When using functions in *numpy*, you can specify whether you want to take the average across the row or the column using <span style='font-family:Courier'>axis = 1</span> or <span style='font-family:Courier'>axis = 0</span> respectively.  This also works for <span style='font-family:Courier'>np.sum</span> and other functions.



We will need to average the five trials together for each extension of the spring, and put that in $y$, and find the uncertainty in the force measurement for each extension of the spring, and put that in $dy$. 

Your final answer for both $y$ and $dy$ should be a vector of length 10.  

### 15. Fill in the three '...' below to create an array for the mean force measurements for each extension of the spring, $y$, and their uncertainties, $dy$.  
  
[1.] Check that *y* and *dy* are what you expect (particularly check the number of data points). 

[2.] Manually check the first value (corresponding to extension=0 for each). 




In [10]:
y=np.mean(forces, axis=1) #Take the mean of each of the sets of 5 trials in "forces".  

#Note: When using functions in numpy, you can specify whether you want to take the average across the row or the 
#column using axis= 1 or 0 respectively.  This also works for np.sum and other functions.

print("y:")
print(y)
dy=np.std(forces, axis=1)/np.sqrt(len(forces[0])) #Calculate the standard uncertainty of the mean for each of the sets of 5 trials in "forces". 
#divide by square root of number of trials in each set to get the uncertainty.  
print("dy:")
print(dy)


y:
[-0.01161484  1.19516725  2.10321275  3.31161172  3.84511505  4.94014367
  5.93767384  6.56329447  7.92878621  8.88416316]
dy:
[0.15449221 0.1867483  0.13279687 0.23192446 0.15809517 0.11609313
 0.2319742  0.04864334 0.24513271 0.22916499]


Now let's plot what we just made, and try fitting it. Run the code block below. 

In [11]:

uncertainty=dy 
fig, ax=plt.subplots(1,2, figsize=(8,4))
ax[0].set_title("force vs extension")
line,=ax[0].plot(x,linear(x, [0,1]))
data=ax[0].errorbar(x,y, uncertainty, fmt='.k')
ax[1].set_title("residuals")
residuals=linear(x,[0, 1])-y
res=ax[1].errorbar(x,residuals, uncertainty, fmt='.k')
ax[1].grid(True, which='both')
plt.show()
def update(intercept=0,slope=1):
    fx=linear(x,[intercept, slope])
    line.set_ydata(fx)
    residuals=fx-y
    ax[1].cla()
    res=ax[1].errorbar(x,residuals, uncertainty, fmt='.k')
    ax[1].grid(True, which='both')
    ax[1].set_title("Residuals")
    fig.canvas.draw_idle()
    print("chi-squared value:  ")
    print(chiSquared(x,y, uncertainty, linear, [intercept, slope]).round(3))
interact(update, intercept=(-2, 12, .2), slope=(0, 5, .1));


<IPython.core.display.Javascript object>

interactive(children=(FloatSlider(value=0.0, description='intercept', max=12.0, min=-2.0, step=0.2), FloatSlid…

# Exercise 3.6: Deterministic chaos and the Feigenbaum Plot

In [14]:
r = 3
x = 0.5

def logisticMap(r, x):
    return r*x*(1-x)

r_list = []
x_list = []

for r in np.linspace(1, 4, 500):
    for i in range(10):
        x = logisticMap(r, x)
        r_list.append(r)
        x_list.append(x)

r_points = np.array(r_list)
x_points = np.array(x_list)
plt.figure(figsize=(4,4))
plt.xlabel('r')
plt.ylabel('x')
plt.scatter(r_points, x_points)
plt.show()

<IPython.core.display.Javascript object>

**a) ANSWER:** For a given value of $r$, a fixed point would be where the $x$ vs $r$ graph looks like a function: where it passes the vertical line test, somewhere on the range of $1\le x \le3.5$. This will guaruntee that a value of $r$ always converges to a single value of $x$. A limit cycle on the plot is where the plot cycles between different values of $x$, somewhere on the range of $3\le x\le 3.5$. A region of chaos whould be where values of r correspond to a wide range of non-uniform $x$ values.
**b) ANSWER:** The edge of chaos for this graph appears to be at $r=3.5$, where it switches from a clear limit cycle to a chaotic dispersion of $x$ values for each $r\ge3.5$.