# Lecture 2 - Data Presentation

Heavily inspired by a my colleague Anders S. Christensen. Check out his github intros for yourself!

https://github.com/andersx/python-intro

Matplotlib examples are from the Matplotlib website.

## Section 1 - Linear Least Squares Regression

First, lets start with some data for a linear function:
\begin{equation}
y = 1.2x + \mathrm{noise}
\end{equation}

In [None]:
import numpy as np

np.random.seed(666)

# X-values
x = np.arange(0,20.0, 0.2)

# Y-values: Y = 1.2*X + random noise 
y = 1.2 * x + np.random.normal(scale=2.0, size=len(x))

print(x.shape)
print(y.shape)

#### Let's try to plot the data:

In [None]:
import matplotlib.pyplot as plt

plt.scatter(x, y, marker='.', color='black', label='Training data')
plt.plot(x, x*1.2, color='red', label="Truth")
plt.xlabel('x', size=10, weight='bold')
plt.ylabel('y', size=10, weight='bold')
plt.grid(True)
plt.title('Linear Regression', size=12, weight='bold')
plt.legend()
plt.show()

#### We now want to fit a linear regression model to the data:

In [None]:
import scipy.optimize as sco

In [None]:
def func(x, m, b):
    y = m * x + b
    return y

In [None]:
(m, b), var = sco.curve_fit(func, x, y)

In [None]:
residuals = y - func(x, m, b)
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y - np.mean(y))**2)
R2 = 1 - (ss_res / ss_tot)

In [None]:
plt.scatter(x, y, marker='.', color='black', label='Training data')
plt.plot(x, func(x, m, b), color='red', label=f'fit: R2={R2:.2f}% m={m:.4f} b={b:.4f}')
plt.xlabel('x', size=10, weight='bold')
plt.ylabel('y', size=10, weight='bold')
plt.grid(True)
plt.title('Linear Regression', size=12, weight='bold')
plt.legend()
plt.show()

### What does linear regression do?

It is an approximation to a linear function:

\begin{equation}
y(\mathbf{x}) = x_1 \alpha_1 + x_2 \alpha_2 + \dots + x_n \alpha_n
\end{equation}
This can conviently be written in vector notation:
\begin{equation}
y(\mathbf{x}) = \mathbf{x} \cdot \mathbf{\alpha}
\end{equation}

Where $\mathbf{x}$ is our feature vector/descriptor/representation for a given datapoint. $\mathbf{\alpha}$ is the vector of regression coefficients.

"Fitting" is what you do to find the best set of $\alpha$-values. This is done by finding the solution with the "least squares":

\begin{equation}
\mathbf{y} = \mathbf{X}\mathbf{\alpha}
\end{equation}

Minimze the error:
\begin{equation}
\mathbf{\alpha} = \text{arg min} || \mathbf{y}^\text{ref} - \mathbf{X}\mathbf{\alpha}||^2
\end{equation}

## Section 2 - Non-linear regression?

How would we fit a sine function?

In [None]:
import numpy as np
import matplotlib.pyplot as plt


x = np.arange(0,6.6, 0.6)
y = np.sin(x) #+ (np.random.random(size=len(x)) - 0.5) * 0.5

print(x.shape)
print(y.shape)

xplot = np.arange(0,6.6, 0.01)

plt.scatter(x,y, label="Training")
plt.plot(xplot, np.sin(xplot), color="g", label="sin(x)")

plt.legend()
plt.show()

#### This is a job for machine learning! There is a very cool example (machine_learning_example_sinx.ipynb) at https://github.com/andersx/python-intro

## Section 3 - Plotting types

Can everything be nicely plotted as a linear function? 

What about multivariable data sets?

What should I choose?

## Histograms

Example taken from matpotlib documentation.

In [None]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(19680801)

# example data
mu = 100  # mean of distribution
sigma = 15  # standard deviation of distribution
x = mu + sigma * np.random.randn(437)

num_bins = 50

fig, ax = plt.subplots()

# the histogram of the data
n, bins, patches = ax.hist(x, num_bins, density=1)

# add a 'best fit' line
y = ((1 / (np.sqrt(2 * np.pi) * sigma)) *
     np.exp(-0.5 * (1 / sigma * (bins - mu))**2))
ax.plot(bins, y, '--')
ax.set_xlabel('Smarts')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of IQ: $\mu=100$, $\sigma=15$')

# Tweak spacing to prevent clipping of ylabel
fig.tight_layout()
plt.show()

## Boxplots

Example taken from matpotlib documentation

In [None]:
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))

# Fixing random state for reproducibility
np.random.seed(19680801)


# generate some random test data
all_data = [np.random.normal(0, std, 100) for std in range(6, 10)]

# plot violin plot
axes[0].violinplot(all_data,
                   showmeans=False,
                   showmedians=True)
axes[0].set_title('Violin plot')

# plot box plot
axes[1].boxplot(all_data)
axes[1].set_title('Box plot')

# adding horizontal grid lines
for ax in axes:
    ax.yaxis.grid(True)
    ax.set_xticks([y + 1 for y in range(len(all_data))])
    ax.set_xlabel('Four separate samples')
    ax.set_ylabel('Observed values')

# add x-tick labels
plt.setp(axes, xticks=[y + 1 for y in range(len(all_data))],
         xticklabels=['x1', 'x2', 'x3', 'x4'])
plt.show()