# Regression

## Diabetes

Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Over time, having too much glucose in your blood can cause health problems, such as heart disease, nerve damage, eye problems, and kidney disease. You can take steps to prevent diabetes or manage it. 

As of 2014, 29.1 million people in the United States, or 9.3 percent of the population, have diabetes. One in four people with diabetes don’t know they have the disease. An estimated 86 million Americans aged 20 years or older have prediabetes.

Reference: https://www.niddk.nih.gov/health-information/diabetes

Risk factors:http://www.mayoclinic.org/diseases-conditions/diabetes/basics/risk-factors/con-20033091

# Diabetes Dataset

Reference: http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

*Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline; where* $BMI = kg/m^2$

## Python Quick Review

If you are less familiar with Python, you might want to review the materials in the [DeCART Boot Camp, Part 2: Introduction to Python](https://github.com/UUDeCART/decart_bootcamp_part2), in particular our [Python crash course](https://github.com/UUDeCART/decart_bootcamp_part2/blob/master/modules/module1/python_crash_course.ipynb).

## [Pandas](http://pandas.pydata.org/) Review

Pandas it a Python package for working with tabular data that was developed in the finance community. Pandas will be our main framework for working with data and standard Python packages for machine learning, like [scikit-learn](http://scikit-learn.org/stable/) all work natively with Pandas DataFrames and Series.

### [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)



### Functions in Python

Python functions take some input (possibly nothing), manipulates data (possibly by doing nothing), and returns some output (possibly nothing, denoted by the special Python value `None`).

In [None]:
def myfunction():
    P=[1,2,'3']
    return P
myfunction()

### Analysis of `myFunction`

* What goes in?
    * Nothing
* What data manipulation does `myFunction` do?
    * Nothing
* What comes out?
    * A Python `list` containing the integers `1` and `2` and the string `"3"`


In [None]:
L1=1
P1=['1','b']
a1='1.5'
_,L2,P2=L1,P1,a1
print(L2,P2)

### Notes

* `_` is by convention in Python a throw away variable
* `print` is an example that takes some input and provides as output a string printed to standard output (the screen)

### Construct a Pandas DataFrame

In [None]:
import pandas as pd
L=pd.DataFrame({'a':[4,5,6,6],'b':[7,8,12.8,9],'c':[10,11,12,10]},
  index=[1,2,3,4],
  columns=['a','b','c'])
print (L)

### If we let the Notebook evaluate a Pandas DataFrame (e.g. `L`), it will provide a nice HTML table

In [None]:
L

### Pandas DataFrame Object
 
Python is primarily an [object oriented programming language](https://en.wikipedia.org/wiki/Object-oriented_programming). By contrast, R is a [functional programming language](https://en.wikipedia.org/wiki/Functional_programming)

In an object oriented programming language, objects have attributes and **methods**. Methods are special functions that operate on the attributes of the object.

In our example object, `L`, the attributes are the data values (contained in the columns `a`, `b`, and `c` and rows `1`,  `2`, `3`, and `4`.

If we want to learn what methods the `L` has, we can use the `help` function.

In [None]:
help(L)

### Example DataFrame and Series methods

We will look at the following methods that are useful for summarizing our data

* [`describe()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)
    * A method for either DataFrames or Series
* [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
    * A method for Series
* [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)

In [None]:
L.describe()

In [None]:
L['a'].value_counts()

In [None]:
L.groupby("a").boxplot()

In [None]:
L.groupby(['a','c']).mean()

In [None]:
import numpy as np
a = np.arange(6)
print(a)
a=a.reshape((3, 2))
print(a)

###  [DS package](./DS.py)

I have created a Python package (file) containing functions that we will use. In order to access the functions in our notebook, we need to `import` it.


In [None]:
import DS

### Reloading a Package

If we change the DS.py file, we will not see those changes in this notebook unless we restart the notebook or `reload` the package. To reload the package, past the code below into a code cell and evalaute.

```Python
import importlib
importlib.reload(DS)
```

In [None]:
import importlib
importlib.reload(DS)

In [None]:
help(DS.main)

In [None]:
L1,L2,L3=DS.main()

### What is our Output

* `L1` our original data
* `L2`: the data we will use to predict with
* `L3`: what we are going to predict

### Here are some functions to explore our Data

They take as arguments the index for the data set we want to examine:

* `0` for diabetes
* `1` for Parkinson's

Change the value of `which_data` to see the two data sets.

In [None]:
which_data = 0
x,_ = DS.numDes(which_data), DS.meanSDDes(which_data)

## [Take a Quiz](https://goo.gl/forms/iWUy8u6fsTvS9Nbz2)

### `x` is a dictionary of DataFrame and Series containing the results of a `describe` method call on our predictors and targets

In [None]:
x.keys()

In [None]:
x["myTrain"]

In [None]:
x["myVal"]

In [None]:
_ = pd.tools.plotting.scatter_matrix(L1[['AGE', 'SEX', 'BMI', 'S1']], alpha=0.2, figsize=(10,10), diagonal='kde')

### [Feature Scaling](https://en.wikipedia.org/wiki/Feature_scaling)

Because our features have different natural scales, we want to transform them so that they have similar scales.

`dataScaling` read in a data set and performs a [`minmax_scale`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) transformation on each column.

```Python
def dataScaling(index=0,taskID='filesReg'):
    data,myTrain,myVal=load_data(index,taskID)
    for name in myTrain.columns:
        if (not(myTrain[name].dtype=='O')):
            myTrain[name]=pre.minmax_scale(myTrain[name].astype('float'))             
    return data,myTrain,myVal
```


In [None]:
_,L6,_=DS.dataScaling(0)

In [None]:
L6.head()

In [None]:
pd.__version__

In [None]:
L6.columns

In [None]:
_ = pd.tools.plotting.scatter_matrix(L6[['AGE', 'SEX', 'BMI', 'S1']], alpha=0.2, figsize=(6, 6), diagonal='kde')

### Or we could use [Seaborn](https://seaborn.pydata.org/index.html)

In [None]:
import seaborn as sns
sns.pairplot(L6[['AGE', 'SEX', 'BMI', 'S1']])

In [None]:
sns.pairplot(L1[['AGE', 'SEX', 'BMI', 'S1']])

### We could do this more directly with Pandas

In [None]:
diabetes_data, diabetes_predictors, diabetes_targets = DS.load_data(0, "filesReg")

In [None]:
from sklearn import preprocessing as pre
pd.DataFrame(pre.minmax_scale(diabetes_predictors), 
             index=diabetes_predictors.index, 
             columns=diabetes_predictors.columns).head()

**P-values using ANOVA or f-regression**

L7=DS.uniFeatureReg(0)
L7

## Linear Regression

In linear regression we want to find a linear (straight) line that optimally fits our **continuous** labeled data.

### There are several things we need to consider here

#### A straight line: Is our Data Linear? Is the world Linear?

#### What do we mean by "optimally"?


### Linearity
#### Is the function below linear?

In [None]:
from sympy import symbols
from sympy.plotting import plot, plot3d, plot3d_parametric_line
x = symbols('x')
plot(x, (x,-5,5))

#### Is this function linear?

In [None]:
plot(x**2, (x,-5,5))

#### What about this?

In [None]:
p1 = plot(x**2, (x,2,3))

### Optimality 

Optimal means that our model is the closest to our data, but we have to decide on what we mean by close.

#### Norms

\begin{equation}
||x||_p = \large(|x_1|^p + |x_2|^p+\cdots+|x_n|^p\large)^{\frac{1}{p}}
\end{equation}
* $L^2$
* $L^1$
* $L^\infty$

In [None]:
import matplotlib.pyplot as plt
import numpy as np
xs = np.arange(2,3,0.1)
import numpy.random as ra
ys_ = xs**2+ra.normal(0,1, size=xs.shape)
fig1, ax1 = plt.subplots(1)
plt.plot(xs,xs**2,axes=ax1)
ax1.scatter(xs, ys_, marker='+', color='r')

## Linear Regression

If we remember from high school, we can describe a straight line in the plane as 

\begin{equation}
y=mx+b
\end{equation}

So for example $m=0.55$ and $b=-2$ we would have the following line:

In [None]:
plot(x*0.5-2,(x,-5,5), ylim=(-5,5), axis=True, axis_center=(0,0))

### We can draw a line in 3D (depending on $x$ and $y$) $\cdots$

### Or 4D (depending on $x$, $y$, and $z$)

Of course this can extend on to arbitrary dimension, but our $x$, $y$, $z$ labeling for variables becomes problematic and we would typically do something like the following for our variable names

$x_1$ (x), $x_2$ (y), $x_3$ (z), $x_4$, $\cdots$, $x_n$

The mathematical expression of a linear regression in $n$ dimensions can be written as

\begin{equation}
\hat{y}= \Theta_0 + \Theta_1x_1 + \Theta_2x_2+\cdots+\Theta_nx_n
\end{equation}

$\Theta$ is the capital Greek letter ["theta"](https://en.wiktionary.org/wiki/theta#Pronunciation)

* $\hat{y}$ is the predicted value from our model
* $n$ is the number of features (independent variables) we have
* $x_i$ is the $i^{th}$ feature (variable)
* $\Theta_0$ corresponds to $b$ in our simple line
* $\Theta_1$ corresponds to $m$ in our simple line

##  Linear Regression 

<img src="../images/Linearregression.png" height= 55% width=55% style="right;">


<sub>
    Figure Reference: the first recommended book
</sub>

DS:runDifferentRegressionTrTeDataSet(),plotLearningCurvesRegression(): [a relative link](DS.py)

** From the below figure, is this a good model: predicted vs. real **

In [None]:
L8=DS.runDifferentRegressionTrTeDataSet([0],[DS.Regs[0]],[0])

In [None]:
L8['diabetes'][2],L8['diabetes'][3], L8['diabetes'][4]

** From the below figure, decribe the trends of the errors for validation and train datasets**

In [None]:
L9=DS.plotLearningCurvesRegression([0],[DS.Regs[0]],[0])

## Example

Does BMI predict BP?

Following the scikit-learn [example here](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py).

**Note:** had to address a [strange issue](https://stackoverflow.com/questions/38058486/fit-error-found-arrays-with-inconsistent-numbers-of-samples) with one predictor variable

#### We can use scikit-learn to split our data into training/test sets

In [None]:
diabetes_predictors["BMI"].shape, diabetes_predictors["BP"].shape

In [None]:
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split

lr = linear_model.LinearRegression()

X = diabetes_predictors["BMI"].reshape(-1,1)

y = diabetes_predictors["BP"]

X_train, X_test, y_train, y_test = train_test_split(X, y)
lr.fit(X_train, y_train)
print(lr.coef_)

plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, lr.predict(X_test), color='blue',
         linewidth=3)
plt.xlabel("BMI")
plt.ylabel("BP")
plt.xticks(())
plt.yticks(())

plt.show()

### How optimal is our model?

In [None]:
print("Mean squared error: %.2f"
      % np.mean((lr.predict(X_test) - y_test) ** 2))

### What does this number mean?

## Using [statsmodels](http://www.statsmodels.org/stable/index.html)

In [None]:
import statsmodels.formula.api as smf
mod = smf.ols(formula='BP ~ BMI', data=diabetes_predictors)
res = mod.fit()
print(res.summary())

## I Can Easily Make a Higher Dimensional Model

#### I don't know the meaning of any of these S\* variables!

In [None]:
mod = smf.ols(formula='BP ~ BMI + S1 + S2', data=diabetes_predictors)
res = mod.fit()
print(res.summary())

## Exercise

Use statsmodels to make the **best** model you can.

### Statsmodel is giving us warnings. Could we have anticipated these?

# Exercise


## Parkinson’s Disease

Parkinson's disease (PD) is a chronic and progressive movement disorder, meaning that symptoms continue and worsen over time. Nearly one million people in the US are living with Parkinson's disease. The cause is unknown, and although there is presently no cure, there are treatment options such as medication and surgery to manage its symptoms.

Data set information, paper and the related links: https://archive.ics.uci.edu/ml/datasets/parkinsons

### Questions [Discuss the pseudo codes(steps)]

1. Use the file of code 1 instead of 0 to load the data, describe, scale, and calculate the related Mean/SD and p-values
2. Build a linear regression model and output the above results.
3. Plot the graphs of actual-prediction and learning curves.
4. Compare both data results and summarize your conclusions of the performances of the two data.
