# Python Tutorial

The typical workflow for data analysis in Python is bases around Jupyter notebooks (like this one). Notebook are divided into cells which contain either Markdown formatted text or Python code. The cells can be executed one-by-one, in any order. To execute a cell, select it and then either click run or press shift+enter.

For example, you can do arithmetic:

In [None]:
2+4

Note that the output of the last line of the cell is printed below it. You can include multiple commands in a cell. You can also explicitly print the output.

In [None]:
x = 5
print(x)
y = 6
print(y)
x*y

## Basic Data Wrangling

Python is a general-purpose programming language. In order to do statistics we need to add several statistics-specific packages to the base language. We do this using the `import` command:

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns

Pandas is a library for working with tabular (CSV/spreadsheet) style data. Numpy adds a high-performance numerical array data type and associated mathematical operations. Seaborn is a library for making plots. All three are designed to work well together. It will be useful to have the documentation for each of these ready to hand:

- [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)
- [Numpy](https://numpy.org/doc/stable/user/index.html)
- [Seaborn](http://seaborn.pydata.org/tutorial.html)

The `as` following the import is for convenience. We could simply have imported the packages, but import Pandas as `pd` means we can write just `pd` instead of `pandas` whenever we need to use a Pandas command.

We'll start by import the Cleveland Heart Disease datasets using Pandas. Make sure that dataset is downloaded and in the same directory as this notebook (or modify this command to include the path to where you have saved it).

In [None]:
df = pd.read_csv(r"/Users/m298134/Downloads/data/processed.cleveland.data")

This creates a _dataframe_ object called `df` containing the contents of the file. Dataframes are arranged by rows and columns much like a spreadsheet. We can see the contents of `df` by executing a cell containing just the object name.

In [None]:
df

Note the occurrence of a question mark in the last row. The raw data contains some missing values indicated by `?`. Pandas does not know that `?` means missing, and interprets that as a literal data value. We can handle this by telling Pandas how NA is encoded.

In [None]:
df = pd.read_csv("/Users/m298134/Downloads/data/processed.cleveland.data", na_values="?")

If you rerun the containing just `df` you should see the `?` replace by `NaN` which is the correct Python symbol for missing values.

We can observe the data type of each column:

In [None]:
df.dtypes

Here `float` means a decimal number and `int` an integer number. Notice that our categorical data type are all treated as numerical! We can explicitly convert them to categorical types.

In [None]:
df["sex"] = pd.Categorical(df["sex"])

In [None]:
df.dtypes

Try changing all of the categorical columns to categorical type. 

It's useful to add a column that indicates heart disease / no heart disease, rather than the `num` column which is divided into different types of diagnosis. Here's the syntax for accessing just one column:

In [None]:
df["num"]

We want a column indicating whether the subject has heart disease (any value greater than zero) or not (a value of zero).

In [None]:
df["num"] > 0

We can add this true/false list as a new column of the data frame called `hd`:

In [None]:
df["hd"] = df["num"] > 0
df

We can compute basic statistic of any column in several different ways:

Using numpy:

In [None]:
np.mean(df["thalach"])

In [None]:
np.std(df["thalach"])

Using pandas:

In [None]:
df["thalach"].mean()

In [None]:
df["thalach"].std()

We can create a new varible containing a single column:

In [None]:
y = df["thalach"]

Besides selecting columns, it is often useful to select only certain rows of data. If we want a particular row number, we can do:

In [None]:
df.iloc[2, :]

This gets the third row of data (Python starts counting at zero). The `:` means to select _all_ columns. If we only wanted one column we could do

In [None]:
df.iloc[2, 1]

We can also get a particular row and column based on the column _name_ rather than its number:

In [None]:
df.loc[2, "sex"]

Usually we want to select rows based on some value they contain, rather than their number. For example, we might want all of the male subjects in the data.

In [None]:
df["sex"] == 1

This gives us a boolean (true/false) list. We can use such a list to select a subset of the rows. We'll name the list `idx` (for index). Then asking for the `df` of the indexing list selects only the rows with the value `True`.

In [None]:
idx = df["sex"] == 1
men = df[idx]
men

As a variation on this, try selecting only the rows containing subjects above the age of 60.

## Plotting

We'll use Seaborn for most plots. Seaborn is a convenient interface built on top of Matplotlib. You can also use the matplotlib library directly, but it can be complicated for even simple graphics.

Most Seaborn plots take arguments for the `x` and `y` axes:

In [None]:
sns.scatterplot(x=df["age"], y=df["oldpeak"])

The axes can be either full columns of data, or names of columns. In the latter case we also have to tell Seaborn which dataframe contains the columns.

In [None]:
sns.scatterplot(x = "age", y = "oldpeak", data=df)

We can use categorical columns change the color of the markers or the marker styles:

In [None]:
sns.scatterplot(x="age", y="oldpeak", hue="hd", data=df)

A histogram only has an `x` axis. If we only provide a column of data, Seaborn automatically uses that for the `x`-axis:

In [None]:
sns.histplot(df["age"])

## Programming Tools

Much of the power from using Python or R instead of a graphical or point-and-click interface for statistical analysis is that we can use all of the features of the programming language to automate parts of our analysis.

The two most important features in a programming language are the ability to repeat operations using loops and to execute commands conditionally. Simple loops use a counter which increments through a range of values. At each increment, the operations inside the loop are executed. 

In [None]:
for i in range(1, 10):
    print(i)

Note that Python loops start with the first value in `range` and end at one less than the last value. We can also provide just the ending value:

In [None]:
for i in range(10):
    print(i)

Python defaults to starting at zero. A range of 10 includes ten numbers, zero through nine.

We can put multiple commands inside a loop. We can also make the loop increment by an amount other than 1.

In [None]:
for i in range(5, 100, 20):
    x = i / 5
    print(x)

A conditional uses the `if` statement. This starts with a logical condition. Code inside the `if` block is executed if and only if the logical statement is true.

In [None]:
x = 5
if x > 2:
    print("yes")
if x < 2:
    print("no")

The second statement is not printed because `x < 2` is false.

Loops and conditionals become powerful when we combine them. For example, suppose we want to print the numbers less than 20 which are divisible by 3. The operation `%` gives the remainder after division:

In [None]:
11 % 4

If the `x % y` is zero, there is no remainder, so x is evenly divisible by y. Putthing this in a conditional:

In [None]:
x = 11
y = 4

if x % y == 0:
    print("yes")

(Try changing the values of x and y).

Now we can put the conditional inside of a loop:

In [None]:
for i in range(20):
    if i % 3 == 0:
        print(i)

This iterates `i` through the range of numbers less than 20. For each number in that range, it tests the condition `i % 3 == 0`. It prints `i` only when that condition is true, which happens exactly for the number divisible by three. 

## Model fitting

We'll use two different packages for building statistical models in Python, Scikit-Learn and Statsmodels. 



### Scikit-Learn

Start by building a simple linear regression model with scikit learn.

In [None]:
from sklearn.linear_model import LinearRegression

Notice that instead of importing the whole package we only imported one class from it, the LinearRegression class.

The first thing to do is create an instance of this class, which we will call `model`. This essentially tells scikit-learn that we are building a model of the form $y = a + bx$, but does not provide any fitting data, just the abstract form. The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) gives more details.

In [None]:
model = LinearRegression()

Some models will have more options we wish to specify. For example, here we could do `model = LinearRegression(fit_intercept=False)` to specify that the model has the form $y = bx$ (i.e. no intercept term $a$). In this case the defaults are what we want, however.

The next step is to provide data with which to fit the model. We'll create `x` and `y` variables.

In [None]:
y = df["thalach"]
X = df["age"]

Scikit-learn expects `X` to have the form of an $N \times p$ matrix with $p = 1$ since there's only one column. Unfortunately, a single column selection gets "flattened" into a row vector. We need to manually reshape `X`.

In [None]:
X.shape

In [None]:
X = X.values.reshape(-1, 1)

In [None]:
X.shape

Now that `X` has the right shape, fitting the model is simple.

In [None]:
model.fit(X=X, y=y)

It doesn't look like much happened, but now the model has calculated the values of the parameters $a$ and $b$. The value of $a$ is the intercept:

In [None]:
model.intercept_

The value of $b$ is the coefficient associated to `X`:

In [None]:
model.coef_[0]

Technically `model.coef_` is a list of coefficients, one for each predictor in the model. We only have one predictor, so we are interested in the first coefficient, which is coefficient 0 with Pythons indexing scheme. Notice the difference between printing `model.coef_` and `model.coef_[0]`.

We can also predict new values using the model. If we want to know the predict max heart rates for subjects of ages 30, 40, and 50, we can create a new array of $x$ values and use the predict command.

In [None]:
X_new = np.array([[30], [40], [50]])
model.predict(X=X_new)

All of the brackets are again necessary so `X_new` has the right shape.

### Statsmodels

Now we'll repeat fitting the model, but with statsmodels. This has a more R-like syntax and output.

In [None]:
import statsmodels.formula.api as smf

Here we construct the model and provide the fitting data at the same time. We give the column names in the form of a formula, "y ~ x", in which the left-hand side is the response variable name and everything on the right-hand side is used as a predictor. We also have to state which dataframe contains the columns.

In [None]:
model = smf.ols("thalach ~ age", data=df)

The linear regression model in Statsmodels is called `ols` for Ordinary Least Squares.

Even though the model has data, it is not actually fitted until we tell Statsmodels to compute the fit. The result of fitting the model is a new object, which we'll call `res`.

In [None]:
res = model.fit()

To see all of the information about the fitted model, we can print a summary of `res`.

In [None]:
print(res.summary())

The model parameters are in there (under `coef`), as well as a bunch of other information that may or may not be useful. If we just want the parameters we can get those:

In [None]:
res.params