# Getting started with data analysis in Python

Bartosz Telenczuk, 2021
https://github.com/btel

This work is marked with CC0 1.0. To view a copy of this license, visit http://creativecommons.org/publicdomain/zero/1.0


Some of the examples were taken from ["Plotting and Programming in Python"](http://swcarpentry.github.io/python-novice-gapminder/index.html) by [The Carpentries](https://carpentries.org/), licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)

## Intro to Python for (data) scientists

### JupyterLab

* starting
* creating jupyter notebook
* keyboard shorcuts: `Enter` (to enter edito mode), `Shit-Enter` (Run), `Esc` (enter command mode), `M` (markdown, in command mode), `X` (remove cell, in command mode)

### Variables

* defining strings and integers
* variables stay defined even if you remove a cell
* indexing with integers and slices
* zero-based indexing!
* type

In [None]:
first_name = 'Adam'
age = 100
print(first_name, 'is', age, 'years old')

In [None]:
first_name[0]

In [None]:
first_name[1:3]

In [None]:
type(first_name)

In [None]:
type(age)

**Exercise** Test the following operations in your notebook. Which output do the produce?  What is the type?

```python
first_name = 'Adam'
age = 100

variable_1 = 'hello' + first_name
variable_2 = age + 1
variable_3 = 5.1
variable_4 = first_name + 1
```

### Built-in functions, methods and and help

* builtin functions
* positional arguments
* string methods
* official Python docs: https://docs.python.org/3/
* types have methods

In [None]:
max(1, 5, -2)

In [None]:
help(max)

In [None]:
max(first_name)

In [None]:
first_name.upper()

#### Exercise (comparing strings)

What will the following program show:

```python
rich = "gold"
poor = "tin"
print(max(rich, poor))
```

## Data analysis with pandas

### Working with data

* openning files in jupyter lab
* importing extra function libraries (pandas)
* importing csv data with read_csv
* keyword arguments
* showing dataframe

In [None]:
import pandas as pd

In [None]:
# df = pd.read_csv("https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", delimiter='\t')
df = pd.read_csv("diabetes.tab.txt", delimiter='\t')

In [None]:
df

Try `pd.read_<Tab>` to find other formats (or look them up in docs)

### Plotting

* line and dot plots
* histograms
* scatter plots

In [None]:
df.plot()

In [None]:
df.plot('S1', 'S2')

In [None]:
df.plot('S1', 'S2', style='.')

In [None]:
df.plot(kind='hist')

#### Exercise (plotting styles)

Plot the relation between age and BMI using different ploting styles (such as 'o', ':', '-.', 'ro', 'bo')

### Indexing data frame

* extract column
* iloc vs loc
* dataframe index
* two-dimensional indexing
* using empty slice


In [None]:
df['AGE'].plot(kind='hist')

In [None]:
stats = df.describe()
stats

In [None]:
stats.iloc[1]

In [None]:
mean_ = stats.loc['mean']
std_ = stats.loc['std']

In [None]:
stats.loc['mean' , 'S1']

In [None]:
stats.loc[:, 'S1']

#### Exercise (automatic alignment)

Normalize all variables in the data frame (subtract mean and divide by standard deviation)

## Linear regression with sklearn

* split data into train/test set
* plotting with matplotlib
* fitting scikit learn linear regression on train set
* predicting on test set

In [None]:
df.plot(x='S1', y='S2', style='.')

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.loc[:, ['AGE', 'BMI', 'S1']]
y = df.loc[:, 'S2']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
import matplotlib.pyplot as plt
plt.plot(X_test.loc[:, 'S1'], y_test, '.')

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
lr.coef_

**Question** Why do we have 3 different coefficients?

In [None]:
y_pred = lr.predict(X_test)

In [None]:
plt.plot(X_test.loc[:, 'S1'], y_test, '.')
plt.plot(X_test.loc[:, 'S1'], y_pred, 'r.')