## Intro


It's not a step-by-step introduction to Python programming, check out Software Carpentry lessons and join one of their workshops around your place: https://swcarpentry.github.io/python-novice-inflammation/, but:


* Python is meant to resemble a human language
* you can go long way without knowing the details of Python syntax,
* just load some data and start playing with it
* you can start to use Python to replace your excel sheets or statistical package or create simple graphs to share with colleagues
* you can also use python as a programmable calculator
* we will focus on data analysis
* it's a tutorial not a lecture, let's make it interactive

#### Discussion

* How many people are familiar with Python? 
* What are the main uses?
* Are you familiar with other programming languages?
* Do you have a working Jupyter installation?

## Using Jupyter lab

> ⚠️ If you don't have Jupyter installation, you can use the browser-based distribution (Jupyterlite) by following the link:
>
>  https://btel.github.io/2023-09-20-eitn-school-python

* moving around
* editing mode
* executing cells
* getting help
* keyboard shorcuts: 
  - Enter (to enter edito mode), 
  - Shit-Enter (Run), 
  - Esc (enter command mode), 
  - M (markdown, in command mode), 
  - X (remove cell, in command mode)
  - b (command mode, insert new cell below)

## Basic Python

##### expressions:

  ```python
  a = 4
  b = a + 1
  print(f"{a} + 1 = {b}")
  ```

##### data structures

  ```python
  # list
  my_list = [1, 5, 6]
  print(my_list[0])

  # string
  my_string = "hello world"

  # tuple
  my_tuple = (4, 5)
  x, y = my_tuple

  # dictionary
  my_dict = {'a': 1, 'b': 3}
  print(my_dict['a']) 

```

##### conditionals:

  ```python
  if a > 0:
     print("a is positive")
  ```

##### loops

  ```python
  my_list = [1, 2, 3, 4]
  for i in range(4):
      print(my_list[i])
  ```

##### functions

  ```python
  def my_function(a):
      return a + 1
  print(my_function(5))
  ```


### Quiz

Name the type of the following data structures:

  a) `var_a = {'k': 0, 'l': 5}`

  b) `var_b = "Paris"`
  
  c) `var_c = ('hello', 'world')`
  
  d) `var_d = [(1, 1), (2, 2),  (3, 3)]`

What are the values of the following expressions:

  a) `var_a['k']`
  
  b) `var_b[1]`
  
  c) `var_d[2]`
  
  d) `var_a[1]`

## Importing and exploring data

* importing libraries
* pandas
* `read_csv`, `describe`, `head`
* `.iloc`, `.loc`

In [1]:
import pandas as pd
import matplotlib

In [2]:
#url = 'https://raw.githubusercontent.com/btel/2022-09-21-eitn-school/main/eeg_powers.csv'
url = 'https://bit.ly/3BTE0A1'

# use this command if you don't have the CSV file locally
# df = pd.read_csv(url, index_col=0)
df = pd.read_csv('eeg_powers.csv', index_col=0)

Definitions of EEG bands:

* delta 0.5 -- 4 Hz
* alpha 8 -- 13 Hz,  
* beta 13 -- 30 Hz, 
* gamma: > 30 Hz

For details, see my notebook with feature extraction: https://www.kaggle.com/btelenczuk/eeg-extract-features


#### Exercise: importing and indexing

1) Read the data.
2) How many categorical data columns are there?
3) (Optional) Display the last 10 rows

## Working with categorical data

**Goal**
* check class names for each categorical column

**Functions**
* `unique`, `nunique`, `value_counts`

## Plotting: time series and distributions

**Goal**
* plot the time series of powers
* use subplots
* plot distributions with a custom number of bins

**Functions**
* pandas: `plot` and `hist`

## Transforming data

**Goal**:
* "normalize" (log) powers distribution
* remove zero power rows
* replace and/or add columns
* use for loop to transfrom all powers columns
* replot histograms and observe mulitmodality

**Functions**
* boolean indexing/masking/filtering
* `.apply`
* `.copy`
* for loops

#### Discussion

What are the reasons for the multimodality?

#### Exercise: adding column

Create a new column 'total_power' which is the sum of the powers across the 'alpha', 'beta', 'gamma', and 'delta' columns. Plot the histogram

## Scatter plots

**Goal**: 
* identify dependencies between continous variables (powers) using scatter plot
* observe clusters

**Functions**
* `.plot.scatter` or `.plot(kind='scatter', ...)`

#### Exercise: plotting

We have the following dataset:

```
data = {
    'Value': np.random.randn(100),
    'Other Value': np.random.randn(100),
    'Category': np.random.choice(['A', 'B', 'C'], size=100)
}
df = pd.DataFrame(data)
```

Match the graphs and the plotting functions:

1) ```python
   df.plot(y='Value')
   ```

3) ```python
   df.plot(kind='scatter', x='Other Value', y='Value')
   ```

5) ```python
   df['Value'].plot.hist(bins=20, title='Histogram')
   ```

7)
   ```python
   bar_data = df.groupby('Category')['Value'].mean()
   bar_data.plot(kind='bar', ax=axes[1,0])
   ```



![](graphs.svg)

## Splitting data

**Goal**
* split data into distinct subsets (drowsy vs non-drowsy) using boolean indexing
* show two sub-sets on a single plot

**Function**
* Axes object
*  boolean indexing
* `.isin`


## Compare groups

**Goal**: 
* explore dependencies between categorical and continuous variables 
* use groupby the group the rows based on channel name
* calculate the mean total power for each channel
* plot a bar plot
  
**Functions**

* `.groupby`
* `.plot.bar`


#### Excercise: low and high channels

Redo the scatter graph for low- and high channels.

#### Discussion: EDA

What have we learnt from this analysis? Why do we see the two clusters? Why is it important to do Exploratory Data Analysis (EDA)

## Predicting the state

**Goal**

* use machine learning to predict the brain state from powers

### preprocess data

**Goal**
* select features (powers)
* split into train and test set
* normalize the data

**Functions**

* `train_test_split`
* `StandardScaler`

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### fit model

**Goal**
* choose a model
* fit a model on a train set
* calculate prediction on a test set
* calculate accuracy
* data cleaning

**Functions**

* `LogisticRegression`
* `accuracy_score`

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


#### Exercise: improving accuracy

1) What is chance accuracy level for this problem?
2) What is the accuracy if we apply the analysis only to high channels?