# NEURO-105: Statistics and Probability using Python
## Lesson 2 - Friday 23/1/26

**Instructor:** Alexandros Pittis  
**Course:** MSc in Neurosciences, University of Crete

---

### Today's Objectives
1. Finish Lesson 1 (Pandas)
2. Correlation analysis
3. Linear regression

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats

sns.set_theme(style="whitegrid")

print("Libraries loaded!")

---
## Part 1: Working with the Penguins Dataset (built into seaborn)

We'll use a real dataset: measurements of penguins from Palmer Station, Antarctica.

This dataset contains measurements of three species:
- Adelie
- Chinstrap  
- Gentoo

In [None]:
# Load the penguins dataset
penguins = sns.load_dataset('penguins')

# Show first x rows
penguins.head(10)

In [None]:
# How many rows and columns in total?
penguins.shape

In [None]:
# Column names
penguins.columns

In [None]:
# Summary statistics
penguins.describe()

### Handling Missing Data

Real data often has missing values (NaN = "Not a Number").

In [None]:
# Check for missing values
penguins.isna().sum()

In [None]:
# Remove rows with missing values
penguins_clean = penguins.dropna()

print("Original:", len(penguins), "rows")
print("After dropna:", len(penguins_clean), "rows")

### Accessing Columns

In [None]:
# Get one column
penguins_clean['bill_length_mm']

In [None]:
# Get multiple columns
penguins_clean[['species', 'bill_length_mm', 'body_mass_g']]

### Basic Statistics

In [None]:
# Mean body mass
penguins_clean['body_mass_g'].mean()

In [None]:
# Median
penguins_clean['body_mass_g'].median()

In [None]:
# Standard deviation
penguins_clean['body_mass_g'].std()

In [None]:
# Min and Max
print("Min:", penguins_clean['body_mass_g'].min())
print("Max:", penguins_clean['body_mass_g'].max())

### Filtering Rows

In [None]:
# Keeping only Adelie penguins (hint: use boolean indexing)
penguins_clean[penguins_clean['species'] == 'Adelie']

In [None]:
# Only penguins heavier than 5000g
penguins_clean[penguins_clean['body_mass_g'] > 5000]

In [None]:
# Combine conditions: Adelie AND heavier than 4000g
penguins_clean[(penguins_clean['species'] == 'Adelie') & (penguins_clean['body_mass_g'] > 4000)]

### Group By - Statistics per Group

In [None]:
# Check groupby operation
penguins_clean.groupby?

In [None]:
# Mean body mass per species
penguins_clean.groupby('species')['body_mass_g'].mean()

In [None]:
# Full statistics per species
penguins_clean.groupby('species')['body_mass_g'].describe()

In [None]:
# Count per species and island
penguins_clean.groupby(['species', 'island']).size()

### Plotting

In [None]:
# Histogram of body mass
sns.histplot(data=penguins_clean, x='body_mass_g')

In [None]:
# Histogram by species
sns.histplot(data=penguins_clean, x='body_mass_g', hue='species')

In [None]:
# Box plot: body mass per species
sns.boxplot(data=penguins_clean, x='species', y='body_mass_g')

In [None]:
# Scatter plot: bill length vs body mass
sns.scatterplot(data=penguins_clean, x='bill_length_mm', y='body_mass_g')

In [None]:
# Scatter plot colored by species
sns.scatterplot(data=penguins_clean, x='bill_length_mm', y='body_mass_g', hue='species')

---
## Part 2: Correlation

**Correlation** measures how strongly two variables are related.

**Pearson correlation coefficient (r)**:
- r = +1: Perfect positive correlation
- r = 0: No correlation
- r = -1: Perfect negative correlation

In [None]:
# Is there a correlation between bill (ράμφος) length and body mass?
r, p_value = stats.pearsonr(penguins_clean['bill_length_mm'], penguins_clean['body_mass_g'])

print("Correlation (r):", round(r, 3))
print("P-value:", p_value)

**Interpretation?**

In [None]:
# Correlation between flipper (φτερά) length and body mass?
r2, p2 = stats.pearsonr(penguins_clean['flipper_length_mm'], penguins_clean['body_mass_g'])

print("Correlation (r):", round(r2, 3))
print("P-value:", p2)

Does flipper length have stronger correlation with body mass?

### Correlation Matrix

We can calculate correlations between all numeric variables at once.

In [None]:
# Select only numeric columns
numeric_cols = penguins_clean[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]

# Correlation matrix
numeric_cols.corr()

In [None]:
# Visualize as heatmap
sns.heatmap(numeric_cols.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
sns.despine()

**Reading the heatmap:**
- Diagonal is always 1 (variable correlates perfectly with itself)
- Red = positive correlation
- Blue = negative correlation
- Strongest correlation: flipper_length vs body_mass (0.87)

---
## Part 3: Linear Regression

**Linear regression** fits a straight line to data:

$$y = mx + b$$

- m = slope
- b = intercept

Let's predict body mass from flipper length.

In [None]:
# Check stats.linregress function
stats.linregress?

In [None]:
# Linear regression: flipper length → body mass
result = stats.linregress(penguins_clean['flipper_length_mm'], penguins_clean['body_mass_g'])

print("Slope:", round(result.slope, 2))
print("Intercept:", round(result.intercept, 2))
print("R-squared:", round(result.rvalue**2, 3))
print("P-value:", result.pvalue)

**Interpretation?**

In [None]:
# Plot with regression line
sns.regplot(data=penguins_clean, x='flipper_length_mm', y='body_mass_g')

In [None]:
# Regression per species
sns.lmplot(data=penguins_clean, x='flipper_length_mm', y='body_mass_g', hue='species')

### Making Predictions

In [None]:
# Predict body mass for a penguin with flipper length = 200mm
flipper = 200

predicted_mass = result.slope * flipper + result.intercept

print("Flipper length:", flipper, "mm")
print("Predicted body mass:", round(predicted_mass), "g")

In [None]:
# Predict for multiple values
flippers = [180, 190, 200, 210, 220]

for f in flippers:
    mass = result.slope * f + result.intercept
    print("Flipper:", f, "mm → Mass:", round(mass), "g")

---
## Part 5: Summary

**Pandas:**
- Load data: `sns.load_dataset()` or `pd.read_csv()`
- View data: `.head()`, `.shape`, `.columns`
- Handle missing: `.isna()`, `.dropna()`
- Statistics: `.mean()`, `.median()`, `.std()`, `.describe()`
- Filter: `df[df['col'] > value]`
- Group: `df.groupby('col')['other'].mean()`

**Correlation:**
- `stats.pearsonr(x, y)` → (r, p-value)
- `df.corr()` → correlation matrix
- `sns.heatmap()` to visualize

**Linear Regression:**
- `stats.linregress(x, y)` → slope, intercept, r, p-value
- `sns.regplot()` → scatter + regression line
- Predict: `y = slope * x + intercept`

---
## Exercises

### Exercise 1
Find the mean bill length for each species.

In [37]:
# YOUR CODE HERE


### Exercise 2
Calculate the correlation between bill_length_mm and bill_depth_mm for Adelie penguins only.

In [38]:
# YOUR CODE HERE


### Exercise 3
Using our linear regression (flipper → body mass), predict the body mass of a penguin with flipper length 230mm.

In [39]:
# YOUR CODE HERE


---
## Resources

- [Palmer Penguins Dataset](https://allisonhorst.github.io/palmerpenguins/)
- [Seaborn Regression Tutorial](https://seaborn.pydata.org/tutorial/regression.html)
- [SciPy linregress](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html)

---

**Next class (26/1/26):** Permutation tests, p-value estimation

---
*NEURO-105 - MSc in Neurosciences, University of Crete*