# Introduction to Python for Clinical Data Analysis

This notebook introduces core Python concepts using a healthcare dataset.

By the end of this session you should understand:
- How Python handles data types
- How to work with tabular data using pandas
- What vectorisation means
- How to filter and summarise clinical datasets

## 1. Importing Libraries

In Python, we use libraries (also called packages) to extend functionality.

- `pandas` is used for tabular data (like spreadsheets)
- `matplotlib` is used for visualisation
- `numpy` is used for numerical operations

We import them at the start of our analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 2. Loading a Dataset

A pandas DataFrame is like a spreadsheet in memory.
- Rows represent patients
- Columns represent variables (age, CRP, diagnosis, etc.)

We load data from a CSV file.

In [None]:
# Replace with your actual file path if needed
df = pd.read_csv('patients.csv')
df.head()

## 3. Understanding the Structure of the Data

Before analysing, always inspect:
- Column names
- Data types
- Summary statistics

This prevents errors later.

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

## 4. Filtering Data

Filtering allows us to define a cohort.

For example:
- Patients with CRP > 10
- Patients older than 65

This mirrors inclusion criteria in research studies.

In [None]:
high_crp = df[df['crp'] > 10]
high_crp.head()

In [None]:
older_high_crp = df[(df['crp'] > 10) & (df['age'] > 65)]
older_high_crp.head()

## 5. Vectorisation

Vectorisation means applying operations to entire columns at once instead of looping row by row.

This is faster, cleaner, and scales to large datasets.

For example, calculating a risk score for all patients simultaneously.

In [None]:
df['risk_score'] = df['age'] * 0.3 + df['crp'] * 0.5
df.head()

## 6. Visualisation

Visualising data helps identify:
- Distribution
- Skewness
- Outliers

Here we plot CRP distribution.

In [None]:
plt.hist(df['crp'])
plt.xlabel('CRP')
plt.ylabel('Frequency')
plt.show()

## 7. Key Takeaways

- Python enables reproducible analysis
- Pandas allows structured data handling
- Vectorisation improves efficiency
- Filtering mirrors cohort selection

This workflow scales from small audits to national datasets.