# Introduction to Programming, Statistics, Data Science, and Machine Learning for MD Students

Welcome to the first lecture of this course! Over the coming weeks, we will dive into the foundational concepts of programming, statistics, data science, and machine learning. This course is tailored to equip MD students with the computational and analytical skills needed to tackle real-world medical challenges and make data-driven decisions.

---

## About the Instructor

Hi, I’m **Arman Karshenas**, and I’ll be your instructor for the first and last sections of the course. I completed my BA at the University of Oxford followed by an MPhil from the University of Cambridge. I recently finished my PhD in Biophysics at UC Berkeley, where my research focuses on using machine learning and computational methods to solve complex biological problems. I’ve taught programming and bioinformatics courses and led initiatives to make these skills accessible to students globally. I am passionate about bridging the gap between computational techniques and medical research, and I’m excited to share my knowledge with you.

--- 

## Course logistics 

Please kindly note the following: 

1. The first section consist of 6 lectures in total and we will be going over the basics of programming in Python. If we have time, I might try to sqeeze in some R for those of you who are interested in doing more stats in the future.
2. The classes are scheduled to run from 7-10 PM Tehran time and we will meet on Zoom using the following link [https://berkeley.zoom.us/my/karshenas](https://berkeley.zoom.us/my/karshenas)
3. The class is hands-on with a lot of examples to help you better understand and I encourage everyone to raise their hands on Zoom and interrupt me to ask questions at any point during the class.
4. Each class is accompanied by a HW that is there for you to try different and harder problems.
5. I will try to post readings and resources regarding each major topic that we introduce in the course so that you can refer to them later if you are interested.
6. If you have any questions/concerns please email me at: karshenas [AT] berkeley [dot] edu



# The Importance of Data Science in Medical Fields

Data science plays a pivotal role in advancing medical research and practice. Here are some key use cases demonstrating its impact:

---

## 1. FDA Clinical Trials and Drug Development

Data science is integral to the design, analysis, and interpretation of FDA-regulated clinical trials. By leveraging statistical and machine learning techniques, researchers can:

- Optimize trial designs using adaptive methods.
- Predict patient responses to treatments, reducing trial durations and costs.
- Analyze large-scale clinical trial data to identify safety signals and efficacy patterns.

*Example:* The FDA’s Sentinel Initiative uses data science to monitor the safety of medical products by analyzing real-world evidence.  
**Reference:** [FDA Sentinel Initiative](https://www.fda.gov/safety/fdas-sentinel-initiative)

---

## 2. Genome-Wide Association Studies (GWAS)

GWAS uses large datasets to identify genetic variants associated with specific traits or diseases. Data science enables:

- The processing and analysis of vast genomic datasets.
- Identification of genetic markers linked to diseases like diabetes and cancer.
- Visualization of association results in Manhattan plots to pinpoint significant genetic loci.

*Example:* The EMBL-EBI GWAS catalog lists all the major studies ever done for various health conditions.  
**Reference:** [EMBL-EBI](https://www.ebi.ac.uk/gwas/)

---

## 3. Genotype-Phenotype Mapping and Association Tests

Understanding how genetic variations influence phenotypic traits is crucial for precision medicine. Data science facilitates:

- Conducting association tests between genotypes and phenotypes.
- Modeling complex interactions between multiple genetic variants.
- Developing polygenic risk scores to predict disease susceptibility.

---

These examples underscore the transformative potential of data science in medical fields, from drug development to personalized medicine. By mastering the tools and techniques in this course, you will be equipped to contribute to such impactful applications.


# Statistical Testing with a Linear Model

To analyze the effect of Drug A on blood pressure, we can model blood pressure as a linear function of the binary variable $D$ (indicating whether the drug is administered). This approach allows us to estimate the effect of the drug and assess its statistical properties.

---

## Linear Model with Blood Pressure as Output

The model is defined as:
$$
Y = \beta_0 + \beta_1 D + \epsilon
$$
where:
- $Y$: Blood pressure (output variable).
- $\beta_0$: Intercept (average blood pressure for the control group).
- $\beta_1$: Effect of Drug A (difference in mean blood pressure between treatment and control groups).
- $D$: Binary variable ($D = 1$ if Drug A is administered, $D = 0$ otherwise).
- $\epsilon$: Error term (assumed to be normally distributed with mean $0$ and variance $\sigma^2$).

The goal is to estimate $\beta_1$, the effect of Drug A.

---

## Properties of the Estimator for $\beta_1$

The estimator for $\beta_1$ is given by:
$$
\hat{\beta}_1 = \frac{\sum_{i=1}^n (D_i - \bar{D})(Y_i - \bar{Y})}{\sum_{i=1}^n (D_i - \bar{D})^2}
$$
where $\bar{D}$ and $\bar{Y}$ are the means of $D$ and $Y$.

### Mean of the Estimator
The mean of $\hat{\beta}_1$ is:
$$
E[\hat{\beta}_1] = \beta_1
$$

### Standard Deviation of the Estimator
The standard deviation of $\hat{\beta}_1$ is:
$$
\text{SD}(\hat{\beta}_1) = \sqrt{\frac{\sigma^2}{\sum_{i=1}^n (D_i - \bar{D})^2}}
$$

### Bias of the Estimator
Since $E[\hat{\beta}_1] = \beta_1$, the estimator is unbiased:
$$
\text{Bias}(\hat{\beta}_1) = E[\hat{\beta}_1] - \beta_1 = 0
$$

### 95% Confidence Interval
The 95% confidence interval for $\hat{\beta}_1$ is:
$$
\text{95% CI} = \hat{\beta}_1 \pm 1.96 \times \text{SD}(\hat{\beta}_1)
$$

---

## Extending the Model with an Additional Variable $X$

Now suppose we introduce another variable $X$ (e.g., age or BMI), which also influences blood pressure. The linear model becomes:
$$
Y = \beta_0 + \beta_1 D + \beta_2 X + \epsilon
$$
where:
- $\beta_2$: Coefficient representing the effect of $X$ on blood pressure.

### Estimator for $\beta_1$ with $X$
The estimator for $\beta_1$ in this model is:
$$
\hat{\beta}_1 = \frac{\sum_{i=1}^n \left((D_i - \bar{D}) (Y_i - \hat{Y}_{X,i})\right)}{\sum_{i=1}^n (D_i - \bar{D})^2}
$$
where $\hat{Y}_{X,i} = \beta_2 X_i$ accounts for the contribution of $X$.

### Mean, Standard Deviation, and Bias of $\hat{\beta}_1$
- **Mean**:  
  $$ E[\hat{\beta}_1] = \beta_1 $$  
  (Unbiased if $D$ and $X$ are independent.)

- **Standard Deviation**:  
  $$ \text{SD}(\hat{\beta}_1) = \sqrt{\frac{\sigma^2}{\sum_{i=1}^n (D_i - \bar{D})^2}} $$  
  (Similar to the single-variable case but adjusted for the presence of $X$.)

- **Bias**:  
  $$ \text{Bias}(\hat{\beta}_1) = 0 $$  
  (As long as the model is correctly specified.)

- **95% Confidence Interval**:  
  $$ \text{95% CI} = \hat{\beta}_1 \pm 1.96 \times \text{SD}(\hat{\beta}_1) $$

This extended model enables us to isolate the effect of Drug A ($\beta_1$) while accounting for the influence of $X$, leading to more robust and interpretable results.
