# Introduction to Programming, Statistics, Data Science, and Machine Learning for MD Students

Welcome to the first lecture of this course! Over the coming weeks, we will dive into the foundational concepts of programming, statistics, data science, and machine learning. This course is tailored to equip MD students with the computational and analytical skills needed to tackle real-world medical challenges and make data-driven decisions.

---

## About the Instructor

Hi, I’m **Arman Karshenas**, and I’ll be your instructor for the first and last sections of the course. I completed my BA at the University of Oxford followed by an MPhil from the University of Cambridge. I recently finished my PhD in Biophysics at UC Berkeley, where my research focuses on using machine learning and computational methods to solve complex biological problems. I’ve taught programming and bioinformatics courses and led initiatives to make these skills accessible to students globally. I am passionate about bridging the gap between computational techniques and medical research, and I’m excited to share my knowledge with you.

--- 

## Course logistics 

Please kindly note the following: 

1. The first section consist of 6 lectures in total and we will be going over the basics of programming in Python. If we have time, I might try to sqeeze in some R for those of you who are interested in doing more stats in the future.
2. The classes are scheduled to run from 7-10 PM Tehran time and we will meet on Zoom using the following link [https://berkeley.zoom.us/my/karshenas](https://berkeley.zoom.us/my/karshenas)
3. The class is hands-on with a lot of examples to help you better understand and I encourage everyone to raise their hands on Zoom and interrupt me to ask questions at any point during the class.
4. Each class is accompanied by a HW that is there for you to try different and harder problems.
5. I will try to post readings and resources regarding each major topic that we introduce in the course so that you can refer to them later if you are interested.
6. If you have any questions/concerns please email me at: karshenas [AT] berkeley [dot] edu



# The Importance of Data Science in Medical Fields

Data science plays a pivotal role in advancing medical research and practice. Here are some key use cases demonstrating its impact:

---

## 1. FDA Clinical Trials and Drug Development

Data science is integral to the design, analysis, and interpretation of FDA-regulated clinical trials. By leveraging statistical and machine learning techniques, researchers can:

- Optimize trial designs using adaptive methods.
- Predict patient responses to treatments, reducing trial durations and costs.
- Analyze large-scale clinical trial data to identify safety signals and efficacy patterns.

*Example:* The FDA’s Sentinel Initiative uses data science to monitor the safety of medical products by analyzing real-world evidence.  
**Reference:** [FDA Sentinel Initiative](https://www.fda.gov/safety/fdas-sentinel-initiative)

---

## 2. Genome-Wide Association Studies (GWAS)

GWAS uses large datasets to identify genetic variants associated with specific traits or diseases. Data science enables:

- The processing and analysis of vast genomic datasets.
- Identification of genetic markers linked to diseases like diabetes and cancer.
- Visualization of association results in Manhattan plots to pinpoint significant genetic loci.

*Example:* The EMBL-EBI GWAS catalog lists all the major studies ever done for various health conditions.  
**Reference:** [EMBL-EBI](https://www.ebi.ac.uk/gwas/)

---

## 3. Genotype-Phenotype Mapping and Association Tests

Understanding how genetic variations influence phenotypic traits is crucial for precision medicine. Data science facilitates:

- Conducting association tests between genotypes and phenotypes.
- Modeling complex interactions between multiple genetic variants.
- Developing polygenic risk scores to predict disease susceptibility.

---

These examples underscore the transformative potential of data science in medical fields, from drug development to personalized medicine. By mastering the tools and techniques in this course, you will be equipped to contribute to such impactful applications.


# Dive into the FDA trial example 

When studying the effect of a drug, such as Drug A, which is thought to reduce blood pressure, we need a systematic way to determine whether the observed effect is due to the drug or just random chance. Statistical tests allow us to:

- Quantify uncertainty in the results.
- Make evidence-based conclusions.
- Minimize the influence of bias and random noise.

Without statistical tests, we risk drawing incorrect conclusions that could lead to ineffective or even harmful treatments being administered.

---

## Control and Treatment Groups

To evaluate the effectiveness of Drug A, we divide the population into two groups:

- **Treatment Group**: Receives Drug A.
- **Control Group**: Does not receive the drug (may receive a placebo instead).

This setup helps isolate the effect of the drug by comparing outcomes between the two groups, ensuring that other factors are balanced between them.

---

## Hypothesis Testing and Errors

### Hypotheses
- **Null Hypothesis (H₀):** Drug A has no effect on blood pressure.
- **Alternative Hypothesis (H₁):** Drug A reduces blood pressure.

### Errors
- **Type I Error (False Positive):** Rejecting H₀ when it is true (concluding Drug A is effective when it’s not).
- **Type II Error (False Negative):** Failing to reject H₀ when it is false (concluding Drug A is not effective when it is).

---

## p-Value and Estimators

- **p-Value**: The probability of observing results as extreme as those in the study, assuming H₀ is true. A low p-value (typically <0.05) suggests evidence against H₀.
- **Estimators**: Quantitative measures used to estimate the effect size of Drug A.

If $ p\% $ of the population receives Drug A, let the effect of the drug $D $ be a binary variable:
- $ D = 1 $: If blood pressure decreases.
- $ D = 0 $: Otherwise.

The **mean effect** of the drug is given by:  
$$
\mu_D = \frac{1}{n} \sum_{i=1}^n D_i
$$  
where $ n $ is the sample size.

The **standard deviation** of $ D $ is:  
$$
\sigma_D = \sqrt{\mu_D (1 - \mu_D)}
$$

The **95% confidence interval (CI)** for $ \mu_D $ can be calculated as:  
$$
\text{95% CI} = \mu_D \pm 1.96 \times \frac{\sigma_D}{\sqrt{n}}
$$

---

### Interpretation
- A narrow CI indicates a precise estimate of the drug’s effect.
- If the CI does not include 0, it provides evidence that Drug A significantly affects blood pressure.

By following this structured approach, we can scientifically assess whether Drug A is effective and quantify its impact on the population.
