# Lecture 1 - Introduction to Programming, Statistics, Data Science, and Machine Learning for MD Students

Welcome to the first lecture of this course! Over the coming weeks, we will dive into the foundational concepts of programming, statistics, data science, and machine learning. This course is tailored to equip MD students with the computational and analytical skills needed to tackle real-world medical challenges and make data-driven decisions.

---

## About the Instructor

Hi, I’m **Arman Karshenas**, and I’ll be your instructor for the first and last sections of the course. I completed my BA at the University of Oxford followed by an MPhil from the University of Cambridge. I recently finished my PhD in Biophysics at UC Berkeley, where my research focuses on using machine learning and computational methods to solve complex biological problems. I’ve taught programming and bioinformatics courses and led initiatives to make these skills accessible to students globally. I am passionate about bridging the gap between computational techniques and medical research, and I’m excited to share my knowledge with you.

--- 

## Course logistics 

Please kindly note the following: 

1. The first section consist of 6 lectures in total and we will be going over the basics of programming in Python. If we have time, I might try to sqeeze in some R for those of you who are interested in doing more stats in the future.
2. The classes are scheduled to run from 7-10 PM Tehran time and we will meet on Zoom using the following link [https://berkeley.zoom.us/my/karshenas](https://berkeley.zoom.us/my/karshenas)
3. The class is hands-on with a lot of examples to help you better understand and I encourage everyone to raise their hands on Zoom and interrupt me to ask questions at any point during the class.
4. All the course materials are going to be in English - I think it is good practice to learn the formal and proper words and vocabulary of the areas we are going to cover. Please do interrupt me or send me a message if you are struggling with this. 
5. I will be publishing course materials and HWs in Jupyter notebook format. You have the option to use [Jupyter labs](https://jupyter.org/install) or [Google colab](https://colab.research.google.com) to work with these notebooks (more on the setup later). 
6. Each class is accompanied by a HW that is there for you to try different and harder problems.
7. I will try to post readings and resources regarding each major topic that we introduce in the course so that you can refer to them later if you are interested.
8. If you have any questions/concerns please email me at: karshenas [AT] berkeley [dot] edu



# The Importance of Data Science in Medical Fields

Data science plays a pivotal role in advancing medical research and practice. Here are some key use cases demonstrating its impact:

---

## 1. FDA Clinical Trials and Drug Development

Data science is integral to the design, analysis, and interpretation of FDA-regulated clinical trials. By leveraging statistical and machine learning techniques, researchers can:

- Optimize trial designs using adaptive methods.
- Predict patient responses to treatments, reducing trial durations and costs.
- Analyze large-scale clinical trial data to identify safety signals and efficacy patterns.

*Example:* The FDA’s Sentinel Initiative uses data science to monitor the safety of medical products by analyzing real-world evidence.  
**Reference:** [FDA Sentinel Initiative](https://www.fda.gov/safety/fdas-sentinel-initiative)

---

## 2. Genome-Wide Association Studies (GWAS)

GWAS uses large datasets to identify genetic variants associated with specific traits or diseases. Data science enables:

- The processing and analysis of vast genomic datasets.
- Identification of genetic markers linked to diseases like diabetes and cancer.
- Visualization of association results in Manhattan plots to pinpoint significant genetic loci.

*Example:* The EMBL-EBI GWAS catalog lists all the major studies ever done for various health conditions.  
**Reference:** [EMBL-EBI](https://www.ebi.ac.uk/gwas/)

---

## 3. Genotype-Phenotype Mapping and Association Tests

Understanding how genetic variations influence phenotypic traits is crucial for precision medicine. Data science facilitates:

- Conducting association tests between genotypes and phenotypes.
- Modeling complex interactions between multiple genetic variants.
- Developing polygenic risk scores to predict disease susceptibility.

---

These examples underscore the transformative potential of data science in medical fields, from drug development to personalized medicine. By mastering the tools and techniques in this course, you will be equipped to contribute to such impactful applications.


# Statistical Testing and Hypothesis Evaluation for Drug A

In evaluating the effectiveness of Drug A in reducing blood pressure, statistical testing provides a structured framework for determining whether observed changes in blood pressure are due to the drug or random chance.

---

## Hypothesis Testing for Drug A

To assess the impact of Drug A, we define the following hypotheses:

- **Null Hypothesis ($H_0$):** Drug A has no effect on blood pressure (the mean blood pressure in the treatment and control groups is the same).
- **Alternative Hypothesis ($H_1$):** Drug A reduces blood pressure (the mean blood pressure in the treatment group is lower than in the control group).

Using statistical testing, we analyze the collected data to decide whether to reject $H_0$ in favor of $H_1$. The decision is guided by the **p-value** and test results.

---

## Error Types in the Context of Drug A

### Type I Error (False Positive)
- **Definition:** Concluding that Drug A reduces blood pressure when, in reality, it does not.
- **Example:** Due to random variation, we observe a significant decrease in blood pressure in the treatment group, but this effect is not truly caused by the drug.
- **Probability:** Denoted by $\alpha$, typically set at 0.05 (5%).

### Type II Error (False Negative)
- **Definition:** Failing to detect that Drug A reduces blood pressure when it actually does.
- **Example:** Insufficient sample size or high variability leads to an inability to observe a statistically significant difference.
- **Probability:** Denoted by $\beta$.

### Power of the Test
The power of the test is $1 - \beta$, representing the probability of correctly detecting that Drug A reduces blood pressure. A higher power increases confidence in the test results.

---

## p-Value Interpretation for Drug A

The **p-value** indicates the strength of evidence against $H_0$:
- **Small p-value (e.g., $< 0.05$):** Strong evidence that Drug A reduces blood pressure.
- **Large p-value (e.g., $\geq 0.05$):** Insufficient evidence to conclude that Drug A is effective.

---

## Role of Sample Size in Testing Drug A

### Larger Sample Sizes
- Reduce variability in the blood pressure measurements.
- Increase the precision of the estimated effect of Drug A.
- Enhance the power of the test to detect a true effect.

### Smaller Sample Sizes
- Increase the likelihood of a Type II error, potentially missing the true effect of Drug A.
- Lead to less reliable conclusions.

For example, if we expect Drug A to reduce blood pressure by 5 mmHg, a sample size calculation ensures we have enough participants to detect this effect with sufficient power.

---

## Bias and Its Impact

### Selection Bias
If the treatment and control groups are not randomly assigned, differences in blood pressure could be attributed to factors other than Drug A.

### Measurement Bias
Inaccurate or inconsistent blood pressure measurements can obscure the true effect of the drug.

### Confounding Bias
Variables like age, weight, or baseline health conditions could influence blood pressure and confound the observed effect of Drug A.

---

## Power Analysis for Drug A

If we hypothesize that Drug A reduces blood pressure by 5 mmHg with a standard deviation of 10 mmHg, and we want 80% power to detect this difference at $\alpha = 0.05$, power analysis can help determine the required sample size.

### Example Calculation
For a two-sample t-test:
$$
n = \frac{(Z_{\alpha/2} + Z_\beta)^2 \cdot 2\sigma^2}{\Delta^2}
$$
where:
- $Z_{\alpha/2}$ is the critical value for the significance level.
- $Z_\beta$ is the critical value for the desired power.
- $\sigma$ is the standard deviation (10 mmHg in this example).
- $\Delta$ is the expected difference (5 mmHg).

This ensures the study is adequately powered to detect the effect of Drug A on blood pressure.

---

By grounding hypothesis testing, p-values, errors, and power analysis in this specific example, we can ensure the study on Drug A is designed to produce reliable and interpretable results.


# Introduction to Programming and Python

Programming is the foundation of modern data analysis, automation, and problem-solving. It enables us to write instructions that computers can execute to perform tasks efficiently. Understanding programming is a key skill for anyone looking to work with data, develop software, or create innovative solutions.

---

## Types of Programming Languages

Programming languages can be broadly categorized into:

1. **Low-Level Languages**:
   - Close to machine code (e.g., Assembly).
   - Fast execution but harder to read and write.

2. **High-Level Languages**:
   - Human-readable and easier to use (e.g., Python, Java, R).
   - Slightly slower but ideal for most modern applications.

3. **Scripting Languages**:
   - Designed for automating tasks and writing short programs (e.g., Python, JavaScript).

4. **Compiled vs. Interpreted**:
   - **Compiled Languages** (e.g., C, C++): Converted into machine code before execution.
   - **Interpreted Languages** (e.g., Python, R): Executed line-by-line during runtime.

---

## Why Python?

Python is one of the most popular programming languages for beginners and experts alike. Its simplicity and versatility have made it a favorite in fields ranging from web development to data science and machine learning.

### Pros of Python:
- **Easy to Learn and Use**: Python's syntax is simple and intuitive, making it ideal for beginners.
- **Versatile**: Used for web development, data analysis, artificial intelligence, scientific computing, and more.
- **Rich Ecosystem**: Thousands of libraries and frameworks (e.g., NumPy, Pandas, TensorFlow) make it powerful for various applications.
- **Cross-Platform**: Works seamlessly on Windows, macOS, and Linux.
- **Large Community**: A vast community ensures extensive support, tutorials, and resources.

Python's philosophy of readability and efficiency has led to its widespread adoption across industries.

--- 



## Installing Python on Your Local Computer

Follow these steps to install Python on your local machine, regardless of whether you are using Windows, macOS, or Linux.

---

## Step 1: Check if Python is Already Installed

1. Open your terminal:
   - **Windows**: Command Prompt or PowerShell.
   - **macOS/Linux**: Terminal.
2. Type:
   ```bash
   python --version
   ```
   or:
   ```bash
   python3 --version
   ```
3. If Python is installed, the version number will be displayed. If not, proceed to the next step to install Python.

---

## Step 2: Install Python

### For Windows:
1. Visit the [official Python website](https://www.python.org/downloads/).
2. Download the latest Python version for Windows.
3. Run the installer:
   - **Important:** Check the box for "Add Python to PATH."
   - Click "Install Now" or customize the installation settings if needed.
4. Verify the installation by typing:
   ```bash
   python --version
   ```

### For macOS:
1. macOS includes Python 2 by default, but Python 3 is recommended.
2. Install **Homebrew** if it’s not installed:
   ```bash
   /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
   ```
3. Use Homebrew to install Python 3:
   ```bash
   brew install python
   ```
4. Verify the installation by typing:
   ```bash
   python3 --version
   ```

### For Linux:
1. Open your terminal.
2. Use your distribution's package manager to install Python:
   - **Ubuntu/Debian**:
     ```bash
     sudo apt update
     sudo apt install python3
     ```
   - **Fedora**:
     ```bash
     sudo dnf install python3
     ```
   - **Arch Linux**:
     ```bash
     sudo pacman -S python
     ```
3. Verify the installation by typing:
   ```bash
   python3 --version
   ```

---

## Step 3: Install and Set Up JupyterLab (Optional)

JupyterLab is a powerful, browser-based interactive development environment (IDE) commonly used in data science, machine learning, and scientific computing. It allows you to write and run Python code in interactive notebooks.

### Installing JupyterLab

1. Ensure Python is installed and working on your machine.
2. Install JupyterLab using `pip` (Python’s package manager):
   ```bash
   pip install jupyterlab
   ```

### Starting JupyterLab

1. Open a terminal or command prompt.
2. Type the following command to launch JupyterLab:
   ```bash
   jupyter lab
   ```
3. JupyterLab will open in your default web browser. If it doesn’t, copy the URL displayed in the terminal and paste it into your browser.

### Basic Usage of JupyterLab

- **Create a New Notebook**: Click the "Python 3" button under "Notebook" to start a new notebook.
- **Write and Run Code**: Type Python code in a cell and press `Shift + Enter` to run it.
- **Save Your Work**: Save your notebook using the `File > Save` option or the save icon.

### Advantages of JupyterLab

- Interactive environment for writing and testing code.
- Support for combining code, text (Markdown), and visualizations in one document.
- Easy integration with popular Python libraries like NumPy, Pandas, and Matplotlib.
- Export notebooks as `.ipynb` or `.html` files for sharing or presentation.

---


## Step 4: Test Your Python Installation

1. Open a terminal or command prompt.
2. Type:
   ```bash
   python3
   ```
   (On Windows, use `python` if `python3` doesn’t work.)
3. In the Python interactive shell, type:
   ```python
   print("Hello, Python!")
   ```
4. If you see:
   ```plaintext
   Hello, Python!
   ```
   Python is successfully installed!

---

You’re now ready to start coding in Python. Enjoy your programming journey!


In [1]:
# Let's go through a few examples!! 
# The print statement: 

print("Hello World!")
print(2+2) 
print("2+2 = ",2+2)

Hello World!
4
2+2 =  4


In [2]:
# Introducing variables 

# Variables store values (numbers) or objects (we will get to this later) and they are the very first building blocks of any programming languages 

# Let's define a variable called name and store your name 

name = "Arman" 

# Now let's define another variable to store your age! 

age = 27 

# Now let's print Arman is 27 years old 

print(name," is ",age," years old")



Arman  is  27  years old!



# Variables in Python

In Python, **variables** are used to store data values. They act as containers for information that you can use and manipulate throughout your code.

---

## What Are Variables?

A variable is a name that refers to a value. For example:
```python
x = 10
```
Here:
- `x` is the variable name.
- `10` is the value assigned to the variable.

---

## Rules for Naming Variables

1. Variable names can include letters, numbers, and underscores (`_`), but **cannot start with a number**.
   - Valid: `my_var`, `var2`
   - Invalid: `2var`
2. Variable names are **case-sensitive**.
   - `age` and `Age` are two different variables.
3. Avoid using Python **keywords** (like `if`, `for`, `def`, etc.) as variable names.

---

## Assigning Values to Variables

You can assign values of different types to variables:
```python
# Integer
x = 5

# Float
pi = 3.14

# String
name = "Alice"

# Boolean
is_active = True
```

---

## Operations Between Variables

Python allows you to perform operations between variables, depending on their types:

### Arithmetic Operations
For numeric variables:
```python
a = 10
b = 20

# Addition
sum = a + b  # 30

# Subtraction
diff = a - b  # -10

# Multiplication
product = a * b  # 200

# Division
quotient = a / b  # 0.5

# Exponentiation
power = a ** 2  # 100
```

### String Operations
For string variables:
```python
# Concatenation
greeting = "Hello" + " " + "World"  # "Hello World"

# Repetition
repeat = "Hi! " * 3  # "Hi! Hi! Hi! "
```

---

## Variable Types and Type Checking

In Python, variables are **dynamically typed**, meaning you don’t need to declare their type explicitly. Python will infer the type based on the value.

To check the type of a variable:
```python
x = 10
print(type(x))  # Output: <class 'int'>

y = "Hello"
print(type(y))  # Output: <class 'str'>
```

---

## Things Beginners Need to Know

1. **Reassigning Variables**: You can change the value of a variable at any time.
   ```python
   x = 5
   x = "Now I'm a string!"
   ```

2. **Input and Output**:
   - Use `input()` to get user input.
   - Use `print()` to display output.
   ```python
   name = input("Enter your name: ")
   print("Hello, " + name + "!")
   ```

3. **Common Errors**:
   - Using a variable before it is defined:
     ```python
     print(x)  # NameError: name 'x' is not defined
     ```
   - Mismatched operations (e.g., adding a string to a number):
     ```python
     result = "Age: " + 25  # TypeError: can only concatenate str (not "int") to str
     ```

4. **Best Practices**:
   - Use meaningful variable names to make your code easier to read.
     ```python
     # Bad
     a = 25

     # Good
     age = 25
     ```

With these basics, you're ready to start working with variables in Python!
