<a href="https://colab.research.google.com/github/anandchauhan21/AI_ML_Notes/blob/main/Regression_and_Correlation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression and Correlation â€“ Module 5

## Big Question:
Can we predict a studentâ€™s exam score using:
- Study hours
- Sleep hours?

### Today you will learn:
1. What correlation is
2. What regression is
3. How to use regression to make predictions


# ðŸ“˜ What Is Regression?


**Regression** is a statistical method used to **understand relationships** between variables and to **predict an outcome**.

It answers the question:

> *How does one variable change when another variable changes?* <br>
> *And can we use this relationship to make predictions?*

---

## ðŸ”¹ 1. What Regression Does

Regression helps us:

* Find a **pattern** in data
* Describe how variables are related
* **Predict future values**

Example:
Using study hours to predict test scores.

---

## ðŸ”¹ 2. Simple Linear Regression

Simple regression uses **one predictor**.

### Formula:

```
      y = mx + b
```
Where:

* **y** = predicted value (test score)
* **x** = input (study hours)
* **m** = slope
* **b** = intercept

### Meaning:

* The **slope (m)** shows how much y changes when x increases by 1.
* The **intercept (b)** is the starting value of y when x = 0.

---

## ðŸ”¹ 3. Multiple Regression

Multiple regression uses **two or more predictors**.

### Formula:
```
      y = b_0 + b_1x_1 + b_2x_2 +
```
Example:
```
      Score = b_0 + b_1(Study) + b_2(Sleep)
```
Each coefficient shows the effect of one variable **while holding the others constant**.

---

## ðŸ”¹ 4. Regression Line

The regression line is the **best-fit line** that minimizes the total error between predicted and actual values.

It shows the **trend** of the data.

---

## ðŸ”¹ 5. Why Regression Is Important

Used in:

* Business (predict sales)
* Health (predict risk)
* Education (predict grades)
* Economics (forecast trends)



# ðŸ“˜ What Is Correlation?



**Correlation** is a statistical measure that shows **how strongly two variables are related** and **in which direction they move**.

It answers the question:

> *When one variable changes, does the other also change?*

---

## ðŸ”¹ 1. Direction of Correlation

| Type         | Meaning                                 | Example                        |
| ------------ | --------------------------------------- | ------------------------------ |
| **Positive** | Both variables increase together        | More study â†’ higher score      |
| **Negative** | One increases while the other decreases | More phone time â†’ lower grades |
| **Zero**     | No clear relationship                   | Shoe size â†’ intelligence       |

---

## ðŸ”¹ 2. Strength of Correlation

Correlation is measured using **r** (correlation coefficient).



| r Value        | Strength        |
| -------------- | --------------- |
| Â±0.90 to Â±1.00 | Very strong     |
| Â±0.70 to Â±0.89 | Strong          |
| Â±0.40 to Â±0.69 | Moderate        |
| Â±0.10 to Â±0.39 | Weak            |
| 0.00           | No relationship |

The sign (+ or â€“) shows the **direction**.

---



## ðŸ”¹ 4. Important Rule

> **Correlation does NOT mean causation.**

Just because two things move together does not mean one causes the other.

**Example:**
Ice cream sales and drowning both increase in summer â€”
they are correlated, but ice cream does not cause drowning.

---

## ðŸ”¹ 5. Why Correlation Matters

* Helps find relationships
* Helps choose variables for regression
* Used in science, business, health, economics

---


# CODE

In [None]:
# Import Tools

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

## Our Dataset

Each row is one student.
Each column is a factor that may affect exam scores.


In [None]:
# 10 data point
data = {
    "StudyHours": [2, 4, 6, 3, 7, 8, 5, 1, 9, 4],
    "SleepHours": [5, 6, 7, 4, 8, 9, 6, 3, 8, 5],
    "TestScore": [60, 70, 85, 65, 90, 95, 78, 50, 92, 72]
}

df = pd.DataFrame(data)
df

In [None]:
# Correlation Table
df.corr()

In [None]:
# Scatter Plot
plt.scatter(df["StudyHours"], df["TestScore"])
plt.xlabel("Study Hours")
plt.ylabel("Test Score")
plt.title("Study Hours vs Test Score")
plt.show()


## Simple Linear Regression

We draw the best straight line through the data.

Formula:
y = mx + b

m = slope â†’ change in score per hour
b = intercept â†’ score when study = 0


In [None]:
# Train Simple Model
X = df[["StudyHours"]]
y = df["TestScore"]

model = LinearRegression()
model.fit(X, y)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)

Slope: 5.610837438423648
Intercept: 48.20689655172413


In [None]:
# Regression Line
plt.scatter(X, y)
plt.plot(X, model.predict(X))
plt.xlabel("Study Hours")
plt.ylabel("Test Score")
plt.title("Regression Line")
plt.show()

## Multiple Regression

Now we use TWO predictors:
- Study Hours
- Sleep Hours

Formula:
y = b0 + b1(Study) + b2(Sleep)

Each coefficient shows the effect of one variable
while the other remains constant.


In [None]:
# Train Multiple Model
X_multi = df[["StudyHours", "SleepHours"]]

model_multi = LinearRegression()
model_multi.fit(X_multi, y)

print("Intercept:", model_multi.intercept_)
print("Coefficients:", model_multi.coef_)

In [None]:
# Make Prediction
new_data = pd.DataFrame({"StudyHours":[6], "SleepHours":[2]})
model_multi.predict(new_data)

# Reflection

1. Which affects score more: study or sleep?
2. What does slope mean?
3. Why is multiple regression more realistic?

