<a href="https://colab.research.google.com/github/epythonlab/epythonlab/blob/master/Machine_Learning_Step_by_Step_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression with Examples


## What is Linear Regression?

**Linear regression** is a commonly used statistical method for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. The goal is to find the best-fitting line (or hyperplane in the case of multiple independent variables) that minimizes the sum of squared differences between the predicted values and the actual values.

In this tutorial, I will show you a step-by-step example of linear regression with a simple single-variable (simple linear regression) example:

## Define the Problem

**Problem Statement:** Suppose you want to predict a student's score on a test (dependent variable) based on the number of hours they studied (independent variable). You have collected data on several students, including their study hours and test scores, and you want to build a linear regression model to make predictions.

## Step 1: Data Collection



Collect data on study hours (independent variable) and test scores (dependent variable) for a group of students. Here's a simplified dataset:

<table>

<th>Study Hours (X)</th>
<th>Test Scores (Y)</th>
<tr><td>2</td><td>50</td></tr>
<tr><td>3</td><td>60</td></tr>
<tr><td>4</td><td>70</td></tr>
<tr><td>5</td><td>75</td></tr>
<tr><td>6</td><td>80</td></tr>


</table>

## Step 2: Data Visualization


Visualize the data using a scatter plot to understand the relationship between study hours and test scores. This will help you determine if a linear relationship exists.

In [None]:
import matplotlib.pyplot as plt

# Data
study_hours = [2, 3, 4, 5, 6]
test_scores = [50, 60, 70, 75, 80]

# Scatter plot
plt.scatter(study_hours, test_scores)
plt.title('Study Hours vs. Test Scores')
plt.xlabel('Study Hours')
plt.ylabel('Test Scores')
plt.show()


From the scatter plot, it appears that there is a linear relationship between study hours and test scores. As study hours increase, test scores generally tend to increase as well.

## Step 3: Model Building


Build a linear regression model to fit the data. The equation for a simple linear regression model is:

`Y = b0 + b1 * X`


Where:
`Y` is the predicted test score.
`X` is the study hours.
`b0` is the intercept (constant).

`b1` is the slope (coefficient) that represents the change in test score for each additional study hour.
You can use various methods to calculate
`b0` and `b1`, such as the least squares method. In Python, you can use libraries like `NumPy` or `scikit-learn` to perform linear regression. Here's an example using scikit-learn:

In [None]:
from sklearn.linear_model import LinearRegression

# Reshape the data
study_hours = np.array(study_hours).reshape(-1, 1)
test_scores = np.array(test_scores)

# Create a linear regression model
model = LinearRegression()

# Fit the model to the data
model.fit(study_hours, test_scores)

# Get the intercept and slope
b0 = model.intercept_
b1 = model.coef_[0]

print(f"Intercept (b0): {b0}")
print(f"Slope (b1): {b1}")


## Step 4: Make Predictions
Once you have the model coefficients `(b0 and b1)`, you can make predictions for test scores based on new study hours.

In [None]:
# Predict test scores for new study hours
new_study_hours = [7, 8, 9]
predicted_scores = model.predict(np.array(new_study_hours).reshape(-1, 1))

for i, hours in enumerate(new_study_hours):
    print(f"Predicted score for {hours} hours of study: {predicted_scores[i]}")


## Step 5: Evaluate the Model


You should also evaluate the model's performance using appropriate metrics such as mean squared error (MSE), R-squared, or others to assess how well it fits the data and makes predictions.

In [None]:
# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(test_scores, predicted_scores)
print(f"Mean Squared Error (MSE): {mse}")

# Calculate the R-squared (R^2) score
r2 = r2_score(test_scores, predicted_scores)
print(f"R-squared (R^2) score: {r2}")





- **mean_squared_error** calculates the mean squared error between the actual test scores and the predicted scores. Lower MSE values indicate better model performance.

- **r2_score** calculates the R-squared (R^2) score, which measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). R^2 values range from 0 to 1, with higher values indicating a better fit. A value of 1 means a perfect fit, and a value of 0 means the model doesn't explain any variance in the data.

By calculating and examining these metrics, you can assess how well your linear regression model fits the data and makes predictions. Remember that the choice of evaluation metrics may vary depending on the specific problem you are working on.

# Conclusion

This example demonstrates the basic steps of simple linear regression. In practice, you may encounter more complex regression scenarios with multiple independent variables (multiple linear regression) or non-linear relationships (polynomial regression), but the fundamental principles remain similar.