#Beginner-Friendly Introduction to Data Science and Simple Linear Regression

# **Understanding Data Science**


Data Science is a way to get useful knowledge from data. Think of it like being a detective who looks at tons of information and finds the truth hidden inside.
Data Science combines several disciplines:





Mathematics and Statistics – These are used to understand and identify pat- terns in the data.

Computer Science and Programming – These help in writing code, collecting data, cleaning it, and automating tasks.

Domain Knowledge – This means understanding the particular area or field in which data science is being applied, such as healthcare, finance, weather forecasting, business, etc.

What Does a Data Scientist Do

Collects and examines raw data, which is often incomplete and unorganized.
* Cleans the data to make it suitable for analysis.
* Analyzes the data to identify trends, correlations, and useful patterns.
* Builds models that can make predictions based on the data.
* Presents the results clearly to help people or organizations make better decisions.

**Real-Life Applications of Data Science**


* Entertainment: Netflix recommends movies based on your watching history using
data science algorithms.
* Banking: Banks use it to detect fraudulent transactions and ensure security.
* Healthcare: Doctors predict diseases by analyzing patient records with the help
of data science.

In short, data science is a combination of logical thinking, coding, mathematics, and
real-world application. It is used across various industries to solve problems and make
informed decisions using data

# 2. Simple Linear Regression

**2.1 Introduction to Linear Regression**

Linear Regression is a method used to model the relationship between two continuous
variables. It draws a straight line through a scatter plot of the data, which is used to
predict the value of one variable based on the value of another.

**Example:** Predicting exam marks (Y) based on hours studied (X).

**Mathematical Formula**

The equation of a simple linear regression model is:

\begin{equation}
Y \approx \beta_0 + \beta_1 X
\end{equation}
Where:




*   $Y$: the value we want to predict (e.g., marks)
*   $X$: the input variable (e.g., hours of study)
*  $\beta_0$: the intercept (the starting value of Y when X is zero)

*  $\beta_1$: the slope (how much Y changes for a one-unit increase in X)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = {
    "Hours": [1, 2, 3, 4, 5, 6, 7, 8],
    "Marks": [35, 40, 50, 55, 60, 65, 70, 80]
}
df = pd.DataFrame(data)

plt.scatter(df["Hours"], df["Marks"], color="blue")
plt.title("Study Hours vs Marks")
plt.xlabel("Hours Studied")
plt.ylabel("Marks")
plt.grid(True)
plt.show()

# Training the Model – Finding the Line of Best Fit


To find the best values for $\beta_0$ and $\beta_1$, we use the least squares method. This method minimizes the error between the predicted and actual values.

\begin{equation}
\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}
\end{equation}



**Python Code – Training the Linear Model**

In [None]:
from sklearn.linear_model import LinearRegression
import pandas as pd

# Create DataFrame
data = {
    "Hours": [1, 2, 3, 4, 5, 6, 7, 8],
    "Marks": [35, 40, 50, 55, 60, 65, 70, 80]
}
df = pd.DataFrame(data)

# Independent and Dependent variables
X = df[["Hours"]]  # X should be 2D
y = df["Marks"]    # y is 1D

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Coefficients
beta_0 = model.intercept_  # Intercept (β₀)
beta_1 = model.coef_[0]    # Slope (β₁)

# Print results
print(f"Intercept (|beta): {beta_0}")
print(f"Slope (\t): {beta_1}")