# Linear Regression

## Load & Understand the Data
The data for this exercise was synthetically generated.

<font color='green'>**Load the data and familiarize yourself with it.**</font>

In [None]:
import pandas as pd
df = pd.read_csv("03_salary_satisfaction.csv")

## Building a Linear Regression Model
In **linear regression**, we aim to predict one numerical variable using another numerical variable.

A linear regression results in a **model**. This is a fancy term for an equation or **function** $f(x)$. This function $f(x)$ takes an $x$ value as input and predicts the value of the variable $y = f(x)$.

This function has a fixed form: $f(x) = m*x+b$.

- The function represents a **straight line** (hence the term *linear*).
- $m$ describes the **slope** of the function, which tells us how much our target variable $y$ changes when we change our input variable (also called **factor** or **predictor**) by one unit. In English, this value is often called the **slope**.
- $b$ describes the $y$-value when $x=0$. You might remember this from school as the *Y-intercept*, because the line crosses the Y-axis at this point. In English, this value is often called the **intercept**.
- $x$ is the value of our predictor for which we want to make a prediction about $y$. We can plug in any value here, and the model will make a prediction.

In Python, linear regressions can be calculated in many ways, both "manually" and with pre-built functions. For this, we'll use `scikit-learn` (also known as `sklearn`).

In [None]:
from sklearn.linear_model import LinearRegression

We're building a linear regression model. In this case, we'll use years of experience as the predictor and salary as the target variable.

In [None]:
X = pd.DataFrame(df.years_experience) # Inputs are usually a 2-D matrix, so we transform our single column into a DataFrame to create this shape
y = df.salary

model = LinearRegression()
model.fit(X, y) # Here our model is automatically calculated ("fitted")

# Here we extract the results
slope = model.coef_[0]
intercept = model.intercept_

print(f"Model: f(x) = {int(slope)} * x + {int(intercept)}")

<font color='green'>**Try to interpret what this model says.**</font>

Hint / Guiding questions:
- What does the model say about the starting salary with no work experience?
- According to our model, how much does each additional year of work experience add?


Answer:
- Without work experience (i.e., for $x=0$), our model predicts a starting salary of approx. €40,000 (we can see this from the **intercept** $b$).
- For each additional year of professional experience, the expected salary increases by approx. €2,600 🚀 (we can see this from the **slope** $m$).

So, how much will we earn after 10 years according to our model?

We can simply plug values into the formula to find out.

In [None]:
years_experience = 10
expected_salary = slope * years_experience + intercept

round(expected_salary,1)

We can also use the model object to make several predictions at once. For example, here's the prediction for each of the first 10 years of professional experience.

In [None]:
# Predict values
y_pred = model.predict(pd.DataFrame({"years_experience":[0,1,2,3,4,5,6,7,8,9,10]}))
print(f"Predicted values: {y_pred.round(1)}")

<font color='green'>**Do these predictions seem realistic to you?**</font>

# Evaluating Model Quality
Before we use our model for a salary negotiation, we should verify how reliable its predictions really are.

There are essentially two methods for this.

## Visual Inspection
We can see how close our modeled line is to the actually observed data points.

In [None]:
import matplotlib.pyplot as plt

Here's how our historical data is distributed

In [None]:
ax = df.plot(x="years_experience", y="salary", kind="scatter", title="Salary by Years of Experience")
ax.set_xlim(0)
ax.set_ylim(0)
plt.show();

Now we can overlay the line

In [None]:
ax = df.plot(x="years_experience", y="salary", kind="scatter", title="Salary by Years of Experience")
plt.axline((0, intercept), slope=slope, color='salmon', linewidth=4, label=f'y = {int(slope)} * x + {int(intercept)}')

ax.set_xlim(0, 12)
ax.set_ylim(0, 150000)
plt.legend()
plt.show();

We can see that the model visually fits quite well. The line more or less runs through the middle of the distribution, and there's no x-range where it deviates significantly from the observed values.

A **crucial limitation** is that we have **not observed data for less than 1 year of work experience or more than 20 years of work experience**. While the model can make predictions in these areas, we should be skeptical because these predictions are not historically supported.

## Empirical Evaluation
For linear regressions, the $R^2$ value is often used to assess whether a model is good or not, or better than another.

$R^2$ indicates how much of the variance in the observations is *explained* by the model. The higher this value, the better the model.

Expressed more intuitively: We compare the total deviation of a model that always just guesses the mean, with the total deviation of the values predicted by our model. If our model is good, the latter deviation decreases and $R^2$ increases.

It's important that we know the true values for evaluation.

In [None]:
from sklearn.metrics import r2_score

In [None]:
predicted_salary = model.predict(pd.DataFrame(df.years_experience))
true_salary = df.salary

r2_score(true_salary, predicted_salary)

**Interpretation**: The $R^2$ value is about 0.5, which means that a significant portion of the variability in salary can be explained by years of experience.

## Conclusion - Can we trust the model?
We have shown both visually and empirically that the model fits well. However, it's very important to emphasize that the model only fits well to the **observed data**, and predictions are only meaningful for individuals for whom the dataset is **representative**.

Therefore, if the dataset was collected in a specific country or only for a specific industry, the model only represents the salary development for that country or industry.

Since we don't know exactly where the data comes from and for which people it is representative, we unfortunately cannot rely on it despite the good results.

# Now It's Your Turn!
<font color='green'>**Perform a similar analysis for salary and satisfaction. Answer the following guiding questions:**
- How much more satisfaction does an additional €1000 annual salary provide?
- How well does salary explain overall satisfaction?
</font>

