# Model Comparison: Linear Regression vs Random Forest vs XGBoost

This notebook compares three regression models
based on performance and prediction results.


In [None]:
!pip install xgboost


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor


## Create Sample Dataset


In [None]:
data = {
    "StudyHours": [1, 2, 3, 4, 5, 6, 7, 8],
    "Score": [50, 55, 65, 70, 80, 82, 88, 92]
}

df = pd.DataFrame(data)
df


## Train-Test Split


In [None]:
X = df[["StudyHours"]]
y = df["Score"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)


## Train Models


In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)


In [None]:
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)


In [None]:
xgb = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)


## Model Evaluation


In [None]:
results = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest", "XGBoost"],
    "MSE": [
        mean_squared_error(y_test, lr_pred),
        mean_squared_error(y_test, rf_pred),
        mean_squared_error(y_test, xgb_pred)
    ],
    "R2 Score": [
        r2_score(y_test, lr_pred),
        r2_score(y_test, rf_pred),
        r2_score(y_test, xgb_pred)
    ]
})

results


## Prediction Comparison


In [None]:
plt.scatter(X_test, y_test, label="Actual", color="black")
plt.scatter(X_test, lr_pred, label="Linear Regression")
plt.scatter(X_test, rf_pred, label="Random Forest")
plt.scatter(X_test, xgb_pred, label="XGBoost")

plt.xlabel("Study Hours")
plt.ylabel("Score")
plt.title("Model Prediction Comparison")
plt.legend()
plt.show()


## Conclusion

- Linear Regression provides a simple baseline model.
- Random Forest captures non-linear patterns better than Linear Regression.
- XGBoost achieves the best performance among the three models on this dataset.

This comparison highlights the importance of selecting an appropriate model
based on data complexity and performance metrics.
