# Linear Regression using Scikit-learn

This notebook demonstrates how to implement Linear Regression using scikit-learn, a powerful machine learning library in Python.

## What we'll cover:
1. Data preparation and visualization
2. Train-test splitting
3. Model training and prediction
4. Model evaluation using multiple metrics
5. Results visualization

## Import Required Libraries

We'll use:
- NumPy for numerical operations
- Matplotlib for visualization
- Scikit-learn for:
  - Linear Regression model
  - Train-test splitting
  - Model evaluation metrics

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(0)

## Generate Sample Data

We create synthetic data with a known relationship:
- True relationship: y = 4 + 3x + noise
- Features (X) are randomly generated
- Gaussian noise is added for realism

This allows us to compare our model's learned parameters with the true values.

In [None]:
# Generate random data points
X = 2 * np.random.rand(100, 1)  # 100 random x values between 0 and 2
y = 4 + 3 * X + np.random.randn(100, 1)  # True relationship with added noise

# Plot the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Data points')
plt.xlabel('X (Input feature)')
plt.ylabel('y (Target variable)')
plt.title('Generated Data: y = 4 + 3x + noise')
plt.legend()
plt.grid(True)
plt.show()

## Split Data into Training and Testing Sets

We split our data to:
1. Train the model on one portion (training set)
2. Evaluate its performance on unseen data (testing set)

This helps us assess how well our model generalizes to new data.

Parameters:
- test_size=0.2: 20% for testing, 80% for training
- random_state=42: For reproducibility

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Dataset Split:")
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

## Train the Model

Scikit-learn makes it easy to:
1. Initialize the model
2. Fit it to our training data
3. Make predictions

The LinearRegression class automatically:
- Computes the optimal parameters
- Handles the mathematical operations
- Provides convenient methods for prediction

In [None]:
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on both training and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print("Model Training Complete!")

## Model Evaluation

We evaluate our model using multiple metrics:
1. R² Score (Coefficient of determination)
   - Measures the proportion of variance explained by the model
   - Range: 0 to 1 (1 being perfect prediction)

2. Mean Squared Error (MSE)
   - Average squared difference between predictions and actual values
   - Lower values indicate better fit

In [None]:
# Calculate performance metrics
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print("Model Performance:")
print(f"Training R² Score: {train_r2:.4f}")
print(f"Testing R² Score: {test_r2:.4f}")
print(f"Training MSE: {train_mse:.4f}")
print(f"Testing MSE: {test_mse:.4f}")

print(f"\nLearned Parameters:")
print(f"Coefficient (weight): {model.coef_[0][0]:.4f} (True value: 3)")
print(f"Intercept (bias): {model.intercept_[0]:.4f} (True value: 4)")

## Visualize Results

Let's create a comprehensive visualization showing:
1. Training data points
2. Testing data points
3. Model's predictions

This helps us visually assess how well our model fits the data and if there are any patterns in the predictions.

In [None]:
plt.figure(figsize=(12, 6))

# Plot training and testing data
plt.scatter(X_train, y_train, color='blue', label='Training Data')
plt.scatter(X_test, y_test, color='green', label='Testing Data')

# Plot the regression line
X_plot = np.sort(X, axis=0)
y_plot = model.predict(X_plot)
plt.plot(X_plot, y_plot, color='red', label='Model Predictions')

plt.xlabel('X (Input feature)')
plt.ylabel('y (Target variable)')
plt.title('Linear Regression: Training, Testing Data, and Predictions')
plt.legend()
plt.grid(True)
plt.show()

## Conclusion

Our scikit-learn implementation successfully:
1. Split the data into training and testing sets
2. Trained a linear regression model
3. Made accurate predictions on both sets
4. Found parameters close to the true values

Advantages of using scikit-learn:
- Simple and clean API
- Efficient implementation
- Built-in model evaluation tools
- Seamless integration with other ML tools