<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_2/section_2_Python_Example__Linear_Regression_with_Scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 2 Python example - linear regression with Scikit-learn

Linear regression is a foundational technique in statistical modeling and data science, used for predicting a quantitative response. It's particularly useful for understanding the relationship between one dependent variable and one or more independent variables. This section provides a practical example of how to implement linear regression using Python’s Scikit-learn library, a powerful tool for machine learning that simplifies the application of various algorithms.

1. Setting Up the Environment:

First, ensure that you have the necessary Python environment and libraries installed. Scikit-learn can be installed via pip if it is not already available in your environment:

In [None]:
pip install numpy scipy matplotlib ipython scikit-learn pandas

2. Importing Required Libraries:

Start your Python script by importing the required libraries. You will need pandas for data manipulation, matplotlib for plotting, and scikit-learn for building the regression model.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

3. Loading and Preparing the Data:

For this example, let's assume you have a dataset housing.csv that contains information on housing prices (HP) along with attributes such as number of rooms per dwelling (RM), number of parking  spaces (PS), property size in square feet (SQFT).

In [None]:
# Load the dataset
data = pd.read_csv('housing.csv')

# Selecting only the RM and HP columns for simplicity
data = data[['RM', 'HP']]
print(data.head())

4. Visualizing the Data:

It’s always a good practice to visualize the data to understand the relationship between variables.

In [None]:
plt.scatter(data['RM'], data['HP'], alpha=0.5)
plt.title('Room Count vs. House Price')
plt.xlabel('Average Number of Rooms')
plt.ylabel('House Price (in $1000s)')
plt.show()

5. Creating Training and Test Sets:

Divide the data into training and test sets to evaluate the performance of your model on unseen data.

In [None]:
# Define the features and the target
X = data[['RM']]  # features
y = data['HP']  # target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

6. Building the Linear Regression Model:

Use scikit-learn to build and train the linear regression model.

In [None]:
# Create a linear regression model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# Display the coefficients
print(f'Coefficient: {model.coef_[0]}')
print(f'Intercept: {model.intercept_}')

7. Making Predictions and Evaluating the Model:

After training the model, use it to make predictions on the test set, and then evaluate the performance.

In [None]:
# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

# Plotting predictions against actual values
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.title('Test vs Prediction')
plt.xlabel('Average Number of Rooms')
plt.ylabel('House Price (in $1000s)')
plt.show()

This Python example illustrates the process of performing linear regression with Scikit-learn, from data preparation to model evaluation. The results provide insights into the relationship between the number of rooms and house prices, demonstrating the power of linear regression in predicting outcomes based on linear relationships. Understanding these steps and applying them to different datasets can help you harness the power of linear regression for various predictive modeling tasks.