# Understanding Regression and Classification: The Two Pillars of Supervised Machine Learning

In supervised machine learning, we teach algorithms to learn from labeled data, aiming to make predictions or classifications on new, unseen data. There are two main types of tasks we can tackle:

## Regression
  
  Goal: Predict a continuous numerical value. Think of it like estimating a number on a scale.
  
  Examples:
  * Predicting house prices based on square footage, number of bedrooms, and location.
  * Forecasting stock prices based on historical trends and financial indicators.
  * Estimating crop yields based on weather conditions and soil quality.
  
**Models:**
  * Linear Regression
  * Polynomial Regression
  * Decision Tree Regression
  * Random Forest Regression
  * Support Vector Regression (SVR)  
  * Neural Networks (for more complex relationships)

## Classification
  
  Goal: Predict a category or class label. Think of it like sorting items into different boxes.

  Examples:

  * Classifying emails as spam or not spam.
  * Determining if a customer will churn (leave) based on their usage patterns.
  * Identifying whether a tumor is malignant or benign from medical images.
  
**Models:**
  * Logistic Regression
  * Decision Tree Classifier
  * Random Forest Classifier
  * Support Vector Machine (SVM)
  * Naive Bayes
  * K-Nearest Neighbors (KNN)
  * Neural Networks (especially for image or text classification)



## Key Differences

**Target Variable:**
* Regression: Continuous numerical value (e.g., price, temperature, salary)
* Classification: Categorical label (e.g., spam/not spam, red/green/blue, yes/no)

**Model Output:**
* Regression: Predicted numerical value
* Classification: Predicted class label or probability of belonging to each class

**Model Types:**
* Regression: Linear regression, polynomial regression, decision tree regression, etc.
* Classification: Logistic regression, decision tree classifier, random forest classifier, etc.

**Evaluation Metrics:**

* Regression: Mean Squared Error (MSE), R-squared, Mean Absolute Error (MAE), etc.
* Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC curve, etc.


**Purpose:**

* Regression: Predict or forecast a numerical outcome based on input features.

* Classification: Assign data points to predefined categories or classes based on their characteristics.


**Which One to Use?**

The choice between regression and classification depends on the type of problem you're trying to solve. If your goal is to predict a number, go with regression. If you want to categorize or label something, use classification.
Important

Note: In both cases, the quality of your model depends heavily on the quality and relevance of your data.



**Fundamentals of Machine Learning with California Housing Data**

**Introduction**

Welcome to your first journey into the exciting world of machine learning! In this lesson, we'll use Python, pandas, and scikit-learn to explore real-world housing data, uncover patterns, and build a simple predictive model using linear regression.

**1. Setting the Stage: Libraries and Data**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

* **pandas:** The backbone of data manipulation in Python.
* **numpy:** For numerical operations.
* **matplotlib & seaborn:** Our tools for visualization.
* **scikit-learn:** The machine learning powerhouse.

In [None]:
# Load the California housing dataset
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Display the first few rows
print(df.head().to_markdown(index=False,numalign='left', stralign='left'))

In [None]:
# Print the dataset description
print(housing.DESCR)

* The dataset is a dictionary-like object. We extract the DataFrame for analysis.
* This code will display the first 5 rows of the dataset.

**2. Exploratory Data Analysis (EDA): Getting to Know the Data**

In [None]:
print(df.describe().to_markdown())

* **`df.describe()`:** This provides the statistical summary of the dataset.
* It reveals the count, mean, standard deviation, minimum, maximum and the percentiles (25%, 50%, 75%).


**Visualizing Feature-Price Relationships**

In [None]:
# Define features of interest
features_to_plot = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population']  # Add more as needed

# Create subplots for each feature
fig, axes = plt.subplots(nrows=len(features_to_plot), ncols=1, figsize=(12, 15))

for i, feature in enumerate(features_to_plot):
    # Create a scatter plot for the current feature
    axes[i].scatter(df[feature], df['MedHouseVal'], alpha=0.5)

    # Add regression line
    sns.regplot(x=df[feature], y=df['MedHouseVal'], scatter=False, ax=axes[i])

    # Add labels and title
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Median House Value')
    axes[i].set_title(f'Relationship between {feature} and Median House Value')
    axes[i].grid(axis='both', alpha=0.75)

# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()

**Explanation:**

1. **Features to Plot:** We define a list of features for which we want to visualize the relationship with the median house value.
2. **Subplots:** We create a grid of subplots (one for each feature) using `plt.subplots`.
3. **Iteration:** The code loops through each selected feature:
   - **Scatter Plot:** Creates a scatter plot with the feature on the x-axis and the median house value on the y-axis.
   - **Regression Line:** Adds a regression line (line of best fit) to the scatter plot.
   - **Labels and Title:**  Sets informative labels and titles for each subplot.
   - **Grid:** Adds a grid for easier interpretation.
4. **Layout Adjustment:**  `plt.tight_layout()` is used to ensure that the subplots don't overlap and are visually appealing.
5. **Display:**  `plt.show()` displays the final plot.

**Key Findings and Interpretation:**

* **Visual Trends:** Observe the overall trend in each plot. Is it positive (higher feature value → higher house price), negative, or more complex?
* **Linearity:**  Assess whether the relationship seems linear (well-represented by a straight line) or non-linear (curved).
* **Outliers:**  Look for any unusual data points (outliers) that might be influencing the regression line.
* **Strength of Relationship:**  Visually judge how closely the data points cluster around the regression line. A tight cluster suggests a stronger relationship.

**Student Challenges and Notes:**

* **Understanding Scatter Plots:** Ensure students understand how to interpret scatter plots and regression lines.
* **Non-Linear Relationships:** Discuss that not all relationships are linear. Show examples where polynomial regression or other methods might be more suitable.
* **Outlier Detection and Handling:** Introduce the concept of outliers and their potential impact on models. Explain how they can be detected and dealt with.

* **Distribution plots** help us understand the spread of values (skewness, outliers, etc.)
* **Scatter plots** with regression lines help us identify potential relationships.

**3. Preparing the Data**

In [None]:
# Define features and target
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

* We set aside 20% of our data for testing our model's performance.

**4. Building the Linear Regression Model**

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

* We create a linear regression model and train it on our training data.

**5. Making Predictions and Evaluating**

In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

* We evaluate the model using Mean Squared Error and R-squared.
* The code will display the results for each of the mentioned evaluation metrics.



**6. Examining Feature Importance**

In [None]:
# Get the feature coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})

# Sort the coefficients by absolute value (magnitude)
coefficients = coefficients.reindex(coefficients['Coefficient'].abs().sort_values(ascending=False).index)

# Display the coefficients
print("Feature Importance:\n", coefficients.to_markdown(index=False))

* **Explanation:**
    * We create a DataFrame to store feature names and their corresponding coefficients from the trained model.
    * Sorting by absolute value reveals the magnitude of the impact (positive or negative) each feature has on the predicted median house value.
    * We print this table to understand which features are most influential.

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 6))
sns.barplot(data=coefficients, x='Coefficient', y='Feature')
plt.title('Feature Importance (Absolute Coefficients)')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.grid(axis='x', alpha=0.75)
plt.show()

* **Explanation:**
    * The bar chart provides a visual representation of the feature importance table.
    * The length of the bar indicates the magnitude of the coefficient, while the direction (positive or negative) signifies whether a higher feature value is associated with a higher or lower predicted house value.

**Key Findings and Interpretation:**

* The features with the largest absolute coefficients have the most substantial impact on the predicted median house value.
* Positive coefficients imply that as the feature value increases, the predicted house value also increases.
* Negative coefficients imply the opposite relationship (e.g., higher value of the feature means lower predicted house value).
* Features with coefficients close to zero have minimal influence on the predicted outcome.

**Student Challenges and Notes:**

* **Understanding Model Behavior:** Challenge students to interpret the results. Which features are most critical? Does the model align with their intuition about real estate?
* **Feature Scaling:** If features have vastly different scales, some models might overemphasize the importance of features with larger values. Discuss feature scaling techniques (like standardization) to address this.
* **Advanced Feature Selection:** Introduce students to techniques like recursive feature elimination for further refining the model.


<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://www.techladder.in/article/decision-tree-detailed-explanation">https://www.techladder.in/article/decision-tree-detailed-explanation</a></li>
  <li><a href="https://github.com/ecemekim/BasicMachineLearningAlgorithms">https://github.com/ecemekim/BasicMachineLearningAlgorithms</a></li>
  <li><a href="https://github.com/mariyan-stephen/California--Housing-Price-Prediction">https://github.com/mariyan-stephen/California--Housing-Price-Prediction</a></li>
  <li><a href="https://github.com/Psychevus/python-scikit-learn">https://github.com/Psychevus/python-scikit-learn</a> subject to MIT</li>
  <li><a href="https://medium.com/@johnmccool_83148/predict-customer-nps-with-machine-learning-8aab1a2aeee1">https://medium.com/@johnmccool_83148/predict-customer-nps-with-machine-learning-8aab1a2aeee1</a></li>
  <li><a href="https://github.com/shivareddy0117/HousePricePrediction_CaliforniaHOusesDataSet">https://github.com/shivareddy0117/HousePricePrediction_CaliforniaHOusesDataSet</a></li>
  </ol>
</div>