<a href="https://colab.research.google.com/github/cloudpedagogy/statistics-python/blob/main/05_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Regression analysis


##Overview

Regression analysis is a statistical technique widely used in data science to model the relationship between a dependent variable (or target variable) and one or more independent variables (or predictor variables). The goal of regression analysis is to understand how changes in the independent variables are associated with changes in the dependent variable, enabling us to make predictions and draw insights from the data.

In data science, regression analysis serves various purposes, including:

1. Prediction: By fitting a regression model to historical data, we can predict the values of the dependent variable for new data points, making it valuable for forecasting future trends and outcomes.

2. Understanding Relationships: Regression analysis helps us quantify the strength, direction, and significance of the relationships between the independent and dependent variables. It allows us to identify which factors have the most significant impact on the target variable.

3. Hypothesis Testing: Regression analysis can be used to test hypotheses about the relationships between variables. By comparing the coefficients of the independent variables to their standard errors, we can determine whether the relationships are statistically significant.

4. Feature Importance: In predictive modeling, regression can help determine the importance of different features or predictors in explaining the variability of the target variable.

There are several types of regression techniques, and the choice of the appropriate method depends on the nature of the data and the research question. Some common types of regression analysis include:

1. Linear Regression: The simplest form of regression, used when the relationship between the dependent and independent variables is linear. The model assumes that the relationship can be represented by a straight line.

2. Multiple Regression: An extension of linear regression, used when there are multiple independent variables. It helps identify the combined effects of all predictors on the dependent variable.

3. Polynomial Regression: When the relationship between the variables is nonlinear, polynomial regression fits a polynomial equation to the data, allowing for more flexible curves.

4. Logistic Regression: While the name contains "regression," logistic regression is a classification technique used for predicting binary outcomes. It models the probability of a binary response based on predictor variables.

5. Ridge Regression and Lasso Regression: Techniques used for regularization to prevent overfitting in multiple regression models by adding penalty terms to the coefficients.

Regression analysis involves estimating the parameters of the regression equation, such as coefficients and intercept, using statistical techniques like Ordinary Least Squares (OLS) for linear regression. These parameters represent the best-fit line or curve that minimizes the differences between predicted values and actual values in the training data.

In summary, regression analysis is a fundamental tool in data science that allows us to model and understand relationships between variables, make predictions, and gain valuable insights from data.

##Simple Linear Regression
To perform simple linear regression using Python's SciPy library on the Pima Indians Diabetes dataset, we can follow these steps:

**Step 1: Import the required libraries and load the data**


In [None]:
import pandas as pd
from scipy import stats

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=names)

# Select the predictor (X) and target (y) variables
X = data['BMI']
y = data['Glucose']


**Step 2: Perform linear regression using SciPy**


In [None]:
# Perform linear regression using the linregress function from scipy.stats
slope, intercept, r_value, p_value, std_err = stats.linregress(X, y)

# Print the slope and intercept of the linear regression line
print("Slope:", slope)
print("Intercept:", intercept)


**Step 3: Visualize the linear regression line**


In [None]:
import matplotlib.pyplot as plt

# Plot the original data points
plt.scatter(X, y, alpha=0.5)

# Add the linear regression line to the plot
plt.plot(X, slope * X + intercept, color='red')

# Add labels and title to the plot
plt.xlabel('BMI')
plt.ylabel('Glucose')
plt.title('Linear Regression: BMI vs. Glucose')

# Display the plot
plt.show()


In this example, we use the SciPy library's `stats.linregress()` function to perform simple linear regression. It calculates the slope, intercept, correlation coefficient (r-value), p-value, and standard error of the regression line. We then plot the original data points and add the linear regression line to visualize the relationship between BMI and Glucose levels.

Please note that this is a simple example, and in practice, you might need to handle missing values, perform data preprocessing, validate the model, and consider additional factors such as multicollinearity or model assumptions.


##Logistic Regression

However, for logistic regression analysis on the Pima Indian dataset, it is recommended to use specialized libraries like Scikit-learn, which provide robust implementations of logistic regression and other machine learning algorithms. Scikit-learn is widely used, well-documented, and provides a user-friendly API for machine learning tasks.

This highlights an issue where another library ie Scikit-learn can provide a better algorithm to what is provided in SciPy.

Here's how you can perform logistic regression on the Pima Indian dataset using Scikit-learn:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset from the provided URL
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(url, header=None)

# Assigning column names to the dataset
data.columns = [
    "pregnancies",
    "glucose",
    "blood_pressure",
    "skin_thickness",
    "insulin",
    "bmi",
    "diabetes_pedigree",
    "age",
    "outcome",
]

# Preprocessing the data
X = data.drop("outcome", axis=1)  # Features
y = data["outcome"]  # Target variable

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
logreg_model = LogisticRegression()

# Fitting the model to the training data
logreg_model.fit(X_train, y_train)

# Making predictions on the test set
y_pred = logreg_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


In this example, we use Scikit-learn to perform logistic regression. The dataset is loaded and preprocessed similarly to the previous example. We then split the data into training and testing sets, fit a logistic regression model to the training data, make predictions on the test set, and evaluate the model's performance using accuracy, classification report, and confusion matrix.

Scikit-learn makes it much easier to work with machine learning algorithms, including logistic regression, and is the recommended approach for this type of analysis.

Now lets compare this with SciPy's approach:

In [None]:
import numpy as np
import pandas as pd
import scipy.optimize as optimize

# Load the dataset from the provided URL
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(url, header=None)

# Assigning column names to the dataset
data.columns = [
    "pregnancies",
    "glucose",
    "blood_pressure",
    "skin_thickness",
    "insulin",
    "bmi",
    "diabetes_pedigree",
    "age",
    "outcome",
]

# Preprocessing the data
X = data.drop("outcome", axis=1)  # Features
y = data["outcome"]  # Target variable

# Implementing Logistic Regression using Scipy
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def cost_function(theta, X, y):
    z = np.dot(X, theta)
    h = sigmoid(z)
    epsilon = 1e-5  # to avoid division by zero
    return -np.mean(y * np.log(h + epsilon) + (1 - y) * np.log(1 - h + epsilon))

def gradient(theta, X, y):
    z = np.dot(X, theta)
    h = sigmoid(z)
    return np.dot(X.T, (h - y)) / len(y)

def logistic_regression(X, y):
    # Adding an intercept term (bias)
    X = np.c_[np.ones((X.shape[0], 1)), X]

    # Initialize theta (parameters) with zeros
    theta = np.zeros(X.shape[1])

    # Using Scipy's optimize.minimize function to find the optimal parameters
    result = optimize.minimize(
        cost_function, theta, args=(X, y), jac=gradient, method="BFGS"
    )
    optimal_params = result.x
    return optimal_params

# Fit the logistic regression model
optimal_params = logistic_regression(X, y)

# Display the coefficients for each feature
feature_names = ["Intercept"] + list(X.columns)
coefficients = pd.Series(optimal_params, index=feature_names)
print("Logistic Regression Coefficients:")
print(coefficients)

# Note: The coefficients represent the log-odds of having diabetes.
# Positive coefficients indicate a higher probability of diabetes,
# while negative coefficients indicate a lower probability.


As you can see the SciPy approach is much more detailed in how it executes  a logistic regression.

So which do you choose?

Both Scipy and Scikit-learn (`scikit-learn`) are powerful Python libraries, but they serve different purposes and have distinct strengths when it comes to logistic regression.

1. **Scipy:**
   - Scipy is a comprehensive library for scientific computing in Python and includes a wide range of functionality for numerical computations, optimization, integration, interpolation, and more.
   - It provides a basic optimization function, `scipy.optimize.minimize()`, which can be used to find the optimal parameters for logistic regression. However, it requires more manual implementation for the logistic regression model itself, including the cost function and gradient calculation.
   - Scipy is suitable for users who prefer a more hands-on approach to implementation and customization, or those who need to perform various scientific computations beyond machine learning tasks.

2. **Scikit-learn:**
   - Scikit-learn is a dedicated machine learning library in Python and is widely used for building and evaluating machine learning models.
   - It provides a high-level, user-friendly interface to implement various machine learning algorithms, including logistic regression. The `LogisticRegression` class in Scikit-learn handles the entire logistic regression process, including optimization, regularization, and predictions.
   - Scikit-learn offers a wide range of utilities for data preprocessing, model evaluation, and model selection, making it an excellent choice for streamlined machine learning workflows.
   - It also provides a consistent API for other machine learning algorithms, allowing users to switch between different models easily.

In summary, if you are primarily focused on logistic regression and other machine learning tasks, and you prefer a more user-friendly and efficient implementation, Scikit-learn is the go-to choice. On the other hand, if you have more general scientific computing needs and want more control over the optimization process, Scipy might be more suitable. In many cases, users may even leverage both libraries, combining the strengths of Scipy for general scientific computations and Scikit-learn for machine learning-specific tasks.

#Reflection points


Linear Regression:
1. Interpretability: Linear regression provides straightforward and easily interpretable results. The coefficients of the regression equation indicate the direction and magnitude of the relationship between the predictor variables and the target variable.

2. Assumptions: Linear regression assumes a linear relationship between the predictor variables and the target variable, as well as independence and homoscedasticity of errors. It is essential to check these assumptions before relying on the model's predictions.

3. Continuous Dependent Variable: Linear regression is suitable for predicting continuous numeric values. It is commonly used for tasks such as predicting sales, housing prices, or temperature.

4. Outliers: Linear regression is sensitive to outliers, which can significantly impact the model's performance. It is important to identify and handle outliers appropriately during data preprocessing.

5. Overfitting: Linear regression can suffer from overfitting if there are too many predictor variables relative to the number of observations. Regularization techniques like Ridge or Lasso regression can be applied to mitigate overfitting.

Logistic Regression:
1. Binary Classification: Logistic regression is widely used for binary classification problems where the target variable has two classes, such as "yes/no," "spam/not spam," or "1/0."

2. Probability Interpretation: Unlike linear regression, logistic regression models the probability of an observation belonging to a particular class. The predicted probabilities are then thresholded to make binary predictions.

3. Log Odds: Logistic regression works on the log-odds scale, transforming the linear combination of predictor variables into probabilities using the logistic (sigmoid) function.

4. Assumptions: Logistic regression assumes that the relationship between the predictor variables and the log-odds of the target variable is linear. Additionally, it assumes that there is little to no multicollinearity among the predictor variables.

5. Multi-class Classification: While logistic regression is inherently binary, it can be extended to handle multi-class classification problems using techniques like One-vs-Rest (OvR) or Multinomial logistic regression.

6. Decision Boundary: The decision boundary in logistic regression is a hyperplane that separates the two classes. It's important to visualize and understand how this boundary influences the model's predictions.

7. Imbalanced Data: Logistic regression can be affected by imbalanced datasets, where one class dominates the other in terms of the number of observations. Techniques like class weighting or resampling can be used to address this issue.

Both linear regression and logistic regression are fundamental and widely used techniques in statistical modeling and machine learning. Understanding their strengths, limitations, and assumptions is crucial for selecting the appropriate model for a given problem and interpreting the results accurately.

#A quiz on Regression analysis


1. What is the primary objective of Simple Linear Regression?
   <br>a) To predict a continuous dependent variable based on one or more independent variables.
   <br>b) To classify data points into different categories.
   <br>c) To find the correlation between two dependent variables.
   <br>d) To find the area under the curve of a given dataset.

2. In Simple Linear Regression, the relationship between the independent variable (X) and dependent variable (Y) is modeled as:
   <br>a) A straight line: Y = mx + c, where m is the slope and c is the intercept.
   <br>b) A quadratic function: Y = ax^2 + bx + c.
   <br>c) An exponential function: Y = a * e^(bx), where 'e' is the base of natural logarithm.
   <br>d) A sine function: Y = a * sin(bx).

3. Which of the following statements is true regarding the coefficient of determination (R-squared) in Simple Linear Regression?
   <br>a) It has a range of [0, 1], where 0 indicates no correlation and 1 indicates a perfect fit.
   <br>b) It can have negative values, indicating a poor model fit.
   <br>c) It is calculated as R-squared = 1 - (sum of squared residuals / total sum of squares).
   <br>d) R-squared value can exceed 1 if the model is overfit.

4. What is the main purpose of Logistic Regression?
   <br>a) To predict a continuous dependent variable based on one or more independent variables.
   <br>b) To classify data points into different categories (binary or multiclass) using probability scores.
   <br>c) To find the correlation between two dependent variables.
   <br>d) To estimate the mean and variance of a given dataset.

5. In Logistic Regression, the S-shaped curve used to model the relationship between the independent variable (X) and the probability of belonging to a specific class (Y=1) is known as:
   <br>a) Step function.
   <br>b) Logistic function (sigmoid function).
   <br>c) Exponential function.
   <br>d) Hyperbolic tangent function.

6. Which of the following statements is true regarding the cost function in Logistic Regression?
   <br>a) The cost function for Logistic Regression is the Mean Squared Error (MSE).
   <br>b) The goal is to minimize the cost function using optimization algorithms like Gradient Descent.
   <br>c) The cost function is only used in binary classification problems, not in multiclass problems.
   <br>d) The cost function is not necessary for Logistic Regression.

7. Scipy is a Python library that provides tools for scientific computing, including functions to perform:
   <br>a) Linear algebra operations and optimization routines, but not regression.
   <br>b) Statistical analysis, such as hypothesis testing and ANOVA, but not regression.
   <br>c) Regression analysis, interpolation, and optimization, among other functionalities.
   <br>d) Visualization and plotting, but not regression-related tasks.

8. Which Scipy function can be used to perform Simple Linear Regression?
   <br>a) scipy.regression.linregress()
   <br>b) scipy.stats.linregress()
   <br>c) scipy.linear_model.regression()
   <br>d) scipy.linear_model.LinearRegression()

9. To evaluate a Logistic Regression model, we typically split the data into training and testing sets. What is the purpose of this step?
   <br>a) To use a larger dataset for training to get more accurate results.
   <br>b) To use the testing set to fit the model and the training set to evaluate its performance.
   <br>c) To avoid overfitting and assess the model's generalization to new, unseen data.
   <br>d) To increase the computational efficiency of the Logistic Regression algorithm.

10. Which Scipy function can be used to perform Logistic Regression?
    <br>a) scipy.linear_model.LogisticRegression()
    <br>b) scipy.stats.logistic_regression()
    <br>c) scipy.regression.logistic_reg()
    <br>d) scipy.logistic_model.LogisticRegression()

---
Answers:
1. a) To predict a continuous dependent variable based on one or more independent variables.
2. a) A straight line: Y = mx + c, where m is the slope and c is the intercept.
3. a) It has a range of [0, 1], where 0 indicates no correlation and 1 indicates a perfect fit.
4. b) To classify data points into different categories (binary or multiclass) using probability scores.
5. b) Logistic function (sigmoid function).
6. b) The goal is to minimize the cost function using optimization algorithms like Gradient Descent.
7. c) Regression analysis, interpolation, and optimization, among other functionalities.
8. d) scipy.linear_model.LinearRegression()
9. c) To avoid overfitting and assess the model's generalization to new, unseen data.
10. a) scipy.linear_model.LogisticRegression()

---