# More on Machine Learning
## Non-Linear Regression
## More On Regression: 
* Gradient Boosting Regression
* Principal Component Regression (PCR)
* Bayesian Linear Regression ( DIY)
## More On Classification
* Naive Bayes Classifier
* Gradient Boosting Classifier
* Linear Discriminant Analysis (LDA)

## Nonlinear Regression Overview
* Nonlinear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables when the relationship is not linear.
* Many real-world relationships are not strictly linear.
* Nonlinear regression allows us to capture more complex patterns in the data.
* Provides flexibility in modeling various functional forms.
* Suitable for problems where linear models fall short.

## Examples
## We can combine Polynomial, Logarithmic and others to make the regression nonlinear
* The growth of a population over time
* The spread of a disease
* The relationship between drug dosage and response
* The relationship between income and education level

In [1]:
import pandas as pd
import numpy as np

# Generate synthetic data
np.random.seed(42)  # For reproducibility

# Generate random values for x1 and x2
n_samples = 100
x1 = np.random.rand(n_samples)
x2 = np.random.rand(n_samples)

# Create DataFrame
df = pd.DataFrame({'x1': x1, 'x2': x2})

# Define the relationship for y
df['y'] = 2 * df['x1'] + 3 * df['x2'] + 0.5 * df['x1'] * df['x2'] + \
          0.1 * df['x1']**2 + 0.2 * df['x2']**2 + \
          0.5 * np.sqrt(df['x1']) + 0.3 * np.sqrt(df['x2']) + \
          0.2 * np.log(df['x1']) + 0.1 * np.log(df['x2']) + \
          np.random.normal(scale=0.1, size=n_samples)  # Adding some random noise

# Display the DataFrame
print(df.head())


         x1        x2         y
0  0.374540  0.031429  0.612247
1  0.950714  0.636410  4.979346
2  0.731994  0.314356  3.042628
3  0.598658  0.508571  3.321972
4  0.156019  0.907566  3.561322


In [2]:
pd.set_option('display.precision', 2)

In [3]:
# Adding new columns
df['x1x2'] = df['x1'] * df['x2']
df['x1^2'] = df['x1']**2
df['x2^2'] = df['x2']**2
df['logx1'] = np.log(df['x1'])
df['logx2'] = np.log(df['x2'])
df['sqrtx1'] = np.sqrt(df['x1'])
df['sqrtx2'] = np.sqrt(df['x2'])
df

Unnamed: 0,x1,x2,y,x1x2,x1^2,x2^2,logx1,logx2,sqrtx1,sqrtx2
0,0.37,0.03,0.61,0.01,1.40e-01,9.88e-04,-0.98,-3.46,0.61,0.18
1,0.95,0.64,4.98,0.61,9.04e-01,4.05e-01,-0.05,-0.45,0.98,0.80
2,0.73,0.31,3.04,0.23,5.36e-01,9.88e-02,-0.31,-1.16,0.86,0.56
3,0.60,0.51,3.32,0.30,3.58e-01,2.59e-01,-0.51,-0.68,0.77,0.71
4,0.16,0.91,3.56,0.14,2.43e-02,8.24e-01,-1.86,-0.10,0.39,0.95
...,...,...,...,...,...,...,...,...,...,...
95,0.49,0.35,2.39,0.17,2.44e-01,1.22e-01,-0.71,-1.05,0.70,0.59
96,0.52,0.73,4.18,0.38,2.73e-01,5.27e-01,-0.65,-0.32,0.72,0.85
97,0.43,0.90,4.39,0.38,1.83e-01,8.05e-01,-0.85,-0.11,0.65,0.95
98,0.03,0.89,2.37,0.02,6.46e-04,7.87e-01,-3.67,-0.12,0.16,0.94


In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Linear regression using only x1 and x2
X1X2 = df[['x1', 'x2']]
y = df['y']

model_x1x2 = LinearRegression()
model_x1x2.fit(X1X2, y)
y_pred_x1x2 = model_x1x2.predict(X1X2)
r2_x1x2 = r2_score(y, y_pred_x1x2)

print("Linear Regression (x1 and x2):")
print("R^2 Score:", r2_x1x2)

# Linear regression using all columns
X_all = df.drop('y', axis=1)
model_all = LinearRegression()
model_all.fit(X_all, y)
y_pred_all = model_all.predict(X_all)
r2_all = r2_score(y, y_pred_all)

print("\nLinear Regression (all columns):")
print("R^2 Score:", r2_all)


Linear Regression (x1 and x2):
R^2 Score: 0.9899044661254759

Linear Regression (all columns):
R^2 Score: 0.9960396838098363


In [5]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
X1X2 = df[['x1', 'x2']]
y = df['y']

degree = 2  # You can adjust the degree as needed

poly_model_x1x2 = make_pipeline(PolynomialFeatures(degree), LinearRegression())
poly_model_x1x2.fit(X1X2, y)
y_pred_poly_x1x2 = poly_model_x1x2.predict(X1X2)
r2_poly_x1x2 = r2_score(y, y_pred_poly_x1x2)

print("Polynomial Regression (x1 and x2):")
print("R^2 Score:", r2_poly_x1x2)

# Polynomial regression using all columns
X_all = df.drop('y', axis=1)

poly_model_all = make_pipeline(PolynomialFeatures(degree), LinearRegression())
poly_model_all.fit(X_all, y)
y_pred_poly_all = poly_model_all.predict(X_all)
r2_poly_all = r2_score(y, y_pred_poly_all)

print("\nPolynomial Regression (all columns):")
print("R^2 Score:", r2_poly_all)


Polynomial Regression (x1 and x2):
R^2 Score: 0.993139617017216

Polynomial Regression (all columns):
R^2 Score: 0.9972279034747895


In [11]:
df= pd.read_csv('housing.csv')
df = df.dropna()
df.head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.33,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3,358500.0,NEAR BAY


In [12]:
df.ocean_proximity.value_counts()

<1H OCEAN     9034
INLAND        6496
NEAR OCEAN    2628
NEAR BAY      2270
ISLAND           5
Name: ocean_proximity, dtype: int64

## Gradient Boosting:

* Start with a Simple Model: Begin with a basic model (e.g., decision tree).
* Initial Prediction: Make predictions with the simple model.
* Calculate Residuals: Find the differences between actual and predicted values.
* Build a New Model: Create a new model to predict residuals.
* Update Prediction: Add the new model's output to the previous prediction.
* Repeat: Iteratively build models to correct errors and refine predictions.
* Final Prediction: Combine predictions for a powerful ensemble model.

**Gradient Boosting in a Nutshell:**

1. **Objective:**
   - Minimize the square loss by building a model $F(x)$.

2. **Given:**
   - Dataset $((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n))$

3. **Improving Model $F$:**
   - Residuals: $y_i - F(x_i)$ represent model deficiencies.
   - Goal: Enhance $F$ with an additional model $h$ to compensate for residuals.

4. **Gradient Boosting Iteration:**
   - Build $h_1$ to compensate $F$ deficiencies: $F_1(x) = F(x) + h_1(x)$.
   - If $F_1$ is not perfect, build $h_2$: $F_2(x) = F_1(x) + h_2(x)$.
   - Repeat until satisfied.

5. **Test Data Generalization:**
   - Improvements on training data extend to new, unseen data.

6. **Connection to Gradient Descent:**
   - Residuals act like "errors" in the current model.
   - Analogous to gradient descent, each $h$ is chosen to move towards the steepest descent of the loss function.

**Formula:**
$$ F(x) = \sum_{m=1}^M \gamma_m h_m(x) $$

**Key Concepts:**
- $M$: Number of weak learners (trees).
- $\gamma_m$: Learning rate for the $m$-th tree.
- $h_m(x)$: Weak learner (tree) at iteration $m$.


In [13]:
# Drop a categorical variable now.
df = df.drop('ocean_proximity', axis=1)

## Gradient Boosting for  Linear Regression

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Building the Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)

# Making predictions on the test set
y_pred = gb_model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared Score: {r2}')


Mean Squared Error: 3087399399.5374117
R-squared Score: 0.7742333759355379


In [15]:
y.describe()

count     20433.00
mean     206864.41
std      115435.67
min       14999.00
25%      119500.00
50%      179700.00
75%      264700.00
max      500001.00
Name: median_house_value, dtype: float64

In [16]:
df['median_price_category'] = pd.cut(df['median_house_value'],
                                     bins=[-float('inf'), 119500, 179700, 264700, float('inf')],
                                     labels=['Very Low', 'Low', 'Moderate', 'High'],
                                     include_lowest=True)

df['median_price_category'].value_counts()

Very Low    5110
Moderate    5109
Low         5107
High        5107
Name: median_price_category, dtype: int64

In [17]:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

X = df.drop(['median_house_value', 'median_price_category'], axis=1)

# Target variable
y = df['median_price_category']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Gradient Boosting Classifier
classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print('\nClassification Report:\n', classification_report_str)


Accuracy: 0.7093

Classification Report:
               precision    recall  f1-score   support

        High       0.84      0.76      0.79      1062
         Low       0.65      0.60      0.63      1054
    Moderate       0.59      0.65      0.62       960
    Very Low       0.77      0.82      0.79      1011

    accuracy                           0.71      4087
   macro avg       0.71      0.71      0.71      4087
weighted avg       0.71      0.71      0.71      4087



# Bayesian Linear Regression

## Objective
- Apply Bayesian statistics to linear regression for uncertainty estimates in predictions.

## Given
- Dataset $(X, y)$ with input features $X$ and target variable $y$.

## Model
- Linear regression model: $y = X\beta + \epsilon$
- Assume a prior distribution for coefficients $\beta$ and noise $\epsilon$.

## Bayesian Formulation
- Posterior distribution: $P(\beta, \sigma^2 | X, y) \propto P(y | X, \beta, \sigma^2) \cdot P(\beta) \cdot P(\sigma^2)$

The left-hand side of the equation, $P(\beta, \sigma^2 | X, y)$, represents the posterior distribution. The right-hand side of the equation is the product of three terms:

- $P(y | X, \beta, \sigma^2)$ is the likelihood function. It represents the probability of observing the data given the values of the parameters.

- $P(\beta)$ is the prior distribution for $\beta$. It represents our beliefs about the values of $\beta$ before we have seen the data.

- $P(\sigma^2)$ is the prior distribution for $\sigma^2$. It represents our beliefs about the values of $\sigma^2$ before we have seen the data.

## Key Equations
1. **Likelihood:** $P(y | X, \beta, \sigma^2) \sim \mathcal{N}(X\beta, \sigma^2 I)$
   - Explanation: This represents the likelihood of observing the target variable $ y $ given the input features $ X $, coefficients $ \beta $, and noise variance $ \sigma^2 $. It assumes a normal (Gaussian) distribution with mean $ X\beta $ and variance $ \sigma^2 I $, where $ I $ is the identity matrix.
2. **Prior:** $P(\beta) \sim \mathcal{N}(0, \Sigma)$
- Explanation: This is the prior distribution for the coefficients $ \beta $. It assumes a normal distribution with mean 0 and covariance matrix $ \Sigma $. The prior represents our beliefs or knowledge about the likely values of the coefficients before observing the data.
3. **Prior for Noise:** $P(\sigma^2) \sim \text{Inv-Gamma}(\alpha, \beta)$
- Explanation: This is the prior distribution for the noise variance $ \sigma^2 $. It assumes an inverse-gamma distribution with shape parameter $ \alpha $ and scale parameter $ \beta $. The prior captures our uncertainty about the amount of noise in the data.
4. **Posterior:** $P(\beta, \sigma^2 | X, y) \propto \exp\left(-\frac{1}{2\sigma^2} \|y - X\beta\|^2 - \frac{1}{2} \beta^T \Sigma^{-1} \beta - \frac{\alpha + n}{2} \log(\beta)\right)$
- Explanation: This is the posterior distribution, representing our updated beliefs about the coefficients and noise variance after observing the data. The terms in the exponent are derived from the product of the likelihood, prior, and prior for noise. The posterior is proportional to the product of these three distributions.

## Inference
- Use the posterior distribution to make predictions and estimate uncertainties.


## This needs pymc3 package.
### https://www.pymc.io/projects/docs/en/stable/learn.html

In [18]:
!pip install pymc3

Collecting scipy<1.8.0,>=1.7.3
  Using cached scipy-1.7.3-1-cp310-cp310-macosx_12_0_arm64.whl (27.0 MB)
Collecting numpy<1.22.2,>=1.15.0
  Using cached numpy-1.22.1-cp310-cp310-macosx_11_0_arm64.whl (12.8 MB)
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.3
    Uninstalling numpy-1.26.3:
      Successfully uninstalled numpy-1.26.3
  Attempting uninstall: scipy
    Found existing installation: scipy 1.12.0
    Uninstalling scipy-1.12.0:
      Successfully uninstalled scipy-1.12.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.[0m[31m
[0mSuccessfully installed numpy-1.22.1 scipy-1.7.3


In [None]:
!pip install --upgrade scipy

# Naive Bayes Classifier

## Introduction

Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem. It assumes independence among features, making it "naive" but often effective for text classification and spam filtering.

## Bayes' Theorem

Bayes' theorem is the foundation of Naive Bayes. It calculates the probability of a hypothesis based on prior knowledge.

$$
P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)}
$$

Where:
- $ P(A | B) $ is the posterior probability.
- $ P(B | A) $ is the likelihood.
- $ P(A) $ is the prior probability.
- $ P(B) $ is the evidence.

## Naive Bayes Classification

The Naive Bayes classifier assumes independence among features given the class. The probability of a class $ C_k $ given features $ x_1, x_2, \ldots, x_n $ can be expressed as:

$$
P(C_k | x_1, x_2, \ldots, x_n) \propto P(C_k) \cdot \prod_{i=1}^{n} P(x_i | C_k)
$$

## Types of Naive Bayes Classifiers

### 1. Multinomial Naive Bayes

- Suitable for discrete data, often used for document classification.

### 2. Gaussian Naive Bayes

- Assumes features follow a normal distribution.

### 3. Bernoulli Naive Bayes

- Appropriate for binary features.

## Training and Prediction

1. **Training:**
   - Estimate prior probabilities $ P(C_k) $ and class-conditional probabilities $ P(x_i | C_k) $ from the training data.

2. **Prediction:**
   - Calculate posterior probabilities using the Naive Bayes formula.
   - Assign the class with the highest posterior probability.

## Advantages

- Simple and computationally efficient.
- Performs well on high-dimensional data.

## Limitations

- Assumes independence, which may not always hold.
- Sensitivity to irrelevant features.



In [19]:

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
X = df.drop(['median_house_value', 'median_price_category'], axis=1)
y = df['median_price_category']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the Naive Bayes classifier (Gaussian)
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)

# Make predictions on the test set
y_pred = naive_bayes.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display results
print(f'Accuracy: {accuracy:.2f}')
print('\nConfusion Matrix:')
print(conf_matrix)
print('\nClassification Report:')
print(classification_report(y_test, y_pred))


Accuracy: 0.49

Confusion Matrix:
[[541  38 301 182]
 [ 22 217 264 551]
 [128 129 409 294]
 [  4 120  37 850]]

Classification Report:
              precision    recall  f1-score   support

        High       0.78      0.51      0.62      1062
         Low       0.43      0.21      0.28      1054
    Moderate       0.40      0.43      0.42       960
    Very Low       0.45      0.84      0.59      1011

    accuracy                           0.49      4087
   macro avg       0.52      0.50      0.47      4087
weighted avg       0.52      0.49      0.47      4087



In [20]:
X.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')

# Principal Component Regression (PCR)

## Introduction

Principal Component Regression (PCR) is a technique that combines the concepts of Principal Component Analysis (PCA) and linear regression. It aims to handle multicollinearity in regression models by transforming the original features into a new set of uncorrelated variables known as principal components.

## Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that transforms the original features into a set of linearly uncorrelated variables called principal components. The first principal component explains the most variance, followed by the second, and so on.

### PCA Equation

The principal components are obtained by linear combinations of the original features:

$$
Z_i = \sum_{j=1}^{p} \phi_{ij}X_j
$$

Where:
- $Z_i$ is the $i$-th principal component,
- $\phi_{ij}$ is the loading of the $j$-th feature on the $i$-th principal component,
- $X_j$ is the $j$-th original feature,
- $p$ is the number of features.

## Principal Component Regression (PCR)

PCR involves performing PCA on the original features and then using a subset of the principal components as predictors in a linear regression model.

### PCR Equation

The regression equation using \(m\) principal components is:

$$
Y = \beta_0 + \sum_{i=1}^{m} \beta_i Z_i
$$

Where:
- $Y$ is the dependent variable,
- $\beta_0$ is the intercept term,
- $\beta_i$ are the regression coefficients for the principal components,
- $Z_i$ are the selected principal components.

## Steps in PCR

1. Standardize the original features.
2. Perform PCA to obtain principal components.
3. Select a subset of principal components based on explained variance or cross-validation.
4. Use selected principal components as predictors in a linear regression model.

## Advantages

- Addresses multicollinearity in regression.
- Reduces dimensionality while preserving important information.

## Limitations

- Interpretability can be challenging.
- The choice of the number of principal components requires consideration.



In [21]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

X = df.drop(['median_house_value', 'median_price_category'], axis=1)
y = df['median_house_value']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a PCR pipeline
pcr_model = make_pipeline(StandardScaler(), PCA(), LinearRegression())

# Fit the model on training data
pcr_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pcr_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display results
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')


Mean Squared Error: 4921881237.63
R-squared: 0.64


# Linear Discriminant Analysis (LDA)

## Introduction

Linear Discriminant Analysis (LDA) is a dimensionality reduction and classification technique that finds the linear combinations of features that best separate different classes in a dataset.

## Objective

Find a linear transformation of the features to maximize the separation between classes.

## Mathematical Formulation

### Within-Class Scatter Matrix (S_W)

$$
S_W = \sum_{i=1}^{C} \sum_{j=1}^{n_i} (x_{ij} - \mu_i)(x_{ij} - \mu_i)^T
$$

Where:
- $C$ is the number of classes.
- $n_i$ is the number of samples in class $i$.
- $x_{ij}$ is the $j$-th sample in class $i$.
- $\mu_i$ is the mean vector of class $i$.

### Between-Class Scatter Matrix (S_B)

$$
S_B = \sum_{i=1}^{C} N_i (\mu_i - \mu)(\mu_i - \mu)^T
$$

Where:
- $N_i$ is the number of samples in class \(i\).
- $\mu$ is the overall mean vector.

### Eigenvalue Problem

Solve the generalized eigenvalue problem:

$$
S_W^{-1}S_B \mathbf{v} = \lambda \mathbf{v}
$$

Where:
- $\mathbf{v}$ is the eigenvector.
- $\lambda$ is the eigenvalue.

### Fisher's Linear Discriminant

The optimal projection vector \(\mathbf{w}\) is the eigenvector corresponding to the largest eigenvalue.

$$
\mathbf{w} \propto S_W^{-1} (\mu_1 - \mu_2)
$$

## Classification

Project new samples onto the discriminant vector and assign to the class with the closest mean.

$$
y = \text{argmax}_i(\mathbf{w}^T x_i)
$$

Where:
- $y$ is the predicted class.
- $x_i$ is the new sample.

## Advantages

- Effective in high-dimensional spaces.
- Assumes normality and equality of covariance matrices.

## Limitations

- Sensitive to outliers.
- Assumes features are normally distributed.



In [22]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X = df.drop(['median_house_value', 'median_price_category'], axis=1)
y = df['median_price_category']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the LDA model
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lda_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display results
print(f'Accuracy: {accuracy:.2f}')
print('\nConfusion Matrix:')
print(conf_matrix)
print('\nClassification Report:')
print(classification_report(y_test, y_pred))


Accuracy: 0.61

Confusion Matrix:
[[690  39 324   9]
 [ 23 524 309 198]
 [182 174 559  45]
 [  3 245  57 706]]

Classification Report:
              precision    recall  f1-score   support

        High       0.77      0.65      0.70      1062
         Low       0.53      0.50      0.51      1054
    Moderate       0.45      0.58      0.51       960
    Very Low       0.74      0.70      0.72      1011

    accuracy                           0.61      4087
   macro avg       0.62      0.61      0.61      4087
weighted avg       0.62      0.61      0.61      4087

