# PCR- Principal Component Regression 

- Principal Component Regression (PCR) is a regression technique that serves the same goal as standard linear regression — model the relationship between a target variable and the predictor variables.
- PCR works in three steps:
1. Apply PCA to generate principal components from the predictor variables, with the number of principal components matching the number of original features p.
2. Keep the first k principal components that explain most of the variance (where k < p), where k is determined by cross-validation
3. Fit a linear regression model (using ordinary least squares) on these k principal components

## Benefits:
1. Reduce overfitting.
2. PCR helps eliminate multicollinearity in the data by removing principal components associated with small eigenvalues.
3. High cardinality.

## Caveats(Disadvantages):
1. PCR is not considered a feature selection method because the resulting principal components used in the regression are linear combinations of the original features.
2.  these predictors lose their ‘explainability’ compared to the original features.

In [2]:
# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import scale 
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA

In [14]:
# import dataset
df = pd.read_csv("winequality-red.csv",sep=';')

In [15]:
# observing table
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [16]:
# data manipulation
target = 'quality'
X = df.drop(target,axis=1)
y = df[target]

In [18]:
df.shape

(1599, 12)

In [22]:
# splitting the dataset for training and testing purpose:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

In [23]:
X_train.shape

(1279, 11)

In [24]:
# data Standardization
#Data standardization means your data is internally consistent — each of your data sources has the same format and 
# labels. When your data is neatly organized with logical descriptions and labels, everyone in your organization 
# can understand it and put it to use.
# Run standardization on X variables:
X_train_scaled, X_test_scaled = scale(X_train), scale(X_test)
X_train_scaled

array([[ 0.95891964, -0.64729253,  0.91841253, ..., -0.66144466,
        -1.00107571,  1.22683979],
       [ 1.18715099, -0.53480404,  1.17686129, ..., -1.36563421,
         1.23858005,  0.85903588],
       [-0.0110636 ,  0.4213481 , -0.73565955, ..., -0.02127234,
        -0.41169262, -1.1638856 ],
       ...,
       [-0.75281549, -0.92851375, -0.32214153, ...,  0.55488275,
        -0.70638416, -0.33632681],
       [-0.12517927,  1.46186661,  1.28024079, ..., -0.66144466,
         0.17769048, -1.07193462],
       [-0.4675263 , -0.81602526,  1.53868955, ...,  0.29881382,
        -0.64744585,  0.85903588]])

# Exploring different models for accuracy.

In [25]:
# Define cross-validation folds
cv = KFold(n_splits=10, shuffle=True, random_state=42)