<a href="https://colab.research.google.com/github/giocarro/Data_Science_Gio/blob/main/Actividades/Pandas_class_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DIPLOMADO CIENCIA MATEMÁTICA DE DATOS

## MODULO: Dimensionality Reduction

### 06/06/2023

## Libraries & Functions

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
from sklearn.decomposition import PCA

In [None]:
def plot_confusion_matrix(cm, labels):
    fig_cm = px.imshow(cm, labels=dict(x="Predicted", y="Actual", color="Count"),
                       x=labels, y=labels, color_continuous_scale='Viridis', text_auto = True,
                       title="Confusion Matrix")
    fig_cm.update_layout(coloraxis_showscale=False)
    fig_cm.show()

## Data Loading

In [None]:
breast_cancer = load_breast_cancer()
df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
df['target'] = breast_cancer.target
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

## Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique used to transform a dataset from a high-dimensional space into a lower-dimensional space while retaining most of the original information. It accomplishes this by identifying the principal components, which are new orthogonal variables that capture the maximum variance in the data.

**Steps of PCA:**

1. **Standardization:** Before applying PCA, it is important to standardize the features to have zero mean and unit variance. This ensures that all features contribute equally to the analysis. The standardized value for a feature x is calculated as follows:

   ![Standardization Formula](https://latex.codecogs.com/png.image?\dpi{150}&space;\bg_white&space;z&space;=&space;\frac{x&space;-&space;\mu}{\sigma})

   where z is the standardized value, x is the original value, μ is the mean of the feature, and σ is the standard deviation of the feature.

2. **Covariance Matrix:** Compute the covariance matrix of the standardized data. The covariance matrix measures the relationships between pairs of features and provides insights into the data's variance structure. The covariance between two features x and y is calculated as follows:

   ![Covariance Formula](https://latex.codecogs.com/png.image?\dpi{150}&space;\bg_white&space;Cov(x,&space;y)&space;=&space;\frac{1}{n-1}\sum_{i=1}^{n}(x_i&space;-&space;\bar{x})(y_i&space;-&space;\bar{y}))

   where Cov(x, y) is the covariance between x and y, n is the number of samples, xi and yi are the standardized values of x and y, and 𝑥̅ and 𝑦̅ are the means of x and y, respectively.

3. **Eigenvalue Decomposition:** Perform eigenvalue decomposition on the covariance matrix to obtain the eigenvectors and eigenvalues. The eigenvectors represent the principal components, and the corresponding eigenvalues indicate the amount of variance explained by each principal component. The eigenvalue decomposition equation is as follows:

   ![Eigenvalue Decomposition Formula](https://latex.codecogs.com/png.image?\dpi{150}&space;\bg_white&space;\mathbf{C}&space;\mathbf{v}&space;=&space;\lambda&space;\mathbf{v})

   where C is the covariance matrix, v is the eigenvector, λ is the eigenvalue.

4. **Selecting Principal Components:** Sort the eigenvalues in descending order and select the top k eigenvectors corresponding to the highest eigenvalues. These eigenvectors constitute the principal components that capture the most significant variation in the data.

5. **Projection:** Project the original data onto the selected principal components to obtain the transformed dataset in the lower-dimensional space. The transformed value for a sample x is calculated as follows:

   ![Projection Formula](https://latex.codecogs.com/png.image?\dpi{150}&space;\bg_white&space;\mathbf{y}&space;=&space;\mathbf{x}&space;\mathbf{W})

   where y is the transformed value, x is the original value, and W is the matrix formed by concatenating the selected eigenvectors.

Certainly! Here's the section on interpreting PCA, repeated for your convenience:

**Interpreting PCA:**

PCA provides several insights into the data:

- The eigenvalues indicate the amount of variance explained by each principal component. Larger eigenvalues correspond to principal components that capture more variation in the data.

- The eigenvectors (principal components) are orthogonal to each other. They represent new axes in the transformed space, where each axis captures a different aspect of the data's variance.

- The cumulative explained variance can be calculated by summing the eigenvalues. It helps determine the number of principal components needed to retain a desired amount of information. The cumulative explained variance up to the k-th principal component can be calculated as follows:
Explained Variance_cumulative(k) = (∑_{i=1}^{k} λ_i) / (∑_{j=1}^{p} λ_j)
where Explained Variance_cumulative(k) is the cumulative explained variance up to the k-th principal component, λ_i is the eigenvalue of the i-th principal component, and p is the total number of principal components.

- The principal components can be interpreted in terms of feature importance. The higher the absolute value of the component's coefficients, the more it contributes to the variance in the data. Positive coefficients indicate a positive correlation with the component, while negative coefficients indicate a negative correlation.

- PCA can also be visualized by plotting the transformed data in the reduced-dimensional space. Each point represents a sample, and the axes correspond to the principal components. By examining the distribution of samples, patterns, clusters, or separability in the transformed space can be observed.

By understanding and interpreting the eigenvalues, eigenvectors, cumulative explained variance, feature importance, and visualizations, we can gain valuable insights into the underlying structure and patterns in the data through PCA.

**Applications of PCA:**

- *Dimensionality Reduction:* PCA is primarily used for dimensionality reduction, where high-dimensional datasets are transformed into lower-dimensional representations while preserving most of the information.

- *Data Visualization:* PCA can be used to visualize high-dimensional data in two or three dimensions by projecting it onto the principal components. This allows for better understanding and interpretation of the data.

- *Noise Reduction:* PCA can help in denoising data by removing noise-related principal components that contribute little to the overall variance. By discarding the components with lower eigenvalues, noise can be reduced in the reconstructed data.

- *Feature Extraction:* PCA can be used as a feature extraction technique to derive new features that capture the most important aspects of the data. The transformed principal components can be used as features in subsequent analysis tasks.

By applying PCA, we can gain insights into the data's variance structure, reduce dimensionality, visualize high-dimensional data, remove noise, and extract meaningful features.

In [None]:
X = breast_cancer.data  # Features
y = breast_cancer.target  # Labels

In [None]:
X

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [None]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In PCA, the `explained_variance_ratio_` and `components_` attributes provide important insights into the transformed data and the contribution of each principal component. Here's how you can use and interpret these attributes:

#### Explained Variance Ratio (`explained_variance_ratio_`):

The `explained_variance_ratio_` attribute of the PCA object represents the proportion of the variance explained by each principal component.

- Each value in `explained_variance_ratio_` corresponds to a principal component and indicates the percentage of the total variance explained by that component.
- You can use this attribute to determine the relative importance of each principal component in capturing the variability in the data.
- The sum of all values in `explained_variance_ratio_` equals 1, meaning it accounts for all the variance in the data.
- Higher values in `explained_variance_ratio_` indicate that the corresponding principal components capture more information from the original data.

#### Principal Components' Coefficients (`components_`):

The `components_` attribute of the PCA object represents the coefficients or weights of the original features in the transformed PCA space.

- It is a matrix where each row corresponds to a principal component, and each column represents a feature in the original dataset.
- The values in `components_` indicate the contribution or influence of each feature on the corresponding principal component.
- Positive and negative values in `components_` represent the direction and magnitude of the feature's influence on the principal component.
- Features with higher absolute values in `components_` have a stronger impact on the corresponding principal component.
- You can use this attribute to understand which original features contribute the most to each principal component and identify patterns or relationships between features and components.

To interpret `explained_variance_ratio_` and `components_`, you can follow these steps:

1. Check `explained_variance_ratio_` to understand the proportion of variance explained by each principal component. Higher values indicate more significant contributions.
2. Examine the individual values in `explained_variance_ratio_` to identify the most important principal components. You may consider selecting components with high cumulative variance, such as those above a certain threshold.
3. Use `components_` to inspect the coefficients or weights of the original features in each principal component. Look for features with high absolute values, as they have a stronger influence on the component.
4. Analyze the relationships between the original features and the principal components. Features with similar signs and magnitudes in different components may indicate common patterns or correlations.
5. Consider the cumulative contribution of the selected principal components. You can sum the `explained_variance_ratio_` values to determine the total proportion of variance explained by the chosen components.

By understanding the `explained_variance_ratio_` and `components_`, you can gain insights into the transformed data and make informed decisions about the importance of each principal component and the relationships between features and components.


### PCA with 2 components

In [None]:
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)
pca_df_2d = pd.DataFrame(data=X_pca_2d, columns=['PC1', 'PC2'])
pca_df_2d['diagnosis'] = y
pca_df_2d.head()

Unnamed: 0,PC1,PC2,diagnosis
0,9.192837,1.948583,0
1,2.387802,-3.768172,0
2,5.733896,-1.075174,0
3,7.122953,10.275589,0
4,3.935302,-1.948072,0


In [None]:
fig_2d = px.scatter(pca_df_2d, x='PC1', y='PC2', color='diagnosis', template = 'plotly_white', title = 'PCA - 2 Components')
fig_2d.show()

In [None]:
explained_variance_2d = pca_2d.explained_variance_ratio_
print("Explained Variance Ratio (2D):", explained_variance_2d)

Explained Variance Ratio (2D): [0.44272026 0.18971182]


In [None]:
components = pd.DataFrame(abs(pca_2d.components_), columns=breast_cancer.feature_names)
components

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,0.218902,0.103725,0.227537,0.220995,0.14259,0.239285,0.2584,0.260854,0.138167,0.064363,...,0.227997,0.104469,0.23664,0.224871,0.127953,0.210096,0.228768,0.250886,0.122905,0.131784
1,0.233857,0.059706,0.215181,0.231077,0.186113,0.151892,0.060165,0.034768,0.190349,0.366575,...,0.219866,0.045467,0.199878,0.219352,0.172304,0.143593,0.097964,0.008257,0.141883,0.275339


#### Classification

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df['target'], test_size=0.2, random_state=7)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
pca = PCA(n_components=2)
X_train_pca_2d = pca.fit_transform(X_train_scaled)
X_test_pca_2d = pca.transform(X_test_scaled)

##### Logistic Regression

In [None]:
lr_pca_2d = LogisticRegression()
lr_pca_2d.fit(X_train_pca_2d, y_train)
lr_pca_2d_pred = lr_pca_2d.predict(X_test_pca_2d)

In [None]:
lr_pca_2d_accuracy = accuracy_score(y_test, lr_pca_2d_pred)
lr_pca_2d_precision = precision_score(y_test, lr_pca_2d_pred)
lr_pca_2d_recall = recall_score(y_test, lr_pca_2d_pred)
lr_pca_2d_f1 = f1_score(y_test, lr_pca_2d_pred)
lr_pca_2d_report = classification_report(y_test, lr_pca_2d_pred)
print("Logistic Regression PCA 2D Classification Report:")
print(lr_pca_2d_report)

Logistic Regression PCA 2D Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.93      0.95        40
           1       0.96      0.99      0.97        74

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



In [None]:
lr_pca_2d_cm = confusion_matrix(y_test, lr_pca_2d_pred)
plot_confusion_matrix(lr_pca_2d_cm, ['Benign', 'Malignant'])

##### KNN

In [None]:
knn_pca_2d = KNeighborsClassifier()
knn_pca_2d.fit(X_train_pca_2d, y_train)
knn_pca_2d_pred = knn_pca_2d.predict(X_test_pca_2d)

In [None]:
knn_pca_2d_accuracy = accuracy_score(y_test, knn_pca_2d_pred)
knn_pca_2d_precision = precision_score(y_test, knn_pca_2d_pred)
knn_pca_2d_recall = recall_score(y_test, knn_pca_2d_pred)
knn_pca_2d_f1 = f1_score(y_test, knn_pca_2d_pred)
knn_pca_2d_report = classification_report(y_test, knn_pca_2d_pred)
print("KNN PCA 2D Classification Report:")
print(knn_pca_2d_report)

KNN PCA 2D Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.98        40
           1       1.00      0.97      0.99        74

    accuracy                           0.98       114
   macro avg       0.98      0.99      0.98       114
weighted avg       0.98      0.98      0.98       114



In [None]:
knn_pca_2d_cm = confusion_matrix(y_test, knn_pca_2d_pred)
plot_confusion_matrix(knn_pca_2d_cm, ['Benign', 'Malignant'])

##### SVM

In [None]:
svm_pca_2d = SVC()
svm_pca_2d.fit(X_train_pca_2d, y_train)
svm_pca_2d_pred = svm_pca_2d.predict(X_test_pca_2d)

In [None]:
svm_pca_2d_accuracy = accuracy_score(y_test, svm_pca_2d_pred)
svm_pca_2d_precision = precision_score(y_test, svm_pca_2d_pred)
svm_pca_2d_recall = recall_score(y_test, svm_pca_2d_pred)
svm_pca_2d_f1 = f1_score(y_test, svm_pca_2d_pred)
svm_pca_2d_report = classification_report(y_test, svm_pca_2d_pred)
print("SVM PCA 2D Classification Report:")
print(svm_pca_2d_report)

SVM PCA 2D Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.88      0.91        40
           1       0.94      0.97      0.95        74

    accuracy                           0.94       114
   macro avg       0.94      0.92      0.93       114
weighted avg       0.94      0.94      0.94       114



In [None]:
svm_pca_2d_cm = confusion_matrix(y_test, svm_pca_2d_pred)
plot_confusion_matrix(svm_pca_2d_cm, ['Benign', 'Malignant'])

##### Naive Bayes


In [None]:
nb_pca_2d = GaussianNB()
nb_pca_2d.fit(X_train_pca_2d, y_train)
nb_pca_2d_pred = nb_pca_2d.predict(X_test_pca_2d)

In [None]:
nb_pca_2d_accuracy = accuracy_score(y_test, nb_pca_2d_pred)
nb_pca_2d_precision = precision_score(y_test, nb_pca_2d_pred)
nb_pca_2d_recall = recall_score(y_test, nb_pca_2d_pred)
nb_pca_2d_f1 = f1_score(y_test, nb_pca_2d_pred)
nb_pca_2d_report = classification_report(y_test, nb_pca_2d_pred)
print("Naive Bayes PCA 2D Classification Report:")
print(nb_pca_2d_report)

Naive Bayes PCA 2D Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.80      0.84        40
           1       0.90      0.95      0.92        74

    accuracy                           0.89       114
   macro avg       0.89      0.87      0.88       114
weighted avg       0.89      0.89      0.89       114



In [None]:
nb_pca_2d_cm = confusion_matrix(y_test, nb_pca_2d_pred)
plot_confusion_matrix(nb_pca_2d_cm, ['Benign', 'Malignant'])

### PCA with 3 components

In [None]:
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)
pca_df_3d = pd.DataFrame(data=X_pca_3d, columns=['PC1', 'PC2', 'PC3'])
pca_df_3d['diagnosis'] = y
pca_df_3d.head()

Unnamed: 0,PC1,PC2,PC3,diagnosis
0,9.192837,1.948582,-1.123165,0
1,2.387802,-3.768172,-0.529292,0
2,5.733896,-1.075174,-0.551748,0
3,7.122953,10.275588,-3.232788,0
4,3.935302,-1.948073,1.389768,0


In [None]:
fig_3d = px.scatter_3d(pca_df_3d, x='PC1', y='PC2', z='PC3', color='diagnosis', template = 'plotly_white', title = 'PCA - 3 Components')
fig_3d.show()

In [None]:
explained_variance_3d = pca_3d.explained_variance_ratio_
print("Explained Variance Ratio (3D):", explained_variance_3d)

Explained Variance Ratio (3D): [0.44272026 0.18971182 0.09393163]


In [None]:
components = pd.DataFrame(abs(pca_3d.components_), columns=breast_cancer.feature_names)
components

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,0.218902,0.103725,0.227537,0.220995,0.14259,0.239285,0.2584,0.260854,0.138167,0.064363,...,0.227997,0.104469,0.23664,0.224871,0.127953,0.210096,0.228768,0.250886,0.122905,0.131784
1,0.233857,0.059706,0.215181,0.231077,0.186113,0.151892,0.060165,0.034768,0.190349,0.366575,...,0.219866,0.045467,0.199878,0.219352,0.172304,0.143593,0.097964,0.008257,0.141883,0.275339
2,0.008531,0.06455,0.009314,0.0287,0.104292,0.074092,0.002734,0.025563,0.04024,0.022574,...,0.047507,0.042298,0.048547,0.011902,0.259798,0.236076,0.173057,0.170344,0.271313,0.232791
