# Principal Component Analysis


Lesson Goals

In this lesson we will learn how to reduce the dimensions of our data using a technique called PCA. We will learn the theory behind this technique as well as how to implement it in Python.
Introduction

PCA is a dimensionality reduction technique that is used to transform our data. What motivates us to reduce the dimensions of our data?

    We may want to speed up the performance of our model by providing it with less variables
    We may have many columns of very sparse data. By using PCA, we have fewer columns with less sparsity. This may improve the performance of our model

Creating less variables that are a function of our original data enables us to accomplish these goals.



# Dimension Reduction

In real life, we reduce the dimensions of things quite often while still maintaining most of the important information. For example, we are able to watch television and still understand the images even though they have been reduced from 3 dimensions to 2 dimensions. PCA is meant to reduce data such that we still retain a large amount of information about the data.



# PCA in Python

We will find the principal components in our data using the PCA function in statsmodels. In this example, we will be using a breast cancer dataset from the UCI data repository. This data contains 35 columns, we will retain 31 numeric columns and find the top 2 principal components of our data.

In [1]:
import numpy as np
import pandas as pd
from statsmodels.multivariate.pca import  PCA

In [2]:
total_cols = ['id', 'outcome', 'time', 'cell_1_radius', 'cell_1_texture', 'cell_1_perimiter',
              'cell_1_area', 'cell_1_smoothness', 'cell_1_compactness', 'cell_1_concavity',
              'cell_1_concave_points', 'cell_1_symmetry', 'cell_1_fractal_dimension',
              'cell_2_radius', 'cell_2_texture', 'cell_2_perimiter', 'cell_2_area',
              'cell_2_smoothness', 'cell_2_compactness', 'cell_2_concavity',
              'cell_2_concave_points', 'cell_2_symmetry', 'cell_2_fractal_dimension', 'cell_3_radius', 
              'cell_3_texture', 'cell_3_perimiter', 'cell_3_area', 'cell_3_smoothness',
              'cell_3_compactness', 'cell_3_concavity', 'cell_3_concave_points', 'cell_3_symmetry',
              'cell_3_fractal_dimension', 'tumor_size', 'lymph_status']


breast_cancer = pd.read_csv('breast-cancer.csv', names=total_cols)
breast_cancer.head()

Unnamed: 0,id,outcome,time,cell_1_radius,cell_1_texture,cell_1_perimiter,cell_1_area,cell_1_smoothness,cell_1_compactness,cell_1_concavity,...,cell_3_perimiter,cell_3_area,cell_3_smoothness,cell_3_compactness,cell_3_concavity,cell_3_concave_points,cell_3_symmetry,cell_3_fractal_dimension,tumor_size,lymph_status
0,119513,N,31,18.02,27.6,117.5,1013.0,0.09489,0.1036,0.1086,...,139.7,1436.0,0.1195,0.1926,0.314,0.117,0.2677,0.08113,5.0,5
1,8423,N,61,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,3.0,2
2,842517,N,116,21.37,17.44,137.5,1373.0,0.08836,0.1189,0.1255,...,159.1,1949.0,0.1188,0.3449,0.3414,0.2032,0.4334,0.09067,2.5,0
3,843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2.0,0
4,843584,R,27,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,3.5,0


Now we will extract only the numeric columns and find the principal components using the PCA function. For this example, we will limit ourselves to only 2 components.

In [3]:
numeric_columns = [x for x in total_cols if x not in ['id', 'outcome', 'time', 'lymph_status']]
breast_cancer_numeric = breast_cancer[numeric_columns]
pc = PCA(np.array(breast_cancer_numeric), ncomp=2)
pc.factors.shape

(198, 2)

In [10]:
import numpy as np
import pandas as pd
from statsmodels.multivariate.pca import  PCA

total_cols = ['id', 'outcome', 'time', 'cell_1_radius', 'cell_1_texture', 'cell_1_perimiter',
              'cell_1_area', 'cell_1_smoothness', 'cell_1_compactness', 'cell_1_concavity',
              'cell_1_concave_points', 'cell_1_symmetry', 'cell_1_fractal_dimension',
              'cell_2_radius', 'cell_2_texture', 'cell_2_perimiter', 'cell_2_area',
              'cell_2_smoothness', 'cell_2_compactness', 'cell_2_concavity',
              'cell_2_concave_points', 'cell_2_symmetry', 'cell_2_fractal_dimension', 'cell_3_radius', 
              'cell_3_texture', 'cell_3_perimiter', 'cell_3_area', 'cell_3_smoothness',
              'cell_3_compactness', 'cell_3_concavity', 'cell_3_concave_points', 'cell_3_symmetry',
              'cell_3_fractal_dimension', 'tumor_size', 'lymph_status']

df = pd.read_csv(r'C:\Users\Yael Aguilar\anaconda3\pkgs\scikit-learn-0.23.2-py38h47e9c7a_0\Lib\site-packages\sklearn\datasets\data\breast_cancer.csv', names = total_cols)
df.head()

Unnamed: 0,id,outcome,time,cell_1_radius,cell_1_texture,cell_1_perimiter,cell_1_area,cell_1_smoothness,cell_1_compactness,cell_1_concavity,...,cell_3_perimiter,cell_3_area,cell_3_smoothness,cell_3_compactness,cell_3_concavity,cell_3_concave_points,cell_3_symmetry,cell_3_fractal_dimension,tumor_size,lymph_status
0,569.0,30.0,malignant,benign,,,,,,,...,,,,,,,,,,
1,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,0.6656,0.7119,0.2654,0.4601,0.1189,0.0,,,,
2,20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,0.1866,0.2416,0.186,0.275,0.08902,0.0,,,,
3,19.69,21.25,130,1203,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,0.4245,0.4504,0.243,0.3613,0.08758,0.0,,,,
4,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,0.8663,0.6869,0.2575,0.6638,0.173,0.0,,,,


In [12]:
df.dtypes 

id                          float64
outcome                     float64
time                         object
cell_1_radius                object
cell_1_texture              float64
cell_1_perimiter            float64
cell_1_area                 float64
cell_1_smoothness           float64
cell_1_compactness          float64
cell_1_concavity            float64
cell_1_concave_points       float64
cell_1_symmetry             float64
cell_1_fractal_dimension    float64
cell_2_radius               float64
cell_2_texture              float64
cell_2_perimiter            float64
cell_2_area                 float64
cell_2_smoothness           float64
cell_2_compactness          float64
cell_2_concavity            float64
cell_2_concave_points       float64
cell_2_symmetry             float64
cell_2_fractal_dimension    float64
cell_3_radius               float64
cell_3_texture              float64
cell_3_perimiter            float64
cell_3_area                 float64
cell_3_smoothness           

In [11]:
numeric_columns = [x for x in total_cols if x not in ['id', 'outcome', 'time', 'lymph_status']]
breast_cancer_numeric = df[numeric_columns]
pc = PCA(np.array(breast_cancer_numeric), ncomp=2)
pc.factors.shape

ValueError: could not convert string to float: 'benign'