In [None]:
"""
Domain
    Health Care

focus
    Cancer detection

Business challenge/requirement
    John Cancer Hospital (JCH) is a leading cancer hospital in the USA. It specializes in 
    preventing breast cancer. 
    Throughout the last few years, JCH has collected breast cancer data from patients 
    who came for screening/treatment.
    However, this data has almost 30 attributes and is difficult to run and interpret the 
    result. You as an ML expert have to reduce the no. of attributes (Dimensionality 
    Reduction) so that results are meaningful and accurate.

Key issues
    Reduce the no. of attributes/features in data to make the results and analysis 
    comprehensible by doctors

Considerations
    NONE

Data volume
    - Approx 569 records - file breast-cancer-data.csv 

Fields in Data
    • Details in the ipynb notebook

Additional information
    - NA

Business benefits
    The improved success rate of cancer detection and hence direct impact on revenue 
    and profit of hospital. More than that it contributes to JCH's mission "Better Life"
"""

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [13]:
CSV_PATH = r'D:\CourseWork\data-science-python-certification-course\Assignments\08 Dimensionality Reduction\Case Study III\resources\breast-cancer-data.csv'
data = pd.read_csv(CSV_PATH, index_col=0)

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 569 entries, 842302 to 92751
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se            

In [15]:
df = data.copy()
df.drop(["diagnosis"], inplace=True, axis=1)

In [16]:
pca_model = PCA(n_components = 2)
pca_model.fit(df)
transformer_data = pca_model.transform(df)

In [17]:
ndf = pd.DataFrame(transformer_data)
ndf.columns = ["PC1", "PC2"]
ndf.index = df.index
ndf['diagnosis'] = data['diagnosis']
ndf.head(5)

Unnamed: 0_level_0,PC1,PC2,diagnosis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
842302,1160.142574,-293.917544,M
842517,1269.122443,15.630182,M
84300903,995.793889,39.156743,M
84348301,-407.180803,-67.38032,M
84358402,930.34118,189.340742,M


In [18]:
print(pca_model.explained_variance_ratio_)

[0.98204467 0.01617649]
