# Project 3 - Breast Cancer Wisconsin Dataset

### Download Dataset From The Following Link
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

- Follow the instructions to complete the project
- Knowledge of Python and Python Libraries is a Pre-requisite
- Feel free to add in your code and analysis, this is a practice exercise and you should try to implement your learnings
- If something is not clear, go back to lessons, or the documentation page for Python Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline # pretty display for notebooks


In [None]:
# load data using pandas
cancer = pd.read_csv()

In [None]:
# Initial Data Inspection
# Use cancer.head() to display first n rows
# Use cancer.info and cancer.describe to inspect data



In [None]:
# drop the columns ['Unnamed: 32', 'Id'] using cancer.drop(columns, axis = 1, inplace = True)
cancer.drop()

In [None]:
# Display no of Malign and Benign Inputs
sns.countplot(cancer.diagnosis)

In [None]:
# Correlation Map for breast cancer data
plt.figure(figsize = (25, 20))
sns.heatmap(cancer.corr(method = 'pearson'), annot = True, cmap = 'Blues')

In [None]:
# dropping unnecessary features
drop = ['radius_mean', 'perimeter_mean', 'compactness_mean', 'concave points_mean','radius_se',\
        'perimeter_se','radius_worst','perimeter_worst','compactness_worst','concave points_worst',\
        'compactness_se','concave points_se','texture_worst','area_worst']

cancer.drop(drop, axis = 1, inplace = True)
cancer.head()

In [None]:
plt.figure(figsize = (15, 10))
sns.heatmap(cancer.corr(method = 'pearson'), annot = True, cmap = 'Blues')

### Split features and target variable

In [None]:
target = cancer.diagnosis
features = cancer.drop('diagnosis', axis = 1)
features.head()

In [None]:
# Feature Scaling and Dimensionality Reduction
# Normalize the features
# Normalized_X = (X - mean)/(Max - Min)
features = (features - features.mean())/(features.max() - features.min())
# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 6)
pca.fit(features)

final_data = pca.transform(features)
final_data = pd.DataFrame(final_data, columns = ['Dimension 1', 'Dimension 2', 'Dimension 3', \
                                                 'Dimension 4', 'Dimension 5', 'Dimension 6'])
final_data.head()

### Cross Validation
A technique which involves saving a portion of dataset called the validation set, on which we do not train the model, and later test out trained model on this sample before finalizing the model. We train the model on large portion of the dataset to recognise the pattern or trend in the data.

***k-fold cross validation***
Training and testing multiple times

- Randomly split the datat set into k folds
- For each k-fold, train the model on rest of the dataset and test it on the 'k-fold' reserved portion
- Repeat for each k-fold
- Average error for each k-fold is the cross validation error

### F1 Score
A better measure for the performance of classification problems than accuracy_score which can attain an accuracy score of 100% by just predicting all the inputs as 'True'. As accuracy measures how many we got right, by predicting everything true it still predicts the inputs which were supposed to be True as True, but also predicting the false ones True, which is a bad predicton.

Takes into account the false positives and false negatives; i.e th prediction which were false but predicted true and the prediction which were true but predicted false

**F1 = 2 * (precision X recall) / (precision + recall)**



learn more about it form here : http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold # import kfold

clf_1 = DecisionTreeClassifier()
clf_2 = RandomForestClassifier()
clf_3 = LogisticRegression()
kf = KFold(n_splits=5)

for train_index, test_index in kf.split(final_data):

    # splitting the test and train data using the kfolds
    X_train, X_test = final_data.iloc[train_index], final_data.iloc[test_index] 
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]
    
    #fit the model and predit 
    clf_1.fit(X_train, y_train) # Select Classifier : clf_1, clf_2, clf_3
    prediction = clf_1.predict(X_test)
    score = accuracy_score(y_test, prediction)
    f1 = f1_score(y_test, prediction, average = 'weighted')
    
    print('Accuracy: ', score)
    print('F1 Score: ', f1)