# Cervical Cancer Risk Factors for Biopsy

*explain about usecase*

<b>Dataset:</b> <i>kag_risk_factors_cervical_cancer_cleaned.csv</i> <br>

<b><a href="https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29">Description</b></a><br>
    
<b>Objectives:</b>
- Load and Explore the Dataset
- Split into Training and Test Set (see guidelines below)
- Build the following models using the Training Set (using default parameters):
    - KNN
    - Logistic Regression
    - Decision Tree
- Print the Accuracy Score of a Train/Test Split (see parameters below)  
- Build the following models using the WHOLE dataset
- Print the Accuracy Score on a 5-fold cross validation of the whole dateset given  
- <i>Bonus:
    - Experiment with Decision Tree parameters to determine if you can increase the accuracy
    - Report the highest accuracy score on  a 5-fold cross val</i> 

<b>Guidelines:</b><br>
- Target Column: Biopsy
- Train Test Split Parameters:
    - test_size = 0.25
    - random_state = 12
- For models that have a random_state parameter:
    - random_state=12

## Importing of Libraries

### Standard Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

%matplotlib inline

### Additional Libraries

In [2]:
#Train Test Split
from sklearn.model_selection import train_test_split

#Accuracy Score Metric
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score

#Required Algorithms
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import MinMaxScaler

#### Ignore Warnings

In [3]:
import warnings
warnings.filterwarnings('ignore')

# Load and Explore the Dataset

In [4]:
df = pd.read_csv("kag_risk_factors_cervical_cancer_cleaned.csv")

In [5]:
df.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs:HPV,STDs: Number of diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18.0,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15.0,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,34.0,1.0,17.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,42.0,3.0,23.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,44.0,3.0,26.0,4.0,0.0,0.0,0.0,1.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 411 entries, 0 to 410
Data columns (total 34 columns):
Age                                   411 non-null float64
Number of sexual partners             411 non-null float64
First sexual intercourse              411 non-null float64
Num of pregnancies                    411 non-null float64
Smokes                                411 non-null float64
Smokes (years)                        411 non-null float64
Smokes (packs/year)                   411 non-null float64
Hormonal Contraceptives               411 non-null float64
Hormonal Contraceptives (years)       411 non-null float64
IUD                                   411 non-null float64
IUD (years)                           411 non-null float64
STDs                                  411 non-null float64
STDs (number)                         411 non-null float64
STDs:condylomatosis                   411 non-null float64
STDs:cervical condylomatosis          411 non-null float64
STDs:vagin

In [7]:
df.describe()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs:HPV,STDs: Number of diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
count,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0,...,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0
mean,25.883212,2.155718,17.411192,2.014599,0.048662,0.351097,0.095423,0.666667,1.635693,0.029197,...,0.0,0.026764,0.014599,0.007299,0.014599,0.017032,0.060827,0.116788,0.043796,0.13382
std,7.362462,1.075258,2.745954,1.117393,0.215422,2.434709,0.852453,0.471979,2.848869,0.168564,...,0.0,0.16159,0.120085,0.085227,0.120085,0.129547,0.239304,0.321559,0.204889,0.340874
min,14.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,20.0,1.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,25.0,2.0,17.0,2.0,0.0,0.0,0.0,1.0,0.5,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,30.0,3.0,18.0,3.0,0.0,0.0,0.0,1.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,52.0,10.0,29.0,8.0,1.0,34.0,15.0,1.0,20.0,1.0,...,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [8]:
df.shape

(411, 34)

In [9]:
df["Biopsy"].value_counts()

0.0    356
1.0     55
Name: Biopsy, dtype: int64

# Prepare, Train and Test the Dataset

### Separating Y (target variable) from X (predictor) columns

In [10]:
X = df.drop(columns = ["Biopsy"], axis=1)
y = df["Biopsy"]

### Split into train and test partitions using the train_test_split function
test_size should be 25% and random_state = 12

In [11]:
#Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)

In [12]:
#Check shape to make sure it is all in order
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((308, 33), (103, 33), (308L,), (103L,))

### Scaling the Model: MinMaxScaler

In [13]:
#Instantiate the MinMax Scaler
minmax = MinMaxScaler()

#Fit the scaler to the training set
#Because it it is still not used by the system
minmax.fit(X_train)

#Transform the training set
X_train_scaled_mm = minmax.transform(X_train)

#Transform the test set
X_test_scaled_mm = minmax.transform(X_test)

In [14]:
#Change to Pandas dataframe for easier viewing and manipulation of the data
X_train_smm = pd.DataFrame(X_train_scaled_mm, index=X_train.index, columns=X_train.columns)
X_test_smm = pd.DataFrame(X_test_scaled_mm, index=X_test.index, columns=X_test.columns)

In [15]:
X_train_smm.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs:Hepatitis B,STDs:HPV,STDs: Number of diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology
343,0.567568,0.222222,0.277778,0.375,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
268,0.243243,0.0,0.222222,0.375,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
227,0.324324,0.111111,0.444444,0.125,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41,0.432432,0.111111,0.611111,0.25,0.0,0.0,0.0,1.0,0.004,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
232,0.135135,0.222222,0.277778,0.125,0.0,0.0,0.0,1.0,0.025,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Building and Validating the Models

#### Build models on the following algorithms and report ACCURACY SCORE on the test dataset
1. KNN 
2. Logistic Regression
3. Decision Tree Classifier

*Note: Accuracy Score should be presented as a percentage*<br>
*Note: For models that have a random_state parameter, set random_state = 12*

## 1. K-Nearest Neighbors (KNN)

In [16]:
#Set the value of K
k = 4

#Instatiate the model
knn = KNeighborsClassifier(n_neighbors=k)

#Fit the model to the training set
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform')

In [17]:
print(cross_val_score(knn, X_train, y_train, cv=5))

[0.9047619  0.90322581 0.8852459  0.8852459  0.90163934]


In [18]:
print(np.mean(cross_val_score(knn, X_train, y_train, cv=5)))

0.8960237717509003


In [19]:
#Predict on the Test Set, 
y_pred_k = knn.predict(X_test)

In [20]:
#Get the Confusion Matrix and other metrics to test performance
print("Classification report for classifier %s:\n%s\n"
      % (knn, metrics.classification_report(y_test, y_pred_k)))

Classification report for classifier KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform'):
              precision    recall  f1-score   support

         0.0       0.87      1.00      0.93        89
         1.0       1.00      0.07      0.13        14

   micro avg       0.87      0.87      0.87       103
   macro avg       0.94      0.54      0.53       103
weighted avg       0.89      0.87      0.82       103




In [21]:
print("Confusion matrix:\n%s" % metrics.confusion_matrix(y_test, y_pred_k))

Confusion matrix:
[[89  0]
 [13  1]]


In [22]:
#Plotting for easy viewing
labels_knn = list(y_test.unique())
cmk = metrics.confusion_matrix(y_test, y_pred_k)
cm_dfk = pd.DataFrame(cmk,index=labels_knn, columns=labels_knn)
cm_dfk

Unnamed: 0,1.0,0.0
1.0,89,0
0.0,13,1


In [49]:
ak = accuracy_score(y_test,y_pred_k) * 100
print(accuracy_score(y_test,y_pred_k) * 100)

87.37864077669903


In [50]:
print("Percentage: %s %%\n" % ak)

Percentage: 87.37864077669903 %



In [24]:
print(np.mean(cross_val_score(knn, X, y, cv=5)) * 100)

88.55715545107259


## 2. Logistic Regression

In [25]:
#Instantiate the Algorithm 
logreg = LogisticRegression(C=1e9, class_weight="balanced", solver='liblinear', random_state=20)

#Train/Fit the model
logreg.fit(X_train_smm, y_train)

LogisticRegression(C=1000000000.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=20,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [26]:
#Check the Trained Model Coefficients
print(logreg.coef_)

[[ -5.54400699  78.56846804 -67.72854791  18.14278772 -40.97439911
   -5.66302614  17.53445926  -4.41433067  25.96171773  -0.2772331
    1.90549274   4.4547195    5.28625253   6.11778555   0.
    0.           6.11778555   0.           0.           0.
    0.           0.          -1.66306605   0.           0.
    4.67526006  -2.68043628  20.75674038  -2.68043628  19.69303467
    4.4936084   66.25985248  38.99766839]]


In [27]:
#Create a DataFrame for easy understanding
coef = pd.DataFrame(X_train_smm.columns, columns=["Biopsy"])
coef['Coef'] = logreg.coef_.reshape(-1,1)
coef.head(10)

Unnamed: 0,Biopsy,Coef
0,Age,-5.544007
1,Number of sexual partners,78.568468
2,First sexual intercourse,-67.728548
3,Num of pregnancies,18.142788
4,Smokes,-40.974399
5,Smokes (years),-5.663026
6,Smokes (packs/year),17.534459
7,Hormonal Contraceptives,-4.414331
8,Hormonal Contraceptives (years),25.961718
9,IUD,-0.277233


#### Validating the Model

In [28]:
y_pred_lgr = logreg.predict(X_test_smm)

In [29]:
#Get the Confusion Matrix and other metrics to test performance (model precision)
print("Classification report for classifier %s:\n%s\n"
      % (logreg, classification_report(y_test, y_pred_lgr)))

Classification report for classifier LogisticRegression(C=1000000000.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=20,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False):
              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98        89
         1.0       1.00      0.71      0.83        14

   micro avg       0.96      0.96      0.96       103
   macro avg       0.98      0.86      0.91       103
weighted avg       0.96      0.96      0.96       103




In [30]:
print("Confusion matrix:\n%s" % confusion_matrix(y_test, y_pred_lgr))

Confusion matrix:
[[89  0]
 [ 4 10]]


In [31]:
#Predict the Probabilities
pred_prob_0 = logreg.predict_proba(X_test_smm)[:,0]
pred_prob_1 = logreg.predict_proba(X_test_smm)[:,1]

In [32]:
#Put all information on a DataFrame for analysis
df_results = X_test.copy()

df_results["Predicted_Class"] = y_pred_lgr
df_results["Predicted_Prob(0)"] = pred_prob_0
df_results["Predicted_Prob(1)"] = pred_prob_1

In [33]:
df_results.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Predicted_Class,Predicted_Prob(0),Predicted_Prob(1)
361,38.0,2.0,15.0,4.0,0.0,0.0,0.0,1.0,16.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
152,18.0,1.0,15.0,1.0,0.0,0.0,0.0,1.0,0.33,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.7886e-18
328,20.0,2.0,16.0,1.0,0.0,0.0,0.0,1.0,0.5,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.681781e-16
6,44.0,2.0,25.0,2.0,0.0,0.0,0.0,1.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.375367e-28
1,15.0,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.103754e-14


In [47]:
alg = accuracy_score(y_test,y_pred_lgr) * 100
print(accuracy_score(y_test,y_pred_lgr) * 100)

96.11650485436894


In [48]:
print("Percentage: %s %%\n" % alg)

Percentage: 96.11650485436894 %



In [35]:
print(np.mean(cross_val_score(logreg, X, y, cv=5)) * 100)

97.56979136056421


## 3. Decision Tree

In [36]:
#Instantiate the Algorithm
clf = tree.DecisionTreeClassifier(criterion="gini", min_samples_split=3, min_samples_leaf=3,
            max_depth=8, random_state=12)

#Train the model
clf.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=3,
            min_weight_fraction_leaf=0.0, presort=False, random_state=12,
            splitter='best')

##### Cross Validation: Decision Tree

In [37]:
print (cross_val_score(clf, X, y, cv=5))

[0.98795181 0.98780488 0.97560976 0.97560976 0.98780488]


In [38]:
print (np.mean(cross_val_score(clf, X, y, cv=5)))

0.9829562151043196


##### Validating the Model

In [39]:
y_pred_dt = clf.predict(X_test)

In [40]:
#Check the performance metrics
print("{:.2f}".format(metrics.accuracy_score(y_test,y_pred_dt)))

0.96


In [41]:
print("Classification report for classifier %s:\n%s\n"
      % (clf, metrics.classification_report(y_test, y_pred_dt)))

Classification report for classifier DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=3,
            min_weight_fraction_leaf=0.0, presort=False, random_state=12,
            splitter='best'):
              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98        89
         1.0       1.00      0.71      0.83        14

   micro avg       0.96      0.96      0.96       103
   macro avg       0.98      0.86      0.91       103
weighted avg       0.96      0.96      0.96       103




In [42]:
print("Confusion Matrix: \n%s" % metrics.confusion_matrix(y_test,y_pred_dt))

Confusion Matrix: 
[[89  0]
 [ 4 10]]


In [43]:
#Encode Confusion Matrix into a DataFrame
labels_dt = list(y_test.unique())
cm_dt = metrics.confusion_matrix(y_test, y_pred_dt)
cm_df_dt = pd.DataFrame(cm_dt,index=labels_dt, columns=labels_dt)
cm_df_dt

Unnamed: 0,1.0,0.0
1.0,89,0
0.0,4,10


In [44]:
#Check performance metrics, accuracy scores, everything it has done
adt = accuracy_score(y_test,y_pred_dt) * 100
print(accuracy_score(y_test,y_pred_dt) * 100) 

96.11650485436894


In [45]:
print("Percentage: %s %%\n" % adt)

Percentage: 96.11650485436894 %



In [46]:
print(np.mean(cross_val_score(clf, X, y, cv=5)) * 100)

98.29562151043196
