## Project tasks

1. Build classifiers using Linear Regression, Random Forest and Neural Networks
2. Perform cross validation to measure each methods classification accuracy
3. Discuss the result of each method
    + Linear Regression: What are the coefficients of the Linear Regression
    + Random Forest: What 5 features are the most informative and what are their information strength (i.e., information gain)?
    + Neural Network: How does the classification accuracy change with different number of hidden layers?
4. Perform K-means clustering
    + Measure the rand index, and silhouette scores
    + How does the score change with K={2,3,…, 10}?
    + Discuss the result of the clustering by comparing it with the classification results

### The report shall include the following contents and shall be less than 5 pages:
+ Introduction (One paragraph)
+ Method
    + Linear Regression
    + Random Forest
    + Neural Network
+ Results
+ Discussion
    + Linear Regression: What are the coefficients of the Linear Regression
    + Random Forest: What 5 features are the most informative and what are their information strength (i.e., information gain)?
    + Neural Network: How does the classification accuracy change with different number of hidden layers?
+ Conclusion

---

## 1. Introduction

### 1.1 Feature Selection

주어진 data에는 feature가 여러 개 있었다. 그러나 데이터 간의 collinearity를 분석한 결과, 

fractal_dimension_mean, smoothness_mean and symmetry_mean는 종양의 class를 판별할 때 유용하지 않은 것으로 판단되었다. 따라서 위 3가지 feature를 제외하고 모델링을 하였다.

## 2. Method

### 2.1 Linear Regression

### 2.2 Random Forest

### 2.3 Neural Network

---

## Code

reference: https://www.kaggle.com/jcrowe/model-comparison-for-breast-cancer-diagnosis/notebook

In [12]:
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.model_selection import train_test_split # to split the data into two parts
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn import metrics # for the check the error and accuracy of the model

from scipy.stats import randint as sp_randint

%matplotlib inline
import matplotlib.pyplot as plt

# 1. Introduction

In [2]:
df = pd.read_csv("data/data.csv", header=0)    # here header 0 means the 0 th row is our coloumn 
                                                # header in data

In [3]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
# Remove unnecessary columns
df.drop('id',axis=1,inplace=True)
df.drop('Unnamed: 32',axis=1,inplace=True)

In [5]:
df['diagnosis']=df['diagnosis'].map({'M':1,'B':0})

In [6]:
df.describe()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.372583,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,0.483918,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,0.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,0.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,0.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,1.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,1.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


## 1.1 Feature Selection

In [7]:
df2 = df.iloc[:,:]
# Remove unnecessary columns
df2.drop(['fractal_dimension_mean', 'smoothness_mean', 'symmetry_mean'],axis=1,inplace=True)

In [8]:
# now split our data into train and test
train, test = train_test_split(df2, test_size = 0.3) # in this our main data is splitted into train and test
# we can check their dimension
print(train.shape)
print(test.shape)

(398, 28)
(171, 28)


In [9]:
prediction_var = list(df2)[1:]
prediction_var

['radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst']

In [10]:
train_X = train[prediction_var]  # taking the training data input 
train_y = train.diagnosis  # This is output of our training data
# same we have to do for test
test_X = test[prediction_var]  # taking test data inputs
test_y = test.diagnosis  # output value of test data

## 2. Classification Method

Here a comparison will be made between the different types of learning algorithms. At the end a breakdown of the data and explanation of the algorithm's performance will be made.

### 2.1 Logistic Regression

+ Linear Regression: What are the coefficients of the Linear Regression

In [None]:
model = LogisticRegression()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

### 2.2 Random Forest Classification

+ Random Forest: What 5 features are the most informative and what are their information strength (i.e., information gain)?

In [None]:
model = RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
# the accuracy for RandomForest increase it means the value are more catogrical in Worst part
# lets get the important features
featimp = pd.Series(model.feature_importances_, index=prediction_var).sort_values(ascending=False)
print(featimp) # this is the property of Random Forest classifier that it provide us the importance 
# of the features used

---

reference : https://www.kaggle.com/gargmanish/basic-machine-learning-with-cancer/notebook

## Import Data

In [None]:
# here we will import the libraries used for machine learning
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv), data manipulation as in SQL
import matplotlib.pyplot as plt # this is used for the plot the graph 
import seaborn as sns # used for plot interactive graph. I like it most for plot
%matplotlib inline
from sklearn.linear_model import LogisticRegression # to apply the Logistic regression
from sklearn.model_selection import train_test_split # to split the data into two parts
from sklearn.cross_validation import KFold # use for cross validation
from sklearn.model_selection import GridSearchCV# for tuning parameter
from sklearn.ensemble import RandomForestClassifier # for random forest classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm # for Support Vector Machine
from sklearn import metrics # for the check the error and accuracy of the model
# Any results you write to the current directory are saved as output.
# dont worry about the error if its not working then insteda of model_selection we can use cross_validation

In [None]:
data = pd.read_csv("data/data.csv",header=0)    # here header 0 means the 0 th row is our coloumn 
                                                # header in data

In [None]:
data.head(2)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
# now we can drop this column Unnamed: 32
data.drop("Unnamed: 32",axis=1,inplace=True) # in this process this will change in our data itself 
# if you want to save your old data then you can use below code
# data1=data.drop("Unnamed:32",axis=1)
# here axis 1 means we are droping the column

In [None]:
data.columns

In [None]:
# like this we also don't want the Id column for our analysis
data.drop("id",axis=1,inplace=True)

In [None]:
# As I said above the data can be divided into three parts.lets divied the features according to their category
features_mean= list(data.columns[1:11])
features_se= list(data.columns[11:20])
features_worst=list(data.columns[21:31])
print(features_mean)
print("-----------------------------------")
print(features_se)
print("------------------------------------")
print(features_worst)

In [None]:
# lets now start with features_mean 
# now as ou know our diagnosis column is a object type so we can map it to integer value
data['diagnosis']=data['diagnosis'].map({'M':1,'B':0})

In [None]:
data.head(2)

## Explore Data Analysis

In [None]:
data.describe()

In [None]:
# lets get the frequency of cancer stages
sns.countplot(data['diagnosis'],label="Count")

In [None]:
# from this graph we can see that there is a more number of bengin stage of cancer which can be cure

## Feature Selection

In [None]:
# now lets draw a correlation graph so that we can remove multi colinearity it means the columns are
# dependenig on each other so we should avoid it because what is the use of using same column twice
# lets check the correlation between features
# now we will do this analysis only for features_mean then we will do for others and will see who is doing best
corr = data[features_mean].corr() # .corr is used for find corelation
plt.figure(figsize=(14,14))
sns.heatmap(corr, cbar = True,  square = True, annot=True, fmt= '.2f',annot_kws={'size': 15},
           xticklabels= features_mean, yticklabels= features_mean,
           cmap= 'coolwarm') # for more on heatmap you can visit Link(http://seaborn.pydata.org/generated/seaborn.heatmap.html)

### observation

+ the radius, perimeter and area are highly correlated as expected from their relation so from these we will use anyone of them
+ compactness_mean, concavity_mean and concavepoint_mean are highly correlated so we will use compactness_mean from here
+ so selected Parameter for use is perimeter_mean, texture_mean, compactness_mean, symmetry_mean*

In [None]:
prediction_var = ['texture_mean','perimeter_mean','smoothness_mean','compactness_mean','symmetry_mean']
# now these are the variables which will use for prediction

In [None]:
#now split our data into train and test
train, test = train_test_split(data, test_size = 0.3)# in this our main data is splitted into train and test
# we can check their dimension
print(train.shape)
print(test.shape)

In [None]:
train_X = train[prediction_var]  # taking the training data input 
train_y = train.diagnosis  # This is output of our training data
# same we have to do for test
test_X = test[prediction_var]  # taking test data inputs
test_y = test.diagnosis  # output value of test dat

## Models 1) Random Forest

In [None]:
model = RandomForestClassifier(n_estimators=100)  # a simple random forest model

In [None]:
model.fit(train_X,train_y)  # now fit our model for traiing data

In [None]:
prediction = model.predict(test_X) # predict for the test data
# prediction will contain the predicted value by our model predicted values of dignosis column for test inputs

In [None]:
metrics.accuracy_score(prediction,test_y) # to check the accuracy
# here we will use accuracy measurement between our predicted value and our test output values

+ Here the Accuracy for our model is 91 % which seems good*

## Models 2) SVM

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

+ SVM is giving only 0.85 which we can improve by using different techniques i will improve it till then beginners can understand how to model a data and they can have a overview of ML

## Feature : all feature_mean

In [None]:
prediction_var = features_mean # taking all features

In [None]:
train_X= train[prediction_var]
train_y= train.diagnosis
test_X = test[prediction_var]
test_y = test.diagnosis

## Models 1) Random Forest

In [None]:
model = RandomForestClassifier(n_estimators=100)

In [None]:
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

+ by taking all features accuracy increased but not so much so according to Razor's rule simpler method is better
+ by the way now lets check the importan features in the prediction

In [None]:
featimp = pd.Series(model.feature_importances_, index=prediction_var).sort_values(ascending=False)
print(featimp) # this is the property of Random Forest classifier that it provide us the importance 
# of the features used

## Models 2) SVM

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
# as you can see the accuracy of SVM decrease very much
# now lets take only top 5 important features given by RandomForest classifier

In [None]:
prediction_var=['concave points_mean','perimeter_mean' , 'concavity_mean' , 'radius_mean','area_mean']  

In [None]:
train_X= train[prediction_var]
train_y= train.diagnosis
test_X = test[prediction_var]
test_y = test.diagnosis

In [None]:
model = RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
# so from this discussion we got multi colinearty effecting our SVM part a lot 
# but its not affecting so much randomforest because for random forest we dont need to make so much effort for our analysis part
# now lets do with the 3rd part of data which is worst
# first start with all features_worst

## feature : all features_worst

In [None]:
# so from this discussion we got multi colinearty effecting our SVM part a lot 
# but its not affecting so much randomforest because for random forest we dont need to make so much effort for our analysis part
# now lets do with the 3rd part of data which is worst
# first start with all features_worst

In [None]:
prediction_var = features_worst

In [None]:
train_X= train[prediction_var]
train_y= train.diagnosis
test_X = test[prediction_var]
test_y = test.diagnosis

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
# but same problem With SVM, very much less accuray I think we have to tune its parameter
# that i will do later in intermidate part
# now we can get the important features from random forest now run Random Forest for it 

In [None]:
model = RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
# the accuracy for RandomForest increase it means the value are more catogrical in Worst part
# lets get the important features
featimp = pd.Series(model.feature_importances_, index=prediction_var).sort_values(ascending=False)
print(featimp) # this is the property of Random Forest classifier that it provide us the importance 
# of the features used

In [None]:
# same parameter but with great importance and here it seamed the only conacve points_worst is making 
# very important so it may be bias lets check only for top 5 important features

In [None]:
prediction_var = ['concave points_worst','radius_worst','area_worst','perimeter_worst','concavity_worst'] 

In [None]:
train_X= train[prediction_var]
train_y= train.diagnosis
test_X = test[prediction_var]
test_y = test.diagnosis

In [None]:
model = RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
#check for SVM
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
# now I think for simplicity the Randomforest will be better for prediction

In [None]:
# Now explore a little bit more
# now from features_mean i will try to find the variable which can be use for classify
# so lets plot a scatter plot for identify those variable who have a separable boundary between two class
#of cancer

In [None]:
# Lets start with the data analysis for features_mean
# Just try to understand which features can be used for prediction
# I will plot scatter plot for the all features_mean for both of diagnosis Category
# and from it we will find which are easily can used for differenciate between two category

In [None]:
color_function = {0: "blue", 1: "red"} # Here Red color will be 1 which means M and blue foo 0 means B
colors = data["diagnosis"].map(lambda x: color_function.get(x))# mapping the color fuction with diagnosis column
pd.scatter_matrix(data[features_mean], c=colors, alpha = 0.5, figsize = (15, 15)); # plotting scatter plot matrix

### Observation

1. Radius, area and perimeter have a strong linear relationship as expected 
2. As graph shows the features like as texture_mean, smoothness_mean, symmetry_mean and fractal_dimension_mean can t be used for classify two category because both category are mixed there is no separable plane
3. So we can remove them from our prediction_var

In [None]:
# So predicton features will be 
features_mean

In [None]:
# So predicton features will be 
predictor_var = ['radius_mean','perimeter_mean','area_mean','compactness_mean','concave points_mean']

In [None]:
# Now with these variable we will try to explore a liitle bit we will move to how to use cross validiation
# for a detail on cross validation use this link 
# https://www.analyticsvidhya.com/blog/2015/11/improve-model-performance-cross-validation-in-python-r/

In [None]:
def model(model,data,prediction,outcome):
    # This function will be used for to check accuracy of different model
    # model is the m
    kf = KFold(data.shape[0], n_folds=10) # if you have refer the link then you must understand what is n_folds

In [None]:
prediction_var = ['radius_mean','perimeter_mean','area_mean','compactness_mean','concave points_mean']

In [None]:
# so those features who are capable of classify classe will be more useful

In [None]:
# so in this part i am going to explain about only some concept of machine learnig 
# here I will also compare the accuracy of different models
# I will First use cross validation with different model
# then I will explain about how to to tune the parameter of models using gridSearchCV 

In [None]:
# As we are going to use many models lets make a function
# Which we can use with different models
def classification_model(model,data,prediction_input,output):
    # here the model means the model 
    # data is used for the data 
    # prediction_input means the inputs used for prediction
    # output mean the value which are to be predicted
    # here we will try to find out the Accuarcy of model by using same data for fiiting and 
    # comparison for same data
    # Fit the model:
    model.fit(data[prediction_input],data[output]) #Here we fit the model using training set
  
    # Make predictions on training set:
    predictions = model.predict(data[prediction_input])
  
    # Print accuracy
    # now checking accuracy for same data
    accuracy = metrics.accuracy_score(predictions,data[output])
    print("Accuracy : %s" % "{0:.3%}".format(accuracy))
 
    
    kf = KFold(data.shape[0], n_folds=5)
    # About cross validitaion please follow this link
    # https://www.analyticsvidhya.com/blog/2015/11/improve-model-performance-cross-validation-in-python-r/
    # let me explain a little bit data.shape[0] means number of rows in data
    # n_folds is for number of folds
    error = []
    for train, test in kf:
        # as the data is divided into train and test using KFold
        # now as explained above we have fit many models 
        # so here also we are going to fit model
        # in the cross validation the data in train and test will change for evry iteration
        train_X = (data[prediction_input].iloc[train,:])# in this iloc is used for index of trainig data
        # here iloc[train,:] means all row in train in kf amd the all columns
        train_y = data[output].iloc[train]# here is only column so it repersenting only row in train
        # Training the algorithm using the predictors and target.
        model.fit(train_X, train_y)
    
        # now do this for test data also
        test_X=data[prediction_input].iloc[test,:]
        test_y=data[output].iloc[test]
        error.append(model.score(test_X,test_y))
        # printing the score 
        print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

In [None]:
# Now from Here start using different model

## Decision Tree

In [None]:
model = DecisionTreeClassifier()
prediction_var = ['radius_mean','perimeter_mean','area_mean','compactness_mean','concave points_mean']
outcome_var= "diagnosis"
classification_model(model,data,prediction_var,outcome_var)

#### observation

+ Accuracy is 100 % means over fitting
+ but cross validation scores are not good 3 so accuracy cant be considered only factor here

## SVM

In [None]:
# now move to svm

In [None]:
model = svm.SVC()

classification_model(model,data,prediction_var,outcome_var)

In [None]:
# I am facing problem with SVM dont know why?
# lets leave that we will try to do it later 

## KNN

In [None]:
model = KNeighborsClassifier()
classification_model(model,data,prediction_var,outcome_var)

## Random Forest

In [None]:
# same here cross validation scores are not good
# now move to RandomForestclassifier
model = RandomForestClassifier(n_estimators=100)
classification_model(model,data,prediction_var,outcome_var)

## Logistic Regression

In [None]:
# cross validation score are also not bed
# so Random forest is good
# lets try with logistic regression
model = LogisticRegression()
classification_model(model,data,prediction_var,outcome_var)