<a href="https://colab.research.google.com/github/christabs27/Linear-Regression-for-Heights/blob/main/Activity_11_6_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Lesson 11.6.3 Activity

Breast Cancer Detection

According to the American Cancer Society, breast cancer is the most common cancer in American women, except for skin cancers. The average risk of a woman in the United States developing breast cancer sometime in her life is about 13%. This means there is a 1 in 8 chance she will develop breast cancer.

Mammograms are used to detect breast cancer, hopefully at an early stage. However, many masses that appear on a mammogram are not actually cancer. Developing a machine learning model to predict whether a tumor is benign or cancerous would be helpful for physicians as they guide and treat patients.

In this module, we'll use a Naive Bayes classifier algorithm with different to classify the tumors as benign and malignant.  We'll see if this model does a better or worse job classifying the tumors compared to logistic regression, KNN, and SVM.  

#Step 1: Download and save the `cancer.csv` dataset from the class materials  

* Make a note of where you saved the file on your computer.

#Step 2: Upload the `cancer.csv` dataset by running the following code block 

* When prompted, navigate to and select the `cancer.csv` dataset where you saved it on your computer.

In [1]:
#Step 2

from google.colab import files
cancer = files.upload()

Saving cancer.csv to cancer.csv


#Step 3: Import necessary packages

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
```

In [5]:
#Step 3

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB


# Step 4: Create a Pandas DataFrame from the CSV file
* Name the DataFrame `cancer`.
* Print the first five observations of `cancer`.  Note the kinds of data it contains.

In [6]:
#Step 4
cancer=pd.read_csv('cancer.csv')
cancer.describe



<bound method NDFrame.describe of            id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0      842302         M        17.99         10.38          122.80     1001.0   
1      842517         M        20.57         17.77          132.90     1326.0   
2    84300903         M        19.69         21.25          130.00     1203.0   
3    84348301         M        11.42         20.38           77.58      386.1   
4    84358402         M        20.29         14.34          135.10     1297.0   
..        ...       ...          ...           ...             ...        ...   
564    926424         M        21.56         22.39          142.00     1479.0   
565    926682         M        20.13         28.25          131.20     1261.0   
566    926954         M        16.60         28.08          108.30      858.1   
567    927241         M        20.60         29.33          140.10     1265.0   
568     92751         B         7.76         24.54           47.92      181

#Step 5: Convert the variable `Diagnosis` into a numeric data type  
* There are many way to accomplish this, but you may choose to work with the example shown below.  

```
cancer.loc[cancer['diagnosis'] == 'M', 'cancer_present'] = 1
cancer.loc[cancer['diagnosis'] == 'B', 'cancer_present'] = 0

```
* Name the result `cancer_present` and code malignant tumors with a `1` and benign tumors with a `0`.






In [7]:
#Step 5
cancer.loc[cancer['diagnosis'] == 'M', 'cancer_present'] = 1
cancer.loc[cancer['diagnosis'] == 'B', 'cancer_present'] = 0




#Step 6: Split the data into the target variable and the feature of interest
* We want to predict if a tumor is benign or malignant (`cancer_present`) using the mean tumor perimeter measure (`perimeter_mean`).
* Select the all of the features of the cancer DataFrame **except** `id`, `diagnosis` and `cancer_present`, and and name the resulting DataFrame X.
* Select the column `cancer_present` from the cancer DataFrame and name it y. Make sure y is also a DataFrame and not a Series.

In [8]:
#Step 6
y = cancer['cancer_present']
X = cancer.drop(['id','diagnosis','cancer_present'],axis=1)


#Step 7: Split the data into a training dataset and a test dataset
* Use `train_test_split` from `sklearn.model_selection`.
* Name the X training/validation set `X_train` and the y training/validation set `y_train`.
* Name the X test set `X_test` and the y test set `y_test`.
* Set the `test_size = 0.25` and `random_state = 42`. 






In [9]:
#Step 7
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


#Step 8: What type of Naive Bayes classifier should we use to model our data? 
Recall from the lesson that the three types of classifier are:

**Categorical Naive Bayes:** The feature is a set of classifications.  This is often used for document classification.

**Bernoulli Naive Bayes:** The features have only yes or no values.

**Gaussian Naive Bayes:** The features we have are continuous measurements.





**Step 8 Answer:**



#Step 9: Build a pipeline that will impute and standardize the data and fit a Gaussian Naive Bayes classifier
* The first step shoulde be `SimpleImputer(missing_values=np.nan, strategy='mean'))`.
* The second step should be `StandardScaler()`.
* And the third step should be `GaussianNB()`.  
* Name the pipeline `nb`.
* Fit the pipeline to `X_train` and `Y_train`.






In [12]:
#Step 9
nb = Pipeline([
    ('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('Gaussian', GaussianNB())
])
nb.fit(X_train, y_train)



Pipeline(steps=[('imp_mean', SimpleImputer()), ('scaler', StandardScaler()),
                ('Gaussian', GaussianNB())])

#Step 10: Evaluate the pipeline using 10-fold cross-validation  
* Calculate and print the accuracy of each of the five models using `scores = cross_val_score(nb, X_train, y_train, cv=10)`.
* Calculate and print the mean and SD of the accuracy measures returned from cross-validation.





In [13]:
#Step 10

scores = cross_val_score(nb, X_train, y_train, cv=10)




#Step 11: Compare to other models 
* The logistic regression, KNN and SVM models we created in earlier lessons had accuracies around 94%.
* How does the accuracy of the Naive Bayes model compare to these models?





**Step 11 Answer:**


In [14]:
scores.mean()

0.9106312292358805

In [15]:
scores.std()

0.03176885870533403