## CEBD 1260 Final Programming Assignment

Complete both questions in a single jupyter notebook and upload to github repo in a new directory named "final".

Answers should include both code and written explanation / interpretation of results. Be sure to answer all parts of the question completely.

**1) Your first task to is to classify data from a cancer diagnostic database. In this database are patients with tumors, characteristics of those tumors, and biospy results indicating whether the tumor is Malignant or Benign.**


In cancer_data.txt you will find the following variables:

   - radius (mean of distances from center to points on the perimeter)
   - texture (standard deviation of gray-scale values)
   - perimeter
   - area
   - smoothness (local variation in radius lengths)
   - compactness (perimeter^2 / area - 1.0)
   - concavity (severity of concave portions of the contour)
   - concave_points (number of concave portions of the contour)
   - symmetry 
   - fractal_dimension ("coastline approximation" - 1)
   - cancer (0 = Benign, 1 = Malignant)  *target*


Use any machine learning algorithm you wish. In your answer include a short description of your algorithm of choice and predicted category of a new patient with a tumor with the following features:

   - radius: 14
   - texture: 14
   - perimeter: 88
   - area: 566
   - smoothness: 1
   - compactness: 0.08
   - concavity: 0.06
   - concae points: 0.04
   - symmetry: 0.18
   - fractal dimension: 0.05

In [466]:
import numpy as np
import pandas
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the cancer dataset
f = open("cancer_data.csv")
f.readline()  # skip the header
cancer_data = np.loadtxt(f, delimiter = ',')

# Load the test dataset (target not included)
f = open("cancer_test.csv")
f.readline()  # skip the header
cancer_test = np.loadtxt(f, delimiter = ',')
cancer_test = cancer_test.reshape(1, -1)

In [467]:
# Use 10 variables/features
# radius, texture, perimeter, area, smoothness, compactness, concavity, concae points,symmetry, fractal dimension
cancer_X = cancer_data[:,:10]

# Split the data into training/testing sets (X)
cancer_X_train = cancer_X[:]
cancer_X_test = cancer_test

# Specifying the targets for training set (Y)
cancer_y_train = cancer_data[:,10]

In [468]:
# Create a Linear classifier/logistic regression object
regr = linear_model.SGDClassifier(loss='log')

# Train the model using the training sets
regr.fit(cancer_X_train, cancer_y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

In [469]:
# Make predictions using the testing set
cancer_y_pred = regr.predict(cancer_test)

In [470]:
# The coefficients
print('Coefficients: \n', regr.coef_)

Coefficients: 
 [[  4.59055671e+03   6.69914152e+03   2.67219043e+04  -3.88527575e+03
    4.35087149e+01  -1.85892300e+01  -8.11311967e+01  -3.53547607e+01
    8.31347555e+01   3.44469303e+01]]


In [471]:
#Checking predicted category
pandas.DataFrame(cancer_y_pred) 

Unnamed: 0,0
0,1.0


The predicted cancer category is **1 (Malign type of Cancer) ** for a new patient with a tumor with the following features
   - radius: 14
   - texture: 14
   - perimeter: 88
   - area: 566
   - smoothness: 1
   - compactness: 0.08
   - concavity: 0.06
   - concae points: 0.04
   - symmetry: 0.18
   - fractal dimension: 0.05


**2) The following code contains/**had** 5 bugs (errors). Find and correct them all and then answer the following questions**

  1. How many observations are in the training dataset?

  2. How many features are in the training dataset?

  3. How well did your model perform?

**  BONUS: Which category is Hockey? 0 or 1? Which category is baseball? **

In [505]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

categories = [ 'rec.sport.baseball','rec.sport.hockey']
twenty_train = fetch_20newsgroups(subset='train', categories=categories)
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42)),
])
text_clf.fit(twenty_train.data, twenty_train.target)  
predicted = text_clf.predict(twenty_test.data)

In [506]:
#Observations in the training dataset
print(len(twenty_train.data))

1197


There 1197 observations in the training dataset

In [509]:
#How many features are in the training dataset?
import pandas
df = pandas.DataFrame(twenty_train.data)
print(len(df.columns))

1


There's only 1 feature in the training dataset

In [510]:
#How well did your model perform?

# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(twenty_test.target, predicted))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(twenty_test.target, predicted))


Mean squared error: 0.03
Variance score: 0.87


The model performed very well
 - Mean squared error is low
 - Variance score is very close to 1

In [511]:
#Which category is Hockey? 0 or 1? Which category is baseball?
pandas.DataFrame(twenty_test.target_names)

Unnamed: 0,0
0,rec.sport.baseball
1,rec.sport.hockey


Baseball is category=0 and Hokey is category=1