# Instructor Task
## Dataset
- [Here](https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/datasets/breast-cancer.csv) is the dataset.
- [Here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names) is a description of the data. Ignore column 0 as it is merely the ID of a patient record.

In [37]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn import cross_validation, svm, grid_search
from sklearn.metrics import classification_report
from sklearn import preprocessing
file_loc = '/Users/ericsperber/udacity/general_assembly/data.csv'


## 1. Read in the data

In [23]:
data = pd.read_csv(file_loc, header=None)

## 2. Separate the data into feature and target.

In [24]:

target = np.array(data[0])
features = np.array(data[[1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, \
        19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]])

## 3. Create and evaluate using cross_val_score and 5 folds.
- What is the mean accuracy?
- What is the standard deviation of accuracy?

In [27]:
svc = svm.SVC(C=1, kernel='linear')
kfold = cross_validation.KFold(len(data), n_folds=5, shuffle=True)
accuracy = [svc.fit(features[train], target[train]).score(features[test], target[test]) \
        for train, test in kfold]

In [29]:
print np.mean(accuracy)

0.956124825338


In [30]:
print np.std(accuracy)

0.0183705515745


## 4. Get a classification report to identify type 1, type 2 errors.
- Use train_test_split to run your model once, with a test size of 0.33
- Make predictions on the test set
- Compare the predictions to the answers to determine the classification report

In [31]:
train, test = cross_validation.train_test_split(data, train_size=.67)
train_labels = np.array(train[0])
test_labels = np.array(test[0])
train_features = np.array(train[[1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, \
        19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]])
test_features = np.array(test[[1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, \
        19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]])
fitted_model = svc.fit(train_features, train_labels)
predicted_model = svc.predict(test_features)
print classification_report(predicted_model, test_labels)

             precision    recall  f1-score   support

          B       0.99      0.96      0.98       125
          M       0.93      0.98      0.95        63

avg / total       0.97      0.97      0.97       188



## 5. Scale the data and see if that improves the score.

In [32]:
scaler = preprocessing.StandardScaler().fit(train_features)
fitted_model = svc.fit(scaler.transform(train_features), train_labels)
predicted_model = svc.predict(scaler.transform(test_features))


print classification_report(predicted_model, test_labels)

             precision    recall  f1-score   support

          B       0.99      0.97      0.98       124
          M       0.94      0.98      0.96        64

avg / total       0.97      0.97      0.97       188



## 6. Tune the model using automated parametric grid search via LogisticRegressionCV. Explain your intution behind what is being tuned.

### Q: What should we do to prevent overfitting so our model generalizes well to the test data?

In [43]:
param_grid = {'Cs': [1, 10, 100, 1000]}
clf = grid_search.GridSearchCV(sklearn.linear_model.LogisticRegressionCV(),param_grid)
clf = clf.fit(train_features, train_labels)
clf_results = clf.predict(test_features)

#We can prevent overfitting by not allowing the decision model to rapidly change for small variances in magnitude. 
#We are tuning the amount of regularization in the model when we change these parameters.

### Q: What was the best C?

In [39]:
print clf.best_params_

{'Cs': 10}


In [42]:
ax = plt.subplot()
ax.set_title("Input data")
# Plot the training points
ax.scatter(test_features[:, 0], test_features[:, 1], color=[ i for i in clf_results])
plt.plot()

X = features = np.array(data[[1,  2]])
y = [0 if x=='M' else 1 for x in np.array(data[0])]

svc = svm.SVC(kernel='rbf', C=1000.0).fit(X, y)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),
                     np.arange(y_min, y_max, .02))
plt.subplot(2, 2, 1 + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)

Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title('SVM')

plt.show()
plt.savefig('fig3_normalC=1000.jpg')

#This graph will show the results of the LogisticRegression model above.
#From the graph you can see that their's a fairly distinct classification curve to the model. 

#The second graph includes the results from using SVM models in step 4/5. 
#These graphs shows what happens as you change the C value associated with a SVM model 
#From these graphs, it becomes obvious that as you increase C, the model starts to become overfitted. 

[]

## 8. Provide a one-sentence summary for a non-technical audience. Then provide a longer paragraph-length technical explanation.


In [None]:
#For this exercise, we used machine learning to diagnose, at an accuracy rate of 97%, \
#if a tumour was benign or malignant using 30 different datapoints. 


In [None]:
#For this exercise, we had to first read in a 569*31 dataset using Pandas, a python framework. 
#Once all of our data was read in, we split out the dataset so that it contained features and labels. 
#Labels refers to the class of the row of data. Features are the other 30 columns in that row. 
#We then seperated our data in a training set(.67% of data) and a test set(.33% of the data). 
#With the training and testing sets fully flushed out, we then placed fitted the training set into a SVM model. 
#With the model fitted, we measured it against the test set to get a score of 97% accuracy. 
#We then scaled all the features within the model, ensuring each dimension is at a similar magnitude. 
#This slightly improved the accuracy of model (<1%)
#We also put the training sets and test sets through a Logistic Regression model, that used Grid Search.
#Grid search helped identify the ideal parameters for the model, with a C value of 10. 
#We finally created a few charts, with Matplotlib to plot our models. 