# Machine Learning with Python

Sign-in page: https://goo.gl/forms/jmpV6Rhb2t9ef3Hl1

Now that we've learned how to acquire, clean, and visualize our data, let's start doing some machine learning. Throughout this workshop, we will be using a Python package called scikit-learn. It is surprisingly easy to implement machine learning algorithms using scikit-learn, and in fact much of the work is done for you. Often, it is the acquisition and structuring of the data itself that requires the most finesse. Because of this, we will be working with pre-processed data to illustrate just how simple machine learning can be using scikit-learn. 

Before we dive in, let's start with some background on machine learning. There are two main types of machine learning algorithms: **supervised** and **unsupervised**. The goal of **supervised learning** is to train a model to classify data against a set of labels. For instance, we can train a model to separate images according to whether or not the image contains a face. In this case, the two labels would be *face* and *no face*. The presence of labeled training data is a huge advantage in machine learning, however it is not always available.

When labeled training data is unavailable, we must use **unsupervised learning**. The goal of unsupervised learning is to train a model to extract *meaningful features* from the data. These meaningful features are often common patterns which are useful for compressing the information contained in the data. For example, in speech recognition, individual phonemes serve as meaningful features from which it is possible to reconstruct words independently of their tone or pitch. In computer vision, oriented edge-detectors are often the best features.

## Digit Recognition

In the first half of this workshop, we'll be using a very important supervised machine learning algorithm called the **support vector machine** to classify handwritten digits. This is a very well-studied problem in the machine learning community, and serves as a great starting point. First, let's import scikit-learn and a couple other modules we'll need.

In [None]:
#import necessary modules

Now, sklearn comes with a few preloaded datasets, so let's load up the handwritten digits dataset. This is a list of pixel intensities corresponding to images of handwritten digits plus their associated labels (0-9).

In [None]:
#load digits

Let's use matplotlib to see what one of these images looks like

In [None]:
#plot digits

By default, matplotlib plots each value on a color scale. We can convert this to greyscale to get a better idea of the actual image.

In [None]:
#convert to greyscale

In [None]:
#other digits

Now that we've got a better idea of what our data looks like, let's begin organizing it. An important concept in machine learning is overfitting. Overfitting typically occurs when a model has a large number of parameters, but not much data. An overfit model will perform very well on its training data, but will be unable to generalize beyond this training data.

To prevent overfitting, we will split our dataset into two groups: training data and test data. This will allow us to evaluate how our model performs on data it's never seen before. This procedure is called cross-validation and is extremely important in machine learning.

In [None]:
#cross-validation

Video describing SVM: https://www.youtube.com/watch?v=mA5nwGoRAOo

From here, creating and training our classifier is relatively straightforward.

In [None]:
#set up classifier

Now let's compute the accuracty of our classifier using the test data.

In [None]:
#score classifier

Video describing C: https://www.youtube.com/watch?v=joTa_FeMZ2s

Video describing gamma: https://www.youtube.com/watch?v=m2a2K4lprQw

That's pretty impressive. So what are those mysterious values gamma and C? The C parameter controls the penalty for misclassification of each example in the training data. Large values of C highly penalize misclassification, and thus will fit to the training data more exactly. However, this can lead to overfitting and trouble with outliers, in which case a smaller value of C should be chosen.

The gamma parameter is somewhat more complicated, but it can be understood to be the radius of influence of the individual support vectors. More info can be found in the sklearn SVM documentation: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

So let's try changing the values of C and gamma and see what we get.

In [None]:
#change C and gamma

We can automate the process of finding the best values of gamma and C (also known as hyperparameters) by using sklearn's built in grid-search function. Note that a new classifier must be built, trained, and evaluated for each combination of hyperparameters, so this process can be time consuming.

In [None]:
#gridsearch

## Kaggle Problem

Kaggle is an excellent platform for participating in data science competitions, accessing cool datasets, and getting involved in the data science community. For part two of this workshop, we will be using Kaggle's Biological Response competition data to create a Kaggle submission entry. This is part of Kaggle's Getting Started with Python tutorial, which can be found here: https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience

First let's check out the dataset, which can be found here: https://www.kaggle.com/c/bioresponse/data. 

In [None]:
#get data

Each row in this dataframe represents a specific molecule, and the column labeled "Activity" denotes whether the molecule elicits a specific biological response, (1), or not, (0). The descriptors D1, D2, ..., D1773 quantitatively describe relevant structural and chemical properties of the molecule. The goal is to use these descriptors as features and predict the activity of new molecules.

First, let's split our data into training and test cases. Since the first column represents our labels, or target values, we must be sure to separate them from our features.

In [None]:
#split train-test

In this section, we'll be using a random forest classifier. The random forest classifier can be thought of as an ensemble of individual decision tree classifiers. These decision tree classifiers are trained to extract the most relevant features and make a series of decisions based on given input using these features. While decision trees are prone to overfitting by themselves, as part of an ensemble they form a robust model.

In [None]:
#set up random forest

Now let's use this decision tree classifier to make some predictions on our test data, and see how it holds up. 

In [None]:
#predictions

We now have an idea of how our simple random forest classifier is performing. Let's try it out on the Kaggle test data.

In [None]:
#load real test data

Notice that there's no "Activity" column here, so we don't have to worry about splitting the dataset into targets and features. Let's use our classifier to create a list of predicted activities for these molecules.

In [None]:
#create submission predictions

And finally, let's save our submission as a .csv file using numpy's savetxt function.

In [None]:
#save file

To double-check that our file was saved correctly, we can reload it using pandas and check out the first few lines.

In [None]:
#double-check file

And that's it! If you want, you can submit this submission file to Kaggle and see how it does.

Post-workshop survey: https://goo.gl/forms/70WP46Ykl0G2eLls2