# Python and Scikit-learn Workshop 11/16/17
### Brown Data Science  |  Brown Initiative for Computation in Brain and Mind  |  Brown Institute for Brain Science

## Python

Python is a flexible programming language with relatively simple syntax. It also has a huge suite of tools for processing and analyzing data. This workshop will guide you through the basics of the language. 

### Hello World
You can print things with the `print` function.

In [None]:
print('Hello world!')

### Libraries
Python has a number of built-in modules and libraries that offer convenient access to useful functions. These libraries can be imported by using the built-in `import` function followed by the library name.

Here is one example with the `random` library that can be used for generating random integers within some specified range.

In [None]:
import random
print(random.randint(1,1000))  # prints a random integer between 1 and 1000

### If/else 
We can use `if`, `elif` (else if), and `else` to define cases in our program.

In [None]:
age = 22
 
if age < 13:
    print('kid')
elif age < 20:
    print('teen')
else:
    print('adult')

### Loops
Use `for` and `while` loops to repeat the same block of code.

Remember to indent when using if/else statements and loops!

In [None]:
for i in range(10):
    print(i)

In [None]:
i = 0
while i < 10:
    print(i)
    i += 1  # increment i

### Strings
Python allows for easy manipulation of strings.

In [None]:
mystring = 'ham and eggs'
print(mystring[0:4])         # prints the first 5 characters of mystring
print(mystring.find('and'))  # prints the index of the first occurance of 'and' in mystring
print(mystring.split(' '))   # prints a list of words separated by spaces

### Lists
You can store any type of data in a list.

In [None]:
mylist = [6, 'hello', 3.14, 0]
print(mylist)

In [None]:
for item in mylist:
    print(item)

In [None]:
print(mylist[1:3])

In [None]:
mylist.append('new item')
print(mylist)

### Dictionaries
Dictionaries map keys to values. 

In [None]:
mydict = {'key': 'value'}
mydict['key2'] = 'value2'
mydict[5] = 6
print(mydict)
print(mydict['key'])

In [None]:
for key in mydict:
    print(key, mydict[key])

### Dynamic Typing
In Python, variables are associated with single objects and no specific data type.

In [None]:
var = 5
print(var)
print(type(var))

var = 3.2
print(var)
print(type(var))

var = 'spam'
print(var)
print(type(var))

### Binding 
Notice that Python assignment binds a name to a particular object. If your goal is to make an independent clone of an object, you should use the `deepcopy` function from Python's `copy` library.

In [None]:
a = [1, 2]
print(a)
b = a
b.append(3)
print(a)

In [None]:
from copy import deepcopy

a = [1, 2]
print(a)
b = deepcopy(a)
b.append(3)
print(a)

### Functions
Specify functions using the `def` keyword.

In [None]:
def add(x, y):
    return x+y

def multiply(x, y):
    sum = 0
    for _ in range(y):
        sum = add(sum, x)
    return sum

print(add(3,4))
print(multiply(3,4))

### Numpy
Numpy is a useful Python library for storing and manipulatind data in arrays. 

In [None]:
import numpy as np

array = np.zeros((3,4))
print(array)

In [None]:
nested_lists = [[1,2,3],[4,5,6]]
array = np.array(nested_lists)
print(array)

In [None]:
print(array.shape)

In [None]:
print(array.transpose())

In [None]:
array2 = np.array([[2,2,2],[2,2,2]])
print(array2)
print(array + array2)

In [None]:
# multiply element-wise
print(array * array2)

In [None]:
# matrix multiplication
array3 = np.array([[1, -1], [0, 4], [-2, 3]])
print(np.dot(array, array3))

### Plotting
There are many times where it will be useful to plot various quantities and metrics to better understand the  performance of a model.
Here is a simple example of plotting from the standard plotting library, matplotlib.pyplot (plt).

In [None]:
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0,2*np.pi,50)
y = np.cos(x)
plt.plot(x,y)
plt.title("my plot")
plt.xlabel(' x')
plt.ylabel('Cos')
plt.show()

## Machine Learning

Machine learning is the subfield of artificial intelligence that involves learning from data. In particular, we will be discussing *supervised* learning. 

Let's say we have a function we want to approximate:

$$ y = f(x) $$

But we don't know the formula for *f*. So what do we know about *f*? We have a set of *training examples*, which are pairs of inputs *x* and outputs *y*, where *y* is the result of applying *f* to *x*. 

$$(x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)$$

The outputs $y$ in our training examples are called *labels*. Based on these training examples, we want to form an approximation for *f*. The process of forming an approximation is called supervised learning. Here are the steps involved in supervised learning:

**Step 1:** We form a *hypothesis* about what *f* looks like. A hypothesis is also called a *model*, and it has certain unknown *parameters*. For example, we might hypothesize that *f* looks like this:

$$f(x) =w \cdot x$$

Here we are applying the dot product on vectors *w* and *x*. This model is called a *linear* model, and has unknown parameters *w*. Applying this model is called *linear regression*.

**Step 2:** Find the best set of parameters that fits the training examples. In this case, we want to find the best vector *w* that fits our examples above. 

What do we mean when we say best set of parameters? We want parameters that minimize some *loss function*, which is a function that tells us how bad our model is based on our training data. For linear regression, the typical loss function is the mean squared error:

$$ L(w) = \frac{1}{n}\sum_{i=1}^n (y_i - w \cdot x_i)^2 $$

By choosing the *w* that minimizes this function, we are choosing the *w* that best fits our training data. 

**Step 3:** Apply the approximation to unseen data (data without labels). Our model will predict what the actual label is for a given input.

### Regression vs. Classification

In the above scenario, we have a function that is producing real numbers as outputs. When this is the case, we are doing *regression*. However, there is a different type of supervised learning problem called *classification*.

Consider the problem of predicting whether or not an email is spam, based on the text of the email. In this case, we would have inputs *x* (the full email text) and labels *y*. The label for a given example is 1 if the email is spam, and 0 if not. 

Here there are only two possible outputs of our funciton. This is called *binary classification*. More generally, if our function only produces a finite number of outputs (usually a small number), then we are doing *classification*. The possible outputs of the function (0 and 1 in the above example) are called *classes*.

### Regression using Scikit-learn

Here, we'll walk through an example of regression using scikit-learn. We'll be using the diabetes patients dataset. 

In [None]:
from sklearn import datasets

diabetes = datasets.load_diabetes()
print(diabetes.DESCR)

In [None]:
print(diabetes.data.shape)
print(diabetes.data[:3])  # print the first 3 inputs in the training data

In [None]:
print(diabetes.target.shape)
print(diabetes.target[:10])  # print the first 10 labels

In [None]:
from sklearn import linear_model

# create a model
model = linear_model.LinearRegression()

# fit the model to find the best set of parameters for our data
model.fit(diabetes.data, diabetes.target)

# see the model's prediction on unseen data
model.predict([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235,  \
                -0.0348207, -0.04340085, -0.00259226,  0.01990842, -0.01764613]])

# if you get a weird warning when you run this block, ignore it

### Classification using Scikit-learn

Here we will be using the Iris dataset.

In [None]:
iris = datasets.load_iris()
print(iris.DESCR)

In [None]:
print(iris.data.shape)
print(iris.data[:5])

In [None]:
print(iris.target.shape)
print(iris.target[:10])

In [None]:
from sklearn import svm

# create a model
model = svm.SVC()

# fit the model to find the best set of parameters for our data
model.fit(iris.data, iris.target)

# see the model's prediction on unseen data
model.predict([[6.2, 3, 1.2, 0.3]])

### Train/validation split

How do we evaluate how good our model is? The most simple way to do this is to split up the data into two subsets: training and validation. This means we fit our model to the training data, and we use the trained model to predict ouptuts for the validation data. We can then compare our model's predictions to the actual labels.

In [None]:
from sklearn import model_selection

X = iris.data
y = iris.target
X_train, X_validate, y_train, y_validate = model_selection.train_test_split(X, y, test_size=0.33, random_state=29)

print(X_train.shape, X_validate.shape, y_train.shape, y_validate.shape)

In [None]:
# fit our model to the training data
model.fit(X_train, y_train)

# use our model to predict outputs for our validation data
y_predict = model.predict(X_validate)

# compare the predictions to our actual labels
from sklearn import metrics
print(metrics.accuracy_score(y_validate, y_predict))