## ML - Do it yourself
In this notebook we will get a feel for data generation and a few models.

In [None]:
from cheatdiy import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Using hints and cheats¶
Run the two cells below to learn how to get hints and solutions to the exercises below. You just need to provide the number of the exercise. Exercise 0 is just a dummy to try out the cheating functionality.

In [None]:
cheat(0)

In [None]:
hint(0)

### Exercise 1: Gaussian distribution
- Use np.random.randn to sample 1000 numbers from the N(0,1) distribution -- the normal (or gaussian) distribution with mean 0 and standard deviation 1.

Whenever you don't know how a function works, run help(function) in any cell, for example help(np.random.randn).

In [None]:
x = # your code here
print(x)

### Exercise 2: Convert to pandas dataframe
- The class you should convert to is pd.DataFrame

In [None]:
df = # your code here
df

### Exercise 3: Histogram
- Use np.histogram to compute histogram for x using bins [-3,-2] , [-2,-1], ... , [2,3], i.e. 6 bins.
 - You should only specify the end points of the bins which can be achieved by using range(start, end).
- Use df.hist for the same reason (bonus = plot).
- Use plt.hist to plot histogram of x.

In [None]:
# your code here

### Exercise 4: Multivariate gaussian
- Use np.random.randn to sample 1000 two-dimensional vectors independently from the N(0,1) distribution.

When we sample in using this function, we the samples we get have mean value (0,0), and the two components of the vector are independent. Also each sample is independent from the others.

In [None]:
twoD = # your code here
twoD.shape

### Exercise 5: Scatterplot
- Split twoD into two arrays representing the two dimensions.
- Use plt.scatter to draw the scatter plot.

In [None]:
x = # your code here
y = # your code here

# your code here

### Exercise 6: Scale and translate
- Transform twoD by scaling both axes by 2.
- Transform twoD by adding 5 to x and subtracting 5 from y.
- Compute mean (along axis 0) and standard deviation as a sanity check. Means should be close to [5,-5] and standard deviations should be close to [2,2].
- Draw the scatter plot again.

In [None]:
twoD # your code here
twoD[# fill in ] += 5
twoD[# fill in ] -= 5
print(twoD.mean(axis=0))
print(twoD.std(axis=0))

### Exercise 7: Append more rows
- Sample another 1000 samples of 2D gaussian distribution, again N(0,1) independent entries.
- Create a concatenated 2D array with twoD followed by the additional samples. For this use np.vstack.

In [None]:
extra_rows = # your code here
X = # your code here
print(X.shape)
assert X.shape == (2000,2)

### Exercise 8: Adding labels
- Make a numpy array with shape (2000,1): 1000 entries of 1 followed by 1000 entries of -1.
  - For this use np.vstack and pass it np.ones((1000,1)) and -np.ones((1000,1)). (By the way np.zeros is a similar function).
- Stack this to the right of X to obtain a (2000,3) shaped array.

In [None]:
Y = # your code here
data = # your code here

In [None]:
assert Y.shape == (2000,1)
assert Y.sum() == 0
assert Y[0] == 1
assert data.shape == (2000,3)
assert data[:,2].sum() == 0

### Exercise 9: Scatter plot
- The first 1000 rows with gray. Color is set by the named argument c.
- The remaining rows with pink.

In [None]:
# your code here

### Exercise 10: Shuffle the rows
- Use np.random.permutation to create a random index to shuffle the rows with.
- Use indexing to create a shuffled version of the data.
- Do the same using np.random.shuffle.

In [None]:
perm = # your code here
data = data[perm]
print(data)

# your code here to shuffle using np.random.shuffle
print(data)

In [None]:
assert data[:,2].sum() == 0

 Before proceeding to the section about Sklearn, let's split the data into features and labels.

In [None]:
# select all rows and all columns up to but excluding the last one
features = data[:, :-1]
# select all rows and the last column
labels = data[:, -1]
print(features.shape)
print(labels.shape)

### Sklearn
Sklearn, or scikit-learn, is a very popular, lightweight and easy to use library for machine learning.
It allows you to train models using a few lines of code but still is flexible for extension, varying algorithms, metrics, loss-functions and other hyper parameters.

We will train a logistic regression model for the same data set as above.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
reg = LogisticRegression()

ML algorithms, or Inducers, in sklearn have few but important methods. Take a look at the help for 'fit' and 'predict'. Also if you type reg.<tab\> you will see what methods are supported.

### Exercise 11: Train a model
- Fit the logistic regression model to data.

We can see all the values of hyperparameters used above. We can also programatically access them:

In [None]:
reg.get_params()

The weights can be accessed like so:

In [None]:
a,b = reg.coef_[0]
c = reg.intercept_[0]
(a,b,c)

### Plot
We use the same code as above to plot the logistic regression decision boundary.

In [None]:
plt.scatter(extra_rows[:,0], extra_rows[:,1], c='pink')
plt.scatter(twoD[:,0], twoD[:,1], c='gray')
ticks = [-4 + 0.16*t for t in range(100)]
boundary_y = [-(a*x + c)/b for x in ticks]
boundary_y = [b if b < 15 else 15 for b in boundary_y]
boundary_y = [b if b > -15 else -15 for b in boundary_y]
plt.plot(ticks, boundary_y)

### Exercise 12: Predict and Evaluate the model
- Get the model's predictions for the entire data set.
- Compute the accuracy of the classifier on the training set.
  - You can use compare two numpy arrays to get elementwise comparison.
  - You can think of the accuracy as the average accuracy over all rows where accuracy for one row is 0 or 1.

In [None]:
pred = # your code here

In [None]:
acc = # your code here
assert acc >= 0
assert acc <= 1
acc

### Exercise 13: Cross validation
- Perform 5-fold cross validation with logistic regression by using cross_val_score.
- The evaluation metric is accuracy.
- Only 1 line of code!

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
reg = LogisticRegression()
# your code here

For other metrics see
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

### Exercise 14: Learning Curve
- Retrain the logistic regression model above but use sklearn's validation_curve method to get training and validation scores over training iterations. Specify train_sizes=np.linspace(0.1, 1.0, 200) and use 5 fold cross validation.
- Average the output scores over folds.
- Plot train scores and test scores in the same plot.
- Repeat the above for "neg_log_loss" instead of accuracy.
  - Note that you'll have to add a minus sign if you want to get positive loss values

In [None]:
from sklearn.model_selection import learning_curve

In [None]:
sizes, train_scores, test_scores = # your code here
print(train_scores.shape)
print(test_scores.shape)

In [None]:
train_scores_mean = # your code here
test_scores_mean = # your code here

plt.xlabel("number of samples")
plt.ylabel("accuracy")
plt.plot(sizes, train_scores_mean, c='red', label='train')
plt.plot(sizes, test_scores_mean, c='blue', label='test')
plt.legend()

In [None]:
# repeat for negative log loss

### Exercise 15: Validation Curve
- Use np.logspace to create a list of values _params_ $=10^{-4},10^{-3},10^{-2},10^{-1},10^{0},10^{1}$.
- Use validation_curve to train a logistic regression model on 'digits' data. The parameter C should take on the values above. Use 5-fold crossvalidation and negative log loss. However when plotting use plt.semilogx instead of plt.plot.
- Take averages over folds and plot average train loss and average validation loss vs _params_.

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import validation_curve

In [None]:
# loads hand written character image data
digits = load_digits()
X, y = digits.data, digits.target

params = np.logspace(-4,1,6)
train_scores, valid_scores = # your code here
train_scores_mean = # your code here
valid_scores_mean = # your code here

assert train_scores_mean.shape == (6,)
assert valid_scores_mean.shape == (6,)


plt.xlabel('C')
plt.ylabel('neg_log_loss')
plt.semilogx(params, train_scores_mean, label='train')
plt.semilogx(params, valid_scores_mean, label='valid')
plt.legend()

### Manually setting parameters

In [None]:
reg.set_params(penalty='elasticnet')
reg.set_params(solver='saga')

### Retrain
Now change 'C' to 'l1_ratio' and rerun your cell above and check the resulting graph.

### Exercise 16: Grid search
- Read the help on GridSearchCV.
- Set parameters correctly so that
    - Hint: parameters should be a list of maps, and map values are lists.
    - when _kernel_ is 'poly' then _degree_ takes the values 1,2 and 3
    - when _kernel_ is 'rbf', then _C_ takes values 1 and 10
    - when _kernel_ is 'linear', then _C_ and _gamma_ take on all combinations of \[1,10\] and \[0.1,1\].
- Run the cell to fit a lot of models to iris data or digits data!
- Study the result.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn import svm, datasets
import pandas as pd

iris = datasets.load_iris()
digits = datasets.load_digits()

parameters = # your code here
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=10)
clf.fit(iris.data, iris.target)
#clf.fit(digits.data, digits.target)
df = pd.DataFrame(clf.cv_results_)
# should be 9 parameter settings
assert df.shape[0] == 9 
df.sort_values('mean_test_score', ascending=False, inplace=True)
important_cols = [col for col in df.columns.values if not 'split' in col]
df[important_cols]