# BIOS 823 Final Exam (14 December 2018)

- The time allocated is 3 hours
- This is a **closed book** examination
    - Close ALL applications on your laptop
    - Start an empty browser with a SINGLE Tab in FULL SCREEN MODE
    - You should only have this SINGLE notebook page open in your browser, with NO OTHER TABS or WINDOWS
- You are not allowed any reference material except for the following:
    - Cheat sheet (1 letter-sized paper, both sides)
    - Built-in help accessible either by `?foo`, `foo?` or `help(foo)`
- ALL necessary imports of Python modules have been done for you. 
- **You should not import any additional modules - this includes standard library packages**.

The questions are worth a total of 120 points, but the maximum score is 100. Note that answers will be graded on **correctness**, **efficiency** and **readability**.

<font color=blue>By taking this exam, you acknowledge that you have read the instructions and agree to abide by the Duke Honor Code.</font>

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**1**. (10 points)

Warm up exercise.

Find the 5 most common words and their counts in `data/moby.txt`, after removing punctuation, setting to lowercase and splitting by blank space.

In [None]:
import string

**2**. (10 points)

- Assemble the data from `features`, `subjects`, `X`, and `y` into a single `pandas.DataFrame (DF)` called `har`.  You should end up with a DF that is 7352 by 562 with `activity` as the first column. Rows and columns should be appropriately labeled.
    - `X` is a matrix where each row is a feature matrix
    - The columns of X are given in `features`
    - Each row of X is a subject given in `subjects`
    - `y` is a code for the type of activity performed by the subject (name the column in the DataFrame `actvitity`)
- Name the index `subject`
- Display a sample of 5 rows chosen at random without replacement and the first 5 columns.

In [None]:
activities = np.loadtxt('data/HAR/activity_labels.txt', dtype='str')
features = np.loadtxt('data/HAR/features.txt', dtype='str')[:, 1]
subjects = np.loadtxt('data/HAR/train/subject_train.txt', dtype='int')
X = np.loadtxt('data/HAR/train/X_train.txt')
y = np.loadtxt('data/HAR/train/y_train.txt', dtype='int')

**3**. (10 points)

Using the DF from Question 1, find the average feature value for each subject for all features that have the string `entropy` in it but does NOT end in X, Y or Z. Use method chaining to perform this operation and show a random sample of 5 rows without replacement as a single expression.

**4**. (10 points)

Write an SQL query against the `har` table to count the number of distinct subjects and the total number of rows for each activity, ordering the results by number of rows for each activity in decreasing order. A simple example of how to run an SQL query using `pandas` is provided.

In [None]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/har.db', echo=False)

In [None]:
query = '''
SELECT subject, activity 
FROM har 
LIMIT 5
'''
pd.read_sql(query, con=engine)

**5**. (25 points)

- Create a new DF `df` from the `har` DF with all features that include the string `Acc-mean`
- Scale the feature columns so that all features have mean 0 and standard deviation 1
- Use SVD to find the first two principal components
- Plot the first two principal components as a scatter plot colored by the `activity` type of each feature vector
- Plot the 2D t-SNE plot colored in the same way (t-SNE dimension reduction may take 1-2 minutes)

Do not import any other packages apart from the cell below.

In [None]:
from scipy.linalg import svd
from sklearn.manifold import TSNE

In [None]:
activities

In [None]:
X_test_data = np.loadtxt('data/HAR/test/X_test.txt')
y_test_data = np.loadtxt('data/HAR/test/y_test.txt', dtype='int')
subjects_test = np.loadtxt('data/HAR/test/subject_test.txt', dtype='int')

**6**. (25 points)

You are given training and test data and labels using a subset of the HAR data set. Your job is to use these features to classify rows into WALKING UPSTAIRS (code = 2) or WALKING DOWNSTAIRS (code = 3). 

- Scale the data to have mean zero and unit standard deviation using `StandardScaler`, taking care to apply the same scaling parameters for the training and test data sets
- Use the LaeblEncoder to transform the codes 2 and 3 to 0 and 1 in `y_train` and `y_test` 
- Perform ridge regression to classify data as WALKING UPSTAIRS or WALKING DOWNSTAIRS
    - Train the model with an Cs value chosen from one of (0.01, 0.1, 1, 10, 100) by 5-fold cross-validation using the training data
    - Plot the ROC curve (TPR versus FPR) evaluated on the test data

The necessary classes from `sklearn` are imported for you. Do not use any other `sklearn` classes

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_curve

In [None]:
X_train = np.load('data/X_train.npy')
X_test = np.load('data/X_test.npy')
y_train = np.load('data/y_train.npy')
y_test = np.load('data/y_test.npy')

**7**. (30 points)

- Make the `kmeans` function given below by using Cython. A simple example is given as a hint.
- Find the speed-up of the Cython version.

#### Cython example 

In [None]:
from timeit import timeit

In [None]:
%load_ext cython

In [None]:
def square(x):
    return x**2

def foo(X):
    """Python function."""

    n = len(X)
    s = 0.0
    for i in range(n):
        s += square(X[i])
    return s

In [None]:
%%cython -a

import cython
from libc.math cimport pow

cdef double square_cython(double x):
    return pow(x, 2)

@cython.wraparound(False)
@cython.boundscheck(False)
def foo_cython(double[:] X):
    """Cython function."""

    cdef int n = X.shape[0]
    cdef double s = 0.0
    
    cdef int i
    
    for i in range(n):
        s += square_cython(X[i])
    return s

In [None]:
foo(np.arange(10.0))

In [None]:
foo_cython(np.arange(10.0))

#### Algorithm to Cythonize

In [None]:
def cdist(X, Y):
    """Matrix of Euclidean distances between vectors in X and vectors in Y."""
    
    m, p = X.shape
    n, p = Y.shape
    M = np.zeros((m, n))
    for i in range(m):
        for j in range(n):
            d = 0
            for k in range(p):
                d += (X[i,k] - Y[j,k])**2
            M[i, j] = np.sqrt(d)
    return M        

In [None]:
def kemans(X, k, iters=10):
    """K-means with fixed number of iterations."""

    r, c = X.shape
    centers = X[:k]
    for i in range(iters):
        m = cdist(X, centers)
        z = np.argmin(m, axis=1)
        centers = np.array([np.mean(X[z==i], axis=0) for i in range(k)])
    return (z, centers)

In [None]:
np.random.seed(2017)
from sklearn.datasets import make_blobs

npts = 10000
nc = 6
X, y = make_blobs(n_samples=npts, centers=nc)

In [None]:
z, centers = kemans(X, nc)

In [None]:
plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass