# Eric Karsten PSET  Solutions

## 1. Parallel computing versus serial computing a bootstrapped cross validation 
For this exercise, you will use the same Auto.csv file. This dataset includes 397 observations on miles per gallon (mpg), number of cylinders (cylinders), engine displacement (displacement), horsepower (horsepower), vehicle weight (weight), acceleration (acceleration), vehicle year (year), vehicle origin (origin), and vehicle name (name). We will study the factors that make miles per gallon high or low. Create a binary variable mpg high that equals 1 if mpg\_high $\geq$ median(mpg\_high) and equals 0 if mpg\_high $<$ median(mpg\_high). Create two indicator variables for vehicle origin 1 (orgn1) and vehicle origin 2 (orgn2).

In [103]:
import numpy as np
import pandas as pd

cars = pd.read_csv('data/Auto.csv')

# Small cleaning step
cars = cars[cars["horsepower"] != "?"]
cars["horsepower"] = cars["horsepower"].astype(float)

# Data Preparation
y = pd.DataFrame((cars.mpg <= np.median(cars.mpg)).astype(int))
X = cars[["cylinders", "displacement", "horsepower", "weight", "acceleration", "year"]]
X.loc[:,"orgn1"] = (cars.origin == 1).astype(int)
X.loc[:,"orgn2"] = (cars.origin == 2).astype(int)

### (a) Serial Logit
Using serial computation, perform an estimation of the logistic model on 100 bootstrapped training sets (with replacement) on random draws of training sets of 65% of the data. Use sklearn.linear model.LogisticRegression() function and make sure that the n jobs option is set to None or 1. This will guarantee that it runs in serial. Compute the error rate for each of the 100 test sets. Calculate the average error rate. Make sure to set the seed on each of the 100 random draws so that these draws can be replicated in part (b). What is your error rate? How long did this computation take?


In [104]:
from sklearn.linear_model import LogisticRegression
from time import process_time 

index = np.array(X.index)


def logit_error(seed):
    np.random.seed(seed)

    train_index = np.random.choice(index, size=round(len(index) * .65), replace=True)
    test_index = np.setdiff1d(index, train_index)

    X_train = np.array(X.loc[train_index, :])
    X_test = np.array(X.loc[train_index, :])
    y_train = np.array(y.loc[train_index, :]).flatten()
    y_test = np.array(y.loc[train_index, :]).flatten()

    mod = LogisticRegression(solver='lbfgs', max_iter=10000, n_jobs=None)
    fit_mod = mod.fit(X_train, y_train)
    y_test_predict = fit_mod.predict(X_test)

    error_rate = np.mean(y_test != y_test_predict)
    return error_rate


time_start = process_time()

serial_errors = []
for i in range(0,100):
    serial_errors.append(logit_error(i))

time_elapsed = (process_time() - time_start)



print("The serial computation took ", round(time_elapsed, 2), " seconds.")
print("The mean error from serial computation was ", np.mean(serial_errors))

The serial computation took  9.08  seconds.
The mean error from serial computation was  0.07992156862745096


### (b) Parallel Logit
Now write a function that takes as arguments the bootstrap number (1 through 100 or 0 through 99), random seed, and the data, and estimates the logistic model on 65% of the data and calculates an error rate on the remaining 35%. Use Dask to parallelize these bootstraps. What is your error rate from this parallelized list of error rates? It should be the same
as part (a). How long did this computation take?

In [105]:
from dask import compute, delayed
import dask.multiprocessing
import multiprocessing

num_cores = multiprocessing.cpu_count()

time_start = process_time()

parallel_errors = []
for i in range(0,100):
    parallel_errors.append(delayed(logit_error)(i))

results_par = compute(*parallel_errors, scheduler=dask.multiprocessing.get, num_workers=num_cores)
    
time_elapsed = (process_time() - time_start)

print("The parallel computation took ", round(time_elapsed, 2), " seconds.")
print("The mean error from parallel computation was ", np.mean(results_par))

The parallel computation took  0.59  seconds.
The mean error from parallel computation was  0.07992156862745096
