# Finding features with coefficients set to 0

In [1]:
# Load data
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.preprocessing
from sklearn.linear_model import LogisticRegression
import project_env
from imp import reload
import math
import os
import run_logreg
import project_env
from sklearn.metrics import precision_recall_curve

reload(project_env)
reload(run_logreg)

%matplotlib inline


**The `project_env` package**

I wrote this python package so loading and working with data is quicker and easier. There are several convenience methods:
* `load_split_bucket(station_id)` - Load data for a bike station that's already pre-split into train, dev and test. Includes doing data cleaning and thresholding. The output is a dictionary:
```
  {
    'train': (DataFrame, Series),
    'dev': (DataFrame, Series),
    'test': (DataFrame, Series)
  }
```
  Each `(DataFrame, Series)` tuple is the feature values and target variables, respectively.
* `merge_training(split, df)` - Given two outputs of `load()`, append the training set of the second argument to the training set of the first. This is useful when trying to load data from multiple stations, but testing on one station only.
* `binarize(data, target)` - Given output of `load()` and either 1 or -1, binarize the target variable to 0 or 1. Whatever class is in the second argument will become '1' in the new data.

**The `run_logreg` package**

We wrote this python package so running many logistic regression models with different parameters is cleaner and easier. Here are the methods included:

* `do_logreg` - Takes the result of split_data and performs a logistic regression given the input parameters. It takes a parameters called "squares" that will perform some basic feature engineering by squaring some of the variables, a penalty function (l1 or l2, defaults to l2), and a c parameter (defaults to 100,000).

* `distance` - Calculates the distance betweent two stations, based on the distance formula

* `closest_stations` - Identifies the stations closest to the input station, using the `distance` function

* `add_closest_stations` - Takes split_data for one station and its station_id, splits, merges the closest stations' data and binarizes all

* `format_plot` - Formats the plot according to the desired target_recall and whether this is a plot of empty or full

### Logistic Regression Model

In [2]:
class Logistic_Regression_Specs():
    def __init__(self, split_data, stationid, target, empty=True, squares=False, num_append=0, C=1e5, penalty='l2'):
        self.stationid = stationid
        self.target = target
        self.split_data = split_data
        self.empty = empty
        self.squares = squares
        self.num_append = num_append
        self.penalty = penalty
        self.C = C

def construct_key(spec):
    key = ''
    if spec.target != 'y_60m':
        key = key + spec.target + ' '
    if spec.squares == True:
        key = key + 'squares; '
    if spec.num_append > 0:
        key = key + 'append: ' + str(spec.num_append) + '; '
    key = key + 'penalty: ' + spec.penalty + '; '
    key = key + 'c: ' + str(spec.C) + '; '
    return key

def run_models(list_of_specs):
    '''Creates a dictionary of models based on list of specs objects'''
    
    logregs = {}
    scalers = {}
    predictions = {}
    specs = {}
    
    for spec in list_of_specs:
        logregs[construct_key(spec)], scalers[construct_key(spec)], predictions[construct_key(spec)] = run_logreg.do_logreg(spec, plot = False) 
        specs[(construct_key(spec))] = spec
    return logregs, scalers, predictions, specs

def pr_curve(predictions, true_value, target_recall=0.95):
    curve = precision_recall_curve(true_value, predictions)
    precision, recall, thresholds = curve
    mp, mr, mt = project_env.max_precision_for_recall(curve, target_recall=target_recall)
    return mp, mr, mt

In [3]:
# test loading data
data = project_env.load_split_bucket(519, target='y_60m', log=False)
print('done loading')

done loading


In [4]:
# best prediction: ['y_60m', False, 0, 10, 'l1']

data = project_env.load_split_bucket(519, target='y_60m', log=False)

spec001 = Logistic_Regression_Specs(data, stationid=519, target='y_60m', empty=True, squares=False, num_append=0, C=0.1, penalty='l1')
spec010 = Logistic_Regression_Specs(data, stationid=519, target='y_60m', empty=True, squares=False, num_append=0, C=1, penalty='l1')
spec100 = Logistic_Regression_Specs(data, stationid=519, target='y_60m', empty=True, squares=False, num_append=0, C=10, penalty='l1')

logregs_e_001, scalers_e, predictions_e = run_logreg.do_logreg(spec001, plot=False, merge_train_dev=True)
logregs_e_010, scalers_e, predictions_e = run_logreg.do_logreg(spec010, plot=False, merge_train_dev=True)
logregs_e_100, scalers_e, predictions_e = run_logreg.do_logreg(spec100, plot=False, merge_train_dev=True)


Training set X shape: (5204, 22)
Trained on train set of 5204 examples
Evaluating on dev set of 968 examples
Accuracy: 0.798553719008
[[659 142]
 [ 53 114]]
Training set X shape: (5204, 22)
Trained on train set of 5204 examples
Evaluating on dev set of 968 examples
Accuracy: 0.785123966942
[[632 169]
 [ 39 128]]
Training set X shape: (5204, 22)
Trained on train set of 5204 examples
Evaluating on dev set of 968 examples
Accuracy: 0.780991735537
[[626 175]
 [ 37 130]]


In [5]:
# coefficients

data_empty = project_env.binarize(data, -1)
#print(logregs_e.coef_.ravel())
#print(data_empty['train'][0].columns.ravel())

#coef_rows = logregs_e.coef_.ravel()
#coef_rows.append(logregs_e.coef_.ravel())
#coef_df = pd.DataFrame(coef_rows, columns=data_empty['train'][0].columns.ravel())
col_names = data_empty['train'][0].columns.ravel()
col_names = np.append(['C'], col_names)

coef_df = pd.DataFrame(columns=col_names)
coef_df = coef_df.append(pd.Series(np.append([0.1], logregs_e_001.coef_.ravel()), index=col_names), ignore_index=True)
coef_df = coef_df.append(pd.Series(np.append([1], logregs_e_010.coef_.ravel()), index=col_names), ignore_index=True)
coef_df = coef_df.append(pd.Series(np.append([10], logregs_e_100.coef_.ravel()), index=col_names), ignore_index=True)

coef_df.transpose().round(decimals=3)

Unnamed: 0,0,1,2
C,0.1,1.0,10.0
apparentTemperature,-0.381,-1.21,-1.247
cloudCover,-0.19,-0.199,-0.194
dewPoint,0.0,1.947,3.316
humidity,0.745,-0.331,-1.142
nearestStormDistance,-0.192,-0.232,-0.242
ozone,-0.074,-0.133,-0.147
precipIntensity,0.079,0.144,0.156
precipProbability,0.0,-0.031,-0.047
pressure,0.109,0.152,0.159


Depending on the value of C, there are few or no features with coefficient of 0. Smaller values specify stronger regularization. At C=10, no features are set to 0 and you need to reduce C to start seeing 0s.

Although most features are fairly stable regardless of C, dew point changes dramatically. It's 0 at C=0.1, but is the highest for C=10. Also of note is humidity, which goes from one of the most positive to one of the most negative.