## Objective

Since what I'm most interested in is which factors are most associated with whether an issue is unresolved, random forests seems like the most appropriate model choice, since I can look at the variable importance plot.

Logistic regression is a close second; random forests usually have better performance.

Random forests can also do a better job at getting signal out of raw lat and long. For logistic regression, there would need to be a linear relationship between lat and long and the log-odds of the issue being unresolved, which is a taller order.

Thinking about my engineered feature `days_from_feb_2016` for example, Logistic Regression wouldn't do a good job with it, but a tree-based model would.

In [1]:
from __future__ import division
import pandas as pd
from datetime import timedelta, datetime

In [17]:
import warnings
import seaborn as sns

warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")

from pylab import rcParams
rcParams['figure.figsize'] = 20, 5

import matplotlib.pyplot as plt

%matplotlib inline

In [30]:
df = pd.read_pickle('../data/data_w_transformed_census_and_removed_invalid_rows_and_cols.pkl')
df.shape

(716310, 36)

## scratch looking at NAs

In [34]:
df.head(2000)[df.head(2000).race_white.isnull()][['TYPE', 'ADDR'.head(2).T

Unnamed: 0,601,605
CASE_ENQUIRY_ID,101001771843,101001066689
OPEN_DT,2016-04-19 05:51:01,2014-04-12 07:28:13
TARGET_DT,2016-04-21 08:30:00,2014-04-16 08:30:00
CLOSED_DT,2016-04-19 06:24:26,2014-04-14 06:03:35
CASE_TITLE,Parking Enforcement,Utility Call-In
SUBJECT,Transportation - Traffic Division,Public Works Department
REASON,Enforcement & Abandoned Vehicles,Highway Maintenance
TYPE,Parking Enforcement,Utility Call-In
Department,BTDT,PWDx
SubmittedPhoto,True,False


## Next section

In [5]:
df.head(1).T

Unnamed: 0,0
CASE_ENQUIRY_ID,101000958209
OPEN_DT,2013-11-01 09:27:19
TARGET_DT,2013-11-15 09:27:19
CLOSED_DT,2013-11-27 10:15:45
CASE_TITLE,Sign Repair
SUBJECT,Transportation - Traffic Division
REASON,Signs & Signals
TYPE,Sign Repair
Department,BTDT
SubmittedPhoto,False


## Preprocessing

In [None]:
df.TARGET_DT.head(100).isnull().sum()

In [None]:
df.LOCATION_ZIPCODE.head(100).isnull().sum()

In [6]:
df['days_from_feb_2016'] = (datetime(year=2016, month=2, day=1) - df.OPEN_DT).map(lambda x: x.days)
df['LOCATION_ZIPCODE'] = df['LOCATION_ZIPCODE'].astype('object').fillna('other')
df['neighborhood'] = df['neighborhood'].fillna('other')
df['is_snow_in_type'] = df.TYPE.map(lambda txt: 'snow' in txt.lower())
# df['race_majority'] = df[[i for i in df.columns if 'race' in i]].idxmax(axis=1) # not useful bc will be dummified

In [7]:
df = df.drop(
    ['OPEN_DT',
     'TARGET_DT',
     'CLOSED_DT',
     'COMPLETION_TIME',
     'Property_ID',
     'LOCATION_STREET_NAME',
     'CASE_TITLE',
     'CASE_ENQUIRY_ID',
     'SUBJECT',
     'Department',
     'tract_and_block_group'
    ], 
    axis=1
)

In [8]:
df.shape

(716310, 27)

In [7]:
df.head(1).T

Unnamed: 0,0
REASON,Signs & Signals
TYPE,Sign Repair
SubmittedPhoto,False
neighborhood,Downtown / Financial District
LOCATION_ZIPCODE,other
Property_Type,Intersection
LATITUDE,42.3537
LONGITUDE,-71.058
Source,Constituent Call
race_white,0.77619


### Dummifying bc sklearn's random forest implementation doesn't accept string values

In [9]:
def dummify(df, column):
    # from Darren's linear regression slides
    print '{} is your baseline'.format(sorted(df[column].unique())[-1])
    dummy = pd.get_dummies(df[column]).rename(columns=lambda x: column+'_'+str(x)).iloc[:,0:len(df[column].unique())-1]
    df = df.drop(column,axis=1) #Why not inplace? because if we do inplace, it will affect the df directly
    return pd.concat([df,dummy],axis=1)

In [10]:
df1 = dummify(df, 'TYPE')
df2 = dummify(df1, 'neighborhood')
df3 = dummify(df2, 'LOCATION_ZIPCODE')
df4 = dummify(df3, 'Property_Type')
df5 = dummify(df4, 'Source')
df6 = dummify(df5, 'school')
df7 = dummify(df6, 'housing')
df8 = dummify(df7, 'REASON')

Zoning is your baseline
other is your baseline
other is your baseline
Intersection is your baseline
Twitter is your baseline
8_6th_grade is your baseline
rent is your baseline


In [22]:
df8.shape

(716310, 356)

In [29]:
df.isnull().sum()

REASON                                 0
TYPE                                   0
SubmittedPhoto                         0
neighborhood                           0
LOCATION_ZIPCODE                       0
Property_Type                      23077
LATITUDE                               0
LONGITUDE                              0
Source                                 0
race_white                          2426
race_black                          2426
race_asian                          2426
race_hispanic                       2426
race_other                          2426
poverty_pop_below_poverty_level     4213
poverty_pop_w_public_assistance     4213
poverty_pop_w_food_stamps           4213
poverty_pop_w_ssi                   4213
school                                 0
housing                                0
bedroom                                0
value                                  0
rent                                   0
income                                 0
is_issue_unresol

## OK, let's put it into the model!

In [16]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix


Let's split the data first between train and test, 80/20.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    df7.drop('is_issue_unresolved', axis=1), 
    df7.is_issue_unresolved, 
    test_size=0.2, 
    random_state=300
)

In [14]:
pipe = make_pipeline(
    RandomForestClassifier(
        bootstrap=True, 
        class_weight='balanced_subsample', 
        n_estimators=10
    )
)

In [19]:
params = {'randomforestclassifier__max_depth': [10, 20],
      'randomforestclassifier__min_samples_leaf': [101, 1001], # odd for bin class, to avoid ties, 1% of num_rows for max bound
} # taking too long

params = {'randomforestclassifier__max_depth': [10],
      'randomforestclassifier__min_samples_leaf': [1001], # odd for bin class, to avoid ties, 1% of num_rows for max bound
} # Brad said hyperparams don't matter _that_ much for RF

model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=1, verbose=True)
model.fit(X_train, y_train);

Fitting 3 folds for each of 1 candidates, totalling 3 fits


JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/runpy.py in _run_module_as_main(mod_name='ipykernel.__main__', alter_argv=1)
    169     pkg_name = mod_name.rpartition('.')[0]
    170     main_globals = sys.modules["__main__"].__dict__
    171     if alter_argv:
    172         sys.argv[0] = fname
    173     return _run_code(code, main_globals, None,
--> 174                      "__main__", fname, loader, pkg_name)
        fname = '/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py'
        loader = <pkgutil.ImpLoader instance>
        pkg_name = 'ipykernel'
    175 
    176 def run_module(mod_name, init_globals=None,
    177                run_name=None, alter_sys=False):
    178     """Execute a module's code without importing it

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/runpy.py in _run_code(code=<code object <module> at 0x7f3d4b349830, file "/...2.7/site-packages/ipykernel/__main__.py", line 1>, run_globals={'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': '/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'ipykernel', 'app': <module 'ipykernel.kernelapp' from '/home/ubuntu...python2.7/site-packages/ipykernel/kernelapp.pyc'>}, init_globals=None, mod_name='__main__', mod_fname='/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py', mod_loader=<pkgutil.ImpLoader instance>, pkg_name='ipykernel')
     67         run_globals.update(init_globals)
     68     run_globals.update(__name__ = mod_name,
     69                        __file__ = mod_fname,
     70                        __loader__ = mod_loader,
     71                        __package__ = pkg_name)
---> 72     exec code in run_globals
        code = <code object <module> at 0x7f3d4b349830, file "/...2.7/site-packages/ipykernel/__main__.py", line 1>
        run_globals = {'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': '/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'ipykernel', 'app': <module 'ipykernel.kernelapp' from '/home/ubuntu...python2.7/site-packages/ipykernel/kernelapp.pyc'>}
     73     return run_globals
     74 
     75 def _run_module_code(code, init_globals=None,
     76                     mod_name=None, mod_fname=None,

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py in <module>()
      1 
      2 
----> 3 
      4 if __name__ == '__main__':
      5     from ipykernel import kernelapp as app
      6     app.launch_new_instance()
      7 
      8 
      9 
     10 

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    648 
    649         If a global instance already exists, this reinitializes and starts it
    650         """
    651         app = cls.instance(**kwargs)
    652         app.initialize(argv)
--> 653         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    654 
    655 #-----------------------------------------------------------------------------
    656 # utility functions, for convenience
    657 #-----------------------------------------------------------------------------

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    469             return self.subapp.start()
    470         if self.poller is not None:
    471             self.poller.start()
    472         self.kernel.start()
    473         try:
--> 474             ioloop.IOLoop.instance().start()
    475         except KeyboardInterrupt:
    476             pass
    477 
    478 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/zmq/eventloop/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    157             PollIOLoop.configure(ZMQIOLoop)
    158         return PollIOLoop.current(*args, **kwargs)
    159     
    160     def start(self):
    161         try:
--> 162             super(ZMQIOLoop, self).start()
        self.start = <bound method ZMQIOLoop.start of <zmq.eventloop.ioloop.ZMQIOLoop object>>
    163         except ZMQError as e:
    164             if e.errno == ETERM:
    165                 # quietly return on ETERM
    166                 pass

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    882                 self._events.update(event_pairs)
    883                 while self._events:
    884                     fd, events = self._events.popitem()
    885                     try:
    886                         fd_obj, handler_func = self._handlers[fd]
--> 887                         handler_func(fd_obj, events)
        handler_func = <function null_wrapper>
        fd_obj = <zmq.sugar.socket.Socket object>
        events = 1
    888                     except (OSError, IOError) as e:
    889                         if errno_from_exception(e) == errno.EPIPE:
    890                             # Happens when the client closes the connection
    891                             pass

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    270         # Fast path when there are no active contexts.
    271         def null_wrapper(*args, **kwargs):
    272             try:
    273                 current_state = _state.contexts
    274                 _state.contexts = cap_contexts[0]
--> 275                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    276             finally:
    277                 _state.contexts = current_state
    278         null_wrapper._wrapped = True
    279         return null_wrapper

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    435             # dispatch events:
    436             if events & IOLoop.ERROR:
    437                 gen_log.error("got POLLERR event on ZMQStream, which doesn't make sense")
    438                 return
    439             if events & IOLoop.READ:
--> 440                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    441                 if not self.socket:
    442                     return
    443             if events & IOLoop.WRITE:
    444                 self._handle_send()

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    467                 gen_log.error("RECV Error: %s"%zmq.strerror(e.errno))
    468         else:
    469             if self._recv_callback:
    470                 callback = self._recv_callback
    471                 # self._recv_callback = None
--> 472                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    473                 
    474         # self.update_state()
    475         
    476 

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    409         close our socket."""
    410         try:
    411             # Use a NullContext to ensure that all StackContexts are run
    412             # inside our blanket exception handler rather than outside.
    413             with stack_context.NullContext():
--> 414                 callback(*args, **kwargs)
        callback = <function null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    415         except:
    416             gen_log.error("Uncaught exception, closing connection.",
    417                           exc_info=True)
    418             # Close the socket on an uncaught exception from a user callback

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    270         # Fast path when there are no active contexts.
    271         def null_wrapper(*args, **kwargs):
    272             try:
    273                 current_state = _state.contexts
    274                 _state.contexts = cap_contexts[0]
--> 275                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    276             finally:
    277                 _state.contexts = current_state
    278         null_wrapper._wrapped = True
    279         return null_wrapper

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    271         if self.control_stream:
    272             self.control_stream.on_recv(self.dispatch_control, copy=False)
    273 
    274         def make_dispatcher(stream):
    275             def dispatcher(msg):
--> 276                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    277             return dispatcher
    278 
    279         for s in self.shell_streams:
    280             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {u'allow_stdin': True, u'code': u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);", u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-02-02T02:08:03.306232', u'msg_id': u'F9D0FAE3486E46A2BB7251C761ADB16E', u'msg_type': u'execute_request', u'session': u'D998459802C24B1F99E2DF7DDAD866B8', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'F9D0FAE3486E46A2BB7251C761ADB16E', 'msg_type': u'execute_request', 'parent_header': {}})
    223             self.log.error("UNKNOWN MESSAGE TYPE: %r", msg_type)
    224         else:
    225             self.log.debug("%s: %s", msg_type, msg)
    226             self.pre_handler_hook()
    227             try:
--> 228                 handler(stream, idents, msg)
        handler = <bound method IPythonKernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = ['D998459802C24B1F99E2DF7DDAD866B8']
        msg = {'buffers': [], 'content': {u'allow_stdin': True, u'code': u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);", u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-02-02T02:08:03.306232', u'msg_id': u'F9D0FAE3486E46A2BB7251C761ADB16E', u'msg_type': u'execute_request', u'session': u'D998459802C24B1F99E2DF7DDAD866B8', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'F9D0FAE3486E46A2BB7251C761ADB16E', 'msg_type': u'execute_request', 'parent_header': {}}
    229             except Exception:
    230                 self.log.error("Exception in message handler:", exc_info=True)
    231             finally:
    232                 self.post_handler_hook()

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=['D998459802C24B1F99E2DF7DDAD866B8'], parent={'buffers': [], 'content': {u'allow_stdin': True, u'code': u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);", u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-02-02T02:08:03.306232', u'msg_id': u'F9D0FAE3486E46A2BB7251C761ADB16E', u'msg_type': u'execute_request', u'session': u'D998459802C24B1F99E2DF7DDAD866B8', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'F9D0FAE3486E46A2BB7251C761ADB16E', 'msg_type': u'execute_request', 'parent_header': {}})
    385         if not silent:
    386             self.execution_count += 1
    387             self._publish_execute_input(code, parent, self.execution_count)
    388 
    389         reply_content = self.do_execute(code, silent, store_history,
--> 390                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    391 
    392         # Flush output before sending the reply.
    393         sys.stdout.flush()
    394         sys.stderr.flush()

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code=u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);", silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    191 
    192         self._forward_input(allow_stdin)
    193 
    194         reply_content = {}
    195         try:
--> 196             res = shell.run_cell(code, store_history=store_history, silent=silent)
        res = undefined
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);"
        store_history = True
        silent = False
    197         finally:
    198             self._restore_input()
    199 
    200         if res.error_before_exec is not None:

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/ipykernel/zmqshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, *args=(u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);",), **kwargs={'silent': False, 'store_history': True})
    496             )
    497         self.payload_manager.write_payload(payload)
    498 
    499     def run_cell(self, *args, **kwargs):
    500         self._last_traceback = None
--> 501         return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
        self.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        args = (u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);",)
        kwargs = {'silent': False, 'store_history': True}
    502 
    503     def _showtraceback(self, etype, evalue, stb):
    504         # try to preserve ordering of tracebacks and print statements
    505         sys.stdout.flush()

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell=u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);", store_history=True, silent=False, shell_futures=True)
   2712                 self.displayhook.exec_result = result
   2713 
   2714                 # Execute the user code
   2715                 interactivity = "none" if silent else self.ast_node_interactivity
   2716                 has_raised = self.run_ast_nodes(code_ast.body, cell_name,
-> 2717                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler instance>
   2718                 
   2719                 self.last_execution_succeeded = not has_raised
   2720 
   2721                 # Reset this so later displayed values do not modify the

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.Assign object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Expr object>], cell_name='<ipython-input-19-01dc5a8c4f29>', interactivity='last', compiler=<IPython.core.compilerop.CachingCompiler instance>, result=<ExecutionResult object at 7f3d0b955990, executi..._before_exec=None error_in_exec=None result=None>)
   2822                     return True
   2823 
   2824             for i, node in enumerate(to_run_interactive):
   2825                 mod = ast.Interactive([node])
   2826                 code = compiler(mod, cell_name, "single")
-> 2827                 if self.run_code(code, result):
        self.run_code = <bound method ZMQInteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 0x7f3d0d59a0b0, file "<ipython-input-19-01dc5a8c4f29>", line 10>
        result = <ExecutionResult object at 7f3d0b955990, executi..._before_exec=None error_in_exec=None result=None>
   2828                     return True
   2829 
   2830             # Flush softspace
   2831             if softspace(sys.stdout, 0):

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 0x7f3d0d59a0b0, file "<ipython-input-19-01dc5a8c4f29>", line 10>, result=<ExecutionResult object at 7f3d0b955990, executi..._before_exec=None error_in_exec=None result=None>)
   2876         outflag = 1  # happens in more places, so it's easier as default
   2877         try:
   2878             try:
   2879                 self.hooks.pre_run_code_hook()
   2880                 #rprint('Running code', repr(code_obj)) # dbg
-> 2881                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 0x7f3d0d59a0b0, file "<ipython-input-19-01dc5a8c4f29>", line 10>
        self.user_global_ns = {'GridSearchCV': <class 'sklearn.model_selection._search.GridSearchCV'>, 'In': ['', u'from __future__ import division\nimport pandas as pd\nfrom datetime import timedelta, datetime', u'import warnings\nimport seaborn as sns\n\nwarn...5\n\nget_ipython().magic(u\'matplotlib inline\')', u'df.head(1).T', u"df = pd.read_pickle('../data/data_w_transforme...nd_removed_invalid_rows_and_cols.pkl')\ndf.shape", u'df.head(1).T', u"df['days_from_feb_2016'] = (datetime(year=2016...idxmax(axis=1) # not useful bc will be dummified", u"df = df.drop(\n    ['OPEN_DT',\n     'TARGET_D... 'tract_and_block_group'\n    ], \n    axis=1\n)", u'df.shape', u"def dummify(df, column):\n    # from Darren's ...irectly\n    return pd.concat([df,dummy],axis=1)", u"df1 = dummify(df, 'TYPE')\ndf2 = dummify(df1, ...fy(df5, 'school')\ndf7 = dummify(df6, 'housing')", u'df7.shape', u'from sklearn.cross_validation import train_tes...m sklearn.ensemble import RandomForestClassifier', u"X_train, X_test, y_train, y_test = train_test_..., \n    test_size=0.2, \n    random_state=300\n)", u"pipe = make_pipeline(\n    RandomForestClassif..._subsample', \n        n_estimators=10\n    )\n)", u"params = {'randomforestclassifier__max_depth':...v=5, verbose=True)\nmodel.fit(X_train, y_train);", u'from sklearn.cross_validation import train_tes...er\nfrom sklearn.metrics import confusion_matrix', u'import warnings\nimport seaborn as sns\n\nwarn...t\n\nget_ipython().magic(u\'matplotlib inline\')', u'X_train.shape', u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);"], 'Out': {4: (716310, 36), 5:                                                 ...                                            False, 8: (716310, 27), 11: (716310, 302), 18: (573048, 301)}, 'RandomForestClassifier': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'X_test':                                   REASON Submitt...  0.0          0.0  

[143262 rows x 301 columns], 'X_train':                                    REASON Submit...  0.0          0.0  

[573048 rows x 301 columns], '_': (573048, 301), '_11': (716310, 302), '_18': (573048, 301), '_4': (716310, 36), ...}
        self.user_ns = {'GridSearchCV': <class 'sklearn.model_selection._search.GridSearchCV'>, 'In': ['', u'from __future__ import division\nimport pandas as pd\nfrom datetime import timedelta, datetime', u'import warnings\nimport seaborn as sns\n\nwarn...5\n\nget_ipython().magic(u\'matplotlib inline\')', u'df.head(1).T', u"df = pd.read_pickle('../data/data_w_transforme...nd_removed_invalid_rows_and_cols.pkl')\ndf.shape", u'df.head(1).T', u"df['days_from_feb_2016'] = (datetime(year=2016...idxmax(axis=1) # not useful bc will be dummified", u"df = df.drop(\n    ['OPEN_DT',\n     'TARGET_D... 'tract_and_block_group'\n    ], \n    axis=1\n)", u'df.shape', u"def dummify(df, column):\n    # from Darren's ...irectly\n    return pd.concat([df,dummy],axis=1)", u"df1 = dummify(df, 'TYPE')\ndf2 = dummify(df1, ...fy(df5, 'school')\ndf7 = dummify(df6, 'housing')", u'df7.shape', u'from sklearn.cross_validation import train_tes...m sklearn.ensemble import RandomForestClassifier', u"X_train, X_test, y_train, y_test = train_test_..., \n    test_size=0.2, \n    random_state=300\n)", u"pipe = make_pipeline(\n    RandomForestClassif..._subsample', \n        n_estimators=10\n    )\n)", u"params = {'randomforestclassifier__max_depth':...v=5, verbose=True)\nmodel.fit(X_train, y_train);", u'from sklearn.cross_validation import train_tes...er\nfrom sklearn.metrics import confusion_matrix', u'import warnings\nimport seaborn as sns\n\nwarn...t\n\nget_ipython().magic(u\'matplotlib inline\')', u'X_train.shape', u"params = {'randomforestclassifier__max_depth':...v=3, verbose=True)\nmodel.fit(X_train, y_train);"], 'Out': {4: (716310, 36), 5:                                                 ...                                            False, 8: (716310, 27), 11: (716310, 302), 18: (573048, 301)}, 'RandomForestClassifier': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'X_test':                                   REASON Submitt...  0.0          0.0  

[143262 rows x 301 columns], 'X_train':                                    REASON Submit...  0.0          0.0  

[573048 rows x 301 columns], '_': (573048, 301), '_11': (716310, 302), '_18': (573048, 301), '_4': (716310, 36), ...}
   2882             finally:
   2883                 # Reset our crash handler in place
   2884                 sys.excepthook = old_excepthook
   2885         except SystemExit as e:

...........................................................................
/home/ubuntu/311-prediction-times/03 modeling/<ipython-input-19-01dc5a8c4f29> in <module>()
      5 params = {'randomforestclassifier__max_depth': [10],
      6       'randomforestclassifier__min_samples_leaf': [1001], # odd for bin class, to avoid ties, 1% of num_rows for max bound
      7 } # Brad said hyperparams don't matter _that_ much for RF
      8 
      9 model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=3, verbose=True)
---> 10 model.fit(X_train, y_train);
     11 
     12 
     13 
     14 

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_search.py in fit(self=GridSearchCV(cv=3, error_score='raise',
       e...in_score=True,
       scoring=None, verbose=True), X=                                   REASON Submit...  0.0          0.0  

[573048 rows x 301 columns], y=592998    False
359981    False
464981    False
...7    False
Name: is_issue_unresolved, dtype: bool, groups=None)
    940 
    941         groups : array-like, with shape (n_samples,), optional
    942             Group labels for the samples used while splitting the dataset into
    943             train/test set.
    944         """
--> 945         return self._fit(X, y, groups, ParameterGrid(self.param_grid))
        self._fit = <bound method GridSearchCV._fit of GridSearchCV(...n_score=True,
       scoring=None, verbose=True)>
        X =                                    REASON Submit...  0.0          0.0  

[573048 rows x 301 columns]
        y = 592998    False
359981    False
464981    False
...7    False
Name: is_issue_unresolved, dtype: bool
        groups = None
        self.param_grid = {'randomforestclassifier__max_depth': [10], 'randomforestclassifier__min_samples_leaf': [1001]}
    946 
    947 
    948 class RandomizedSearchCV(BaseSearchCV):
    949     """Randomized search on hyper parameters.

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_search.py in _fit(self=GridSearchCV(cv=3, error_score='raise',
       e...in_score=True,
       scoring=None, verbose=True), X=                                   REASON Submit...  0.0          0.0  

[573048 rows x 301 columns], y=592998    False
359981    False
464981    False
...7    False
Name: is_issue_unresolved, dtype: bool, groups=None, parameter_iterable=<sklearn.model_selection._search.ParameterGrid object>)
    559                                   fit_params=self.fit_params,
    560                                   return_train_score=self.return_train_score,
    561                                   return_n_test_samples=True,
    562                                   return_times=True, return_parameters=True,
    563                                   error_score=self.error_score)
--> 564           for parameters in parameter_iterable
        parameters = undefined
        parameter_iterable = <sklearn.model_selection._search.ParameterGrid object>
    565           for train, test in cv_iter)
    566 
    567         # if one choose to see train score, "out" will contain train score info
    568         if self.return_train_score:

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=-1), iterable=<generator object <genexpr>>)
    763             if pre_dispatch == "all" or n_jobs == 1:
    764                 # The iterable was consumed all at once by the above for loop.
    765                 # No need to wait for async callbacks to trigger to
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=-1)>
    769             # Make sure that we get a last message telling us we are done
    770             elapsed_time = time.time() - self._start_time
    771             self._print('Done %3i out of %3i | elapsed: %s finished',
    772                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError                                         Thu Feb  2 02:08:52 2017
PID: 1582                  Python 2.7.12: /home/ubuntu/anaconda2/bin/python
...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_and_score>
        args = (Pipeline(steps=[('randomforestclassifier', Rando...None, verbose=0,
            warm_start=False))]),                                    REASON Submit...  0.0          0.0  

[573048 rows x 301 columns], 592998    False
359981    False
464981    False
...7    False
Name: is_issue_unresolved, dtype: bool, <function _passthrough_scorer>, memmap([191016, 191017, 191018, ..., 573045, 573046, 573047]), memmap([     0,      1,      2, ..., 191013, 191014, 191015]), True, {'randomforestclassifier__max_depth': 10, 'randomforestclassifier__min_samples_leaf': 1001})
        kwargs = {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': True, 'return_times': True, 'return_train_score': True}
        self.items = [(<function _fit_and_score>, (Pipeline(steps=[('randomforestclassifier', Rando...None, verbose=0,
            warm_start=False))]),                                    REASON Submit...  0.0          0.0  

[573048 rows x 301 columns], 592998    False
359981    False
464981    False
...7    False
Name: is_issue_unresolved, dtype: bool, <function _passthrough_scorer>, memmap([191016, 191017, 191018, ..., 573045, 573046, 573047]), memmap([     0,      1,      2, ..., 191013, 191014, 191015]), True, {'randomforestclassifier__max_depth': 10, 'randomforestclassifier__min_samples_leaf': 1001}), {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': True, 'return_times': True, 'return_train_score': True})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator=Pipeline(steps=[('randomforestclassifier', Rando...None, verbose=0,
            warm_start=False))]), X=                                   REASON Submit...  0.0          0.0  

[573048 rows x 301 columns], y=592998    False
359981    False
464981    False
...7    False
Name: is_issue_unresolved, dtype: bool, scorer=<function _passthrough_scorer>, train=memmap([191016, 191017, 191018, ..., 573045, 573046, 573047]), test=memmap([     0,      1,      2, ..., 191013, 191014, 191015]), verbose=True, parameters={'randomforestclassifier__max_depth': 10, 'randomforestclassifier__min_samples_leaf': 1001}, fit_params={}, return_train_score=True, return_parameters=True, return_n_test_samples=True, return_times=True, error_score='raise')
    233 
    234     try:
    235         if y_train is None:
    236             estimator.fit(X_train, **fit_params)
    237         else:
--> 238             estimator.fit(X_train, y_train, **fit_params)
        estimator.fit = <bound method Pipeline.fit of Pipeline(steps=[('...one, verbose=0,
            warm_start=False))])>
        X_train =                                    REASON Submit...  0.0          0.0  

[382032 rows x 301 columns]
        y_train = 355307    False
274437    False
210953    False
...7    False
Name: is_issue_unresolved, dtype: bool
        fit_params = {}
    239 
    240     except Exception as e:
    241         # Note fit time as time until error
    242         fit_time = time.time() - start_time

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/pipeline.py in fit(self=Pipeline(steps=[('randomforestclassifier', Rando...None, verbose=0,
            warm_start=False))]), X=                                   REASON Submit...  0.0          0.0  

[382032 rows x 301 columns], y=355307    False
274437    False
210953    False
...7    False
Name: is_issue_unresolved, dtype: bool, **fit_params={})
    265         self : Pipeline
    266             This estimator
    267         """
    268         Xt, fit_params = self._fit(X, y, **fit_params)
    269         if self._final_estimator is not None:
--> 270             self._final_estimator.fit(Xt, y, **fit_params)
        self._final_estimator.fit = <bound method RandomForestClassifier.fit of Rand...e=None, verbose=0,
            warm_start=False)>
        Xt =                                    REASON Submit...  0.0          0.0  

[382032 rows x 301 columns]
        y = 355307    False
274437    False
210953    False
...7    False
Name: is_issue_unresolved, dtype: bool
        fit_params = {}
    271         return self
    272 
    273     def fit_transform(self, X, y=None, **fit_params):
    274         """Fit the model and transform with the final estimator

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/ensemble/forest.py in fit(self=RandomForestClassifier(bootstrap=True, class_wei...te=None, verbose=0,
            warm_start=False), X=                                   REASON Submit...  0.0          0.0  

[382032 rows x 301 columns], y=355307    False
274437    False
210953    False
...7    False
Name: is_issue_unresolved, dtype: bool, sample_weight=None)
    242         -------
    243         self : object
    244             Returns self.
    245         """
    246         # Validate or convert input data
--> 247         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
        X =                                    REASON Submit...  0.0          0.0  

[382032 rows x 301 columns]
    248         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    249         if issparse(X):
    250             # Pre-sort indices to avoid that each individual tree of the
    251             # ensemble sorts the indices.

...........................................................................
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py in check_array(array=                                   REASON Submit...  0.0          0.0  

[382032 rows x 301 columns], accept_sparse=['csc'], dtype=<type 'numpy.float32'>, order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, warn_on_dtype=False, estimator=None)
    377 
    378     if sp.issparse(array):
    379         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
        array =                                    REASON Submit...  0.0          0.0  

[382032 rows x 301 columns]
        dtype = <type 'numpy.float32'>
        order = None
        copy = False
    383 
    384         if ensure_2d:
    385             if array.ndim == 1:
    386                 if ensure_min_samples >= 2:

ValueError: could not convert string to float: Highway Maintenance
___________________________________________________________________________

Let's look at the performance of our best model.

In [None]:
y_predict = model.predict(X_test.head())
y_predict

In [None]:
confusion_matrix(y_test, y_predict)

In [None]:
feature_importances = pd.np.argsort(model.feature_importances_)
print "12: top five:", list(df.columns[feature_importances[-1:-6:-1]])

In [None]:
# 13. Calculate the standard deviation for feature importances across all trees

n = 10 # top 10 features

importances = model.feature_importances_[:n]
std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(n):
    print("%d. %s (%f)" % (f + 1, features[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(10), importances[indices], yerr=std[indices], color="r", align="center")
plt.xticks(range(10), indices)
plt.xlim([-1, 10])
plt.show()

In [None]:
pd.DataFrame(model.cv_results)

If my best model included 40 trees, was that enough?

In [None]:
model.cv_results_['params']

Let's make variable importance graph.

A downside of this graph is that highly correlated features will split importance.

## Using Logistic Regression with L2 Lasso Regularization

which is my favorite way to do feature subset selection at the moment :)

There are probably issues when features are co-linear. I'll need to read up on this.

In [None]:
# http://nbviewer.jupyter.org/github/JWarmenhoven/ISLR-python/blob/master/Notebooks/Chapter%206.ipynb#6.6.2-The-Lasso
from sklearn.linear_model import Lasso

In [None]:
lasso = Lasso(max_iter=10000)
coefs = []

for a in alphas*2:
    lasso.set_params(alpha=a)
    lasso.fit(scale(X_train), y_train)
    coefs.append(lasso.coef_)

ax = plt.gca()
ax.plot(alphas*2, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Lasso coefficients as a function of the regularization');