##### Machine Learning with scikit-learn
**Andreas Mueller** *NYU Center for Data Science, scikit-learn*

http://bit.ly/sklodsc

- big list of stuff it can do
- current version 0.17
- useful stuff:
    - train_test_split
    - are .fit/.predict/.score the standard methods on models?
        - yes! examples used: LinearSVM, RandomForestClassifier
    - PCA is pretty easy to run
    - preprocessing:
        - useful stuff for scaling, etc.
        - .transform
    - cross validation
        - takes a method, data, parameters
        - can do stratified cross validation
        - ShuffleSplit
            - not quite clear on what this is doing differently
        - n jobs for parallelization
        - scoring
    - alternative scoring
        - F1 / average_precision / roc_auc
        - can multiple scoring be done?
    - grid search
        - GridSearchCV
        - pass a dict with the parameter values
        - after instantiation, acts like other classifiers (fit/predict/score)
            - .best_params then available
    - pipelines
        - hook multiple things together
        - such as scaler & classifier
        - can be used with grid search
            - special for the param grid
                - 'step\_\_param' to set which parameters to modify
    - Vectorizers
        - for feature creation
        - CountVectorizer
            - for, eg, bag of words
            - includes ngram_range parameter
        - TFIDFVectirizer
        - HashingVectorizer
            - sparse representation, no vocabulary
            - useful for streaming
            - but collisions, model transparency            
    - feature unions
        - essentially a way to extract features in different ways
    - out of core learning
        - partial_fit over batches
        

##### Performance Pandas
###### OR *stuff i really should've already known about pandas*
**Jeff Reback** *Continuum*

http://tiny.cc/pandaspydata

- the beeradvocate data is no longer available
    - probably for the reasons ben fields said yesterday
- types:
    - be careful about data types!
- formats:
    - formats
        - excel is slow
        - sql is better
        - json is mid
        - hdf5 is good
        - pickle and msgpack are great
    - text data: json is a good idea
    - numeric data: json is not so good
    - odo library provides a lot of good conversions
- data:
    - examining data: 
        - the .str accessor for text data: provides a lot of useful functions that would probably make my life better
        - *(tab completion in notebook)*
    - datetime64s are better than dates for pandas
        - dates are not first class objects in pandas (datetimes are)
    - if you are iterating, should be vectorizing
    - categoricals & .cat accessor
        - should be converting some new fields to categoricals
        - maps to integers
    - objects are big
        - storing objects is not ideal
    - *(select_dtypes)*
    - *(.info)*
- indexing
    - use .loc
        - you can provide an indexer and the columns you actually want
        - it's just more explicit & efficient
    - .contains on categorical names
        - convert to string: not so good
        - method chain: 
            - use .cat.categories.str.contains to get the categories
            - then use those categories with .isin
            - use that output as the indexer
    - .iloc is purely positional indexer
    - multiindexing
        - represented by tuples
        - pd.IndexSlice is something to know about
            - to index by multiple levels
            - you can use : to not filter a particular level
        - .query takes strings
            - sql-like syntax
- grouping
    - steps: split, apply, combine
    - df.groupby takes a grouper
        - it can be a series, a mapping, a value...
    - .get_group is much more efficient than a selector
    - apply, things like .agg
        - a user function can be slow!
        - can use selectors, or on a single column
    - can chain group operations as well
    - .agg can be used for mean, stdev, etc.?
    - combine:
        - stack
    - can group by multiple things
    - .agg : aggregates one per group
    - .transform : output has the same shape
        - df.groupby.transform
    - .apply : anything goes
    - don't groupby a groupby
- tidying data
    - each variable should form a column
    - each observation should form a row
    - something else
    - .assign method:
        - copies data, assigns a new column, returns
    - .drop method:
        - removes columns
    - .melt
        - can use to change wide data to long data
        - eg, create multiple new rows from one original
    - .pivot
        - inverse of melt
    - .pipe
        - new thingg
- datetimes
    - Grouper(key='blah', freq='D')
    - .set_index('timecol')[datetimestring\:datetimestring]
        - the \: *must* be present
    - timezones are better in .17
    - .rolling_mean (!) 
- other libraries
- numba & cython
    - pandas plays nicely with these two
    - cython:
        - code can be pretty similar
        - needs allocation
    - numba:
        - numba can do the memory allocation for you 
        - code just needs @jit
        - future support for numba will be even easier
- dask
    - dask.from_pandas to get dask frame from pandas
    - frequently will *just work*

##### One of these things is not like the others. Automatically detecting outliers.
**HOMIN LEE** *Datadog*

- problem domain for datadog is, of course, monitoring
- outliers vs anomalies
- MAD (median absolute deviation)
    - $$MAD(D) = median( { |d_i - median(D)|})$$
    - tolerance & percent for a full time series being an outlier
- DBSCAN
    - again, epsilon & min_samples
- choice: care about the shapes being similar or alignment?
    - DBSCAN is likely to trigger anomaly on bad alignment
- hidden parameter: window size

##### The Art and Science of Data Matching
**MIKE MULL** 

https://github.com/mikemull/Notebooks/blob/master/PyDataNYCSlides.ipynb

Approximate string comparisons:
- Python modules:
    - NLTK
    - Difflib
    - Jellyfish
    - more
- Algorithms
    - Jaro
    - JaroWinkler
        - quite valuable algorithm
    - Jaccard
    - SoftTfIdf
        - uses TF-IDF with other measures

##### Getting Started in Computational Sociology with Python
**TARA ADISESHAN** *Coral Project*

- Collaboration between NYT, Mozilla, Washington Post
    - https://coralproject.net
    - 'don't read the comments'\: but community is important
    - quite new project
- what happens when journalists get involved in comments?
    - potential follow ups, better engagement, less toxicity
    - hard to fix:
        - trolls
        - drive-bys (no real discussion)
        - low quality contributions
        - power-users
        - hostile
        - scaling problems
- so how to help:
    - detect toxic behavior
    - highlight great discussion
    - design inclusive communities
    - deal with scale
- what kind of questions?

- agent-based modeling
    - flocking / craig reynolds / boids
    - schelling's model: (housing)
        - small preferences in who you want to be surrounded by causes big effects
    - abm + python:
        - mesa
        - pyabm
        - pycx.sourceforce.net
        - ...
- 'algorithmic accountability'
    - risks: 
        - grasp of language
        - reinforcing biases
    - possibilities:
        - sentiment analysis
        - relevance to article
        - topic modeling / hand-crafted? generated?
- silly things
    - http://haternews.herokuapp.com/
- lada adamic
    - interesting person to look up 
    

##### Dask - Parallelizing NumPy/Pandas through Task Scheduling
**Jim Crist**

http://github.com/jcrist/Dask_PyData_NYC

- Big data sets
    - how to deal?
- Blocked algorithms
    - blocked mean as working example
        - breaking data into chunks & reducing at the end
    - get you parallelism & less memory usage
    - if you can figure out how to break it out into blocks
- Dask
    - parallel computing framework
    - leverages python ecosystem
    - blocked algorithms & scheduling
    - pure python
    - collections
        - array
            - out of core, parallel, n-dimensional array library
            - copies numpy interface
        - bag
            - map, filter, reduce, etc.
            - example on http://blaze.pydata.org/blog/2015/09/08/reddit-comments/
        - dataframe
            - from pandas: dd.from_pandas
    - visualizes the computations!
    - collections build task graphs
        - these are passed on to schedulers
        - the graph:
            - dictionary of name: task
            - tasks are tuples of (func, args...)
            - args can be names, values, or tasks
        - some things cannot be encoded
            - can create graphs directly and pass to schedulers
    - dask.imperative
        - decorate functions with @do to build up
    - distributed scheduler
        - in the works
        - http://distributed.readthedocs.org/en/latest
    

##### Using Your Powers For Good
**Peter Bull** *DrivenData* @drivendataorg

- datascience in the social sector
    - not many data science people
    - not much money to pay them in nonprofits
- how to get involved
    - join data for good organizations
        - datakind
        - code for america
        

