# Every Data Science Problem is an Optimization Problem: and other lessons from the trenches.

## Speaker: Brad Null, Chief Scientist, Reputation.com

- The field of Data Science is still quite nascent. We are all learning as we go, and not only are we far from a generalized understanding of Data Science best practices, we don't even have a consensus on what a Data Scientist is and does. But we've learned a few things so far, and in this talk I want to share some guiding principles I've come to live by as a Data Scientist and a few of the learning experiences that have crystallized these principles. The title of the talk refers to one of these principles, ("every data science problem is an optimization problem"), and figuring out the right objective function (hint: it is probably not minimizing mean squared error) can have massive consequences on how you want to structure and approach that problem.
- Brad Null is the Chief Scientist and head of the Data Science team at Reputation.com, the leader in online reputation management. His team is responsible for analyzing a unique corpus of online and offline information (e.g. reviews, surveys, social, and other content) about companies in over 80 industries and determining what that means to those companies and how they should respond. He is also the founder of Voodoo Sports, where he assembled a team to build and maintain cutting edge sports data products leveraging his 10+ years of experience in sports analytics, including his PhD thesis "Stochastic Modeling and Optimization in Baseball." 

## Intro

- Data Institute Conference website up!  October 2017
- Brad Null, Reputation.com
    - PhD: Stochastic Modeling and Optimization in Baseball

## Presentation

- ML = Advanced Algorithms
    - Examples
        - Support Vector
        - Regularized Regression
        - Rand Forest
        - Boosted Trees
    - Being Commoditized
        - How good those models are => another talk
    - Easy to implement
    - Fundamental building block for data scientists right now: "scikit-learn, tensor flow, NLTK"
- ML = Revolutionary
    - "Solving" big problems
    - At the doorstep of "true" AI
- ML = Overhyped (the counterpoint)
    - Making progress on well-defined problems
    - BUT
        - None of those problems he works on are not well-defined
    - If you think your problem is well-defined, question if you really understand the problem
    - Lots of important problems we are still stuck at
        - Determining the value of each user
        - Predicting value-add of each feature or asset
- What is DS?
    - ML commonly assoc'd w/ Modeling
    - ML not yet good at
        - BO: Business Objectives
        - PF: Problem Formulation
    - Need to get BO/PF right
- I. Every Problem is an Optimization Problem
    - Another definition of DS
        - "We solve problems"
        - Bring math and data to bear on real-world problems
    - That's our first job
        - Formulate a problem given BO
        - Objective Function is the most important part of the problem (Dantzig)
    - Example: Acquiring Baseball Players
        - How to value baseball players in order to draft the best ones for your team (or fantasy team)
        - Problem: The winner's curse
        - Issue: You don't get the players you valued the most => get who you value more than anyone else
        - You aren't trying to predict player perf => Trying to find udnervalued assets
        - **What to do**
            - Residual analysis => Examine bias in your model
            - Fix your objective function
            - Reverse engineer the market => Understand what everyone else is doing and how to improve on it
    - Example: March Madness Optimization
        - Problem: Want to win a March Madness pool
        - Approach #1: Predict who will win each game
        - Issue: Not max EV, but max chances of winning your pool
    - Example: Ad
- II. Everything is Stochastic
    - Point estimates are over-rated
    - Anywhere you compute a point estimate you can compute a posterior distribution
    - Example: MLB Player Valuation
        - ** Add slide notes**
    - Things change
    - Example: MLB Park Factors and League Averages
        - Important components toa  predictive model of baseball
        - Can change regularly (HR rate up 30% last 2 years)
- III. Fail Fast
    - Entrenched entrepreneurship / consumer product mantra
        - Don't know what will resonate
        - Build Minimally Viable Product (MVP)
        - Market Test
        - Learn
        - Iterate
        - Repeat
    - Most steps may fail to advance key metrics
        - YOu will learn along the way and move forward in big chunks
    - Same approach works for DS
        - Models are limitless
        - Build MVP
        - Evaluate
            - How good is the solution?
            - Is it good enough? What is good enough?
            - What variables are missing? (Residual analysis)
        - What improvement will provide the maximum ROI
            - Direction of steepest ascent (Optimization!)
        - Build, eval, explore
        - Rinse, wash, repeat
    - **A lot of the things we try fail**
        - The faster we move, the more we try to learn along the way
    - So many things you can learn at each step
        - How accurate is your solution
            - Getting first model out there is important => You need a baseline
            - If you don't know how to approach the model, you don't know how the problem
            - Even trying to simplest first and even failing but learning fast is valuable
    - We are not just prioritizing the next step wrt this problem, but the next problem to work on
    - Especially useful for PhDs
    - Nt just iterate fast but iterate simple
- IV. Everything is Connected
    - Couterbalance to "Fail Fast"
        - Don't want simplistic models
        - Want foundational models
    - Example: Hierachical stochastic model of baseball
        - Graphic
        - By understanding the whole system, we can
            - Predict real-time/ in-game
            - Optimize on every level
            
- http://reputation.com/careers
    - Operational insights: manage online presence for companies
    - NLP classification
    - ROI analysis
- Q: Fail Fast
    - As you're getting more complex, you need to add more value
    - If you're not making and progress on any problem, you move on to other problem
    - If you're not convinced that there's something there, there might not be value => do other problems
- Q: Does size/amount of data matter?
    - Can work with problems in any size of data
    - No minimum requirement needed

In [4]:
from graphviz import Digraph

dot = Digraph(comment='Test')
dot.node('A', 'Hello')
print(dot.source)

// Test
digraph {
	A [label=Hello]
}


In [5]:
dot.render()

RuntimeError: failed to execute ['dot', '-Tpdf', '-O', 'Digraph.gv'], make sure the Graphviz executables are on your systems' path