Skip to content

Commit

Permalink
Lots more images
Browse files Browse the repository at this point in the history
  • Loading branch information
feaselkl committed Sep 29, 2018
1 parent aa08b07 commit 1f88f9d
Show file tree
Hide file tree
Showing 30 changed files with 29 additions and 29 deletions.
58 changes: 29 additions & 29 deletions PITCHME.md
Expand Up @@ -224,7 +224,7 @@ Data processing is made up of a few different activities:
* Data Cleansing
* Data Analysis

---
---?image=presentation/assets/background/collection.jpg&size=cover&opacity=15

### Data Gathering

Expand All @@ -235,7 +235,7 @@ Data gathering will likely be an iterative process; as you flesh out your models
* Paid APIs or data sources from third parties
* Survey data

---
---?image=presentation/assets/background/map.jpg&size=cover&opacity=15

### Data Gathering Example

Expand Down Expand Up @@ -275,15 +275,15 @@ Most of your time, you'll be a data plumber.

![Data Plumber](presentation/assets/image/SuperMario.jpg)

---
---?image=presentation/assets/background/2_0_cleaning.jpg&size=cover&opacity=15

### Data Cleansing

General estimates that you will hear from data scientists is that they spend approximately 80% of their time cleaning data. If anything, this is an underestimation--based on my experiences, that number might be closer to 90%.

Simply getting the data is a start, but there's a long journey ahead.

---
---?image=presentation/assets/background/4_3_datatypes.jpg&size=cover&opacity=15

### Data Cleansing

Expand All @@ -293,7 +293,7 @@ After grabbing relevant-looking data sets, you will want to join them together t
* Definining join criteria (because there is no obvious natural join key)
* Reshaping data to fit join criteria

---
---?image=presentation/assets/background/4_1_dataquality.jpg&size=cover&opacity=15

### Data Cleansing

Expand All @@ -305,7 +305,7 @@ You will quickly find problems with your data sets, including (but not limited t
* Data inconsistencies: records conflicting with other records
* Misshapen flat files

---
---?image=presentation/assets/background/mismatch.jpg&size=cover&opacity=15

### Data Cleansing - Mismatches

Expand All @@ -315,7 +315,7 @@ You will quickly find problems with your data sets, including (but not limited t

**Incorrect** data: when data other than the label is incorrect. Ex: a person works 200 hours per week?

---
---?image=presentation/assets/background/hole.jpg&size=cover&opacity=15

### Data Cleansing - Missing Data

Expand All @@ -328,7 +328,7 @@ People don't always fill out the entirety of every form. When we're missing impo

None of these options is perfect, but the last three can help salvage incomplete records.

---
---?image=presentation/assets/background/duplicates.jpg&size=cover&opacity=15

### Data Cleansing - Duplicates

Expand All @@ -342,15 +342,15 @@ Independent systems may end up with inconsistent data due to reasons like typos,
* Make one data set canonical
* Institute rules (pick the lower number, pick the later date, etc.)

---
---?image=presentation/assets/background/misshapen.jpg&size=cover&opacity=15

### Misshapen Data

Data stored in flat files or textual format can end up misshapen--some rows may not have enough delimiters (or maybe too many), there could be newlines in the middle of a record, or the file cuts off in the middle of a record.

This is a problem with flat files and certain semi-structured data formats. It is not a problem with relational databases, where data shape is enforced.

---
---?image=presentation/assets/background/shapes.jpg&size=cover&opacity=15

### Data Shaping

Expand All @@ -365,7 +365,7 @@ There are several techniques we can use to reshape data to make it easier to ana

### Demo Time

---
---?image=presentation/assets/background/exploration.jpg&size=cover&opacity=15

### Data Analysis

Expand Down Expand Up @@ -444,7 +444,7 @@ Here we have two comparisons, depth vs table and x vs y. Depth and table are mil
![Microsoft Team Data Science Process](presentation/assets/image/tdsp-lifecycle2_modeling.png)
@divend

---
---?image=presentation/assets/background/model.jpg&size=cover&opacity=15

### Modeling

Expand All @@ -456,7 +456,7 @@ Modeling has five major steps:
* Model Evaluation
* Model Tuning

---
---?image=presentation/assets/background/engineering.jpg&size=cover&opacity=15

### Feature Engineering

Expand All @@ -468,7 +468,7 @@ Feature engineering involves creating relevant features from raw data. Examples
* Aggregating data (by day, by hour, etc.)
* Text processing -- turning words into arbitrary numbers for numeric analysis (TF-IDF, Word2Vec)

---
---?image=presentation/assets/background/selection.jpg&size=cover&opacity=15

### Feature Selection

Expand All @@ -486,7 +486,7 @@ We use feature selection to winnow down the available set of features. There ar
![My favorite example of spurious correlation](presentation/assets/image/SpuriousCorrelation.png)
(<a href="http://www.tylervigen.com/spurious-correlations">Source</a>)

---
---?image=presentation/assets/background/train.jpg&size=cover&opacity=15

### Model Training

Expand All @@ -498,7 +498,7 @@ There are four major branches of algorithms:
* Self-supervised learning
* Reinforcement learning

---
---?image=presentation/assets/background/supervision.jpg&size=cover&opacity=15

### Supervised Learning

Expand All @@ -508,7 +508,7 @@ Supervised learning models require known answers (labels). We train a model to m
* Classification -- Which?
* Recommendation -- What next?

---
---?image=presentation/assets/background/cluster.jpg&size=cover&opacity=15

### Unsupervised Learning

Expand All @@ -517,7 +517,7 @@ With unsupervised learning, we do not know the answers beforehand and try to der
* Clustering -- How can we segment?
* Dimensionality reduction -- What of this data is useful?

---
---?image=presentation/assets/background/book.jpg&size=cover&opacity=15

### Self-Supervised Learning

Expand All @@ -537,7 +537,7 @@ Self-supervised learning typically happens with neural networks.
Reinforcement learning is where we train an agent to observe its environment and use those environmental clues to make a decision.
@divend

---
---?image=presentation/assets/background/greenscreen.jpg&size=cover&opacity=15

### Choose An Algorithm

Expand Down Expand Up @@ -565,29 +565,29 @@ Once you understand the nature of the problem, you can choose among viable algor
Once you have an algorithm, features, and labels (if supervised), you can train the algorithm. Training a model is solving a system of equations, minimizing a loss function.
@divend

---
---?image=presentation/assets/background/fitting.jpg&size=cover&opacity=15

### Validate The Model

Instead of using up all of our data for training, we typically want to perform some level of validation within our training data set to ensure that we are on the right track and are not overfitting.

Overfitting happens when a model latches on to the particulars of a data set, leaving it unable to generalize to new data. To test for overfitting, test your model against unseen data. If there is a big dropoff in model accuracy between training and testing data, you are likely overfitting.

---
---?image=presentation/assets/background/suitmeasure.jpg&size=cover&opacity=15

### Cross-Validation

Cross-validation is a technique where we slice and dice the training data, training our model with different subsets of the total data. The purpose here is to find a model which is fairly robust to the particulars of a subset of training data, thereby reducing the risk of overfitting.

---
---?image=presentation/assets/background/tuning.jpg&size=cover&opacity=15

### Tune The Model

Most models have **hyperparameters**. For neural networks, the number of training epochs is a hyperparameter. For random forests, hyperparameters include things like the size of each decision tree and the number of trees.

We tune hyperparameters using our validation data set.

---
---?image=presentation/assets/background/evaluation.jpg&size=cover&opacity=15

### Evaluate The Model

Expand Down Expand Up @@ -635,7 +635,7 @@ You can also build a fitness function to evaluate certain types. Genetic algorit

Back in the day, one team would build a solution in an analytics language (e.g., R) but you would not go to production with that. Instead, an implementation team would rewrite your model in C++ or some other fast language. Those days of research versus implementation teams using completely different languages are now (mostly) gone.

---
---?image=presentation/assets/background/microscope.jpg&size=cover&opacity=15

### Deployment

Expand All @@ -649,7 +649,7 @@ Once you have a model ready to go, there are tools which make it relatively easy

![The DeployR process.](presentation/assets/image/deployrworkflow.png)

---
---?image=presentation/assets/background/stack.jpg&size=cover&opacity=15

### Deployment

Expand All @@ -660,7 +660,7 @@ You can also build your own services. Stacks that I've put into production incl

With a microservices architecture, you're trying to plug in these new APIs while not forcing everybody else to change their skills.

---
---?image=presentation/assets/background/time.jpg&size=cover&opacity=15

### Deployment

Expand Down Expand Up @@ -701,7 +701,7 @@ Shiny is an interactive visualization product combining JavaScript and R. This i
4. Deployment
5. **What's Next?**

---
---?image=presentation/assets/background/sinkhole.jpg&size=cover&opacity=15

### What's Next?

Expand All @@ -711,7 +711,7 @@ It is important to keep checking the efficacy of models. Model shift happens, wh

You may also find out that your training/testing data was not truly indicative of real-world data.

---
---?image=presentation/assets/background/flow.jpg&size=cover&opacity=15

### What's Next?

Expand All @@ -721,7 +721,7 @@ Depending upon your choice of algorithm, you might be able to update the existin

Some algorithms, however, require you to retrain from scratch.

---
---?image=presentation/assets/background/repetition.jpg&size=cover&opacity=15

### What's Next?

Expand Down
Binary file added presentation/assets/background/2_0_cleaning.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/4_3_datatypes.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/book.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/cluster.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/collection.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/duplicates.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/engineering.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/evaluation.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/exploration.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/fitting.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/flow.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/greenscreen.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/hole.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/map.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/microscope.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/mismatch.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/misshapen.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/model.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/repetition.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/selection.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/shapes.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/sinkhole.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/stack.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/suitmeasure.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/assets/background/supervision.jpg
Binary file added presentation/assets/background/time.jpg
Binary file added presentation/assets/background/train.jpg
Binary file added presentation/assets/background/tuning.jpg

0 comments on commit 1f88f9d

Please sign in to comment.