# Machine Learning and Data Science Framework

The six-step framework:

1. Problem definition ... what are we trying to solve?
2. Data ... what data do we have?
3. Evaluation ... what defines success for us?
4. Features ... what features should we model? What do we already know about the data?
5. Modeling ... what kind of model should we use?
6. Experiments ... what have we tried and what else can we try?

## 1. Problem definition

Remember ... machine learning isn't the best path for every data problem. Simple, rule-based systems can often be the best approach!

Think ... is this problem a good fit for supervised learning, unsupervised learning, transfer learning, or reinforcement learning?

Supervised learning: when you know your inputs (features) and outputs (labels)
* Classification (binary or multiclass)
* Regression

Unsupervised learning: when you only know your inputs
* Clustering
* Recommendation engines

Transfer learning: leverage a pre-trained model for our use case (when you think your problem might be similar to something else)

## 2. Data

What kind of data do we have?
* Structured?
* Unstructured? (text, audio, images)

How does it come in?
* Static/batch (most problems start here)
* Streaming?

## 3. Evaluation

What kind of accuracy do we need for this project to be useful? What is state-of-the-art or human-level accuracy?

What type of metric will we use to evaluate accuracy?

What do we want to avoid more? False positives or false negatives?

## 4. Features

Typical types of features:
* Numerical
* Categorical

Can we engineer new features from our existing ones?

If a feature has a lot of missing data, it might not be useful to us.

## 5. Modeling 

Three parts:
1. Choosing and training a model (training)
2. Tuning the model (validation data)
3. Compare the model with others you've tried (test set)
    * watch out for under- and overfitting ... you want generality

Based on our problem and data, what model should we use?

An important note ... during our experiments we'll split our data into three sets:
* training data ... 70% - 80% (train the model on this)
* validation data ... 10% - 15% (tune the model on this)
* test data ... 10% - 15% (test and compare on this)

Choose a model and train it on training data.

One goal you want to have during experimentation is to minimize the time between experiments. So if your training data is vast enough to make training take hours, start with a subset of the training data for faster experimentation.

When selecting a model for production, consider not only accuracy, but also training time and time to make a prediction on new data.

Poor performance on training data means underfitting. Try a different model, improve the existing one, or collect more data.

Great performance on training data, but poor performance on test data mean overfitting and poor generalization. Try using a simpler model, or make sure your test data is well represented in the training data. You don't want a difference in what you've trained your data on and what data the model will encounter in production.
