(https://medium.datadriveninvestor.com/how-my-computer-copies-a-baby-machine-learning-types-5ffc8add6b31)

# Reference Guide for All Things Data Science

Four Pillars of Data Science:
1. Domain Expertise - Relevant knowledge that helps formulate questions, make nuanced decisions, and assists with interpretation of results.
2. Math & Stats - Allows investigation of data patterns to determine relationships.
3. Computer Science & Engineering - Utilized for data analysis, machine learning, and higher-order predictive techniques.
4. Communication - Makes work accessible to nontechnical audiences/stakeholders.

<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/new_crisp-dm.png/new_crisp-dm.png" width="500">

**_CRISP-DM_** is probably the most popular Data Science process in the Data Science world right now. Take a look at the visualization above to get a feel for CRISP-DM. Notice that CRISP-DM is an iterative process!

Let's take a look at the individual steps involved in CRISP-DM.

**_Business Understanding:_**  This stage is all about gathering facts and requirements. Who will be using the model you build? How will they be using it? How will this help the goals of the business or organization overall? Data Science projects are complex, with many moving parts and stakeholders. They're also time intensive to complete or modify. Because of this, it is very important that the Data Science team working on the project has a deep understanding of what the problem is, and how the solution will be used. Consider the fact that many stakeholders involved in the project may not have technical backgrounds, and may not even be from the same organization.  Stakeholders from one part of the organization may have wildly different expectations about the project than stakeholders from a different part of the organization -- for instance, the sales team may be under the impression that a recommendation system project is meant to increase sales by recommending upsells to current customers, while the marketing team may be under the impression that the project is meant to help generate new leads by personalizing product recommendations in a marketing email. These are two very different interpretations of a recommendation system project, and it's understandable that both departments would immediately assume that the primary goal of the project is one that helps their organization. As a Data Scientist, it's up to you to clarify the requirements and make sure that everyone involved understands what the project is and isn't. 

During this stage, the goal is to get everyone on the same page and to provide clarity on the scope of the project for everyone involved, not just the Data Science team. Generate and answer as many contextual questions as you can about the project. 

Good questions for this stage include:

- Who are the stakeholders in this project? Who will be directly affected by the creation of this project?
- What business problem(s) will this Data Science project solve for the organization?  
- What problems are inside the scope of this project?
- What problems are outside the scope of this project?
- What data sources are available to us?
- What is the expected timeline for this project? Are there hard deadlines (e.g. "must be live before holiday season shopping") or is this an ongoing project?
- Do stakeholders from different parts of the company or organization all have the exact same understanding about what this project is and isn't?

**_Data Understanding:_**

Once we have a solid understanding of the business implications for this project, we move on to understanding our data. During this stage, we'll aim to get a solid understanding of the data needed to complete the project.  This step includes both understanding where our data is coming from, as well as the information contained within the data. 

Consider the following questions when working through this stage:

- What data is available to us? Where does it live? Do we have the data, or can we scrape/buy/source the data from somewhere else?
- Who controls the data sources, and what steps are needed to get access to the data?
- What is our target?
- What predictors are available to us?
- What data types are the predictors we'll be working with?
- What is the distribution of our data?
- How many observations does our dataset contain? Do we have a lot of data? Only a little? 
- Do we have enough data to build a model? Will we need to use resampling methods?
- How do we know the data is correct? How is the data collected? Is there a chance the data could be wrong?

**_Data Preparation:_**

Once we have a strong understanding of our data, we can move onto preparing the data for our modeling steps. 

During this stage, we'll want to handle the following issues:

- Detecting and dealing with missing values
- Data type conversions (e.g. numeric data mistakenly encoded as strings)
- Checking for and removing multicollinearity (correlated predictors)
- Normalizing our numeric data
- Converting categorical data to numeric format through one-hot encoding

**_Modeling:_**

Once we have clean data, we can begin modeling! Remember, modeling, as with any of these other steps, is an iterative process. During this stage, we'll try to build and tune models to get the highest performance possible on our task. 

Consider the following questions during the modeling step:

- Is this a classification task? A regression task? Something else?
- What models will we try?
- How do we deal with overfitting?
- Do we need to use regularization or not?
- What sort of validation strategy will we be using to check that our model works well on unseen data?
- What loss functions will we use?
- What threshold of performance do we consider as successful?

Other Questions:
* What decisions do I need to make regarding my data? How might these decisions affect overall performance?
* Which predictors do I need? How can I confirm that I have the right predictors?
* What parameter values (if any) should I choose for my model? How can I find the optimal value for a given parameter?
* What metrics will I use to evaluate the performance of my model? Why?
* How do I know if there's room left for improvement with my model? Are the potential performance gains worth the time needed to reach them?

**_Evaluation:_**

During this step, we'll evaluate the results of our modeling efforts. Does our model solve the problems that we outlined all the way back during step 1? Why or why not? Often times, evaluating the results of our modeling step will raise new questions, or will cause us to consider changing our approach to the problem.  Notice from the CRISP-DM diagram above, that the "Evaluation" step is unique in that it points to both _Business Understanding_ and _Deployment_.  As we mentioned before, Data Science is an iterative process -- that means that given the new information our model has provided, we'll often want to start over with another iteration, armed with our newfound knowledge! Perhaps the results of our model showed us something important that we had originally failed to consider the goal of the project or the scope.  Perhaps we learned that the model can't be successful without more data, or different data. Perhaps our evaluation shows us that we should reconsider our approach to cleaning and structuring the data, or how we frame the project as a whole (e.g. realizing we should treat the problem as a classification rather than a regression task). In any of these cases, it is totally encouraged to revisit the earlier steps.  

Of course, if the results are satisfactory, then we instead move onto deployment!

**_Deployment:_**

During this stage, we'll focus on moving our model into production and automating as much as possible. Everything before this serves as a proof-of-concept or an investigation.  If the project has proved successful, then you'll work with stakeholders to determine the best way to implement models and insights.  For example, you might set up an automated ETL (Extract-Transform-Load) pipelines of raw data in order to feed into a database and reformat it so that it is ready for modeling. During the deployment step, you'll actively work to determine the best course of action for getting the results of your project into the wild, and you'll often be involved with building everything needed to put the software into production.

## Data Science follows this general pipeline:

1. Exploratory Data Analysis
    * Data Cleaning - Look for missing values, duplicates, mismatched data types, and [outliers](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/)
    * Data Visualization - Observe distributions and class balance or imbalance, measure correlations, 
    * Feature Engineering - Create new features from existing ones (ratios, encoding, splitting datetime, etc.) or transform the data using logarithm or square root.
2. Inferential Statistics
    * Declare null/alternative hypotheses
    * Establish confidence level (i.e. alpha)
    * Complete statistical tests
    * Interpret results (p-value)
3. Modeling
    * Machine Learning
        * Baseline model comparison & model selection
        * Split data into train/test groups
        * Data preprocessing (SMOTE, scaling, PCA/LDA) on X_train and X_test groups separately
        * Model training
        * Hyperparameter tuning (GridSearch)
        * Model evaluation
    * Deep Learning
        * Model design - ANN, CNN, RNN
        * Data Preprocessing - Normalize/standardize inputs
        * Model training & optimization (SGD, adam)
        * Regularization - Boost model performance (batch, dropout, etc.)
        * Model evaluation
4. Deployment
    * Model serialization - Save the model without the need to retrain
    * API Devlelopment - Allows other systems to interact with the model
    * Containerization - Package model and dependencies together
    * Cloud deployment - Ensures scalability
    * Integration - Make accessible to non-technical users
    * Maintenance - Monitor performance and retrain as necessary

#### Outliers
Identifying Outliers
* 1.5 * IQR

Handling Outliers:
1. Investigate First:
    * Understand the cause of the outliers (error, natural variability, or special cases?).
2. Use Domain Knowledge:
    * Collaborate with subject matter experts to decide whether the outliers are meaningful or spurious.
3. Transform Data:
    * Use transformations (e.g., log, square root) to reduce the impact of outliers without removing them.
4. Use Robust Models:
    * Some machine learning algorithms (e.g., decision trees, random forests) are less sensitive to outliers.
5. Explore Alternatives:
    * Other methods like z-scores, Mahalanobis distance (for multivariate data), or robust statistical techniques might be more appropriate depending on your data and goals.
6. Document Decisions:
    * Always document how and why outliers were handled to ensure reproducibility and explainability.

### Supervised Machine Learning

#### KNN

The K-Nearest Neighbors algorithm works as follows: 

1. Choose a point 
2. Find the K-nearest points
    1. K is a predefined user constant such as 1, 3, 5, or other odd number. 
3. Predict a label for the current point:
    1. Classification - Take the most common class of the k neighbors
    2. Regression - Take the average target metric of the k neighbors
    3. Both classification or regression can also be modified to use weighted averages based on the distance of the neighbors 

Assumption: **_Distance helps us quantify similarity_** - Objects that are more alike are more likely to be the same class. By treating each column in your dataset as a separate dimension, you can plot each data point that you have and measure the distance between them!

Distance Metrics: 
* Minkowski - A generalized distance metric across a _Normed Vector Space_. A Normed Vector Space is just a fancy way of saying a collection of space where each point has been run through a function. It can be any function, as long it meets two criteria: 
    1. the zero vector (just a vector filled with zeros) will output a length of 0, and 
    2. every other vector must have a positive length 
    * `Formula`: $ d(x, y) = \left(\sum_{i=1}^{n}|x_i - y_i|^c\right)^\frac{1}{c}$
* Manhattan - Measures the distance from one point to another traveling along the axes of a grid.
    * `Formula`: $ \large d(x,y) = \sum_{i=1}^{n}|x_i - y_i | $
* Euclidean - Measures the distance between two points, by moving in a straight line.
    * `Formula`: $ \large d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} $

Stores all data in memory! Huge constraint.

During the "predict" step, KNN takes a point that you want a class prediction for, and calculates the distances between that point and every single point in the training set. It then finds the `K` closest points, or **_Neighbors_**, and examines the labels of each. You can think of each of the K-closest points getting to 'vote' about the predicted class. Naturally, they all vote for the same class that they belong to. The majority wins, and the algorithm predicts the point in question as whichever class has the highest count among all of the k-nearest neighbors.

In general, the smaller K is, the tighter the "fit" of the model. Remember that with supervised learning, you want to fit a model to the data as closely as possible without **_overfitting_** to patterns in the training set that don't generalize.  This can happen if your model pays too much attention to every little detail and makes a very complex decision boundary. Conversely, if your model is overly simplistic, then you may have **_underfit_** the model, limiting its potential. A visual explanation helps demonstrate this concept in practice:

<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/fit_fs.png" width = "700">

KNN isn't the best choice for extremely large datasets, and/or models with high dimensionality. This is because the time complexity (what computer scientists call "Big O", which you saw briefly earlier) of this algorithm is exponential. As you add more data points to the dataset, the number of operations needed to complete all the steps of the algorithm grows exponentially! That said, for smaller datasets, KNN often works surprisingly well, given the simplicity of the overall algorithm.

#### Grid Search Cross Validation

All the parameters above work together to create the framework of the decision tree that will be trained. For a given problem, it may be the case that increasing the value of the parameter for `min_samples_split` generally improves model performance up to a certain point, by reducing overfitting. However, if the value for `max_depth` is too low or too high, this may doom the model to overfitting or underfitting, by having a tree with too many arbitrary levels and splits that overfit on noise, or limiting the model to nothing more than a "stump" by only allowing it to grow to one or two levels. 

So how do we know which combination of parameters is best? The only way we can really know for sure is to try **_every single combination!_** For this reason, grid search is sometimes referred to as an **_exhaustive search_**. 

The following code snippet demonstrates how to use `GridSearchCV` to perform a parameter grid search using a sample parameter grid, `param_grid`. Our parameter grid should be a dictionary, where the keys are the parameter names, and the values are the different parameter values we want to use in our grid search for each given key. After creating the dictionary, all you need to do is pass it to `GridSearchCV()` along with the classifier. You can also use K-fold cross-validation during this process, by specifying the `cv` parameter. In this case, we choose to use 3-fold cross-validation for each model created inside our grid search. 

```python
clf = DecisionTreeClassifier()

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [1, 5, 10, 20]
}

gs_tree = GridSearchCV(clf, param_grid, cv=3)
gs_tree.fit(train_data, train_labels)

gs_tree.best_params_
```

This code will run all combinations of the parameters above.

##### Drawbacks of `GridSearchCV`

GridSearchCV is a great tool for finding the best combination of parameters. However, it is only as good as the parameters we put in our parameter grid -- so we need to be very thoughtful during this step! 

The main drawback of an exhaustive search such as `GridsearchCV` is that there is no way of telling what's best until we've exhausted all possibilities! This means training many versions of the same machine learning model, which can be very time consuming and computationally expensive. Consider the example code above -- we have three different parameters, with 2, 4, and 4 variations to try, respectively. We also set the model to use cross-validation with a value of 3, meaning that each model will be built 3 times, and their performances averaged together. If we do some simple math, we can see that this simple grid search we see above actually results in `2 * 4 * 4 * 3 =` **_96 different models trained!_** For projects that involve complex models and/or very large datasets, the time needed to run a grid search can often be prohibitive. For this reason, be very thoughtful about the parameters you set -- sometimes the extra runtime isn't worth it -- especially when there's no guarantee that the model performance will improve!



#### Integrating Grid Search in Pipelines

First, you define the pipeline in the same way as above. Next, you create a parameter grid. When this is all done, you use the function `GridSearchCV()`, which you've seen before, and specify the pipeline as the estimator and the parameter grid. You also have to define how many folds you'll use in your cross-validation. 

```python
# Create the pipeline
pipe = Pipeline([('mms', MinMaxScaler()),
                 ('tree', DecisionTreeClassifier(random_state=123))])

# Create the grid parameter
grid = [{'tree__max_depth': [None, 2, 6, 10], 
         'tree__min_samples_split': [5, 10]}]


# Create the grid, with "pipe" as the estimator
gridsearch = GridSearchCV(estimator=pipe, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)

# Fit using grid search
gridsearch.fit(X_train, y_train)

# Calculate the test score
gridsearch.score(X_test, y_test)
```

#### Ensemble Methods
In Data Science, the term **_ensemble_** refers to an algorithm that makes use of more than one model to make a prediction. These are typically ***supervised*** machine models

Ensemble methods take advantage of the delphic technique (or "wisdom of crowds") where the average of multiple independent estimates is usually more consistently accurate than the individual estimates.

##### Bagging
The main concept that makes ensembling possible is **_Bagging_**, which is short for **_Bootstrap Aggregation_**. Bootstrap aggregation is itself a combination of two ideas -- bootstrap resampling and aggregation. `Bootstrapping` refers to the subsets of your dataset by sampling with replacement. Aggregation is exactly as it sounds -- the practice of combining all the different estimates to arrive at a single estimate.
* A common approach is to treat each classifier in the ensemble's prediction as a "vote" and let our overall prediction be the majority vote. 
* It's also common to see ensembles that take the arithmetic mean of all predictions, or compute a weighted average.

Process:
1. Grab a sizable sample from your dataset, with replacement.
2. Train a classifier on this sample.
3. Repeat until all classifiers have been trained on their own sample from the dataset.
4. When making a prediction, have each classifier in the ensemble make a prediction.
5. Aggregate all predictions from all classifiers into a single prediction, using the method of your choice.

Decision trees are often used because they are very sensitive to variance. On their own, this is a weakness. However, when aggregated together into an ensemble, this actually becomes a good thing!

#### Random Forests

The **_Random Forest_** algorithm is a supervised learning algorithm and ensemble technique generated from Decision Trees. Because Decision Trees maximizes information gain at every step, we differentiate each tree by limiting the samples they are trained on (Bagging (2/3)) and limiting the number of features each tree is trained on (Subspace Sampling; a hyperparameter)

Example:

Lets say we have a training dataset that consists of 3000 rows and 10 columns.
1. Bag 2/3 of the overall data (2000 rows).
2. Randomly select a set number of features to use for training each node within this (tunable; lets choose 6).
3. Train the tree on the modified dataset, which is now a DataFrame consisting of 2000 rows and 6 columns  .
4. Drop the unused columns from step 3 from the out-of-bag rows that weren't bagged in step 1, and then use this as an internal testing set to calculate the out-of-bag error for this particular tree .

Pros:
* Robust to variance
* High performance

Cons:
* Slow on large datasets
* High memory consumption (each tree is stored)

#### Model Deployment

Luckily there are techniques to *pickle* your model -- basically, to store the model for later, so that it can be loaded and can make predictions without being trained again. Pickled models are also typically used in the context of model deployment, where your model can be used as the backend of an API!