(https://medium.datadriveninvestor.com/how-my-computer-copies-a-baby-machine-learning-types-5ffc8add6b31)

# Reference Guide for All Things Data Science

Four Pillars of Data Science:
1. Domain Expertise - Relevant knowledge that helps formulate questions, make nuanced decisions, and assists with interpretation of results.
2. Math & Stats - Allows investigation of data patterns to determine relationships.
3. Computer Science & Engineering - Utilized for data analysis, machine learning, and higher-order predictive techniques.
4. Communication - Makes work accessible to nontechnical audiences/stakeholders.

<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/new_crisp-dm.png/new_crisp-dm.png" width="500">

**_CRISP-DM_** is probably the most popular Data Science process in the Data Science world right now. Take a look at the visualization above to get a feel for CRISP-DM. Notice that CRISP-DM is an iterative process!

Let's take a look at the individual steps involved in CRISP-DM.

**_Business Understanding:_**  This stage is all about gathering facts and requirements. Who will be using the model you build? How will they be using it? How will this help the goals of the business or organization overall? Data Science projects are complex, with many moving parts and stakeholders. They're also time intensive to complete or modify. Because of this, it is very important that the Data Science team working on the project has a deep understanding of what the problem is, and how the solution will be used. Consider the fact that many stakeholders involved in the project may not have technical backgrounds, and may not even be from the same organization.  Stakeholders from one part of the organization may have wildly different expectations about the project than stakeholders from a different part of the organization -- for instance, the sales team may be under the impression that a recommendation system project is meant to increase sales by recommending upsells to current customers, while the marketing team may be under the impression that the project is meant to help generate new leads by personalizing product recommendations in a marketing email. These are two very different interpretations of a recommendation system project, and it's understandable that both departments would immediately assume that the primary goal of the project is one that helps their organization. As a Data Scientist, it's up to you to clarify the requirements and make sure that everyone involved understands what the project is and isn't. 

During this stage, the goal is to get everyone on the same page and to provide clarity on the scope of the project for everyone involved, not just the Data Science team. Generate and answer as many contextual questions as you can about the project. 

Good questions for this stage include:

- Who are the stakeholders in this project? Who will be directly affected by the creation of this project?
- What business problem(s) will this Data Science project solve for the organization?  
- What problems are inside the scope of this project?
- What problems are outside the scope of this project?
- What data sources are available to us?
- What is the expected timeline for this project? Are there hard deadlines (e.g. "must be live before holiday season shopping") or is this an ongoing project?
- Do stakeholders from different parts of the company or organization all have the exact same understanding about what this project is and isn't?

**_Data Understanding:_**

Once we have a solid understanding of the business implications for this project, we move on to understanding our data. During this stage, we'll aim to get a solid understanding of the data needed to complete the project.  This step includes both understanding where our data is coming from, as well as the information contained within the data. 

Consider the following questions when working through this stage:

- What data is available to us? Where does it live? Do we have the data, or can we scrape/buy/source the data from somewhere else?
- Who controls the data sources, and what steps are needed to get access to the data?
- What is our target?
- What predictors are available to us?
- What data types are the predictors we'll be working with?
- What is the distribution of our data?
- How many observations does our dataset contain? Do we have a lot of data? Only a little? 
- Do we have enough data to build a model? Will we need to use resampling methods?
- How do we know the data is correct? How is the data collected? Is there a chance the data could be wrong?

**_Data Preparation:_**

Once we have a strong understanding of our data, we can move onto preparing the data for our modeling steps. 

During this stage, we'll want to handle the following issues:

- Detecting and dealing with missing values
- Data type conversions (e.g. numeric data mistakenly encoded as strings)
- Checking for and removing multicollinearity (correlated predictors)
- Normalizing our numeric data
- Converting categorical data to numeric format through one-hot encoding

**_Modeling:_**

Once we have clean data, we can begin modeling! Remember, modeling, as with any of these other steps, is an iterative process. During this stage, we'll try to build and tune models to get the highest performance possible on our task. 

Consider the following questions during the modeling step:

- Is this a classification task? A regression task? Something else?
- What models will we try?
- How do we deal with overfitting?
- Do we need to use regularization or not?
- What sort of validation strategy will we be using to check that our model works well on unseen data?
- What loss functions will we use?
- What threshold of performance do we consider as successful?

Other Questions:
* What decisions do I need to make regarding my data? How might these decisions affect overall performance?
* Which predictors do I need? How can I confirm that I have the right predictors?
* What parameter values (if any) should I choose for my model? How can I find the optimal value for a given parameter?
* What metrics will I use to evaluate the performance of my model? Why?
* How do I know if there's room left for improvement with my model? Are the potential performance gains worth the time needed to reach them?

**_Evaluation:_**

During this step, we'll evaluate the results of our modeling efforts. Does our model solve the problems that we outlined all the way back during step 1? Why or why not? Often times, evaluating the results of our modeling step will raise new questions, or will cause us to consider changing our approach to the problem.  Notice from the CRISP-DM diagram above, that the "Evaluation" step is unique in that it points to both _Business Understanding_ and _Deployment_.  As we mentioned before, Data Science is an iterative process -- that means that given the new information our model has provided, we'll often want to start over with another iteration, armed with our newfound knowledge! Perhaps the results of our model showed us something important that we had originally failed to consider the goal of the project or the scope.  Perhaps we learned that the model can't be successful without more data, or different data. Perhaps our evaluation shows us that we should reconsider our approach to cleaning and structuring the data, or how we frame the project as a whole (e.g. realizing we should treat the problem as a classification rather than a regression task). In any of these cases, it is totally encouraged to revisit the earlier steps.  

Of course, if the results are satisfactory, then we instead move onto deployment!

**_Deployment:_**

During this stage, we'll focus on moving our model into production and automating as much as possible. Everything before this serves as a proof-of-concept or an investigation.  If the project has proved successful, then you'll work with stakeholders to determine the best way to implement models and insights.  For example, you might set up an automated ETL (Extract-Transform-Load) pipelines of raw data in order to feed into a database and reformat it so that it is ready for modeling. During the deployment step, you'll actively work to determine the best course of action for getting the results of your project into the wild, and you'll often be involved with building everything needed to put the software into production.

## Data Science follows this general pipeline:

1. Exploratory Data Analysis
    * Data Cleaning - Look for missing values, duplicates, mismatched data types, and [outliers](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/)
    * Data Visualization - Observe distributions and class balance or imbalance, measure correlations, 
    * Feature Engineering - Create new features from existing ones (ratios, encoding, splitting datetime, etc.) or transform the data using logarithm or square root.
2. Inferential Statistics
    * Declare null/alternative hypotheses
    * Establish confidence level (i.e. alpha)
    * Complete statistical tests
    * Interpret results (p-value)
3. Modeling
    * Machine Learning
        * Baseline model comparison & model selection
        * Split data into train/test groups
        * Data preprocessing (SMOTE, scaling, PCA/LDA) on X_train and X_test groups separately
        * Model training
        * Hyperparameter tuning (GridSearch)
        * Model evaluation
    * Deep Learning
        * Model design - ANN, CNN, RNN
        * Data Preprocessing - Normalize/standardize inputs
        * Model training & optimization (SGD, adam)
        * Regularization - Boost model performance (batch, dropout, etc.)
        * Model evaluation
4. Deployment
    * Model serialization - Save the model without the need to retrain
    * API Devlelopment - Allows other systems to interact with the model
    * Containerization - Package model and dependencies together
    * Cloud deployment - Ensures scalability
    * Integration - Make accessible to non-technical users
    * Maintenance - Monitor performance and retrain as necessary

#### Outliers
Identifying Outliers
* 1.5 * IQR

Handling Outliers:
1. Investigate First:
    * Understand the cause of the outliers (error, natural variability, or special cases?).
2. Use Domain Knowledge:
    * Collaborate with subject matter experts to decide whether the outliers are meaningful or spurious.
3. Transform Data:
    * Use transformations (e.g., log, square root) to reduce the impact of outliers without removing them.
4. Use Robust Models:
    * Some machine learning algorithms (e.g., decision trees, random forests) are less sensitive to outliers.
5. Explore Alternatives:
    * Other methods like z-scores, Mahalanobis distance (for multivariate data), or robust statistical techniques might be more appropriate depending on your data and goals.
6. Document Decisions:
    * Always document how and why outliers were handled to ensure reproducibility and explainability.

#### Principal Component Analysis (PCA)

PCA re-encodes a dataset into an alternative basis (the axes). Here are the steps:

1. Recenter each feature of the dataset by subtracting that feature's mean from the feature vector
2. Calculate the covariance matrix for your centered dataset
3. Calculate the eigenvectors of the covariance matrix
4. Project the dataset into the new feature space: Multiply the eigenvectors by the mean-centered features

Fortunately, scikit-learn does this for us! We can instantiate the model using n_components and specify a variance threshold (0-1) or a number of features to utilize (1-n)

### Supervised Machine Learning

#### KNN

The K-Nearest Neighbors algorithm works as follows: 

1. Choose a point 
2. Find the K-nearest points
    1. K is a predefined user constant such as 1, 3, 5, or other odd number. 
3. Predict a label for the current point:
    1. Classification - Take the most common class of the k neighbors
    2. Regression - Take the average target metric of the k neighbors
    3. Both classification or regression can also be modified to use weighted averages based on the distance of the neighbors 

Assumption: **_Distance helps us quantify similarity_** - Objects that are more alike are more likely to be the same class. By treating each column in your dataset as a separate dimension, you can plot each data point that you have and measure the distance between them!

Distance Metrics: 
* Minkowski - A generalized distance metric across a _Normed Vector Space_. A Normed Vector Space is just a fancy way of saying a collection of space where each point has been run through a function. It can be any function, as long it meets two criteria: 
    1. the zero vector (just a vector filled with zeros) will output a length of 0, and 
    2. every other vector must have a positive length 
    * `Formula`: $ d(x, y) = \left(\sum_{i=1}^{n}|x_i - y_i|^c\right)^\frac{1}{c}$
* Manhattan - Measures the distance from one point to another traveling along the axes of a grid.
    * `Formula`: $ \large d(x,y) = \sum_{i=1}^{n}|x_i - y_i | $
* Euclidean - Measures the distance between two points, by moving in a straight line.
    * `Formula`: $ \large d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} $

Stores all data in memory! Huge constraint.

During the "predict" step, KNN takes a point that you want a class prediction for, and calculates the distances between that point and every single point in the training set. It then finds the `K` closest points, or **_Neighbors_**, and examines the labels of each. You can think of each of the K-closest points getting to 'vote' about the predicted class. Naturally, they all vote for the same class that they belong to. The majority wins, and the algorithm predicts the point in question as whichever class has the highest count among all of the k-nearest neighbors.

In general, the smaller K is, the tighter the "fit" of the model. Remember that with supervised learning, you want to fit a model to the data as closely as possible without **_overfitting_** to patterns in the training set that don't generalize.  This can happen if your model pays too much attention to every little detail and makes a very complex decision boundary. Conversely, if your model is overly simplistic, then you may have **_underfit_** the model, limiting its potential. A visual explanation helps demonstrate this concept in practice:

<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/fit_fs.png" width = "700">

KNN isn't the best choice for extremely large datasets, and/or models with high dimensionality. This is because the time complexity (what computer scientists call "Big O", which you saw briefly earlier) of this algorithm is exponential. As you add more data points to the dataset, the number of operations needed to complete all the steps of the algorithm grows exponentially! That said, for smaller datasets, KNN often works surprisingly well, given the simplicity of the overall algorithm.

#### Grid Search Cross Validation

All the parameters above work together to create the framework of the decision tree that will be trained. For a given problem, it may be the case that increasing the value of the parameter for `min_samples_split` generally improves model performance up to a certain point, by reducing overfitting. However, if the value for `max_depth` is too low or too high, this may doom the model to overfitting or underfitting, by having a tree with too many arbitrary levels and splits that overfit on noise, or limiting the model to nothing more than a "stump" by only allowing it to grow to one or two levels. 

So how do we know which combination of parameters is best? The only way we can really know for sure is to try **_every single combination!_** For this reason, grid search is sometimes referred to as an **_exhaustive search_**. 

The following code snippet demonstrates how to use `GridSearchCV` to perform a parameter grid search using a sample parameter grid, `param_grid`. Our parameter grid should be a dictionary, where the keys are the parameter names, and the values are the different parameter values we want to use in our grid search for each given key. After creating the dictionary, all you need to do is pass it to `GridSearchCV()` along with the classifier. You can also use K-fold cross-validation during this process, by specifying the `cv` parameter. In this case, we choose to use 3-fold cross-validation for each model created inside our grid search. 

```python
clf = DecisionTreeClassifier()

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [1, 5, 10, 20]
}

gs_tree = GridSearchCV(clf, param_grid, cv=3)
gs_tree.fit(train_data, train_labels)

gs_tree.best_params_
```

This code will run all combinations of the parameters above.

##### Drawbacks of `GridSearchCV`

GridSearchCV is a great tool for finding the best combination of parameters. However, it is only as good as the parameters we put in our parameter grid -- so we need to be very thoughtful during this step! 

The main drawback of an exhaustive search such as `GridsearchCV` is that there is no way of telling what's best until we've exhausted all possibilities! This means training many versions of the same machine learning model, which can be very time consuming and computationally expensive. Consider the example code above -- we have three different parameters, with 2, 4, and 4 variations to try, respectively. We also set the model to use cross-validation with a value of 3, meaning that each model will be built 3 times, and their performances averaged together. If we do some simple math, we can see that this simple grid search we see above actually results in `2 * 4 * 4 * 3 =` **_96 different models trained!_** For projects that involve complex models and/or very large datasets, the time needed to run a grid search can often be prohibitive. For this reason, be very thoughtful about the parameters you set -- sometimes the extra runtime isn't worth it -- especially when there's no guarantee that the model performance will improve!



#### Integrating Grid Search in Pipelines
See [Part 1](https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html) and [Part 2](https://www.kdnuggets.com/2018/01/managing-machine-learning-workflows-scikit-learn-pipelines-part-2.html) of KD Nuggets Blog Post with Examples.

First, you define the pipeline in the same way as above. Next, you create a parameter grid. When this is all done, you use the function `GridSearchCV()`, which you've seen before, and specify the pipeline as the estimator and the parameter grid. You also have to define how many folds you'll use in your cross-validation. 

```python
# Create the pipeline
pipe = Pipeline([('mms', MinMaxScaler()),
                 ('tree', DecisionTreeClassifier(random_state=123))])

# Create the grid parameter
grid = [{'tree__max_depth': [None, 2, 6, 10], 
         'tree__min_samples_split': [5, 10]}]


# Create the grid, with "pipe" as the estimator
gridsearch = GridSearchCV(estimator=pipe, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)

# Fit using grid search
gridsearch.fit(X_train, y_train)

# Calculate the test score
gridsearch.score(X_test, y_test)
```

#### Ensemble Methods
In Data Science, the term **_ensemble_** refers to an algorithm that makes use of more than one model to make a prediction. These are typically ***supervised*** machine models

Ensemble methods take advantage of the delphic technique (or "wisdom of crowds") where the average of multiple independent estimates is usually more consistently accurate than the individual estimates.



##### Bagging
The main concept that makes ensembling possible is **_Bagging_**, which is short for **_Bootstrap Aggregation_**. Bootstrap aggregation is itself a combination of two ideas -- bootstrap resampling and aggregation. `Bootstrapping` refers to the subsets of your dataset by sampling with replacement. Aggregation is exactly as it sounds -- the practice of combining all the different estimates to arrive at a single estimate.
* A common approach is to treat each classifier in the ensemble's prediction as a "vote" and let our overall prediction be the majority vote. 
* It's also common to see ensembles that take the arithmetic mean of all predictions, or compute a weighted average.

Process:
1. Grab a sizable sample from your dataset, with replacement.
2. Train a classifier on this sample.
3. Repeat until all classifiers have been trained on their own sample from the dataset.
4. When making a prediction, have each classifier in the ensemble make a prediction.
5. Aggregate all predictions from all classifiers into a single prediction, using the method of your choice.

Decision trees are often used because they are very sensitive to variance. On their own, this is a weakness. However, when aggregated together into an ensemble, this actually becomes a good thing!

#### Random Forests

The **_Random Forest_** algorithm is a supervised learning algorithm and ensemble technique generated from Decision Trees. Because Decision Trees maximizes information gain at every step, we differentiate each tree by limiting the samples they are trained on (Bagging (2/3)) and limiting the number of features each tree is trained on (Subspace Sampling; a hyperparameter). Each tree then "votes" for a prediction and the label with the highest count is assigned.

Example:

Lets say we have a training dataset that consists of 3000 rows and 10 columns.
1. Bag 2/3 of the overall data (2000 rows).
2. Randomly select a set number of features to use for training each node within this (tunable; lets choose 6).
3. Train the tree on the modified dataset, which is now a DataFrame consisting of 2000 rows and 6 columns  .
4. Drop the unused columns from step 3 from the out-of-bag rows that weren't bagged in step 1, and then use this as an internal testing set to calculate the out-of-bag error for this particular tree .

Pros:
* Robust to variance
* High performance

Cons:
* Slow on large datasets
* High memory consumption (each tree is stored)

##### Boosting

Utilizes `weak learners` that only perform slightly better than random chance. The process can use any algorithm (but mainly uses Decision Trees) and works as follows:

1. Train a single weak learner.  
2. Figure out which examples the weak learner got wrong.
3. Build another weak learner that focuses on the areas the first weak learner got wrong.
4. Continue this process until a predetermined stopping condition is met, such as until a set number of weak learners have been created, or the model's performance has plateaued.

Standout Features include:
* Each model used as a 'weak learner' is intentionally limited (such as max_depth=1 for Decision Trees)
* Iterative training (each weak learner improves on the previous)
* Utilizes weights to determine importance of the previous trees predicted labels.
    * Example: If there are many learners in the overall ensemble that can get the same questions right, then that tree isn't super important -- other trees already provide the same value that it does. This tree will have its overall weight reduced. As more and more trees get a hard problem wrong, the "reward" for a tree getting that hard problem correct goes higher and higher. This "reward" is actually just a higher weight when calculating the overall vote.

Pros:
* Highly resilient against noisy data and overfitting


##### Gradient Descent

##### Learning Rate
Often, we want to artificially limit the "step size" we take in gradient descent. Small, controlled changes in the parameters we're optimizing with gradient descent will mean that the overall process is slower, but the parameters are more likely to converge to their optimal values. The learning rate for your model is a small scalar meant to artificially reduce the step size in gradient descent. Learning rate is a tunable parameter for your model that you can set -- large learning rates get closer to the optimal values more quickly, but have trouble landing exactly at the optimal values because the step size is too big for the small distances it needs to travel when it gets close. Conversely, small learning rates means the model will take a longer time to get to the optimal parameters, but when it does get there, it will be extremely close to the optimal values, thereby providing the best overall performance for the model.

Ensemble Takeaways:
* Multiple independent estimates are consistently more accurate than any single estimate, so ensemble techniques are a powerful way for improving the quality of your models.
* Sometimes you'll use model stacking or meta-ensembles where you use a combination of different types of models for your ensemble.
* It's also common to have multiple similar models in an ensemble - e.g. a bunch of decision trees.
* Two of the most common algorithms for Boosting are Adaboost (Adaptive Boosting) and Gradient Boosted Trees.
* Adaboost creates new classifiers by continually influencing the distribution of the data sampled to train each successive tree.
* Gradient Boosting is a more advanced boosting algorithm that makes use of Gradient Descent.
    * `XGBoost` is a stand-alone library that implements popular gradient boosting algorithms in the fastest, most performant way possible. It is one of the top gradient boosting algorithms currently in use.

#### Recommendation Systems

***Recommendation Systems are software agents that elicit the interests and preferences of individual consumers […] and make recommendations accordingly. They have the potential to support and improve the quality of the
decisions consumers make while searching for and selecting products online.*** - [Bo Xiao and Izak Benbasat, 2017](https://misq.org/e-commerce-product-recommendation-agents-use-characteristics-and-impact.html)

They are used to:
- Help in suggesting the merchants/items which a customer might be interested in after buying a product in a marketplace.
- Estimate profit & loss of many competing items and make recommendations to the customer (e.g. buying and selling stocks).
- Based on the experience of the customer, recommend a customer centric or product centric offering.
- Enhance customer engagement .by providing offers which can be highly appealing to the customer.

Considerations:
* Past purchases
* Ratings of items
* Demographics
* Interest scores for item features 

Data Collection:
* Ask for explicit ratings from a user
* Gather data implicitly as the user is in the domain of the system - that is, to log the actions of a user on the site.

##### Content-Based Recommenders 
> __Main Idea__: If you like an item, you will also like "similar" items.

<img src="https://raw.githubusercontent.com/learn-co-curriculum/dsc-recommendation-system-introduction/master/images/content_based.png" alt="content based filtering. user watches movies, then similar movies are recommended to the user" width="500">

These systems are based on the characteristics of the items themselves. If you ever see a banner ad saying "try other items like this", it is most likely a content-based recommender system. The advantage of a content-based recommender system is that it is a recommender system that gives the user a bit more information as to why they are seeing these recommendations. If they are on a page of a book they very much like, they will be happy to see another book that is similar to it. If they are told that this book is similar to their favorite book, they're more than likely to get that book. A disadvantage of content-based recommender systems is that they often require manual or semi-manual tagging of each of products. More advanced versions of content-based recommender systems allow for the development of an average of all the items a user has liked. This allows for a more nuanced approach to incorporate more than one item when calculating which items are most similar.

##### Collaborative Filtering Systems
> __Main Idea__: If user A likes items 5, 6, 7, and 8 and user B likes items 5, 6, and 7, then it is highly likely that user B will also like item 8.

<img src="https://raw.githubusercontent.com/learn-co-curriculum/dsc-recommendation-system-introduction/master/images/collaborative_filtering.png" alt="collaborative filtering: movies watched by both users indicate that the users are similar, then movies are recommended by one user to another user" width="450">

Collaborative filtering systems use a collection of user rating of items to make recommendations. The issue with collaborative filtering is that you have what is called the "cold start problem." The idea behind it is, how to recommend something based off of user activity if you do not have any user activity to begin with! This can be overcome through various techniques. The most important thing to realize is that there is no one best recommendation system technique. In the end, what matters most is what system actually gets people to get recommendations that they will act upon. It might be that on the aggregate, recommending the most popular items is the most cost effective way to introduce users to new products. 

##### Singular Value Decomposition (SVD)
Pros:
* Efficient for matrix factorization in collaborative filtering.
* Helps reduce dimensionality, improving model interpretability.
* Works well with sparse matrices and can capture latent factors.

Cons:
* Assumes linear relationships in data, which might not capture all complexities.
* Can be computationally expensive for large datasets due to matrix inversion.
* May struggle with handling implicit data (like views or clicks) unless modified.

##### Alternating Least Squares (ALS)
Pros:
* Works well with implicit feedback data (e.g., clicks, views).
* Scales efficiently to large datasets and sparse matrices.
* More flexible for large-scale recommender systems, especially when data is implicit.

Cons:
* Requires careful tuning of parameters to prevent overfitting.
* Slower convergence in some cases compared to other methods like stochastic gradient descent.
* Can be sensitive to regularization choices.

### Unsupervised Learning

#### Clustering
##### Market Segmentation
A regression analysis on last year's data can give you a general idea of how much you can expect to make overall, assuming that there aren't major differences between last year and this year. However, regression just tells you what you can expect _overall_ -- what if we're trying to optimize where we spend our money, rather than just predict what the returns will be, based on the overall amount of money we spent? By identifying **_segments_** in our customer data, we can look for trends that identify one group or another, and create personalized regression models for each group.

By definition, **market segments** are groups within our dataset with substantive differences between them. Clustering provides a great way for us to allow the data to tell us what is and isn't significant -- lest we get caught up chasing down market segments that aren't actually all that different -- or worse, don't actually exist at all!

After we've identified the different market segments, the next step is to build individualized strategies to **_Target_** them! We should first start by answering questions such as "which market segment is most valuable to us?" This can be answered through research or through analyzing our data, or a combination of both.

The third step in this process is a bit outside the scope of clustering. This is where the marketing team really shines -- figuring out how to position our product to make it both as desirable as possible to a given segment, while also making our product stand out from competitors.

##### K-Means Clustering
**_K-means clustering_** is the most well-known clustering technique, and it belongs to the class of non-hierarchical clustering methods. When performing k-means clustering, you're essentially trying to find $k$ cluster centers as the mean of the data points that belong to these clusters. One challenging aspect of k-means is that the number _k_ needs to be decided upon before you start running the algorithm.

The k-means clustering algorithm is an iterative algorithm that reaches for a pre-determined number of clusters within an unlabeled dataset, and basically works as follows:

1. Select $k$ initial seeds 
2. Assign each observation to the cluster to which it is "closest"
3. Recompute the cluster centroids
4. Reassign the observations to one of the clusters according to some rule
5. Stop if there is no reallocation 

Two assumptions are of main importance for the k-means clustering algorithm:

1. To compute the "cluster center", you calculate the (arithmetic) mean of all the points belonging to the cluster.  Each cluster center is recalculated in the beginning of each new iteration   
2. After the cluster center has been recalculated, if a given point is now closer to a different cluster center than the center of its current cluster, then that point is reassigned to the closest center

The best value of k is measured by the **_Variance Ratio_**.

The _variance ratio_ is a ratio of the variance of the points within a cluster, to the variance of a point to points in other clusters. Intuitively, we can understand that we want intra-cluster variance to be low (suggesting that the clusters are tightly knit), and inter-cluster variance to be high (suggesting that there is little to no ambiguity about which cluster the points belong to).

The advantages of the k-means clustering approach are:

* Very easy to implement!
* With many features, k-means is usually faster than HAC (as long as $k$ is reasonably small).
* Objects are locked into the cluster they are first assigned to and can change as the centroids move around.
* Clusters are often tighter than those formed by HAC.

However, this algorithm often comes with several disadvantages:

* Quality of results depends on picking the right value for $k$. This can be a problem when we don't know how many clusters to expect in our dataset .
* Scaling our dataset will completely change the results .
* Initial start points of each centroid have a very strong impact on our final results. A bad start point can cause sub-optimal clusters.
* Dimensionality has a significant affect on results. Consider using assessing performance with and without PCA.

##### Hierarchical Agglomerative Clustering (HAC)
###### Linking Similar Clusters Together

Several linkage criteria that have different definitions for "most similar clusters" can be used. The measure is always defined between two existing clusters up until that point, so the later in the algorithms, the bigger the clusters get.

Scikit-learn provides three linkage criteria:

- **ward** (default): picks the two clusters to merge in a way that the variance within all clusters increases the least. Generally, this leads to clusters that are fairly equally sized.
- **average**: merges the two clusters that have the smallest **_average_** distance between all the points.
- **complete** (or maximum linkage): merges the two clusters that have the smallest **_maximum_** distance between their points.

HAC is useful as a clustering algorithm because:

* It produces an ordered relationship between clusters, which can be useful when visualized.
* It creates smaller clusters. This allows you to get a very granular understanding of the dataset, and zoom in at the level where the clusters make the most sense to you.

However, this algorithm is also built on some assumptions which can be disadvantages:

* Results are usually dependent upon the distance metric used.
* Objects can be grouped 'incorrectly' early on, with no way to relocate them. For instance, consider two points that belong to separate clusters, but are both nearer to each other than the center of the cluster they actually belong to (both are near the "boundary" between their cluster and the opposing cluster). These will be incorrectly grouped as a cluster, which will throw off the clustering of the groups they actually belong to, as well.

##### Evaluation
An _elbow plot_ is a general term for plots like this where we can easily see where we hit a point of diminishing returns.

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-k-means-clustering/master/images/new_elbow-method.png' alt="Calinski Harabaz scores for different values of k" width='500'>

In the plot above, we can see that performance peaks at _k=6_, and then begins to drop off. That tells us that our data most likely has 6 naturally occurring clusters in our data. 

Elbow plots aren't exclusively used with variance ratios -- it's also quite common to calculate something like distortion (another clustering metric), which will result in a graph with a negative as opposed to a positive slope. 

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-k-means-clustering/master/images/new_elbow_2.png' alt="the elbow method showing the optimal k" width="500">

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-hierarchical-agglomerative-clustering/master/images/dendrogram_gif.gif' alt="animation of clusters shown in x-y space on the left and a dendrogram on the right, showing which clusters correspond to which parts of the dendrogram">

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-hierarchical-agglomerative-clustering/master/images/new_clustergram.png' alt="another view of clusters on the left and dendrogram on the right" width='600'>

* A _clustergram_ (left) depicts each cluster and the data points that they include.
* A _dendrogram,_ (right) which is used to visualize the hierarchical relationship between the various clusters that are computed throughout each step. Dendrograms are very useful to decide how clusters change depending on the euclidean distance.

So how to interpret this dendrogram? At the very bottom of the dendrogram, the data points are represented as individual clusters. Moving up, the first merged clusters start to form, starting with data points 12 and 15, and next data points 2 and 6, next 4 and 5, etc, until all the clusters are merged together. This along with the plot created through `plot_agglomerative()` gives basically a complete view of how clusters are created using the ward algorithm. 

Let's look at the y-axis next. The length of how far each branch is apart also shows how far apart the merged clusters are. If branches to go from $k$ to $k-1$ clusters are very long, it means that the merged clusters are far apart. It might then make sense to stick to $k$ clusters!

##### Key Takeaways
The key takeaways from this section include:
* There are two main types of clustering algorithms: non-hierarchical clustering (k-means) and hierarchical agglomerative clustering
* You can quantify the performance of a clustering algorithm using metrics such as variance ratios
* When working with the k-means clustering algorithm, it is useful to create elbow plots to find an optimal value for $k$
* When using hierarchical agglomerative clustering, different linkage criteria can be used to determine which clusters should be merged and at what point
* Dendrograms and clustergrams are very useful visual tools in hierarchical agglomerative clustering 
* Advantages of k-means clustering include easy implementation and speed, whereas the main disadvantage is that it isn't always straightforward how to pick the "right" value for $k$ 
* Advantages of hierarchical agglomerative clustering include easy visualization and intuitiveness, whereas the main disadvantage is that the result is very distance-metric-dependent
* You can use supervised and unsupervised learning together in a few different ways. Applications of this are look-alike models in market segmentation and semi-supervised learning

#### Semi-Supervised Learning
The main idea behind _semi-supervised learning_ is to generate **_pseudo-labels_** that are possibly correct (at least better than random chance). To do this, we don't usually use clustering algorithms -- instead, we use our supervised learning algorithms in an unsupervised way. The main benefit is that is helps companies generate more revenue, get more customers, or increase model performance without paying for more labeled training data!

For example:

We are trying to build a supervised learning model, and we have 100,000 observations in our dataset. However, labels are exceedingly expensive, so only 5,000 of these 100,000 observations are labeled. In traditional supervised learning, this means that in a practical sense, we really only have a dataset of 5,000 observations, because we can't do anything with the 95,000 unlabeled examples. However, with semi-supervised learning we can, using this process:

1. **_Train your model on your labeled training data_**. In the case of our example above, we would build the best model possible with our tiny dataset of 5,000 labeled examples. 

2. **_Use your trained model to generate pseudo-labels for your unlabeled data_**. This means having our trained model make predictions on our 95,000 unlabeled examples. Since our trained model does better than random chance, this means that our generated pseudo-labels will be at least somewhat more correct than random chance. We can even put a number to this, by looking at the performance our trained model had on the test set. For example, if our trained model had an accuracy of ~70%, then we can assume that ~70% of the pseudo-labels will be correct, ~30% will be incorrect. 

3. **_Combine your labeled data and your pseudo-labeled data into a single, new dataset._**. This means that we concatenate all our labeled data of 5,000 examples with the 95,000 pseudo-labeled examples. 

4. **_Retrain your model on the new dataset_**. Although some of the pseudo-labeled data will certainly be wrong, it's likely that the amount that is correct will be more useful, and the signal that these correctly pseudo-labeled examples provide will outweigh the incorrectly labeled ones, thereby resulting in better overall model performance. 

##### Risks of Semi-Supervised Learning
If a model trained only on the real data with no pseudo-labels got this example wrong, then what happens when you train the model on the same example, but this time provide a pseudo-label that "confirms" this incorrect belief? When done correctly, we can hope that the signal provided by all the correctly pseudo-labeled examples will generalize to help the model correct its mistakes on the ones it got wrong. However, if the dataset is noisy, or the original model wasn't that good to begin with (or both), then it can be quite likely that we are introducing even more incorrect information than correct information, moving the model in the wrong direction.

So how do we make sure that we're not making these mistakes when using a semi-supervised approach? **_Use a holdout set!_** You should definitely have a test set that the model has never seen before to check the performance of your semi-supervised model. Obviously, make sure that your test set only contains actual, ground-truth labeled examples, no pseudo-labels allowed! Also, the noisier your dataset or more complicated your problem, the more likely you are to run into trouble with semi-supervised learning.

#### Natural Language Processing (NLP)

**_Natural Language Processing_**, or **_NLP_**, is the study of how computers can interact with humans through the use of human language.  Although this is a field that is quite important to Data Scientists, it does not belong to Data Science alone.  NLP has been around for quite a while, and sits at the intersection of *Computer Science*, *Artificial Intelligence*, *Linguistics*, and *Information Theory*. In the early days of NLP, it mainly consisted of trying to program algorithms that contained many rules borrowed from the field of linguistics. However, in the 1980s, machine learning started to show great success with many NLP tasks, and many of these rule-based methods took a back seat to approaches involving machine learning and AI. Fast forward to now, and NLP has become an area of applied machine learning that Data Scientists all around the globe work in every day. 

Primer:
* **Context-free grammars** (CFGs) refer to bits of text that are grammatically correct, but feel like complete nonsense when considering the same bit of text on the semantic level.
* **Parts of Speech** (POS) tagging refers to the act of helping a computer understand how to interpret a sentence. The context-free grammars (CFG) defines the rules of how sentences can exist.

##### Regular Expressions
 Regular Expressions (regex) are used to to quickly match patterns and filter through text documents. They are an important tool anytime we need to pull information from a larger text document without manually reading the entire thing. Regex is only as good as the **_Patterns_** we create. We can use these patterns to find, or to replace text.

 A **_Range_** such as `[A-Z]`. This will match any uppercase letter. Ranges are always inside of square brackets. We can put many things inside of ranges at the same time, and regex will match on any of them. For instance, if we wanted to find any uppercase letter, lowercase letter, or digit, we could use `[A-Za-z0-9]`.

 Groups are kind of like ranges, but they specify an exact pattern to match on. Groups are denoted by parentheses. Whereas `[A-Z0-9]` matches on any uppercase letter or any digit, `(A-Z0-9)` will only match on the sequence `'A-Z0-9'` exactly. This becomes much more useful when paired with **_Quantifiers_**, which allows us to specify how many times a group should happen in a row. If we want to specify an exact number of times, we can use curly braces. For instance, a group followed by `{3}` will only match on patterns that have that group repeated exactly 3 times. The most common quantifiers are usually:

* `*` (0 or more times)
* `+` (1 or more times)
* `?` (0 or 1 times)

In this way, we can fill a grouping with any pattern, tell and specify the number of times we can expect to see that pattern. When we include things like ranges, groupings, and quantifiers together, it becomes easy to write a pattern that can match complex things, like email addresses -- note that this particular regex does not allow for numbers or special characters.

`'([A-Za-z]+)@([A-Za-z]+)\.com'`

[regexr](https://regexr.com/) allows you to put in a block of text and test your patterns by quickly seeing visually what a pattern will grab out of the text block.

 <img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-introduction-to-regular-expressions/master/images/regex_cheat_sheet.png' alt="regex cheat sheet">


##### Vectorization
The most common approach to working with text is to vectorize it by creating a **_Bag of Words_**.  In this case, the name "Bag of Words" is quite descriptive of the final product -- the bag contains information about all the important words in the text individually, but not in any particular order. If we have a number for every word, then we have a way to treat each bag as a **_vector_**, which opens up all kinds of machine learning tools for use.

If we were to count how many times each word appears in this sentence, we would likely say that "Apple" has a count of three.  However, if we wrote a basic Python script to do this, our algorithm would tell us that the word "Apple" only appears twice! To a computer, "Apple" and "Apple's" are different words.  Capitalization is also a problem -- "apple" would also be counted as a different word. Similarly, punctuation is also a problem.  A basic counting algorithm would see "stock" and "stock." as two completely different words.

**_Stemming_** and **_Lemmatization_** help us deal with this problem, where we reduce each word token down to its root word.  For cases such as "run", "runs", "running" and "ran", they are more similar than different -- we may want our algorithm to treat these as the same word, "run".
* **_Stemming_** accomplishes this by removing the ends of words where the end signals some sort of derivational change to the word.
<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-nlp-and-word-vectorization/master/images/new_stemming.png' alt="stemming rules and examples" width="400">

* **_Lemmatization_** accomplishes pretty much the same thing as stemming, but does it in a more complex way, by examining the **_morphology_** of words and attempting to reduce each word to its most basic form, or **_lemma_**.  Note that the results here often end up a bit different than stemming.  See the following table for an example of the differences in results:

|   Word   |  Stem | Lemma |
|:--------:|:-----:|:-----:|
|  Studies | Studi | Study |
| Studying | Study | Study |

**_Stop Words_**, such as 'the' and 'of', are often removed after tokenization is complete in order to reduce the dimensionality of each corpus down to only the words that contain important information.

**_Term Frequency, Inverse Document Frequency_** (TF-IDF) is a more advanced form of vectorization that weighs each term in a document by how unique it is to the given document it is contained in, which allows us to summarize the contents of a document using a few key words. 
* If the word is used often in many other documents, it is not unique, and therefore probably not too useful if we wanted to figure out how this document is unique in relation to other documents.
* Conversely, if a word is used many times in a document, but rarely in all the other documents we are considering, then it is likely a good indicator for telling us that this word is important to the document in question.

##### Prodigy and spaCy
**Prodigy** is a proprietary software thatimproves human-in-the-loop workflows for researchers building NLP models with their efficient and user friendly annotation tool. This eliminates some of the common challenges that researches face when building custom NLP pipelines and models, such as:
* fair labor/security concerns from using MTurk or other HITS services
* lack of domain knowledge from annotators
* decision fatigue from human annotators that compromises data integrity.

The reason why Prodigy's annotation tool is so powerful, is that it automatically selects edge cases from the labeled datasets and judiciously presents them to the human annotator. This means that the human annotator only has to label or correct a few edge cases to "retrain" the model. Prodigy is able to do this by using __similarity metrics__ and __blocking techniques__ to assess the similarity between examples. Examples that have the least in common with any of known data are classified as edge cases.

**spaCy** is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning. 

##### Text Data Summary
Preproocessing text data is a bit more challenging than working with more traditional data types because there's no clear-cut answer for exactly what sort of preprocessing and cleaning we need to do. Here are some questions that can guide the cleaning process:

* Do we remove stop words or not?    
* Do we stem or lemmatize our text data, or leave the words as is?   
* Is basic tokenization enough, or do we need to support special edge cases through the use of regex?  
* Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?  
* Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?   
* What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?

In general, there's no great answer for exactly which features will improve the performance of your model, and which won't. This means that your best bet is to experiment, and treat the entire project as an iterative process!

#### Model Deployment

Luckily there are techniques to *pickle* your model -- basically, to store the model for later, so that it can be loaded and can make predictions without being trained again. Pickled models are also typically used in the context of model deployment, where your model can be used as the backend of an API!