[Feature Selection](#feature selection)  
[Principle Component Analysis](#pca)  
[Natural Language Processing](#nlp)

<a id='feature selection'></a>
# Feature Selection
---
## Regularization for Feature Selection
Lasso/L1 regularization
- Parameters that are 0 or near 0 are good candidates to remove
- Model is stronger if it has small or near zero coefficients

## Resursive Feature Elimination
- Uses a bottom up approach to feature selection  
- Fits successive versions of a model starting with all features. Each new version will drop the worst performing/least useful feature until we end up with some preset $i$ number of features
-RFE in sklearn
   - <span style="color:green"> Estimator </span> - model type (required)
   - <span style="color:green"> n_features_to_select </span> - # of features to have at end. Default is $\frac{1}{2}$ features rounded down
   - <span style="color:green"> step </span> - how many features to eliminate with each pass 
    
RFE only works with estimators that have .coef_ or .feature\_importances_ attributes

## Automatic Feature Selection with SelectKBest
- Top down form of feature selection
- Looks at all features & keeps the best ones

If there are features that are less useful when another feature is present, RFE may catch that dynamic and SelectKBest may not 

<a id='pca'></a>
# Principle Component Analysis
---
**Goal** 
- Transform original features, inputs, etc to high performance ones  
- Reduce the dimensionality of the data
- Elimate multicolinearity

Reducing Colinearity in Input
- Want high variance of each of features
- Off diagonal elements of covariance matrix -> the amount of colinearity & redundancy between variables

PCA will give 1 or 2 <span style="color:green"> super predictor </span> variables called components  
<span style="color:green"> Dinemsionality Reduction </span> - the process of combining or collapsing existing features (columns) into fewer features.
- Retain signal in original data
- Reduce noise

PCA finds linear combinations of current predictor variables that create new <span style="color:green"> principal components </span> that explain the max possible amount of variance with the least amount of variables . 
Principal Components
- Looking for new directions in feature space
- Each consecutive direction tries to maximize remaining variance

PCA transformation creates new variables that:
- Optimize "explained variance"
- Are uncorrelated
- Each component is created as a weighted sum of your original cols such that components are orthogonal to each other

Interpreting PCA: Signal vs. Noise
- PCA attempts to maximize signal (high variance) while isolating noise (low variance)
- Most variance captured in $1^{st}$ several principal components
- Noise isolated to last several p.c.
- Done simultaneously across all input variables

PCA Assumptions
- <span style="color:green"> Linearity </span> - Data does not hold nonlinear relationships 
- <span style="color:green"> Large variances define importance </span> - dimensions are constructed to maximize remaining variance

PCA Relies On:
- <span style="color:green"> Eigenbalue decomposition of the covariance matrix </span> - diagonalizes the covariance matrix
- <span style="color:green"> Principal component transformation </span> - transform each input varaicne onto a new orthogonal basis in which the new variables are maximally variant

<span style="color:green"> Eigenvalues </span> - PC explained variance  
<span style="color:green"> Eigenvectors </span> - 'weighting' (components) matrix

<a id='nlp'></a>
# Natural Language Processing
---
<span style="color:green"> Sparse Matrix </span> - only remember non-zero values. Leaves space for zero values. Reduced memory footprint  
- Sparse matrices collapse regular (dense) matrices by marking down only cases where a non-zero value is found for a certain combination of row and column. It then drops all the zeroes, allowing for a reduced memory footprint
- <span style="color:blue">.todense() </span> will fill in zeroes

<span style="color:green"> Stemming </span> - normalize words to a common root
 - <span style="color:blue"> nltk.stem </span>  

<span style="color:green"> Stop Words </span> - words that are common but provide no information on text context

## Count Vectorizer
Take a set of words & split them into one column per word with the count of word for that row in that column  
Bag of Words Model

`CountVectorizer` takes the following (useful) keyword arguments:

| Argument | Default Value | Definition |
| :--- | :--- | ---: |
| `decode_error` | `strict` | What to do if text cannot be decoded. `strict` will raise a `UnicodeDecodeError`, `ignore` will skip that word, `replace` will attempt to replace it with a non-Unicode variant|
| `strip_accents` | `None` | When preprocessing a word, `CountVectorizer` does nothing with the accented characters. `ascii` will convert those characters if they have a direct ASCII mapping (à -> a, for example), and `unicode` is slower but will do it for all characters | 
| `preprocessor / tokenizer` | `None` | Ways to override how to split text into words (`tokenizer`) and what to do with those words before vectorizing (`preprocessor`) -- we'll discuss shortly |
| `ngram_range` | `(1, 1)` | Sometimes we may want each sequence of _n_ words as well as each individual word. This is known as an _ngram_ and we cna set that here | 
| `stop_words` | `None` | Whether or not to remove stop words. Will discuss later. |
| `max_df` | `1.0` | `df` refers to the document frequency -- how often does a given word show up across documents. If we set this to a float less than 1.0, any word that happens more frequently than that value will be discarded. Any integer will be the number of documents instead of the proportion | 
| `min_df` | `1` | Same as `max_df`, but for the number / proportion of documents that a word has to appear in before it is included |
| `max_features` | `None` | The top _n_ occuring features to include. If `None`, include all features. This is a great way to coerce `CountVectorizer` to return a matrix of a specific shape / size |
| `binary` | `False` | If `True`, return dummy variables instead of a count of occurances |

## Hashing Vectorizer
Converts collection of text docs to a matrix of occurences. Each word is mapped to a feature
- Benefit: Low memory footprint, allowing you to do more work. Doesn't need to be fit $\rightarrow$ good for streaming data
- Downside: No ability to determine which feature corresponds to what the original word is

## Tfidf Vectorizer
**Term Frequency Inverse Document Frequency**  
Which words are most discriminating between documents  
<span style="color:green"> Term frequency </span> - frequency of a certain word in a document  
<span style="color:green"> Inverse document frequency </span> - frequency of documents that contain that term over the whole corpus

## Dimensionality Reduction with PCA
<span style="color:green"> TruncatedSVD </span> - works on a matrix itself. Maximum number if components you can make is limited to the smaller of your rows or columns.

## Using `spacy` to extract parts of speech and named entities
<span style="color:green"> spaCy </span> is a large-scale NLP and text processing library designed to help you extract useful information from text in a speedy and accurate manner. You can imagine it like `CountVectorizer()` turned up to 11. It has underpinnings to C to increase speed and a focus on usability.
### Parts of Speech
We may want to use some derived statistics about parts of speech in our work as Data Scientists, either as the inputs to a model (document _x_ is _y_% verbs) or to help us modify the inputs to a model (we may want to treat `book` the verb differently than `book` the noun).

## Using `textblob` to do sentiment analysis
We can also use a library known as <span style="color:green">textblob</span> to do a **lot** of text transformation and extraction on our behalf. For our purposes, we are going to use it to analyze text and derive the overall sentiment of the text.  
Sentiment can be split into two related scales:
- subjectivity (0 to 1): scores closer to 0 are more objective in tone, scores closer to 1 are more subjective in tone
- polarity (-1 to 1): scores closer to -1 are more negative in tone, closer to 0 are more neutral, and closer to 1 are more positive in tone.  

Using `textblob` is user-friendly -- pass a string into a `Textblob()` class and then call the `.sentiment.polarity` or `sentiment.subjectivity` attributes:

## Assigning documents to topics using LDA
<span style="color:green">LDA (Latent Dirichlet Allocation)</span> is an unstructured machine learning technique that works by iteratively guessing how likely a given word is to be part of a given topic until we tell it to stop. 

We begin by picking a set of documents and a number of topics that we want to generate. One way that we do this is what's known as collapsed Gibbs saampling. We do the following:
1. Randomly assign every word in every document to one of the $k$ topics:
    - $w$: a word in a document
    - $d$: a document
    - $k$: a topic
2. At this point, every word has a likelihood that they belong in a given a topic, based on the other words in documents that they exist in. 
3. Iterate through every word in every document and:
    1. Assume that every other word has the correct likelihood that they belong to each topic (so, `apple` might have a distribution of `[0.1, 0.1, 0.2, 0.4, 0.2]` for five topics.
    2. Look at the likelihood of seeing word $w$ in document $d$ and adjust the topic probabilities as needed
    > for example, if there are a lot of words in topic 1 in document $d$ and word $w$ has a stronger likelihood of being in topic 2, because we're assuming that every **other** distribution is correct, we should change our understanding of where word $w$ belongs and tweak it more in favor of belonging to topic 1, not topic 2

The name latent dirichlet allocation should begin to make more sense in this context:
- latent -- because we have no explicit marker of topic and are grouping things together based on features we are inferring, not seeing
- dirichlet -- is a type of probability distribution for multiple vectors at once (like a bunch of words towards a bunch of topics)
- allocation -- we are allocating different words to different topics via this iterative updating of priors