## Text Classification

- Topic models
- NMF : Non-negative Matrix Factorization
- LDA : Latend Dirichlet Allocation
- Cosine Similarity
- word embedding

Simply calculating the frequency of terms as in document-term matrix suffers from a critical problem, all terms are considered equally important when it comes to assessing relevancy on a query.

__Named Entity Recognition (NER)__

The process of locating and classifying elements in text into predefined categories such as the names of people, organizations, places, monetary values, percentages, etc.

__Tokenization__

- splitting a string into a list of "tokens" / words.

__Normalization__

Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, expanding contractions, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.

__Stemming__

Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.

running ===> run

__Lemmatization__

Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma. For example, stemming the word "better" would fail to return its citation form (another word for lemma); however, lemmatization would result in the following:

better ===> good


__Stop Words__

Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," 

__Dcoument__

- a document is a set of terms.

__Corpus__

- a corpus is a set of document.

__Bag of Words (BoW).__

- bag referring to the set theory concept of multisets, which differ from simple sets
- The bag of words model omits grammar and word order, but is interested in the number of occurrences of words within the text.
- BoW is simply a word occurence as how many times a word appears in a document
- BoW has limitations such as large feature dimension, sparse representation etc. 
- if dataset is small and context is domain specific, BoW may work better than Word Embedding
- In simple bag-of-words (e.g uni-gram, bi-gram, tri-gram) representations,the frequency (or asimilar weight such as term frequency inverse document frequency) of each word or n-gram is considered as a separate feature. 
- Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. 
- Bag of words models encode every word in the vocabulary as one-hot-encoded vector 
- Bag of word models don’t respect semantics of the word. For example: words ‘car’ and ‘automobile’ are often used in the same context.
- While modeling phrases using bag-of-words the order of words in the phrase is not respected. Ex: “This is good” and “Is this good” have exactly the same vector representation. BoW lose all information about word order: “John likes Mary” and “Mary likes John” correspond to identical vectors. There is a solution: bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality. The Word2Vec model addresses this second problem.

- CountVectorizer()


__Term Frequency(TF)__

- Normalize count occurence
- TF-IDF approach believe that high frequency may not able to provide much information gain. In another words, rare words contribute more weights to the model.
- It is the ratio of number of times a word occurred in a document to the total number of words in the document.
- TfidfVectorizer()

__Inverse Document Frequency(IDF)__

-  IDF (inverse document frequency) assumes that the importance of a term is inversely proportional to the frequency of occurrence of this term in all the documents 

- it is the logarithm of (total number of documents divided by number of documents containing the word).

__TF-IDF__

- tf-idf helps you rank the importance of a term
- With TF-IDF, words are given weight – TF-IDF measures relevance, not frequency
- Term frequency (tf) is basically the output of the BoW model

- With TF-IDF, words are given weight – TF-IDF measures relevance, not frequency. 
- TF-IDF can achieve better results than simple term frequencies when combined with some supervised methods.
- TfidfVectorizer Converts a collection of raw documents to a matrix of TF-IDF features. 
- Tfidfvectorizer are feature generation algorithms and hence running on test will not cause overfitting or leakage
- Tf-idf is a scoring scheme for words - that is a measure of how important a word is to a document.
- TF-IDF is a way to judge the topic of an article. This is done by the kind of words it contains. Here words are given weight so it measures relevance, not frequency.


__Word2Vec__

- also known as Skip-gram with Negative Sampling (SGNS)
- Word2vec produces one vector per word, whereas tf-idf produces a score. 
- Word2vec produces one vector per word, whereas BoW produces one number (a wordcount)
- word2vec learns relationships between words automatically.
- Gensim is heavily applied for training word2vec and doc2vec
- can solve analogies cleary
- gensim.models.Word2Vec()

__Doc2Vec__

- Doc2Vec is a Model that represents each Document as a Vector. 


__DeepIR__

__Global Vectors (GloVe)__
- GloVe is modeled to do dimensionality reduction.
- Glove and Word2vec are both unsupervised models for generating word vectors

__Cosine Similarity__

- With cosine similarity we can measure the similarity between two document vectors
- cosine similarity == 1 ==> same document. If it is 0, the documents share nothing. 

- compare different documents with cosine similarity or the Euclidean dot product formula.

__Latent Dirichlet Allocation (LDA)__

- Latent Dirichlet Allocation, LDA is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA)

-  LDA model didn’t do a very good job of classifying our documents into topics. This is mostly because the corpus used to train the LDA model is so small. Using a larger corpus should give you much better results



__Latent Semantic Analysis (LSA)__

- Both LSA and LDA have same input which is Bag of words in matrix format. LSA focus on reducing matrix dimension while LDA solves topic modeling problems.

__Positive Pointwise Mutual Information (PPMI)__

- PMI is a typical measure for the strength of association between two words.

__Singular Value Decomposition__

- SVD is among the more popular methods for dimensionality reduction and came about in NLP originally via latent semantic analysis (LSA)
- The challenge of SVD is that we are hard to determine the optimal number of dimension

__Random Projections(RP)__

RP aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.

__Q. What is NLP preprocessing steps?__

We first clean the text by removing HTML tags. Next, we tokenize the message’s sentences and remove stopwords. Then, we conduct lemmatization to convert words in different inflected forms into the same base form. Finally, we convert the documents into a collection of words (a so-called bag of words) and build a dictionary of those words.

there are two kinds of works involved in text representation: indexing and term weighting. Indexing is the job to assign the indexing terms for the documents

Term weighting is the job to assign the weights of terms which measure the importance of terms in documents



__Q 19. What is: collaborative filtering, n-grams, cosine distance?__

Collaborative filtering:
- Technique used by some recommender systems
- Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources.
1. A user expresses his/her preferences by rating items (movies, CDs.)
2. The system matches this user’s ratings against other users’ and finds people with most similar tastes
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user

n-grams:
- Contiguous sequence of n items from a given sequence of text or speech
- “Andrew is a talented data scientist”
- Bi-gram: “Andrew is”, “is a”, “a talented”.
- Tri-grams: “Andrew is a”, “is a talented”, “a talented data”.
- An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game
- More concisely, n-gram model: P(Xi|Xi−(n−1)...Xi−1): Markov model
- N-gram model: each word depends only on the n−1 last words

Issues:
- when facing infrequent n-grams
- solution: smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams
- Methods: Good-Turing, Backoff, Kneser-Kney smoothing

Cosine distance:
- How similar are two documents?
- Perfect similarity/agreement: 1
- No agreement : 0 (orthogonality)
- Measures the orientation, not magnitude

Given two vectors A and B representing word frequencies:
cosine-similarity(A,B)=⟨A,B⟩||A||⋅||B||


__Q. How would you come up with a solution to identify plagiarism?__

- Vector space model approach
- Represent documents (the suspect and original ones) as vectors of terms
- Terms: n-grams; n=1 to as much we can (detect passage plagiarism)
- Measure the similarity between both documents
- Similarity measure: cosine distance, Jaro-Winkler, Jaccard
- Declare plagiarism at a certain threshold
