 ## Text vectorisation
 
 Raw textual data is untractable for statistical learning algorithms which can generally
 handle only numeric data. The process process of mapping a document (text) to a numeric space is
 called feature extraction or vectorization and is crucial for the performance of statistical learning
 methods.
 
 For the purposes of text classification the vectorization process should map similar (in meaning) 
 documents to points that are close to each other in the numeric space (semantic space).   
 
 ### Bag-of-words 
 
 Consider a collection (corpus) consisting of $$D$$ documents. Denote the number of unique words (vocabulary)
 in that collection be indexed by $$j=1,\ldots,N$$.
 
 A simple yet commonly used model is the bag-of-words model that maps documents into
 a space of dimension $$N$$ spanned by the set of unique words (n-grams) in the vocabulary.

In [45]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import Binarizer

positive_texts = [
    "we love you",
    "they love us",
    "you are good",
    "he is good",
    "they love mary"
]

negative_texts = [
    "we hate you",
    "they hate us",
    "you are bad",
    "he is bad",
    "we hate mary"
]

all_texts = positive_texts + negative_texts

vectorizer = CountVectorizer()

vectorizer.fit(all_texts)
vectorizer.get_feature_names()

['are',
 'bad',
 'good',
 'hate',
 'he',
 'is',
 'love',
 'mary',
 'they',
 'us',
 'we',
 'you']

    During fitting the vectorizer builds the collection vocabulary.

In [46]:
vectorizer.vocabulary_

{'are': 0,
 'bad': 1,
 'good': 2,
 'hate': 3,
 'he': 4,
 'is': 5,
 'love': 6,
 'mary': 7,
 'they': 8,
 'us': 9,
 'we': 10,
 'you': 11}

    After the vectorizer is fitted we can
    transform the text to obtain frequency count vectors.
    
    The resulting document-word matrix has only a few non-zero
    entries and a large amount of zeroes (sparse matrix).

In [47]:
texts_transformed = vectorizer.transform(all_texts)
texts_transformed.toarray()


array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0]])

    The count vectorizer from `scikit-learn` produces a matrix where each row corresponds
    to a document and each entry $$i,j$$ contains the frequency of vocabulary word $$j$$ within document $$j$$.

    The bag-of-words model disregards the word order and the grammaer and count vectors may be quite unbalanced 
    with the more common words having much higher frequencies than the less common ones (Zipf's law).
    This may have an impact on some models, e.g. generalized linear models.
    
    Another way to vectorize the documents in the collection is to truncate the frequencies at 1. The result is
    called one-hot encoding.

In [48]:
one_hot_vectorizer = Binarizer()

one_hot_encoded_vecs = one_hot_vectorizer.fit_transform(texts_transformed)
one_hot_encoded_vecs.toarray()

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0]])

    The bag-of-word representation of documents treats each document in isolation, without considering
    the whole collection. A way to introduce the context of the collection is to weight the 
    frequencies so that words that occur frequently in the document but are not frequently found in other
    documents receive a higher weight. One way to vectorization that achieve this
    is to use the inverse document frequency.
    

In [49]:
idf_vectorizer = TfidfVectorizer()

texts_idf_vecs = idf_vectorizer.fit_transform(all_texts)
texts_idf_vecs.toarray()

## 


array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.59863623, 0.        , 0.        , 0.        ,
        0.59863623, 0.53223051],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.5499426 , 0.        , 0.5499426 , 0.62859071,
        0.        , 0.        ],
       [0.6195754 , 0.        , 0.6195754 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.48192597],
       [0.        , 0.        , 0.57735027, 0.        , 0.57735027,
        0.57735027, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.5499426 , 0.62859071, 0.5499426 , 0.        ,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.59863623, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.59863623,

   ## Reducing the dimenisons of the feature space
   
   When dealing with large collections with long documents and high number of unique words the
   dimension of the feature space can become high. 
   
   One way to reduce the dimensionality of the feature space while retaining _most_ of the information
   is to use principal components analysis (PCA).  

In [54]:

pca = PCA(6)

reduced = pca.fit_transform(texts_idf_vecs.toarray())
reduced

variance_explained = np.cumsum(pca.explained_variance_)
variance_explained

array([0.23606781, 0.42469203, 0.56094171, 0.660828  , 0.74051767,
       0.81415114])