# Preparing Text Data for machine learning

## Encoding text data in numeric form

- One-hot (vector is the size of vocabulary, 1 is present 0 is not, we don't know the number of times a word appears in text)
- Frequency-based:
    - count
    - TF-IDF - term frequency inverse document frequency (take sinto account words that appear often in object and words that apear often dataset such as "a", "an", "the",...)
    - co-occurence - similar words will occur together and will have similar context
- Prediction-based (based on ml models, word embeddings, dimensionality reduction)

## Bag based Models

- Bag-of-words: any model that represents a document as a bag of its constituent words, disregarding order but maintaining multiplicity
    - count vectorization
    - tf-idp vectorization
- Bag of n-grams: any model that represents a document as a bag of its constituent n-grams, disregarding order but maintaining multiplicity. words that occur togheter

- Not bag of words:
    - one hot encoding (no multiplicity)
    - word embeddings

## Bag of words

In [1]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
train_text = ["The Pessimist Sees Difficulty In Every Opportunity.",
              "The Optimist Sees Opportunity In Every Difficulty.",
              "Don’t Let Yesterday Take Up Too Much Of Today. ",
              "You Learn More From Failure Than From Success.",
              "We May Encounter Many Defeats But We Must Not Be Defeated.",
              "Life Is Either A Daring Adventure Or Nothing."]

count_vectorizer = CountVectorizer()

In [3]:
count_vectorizer.fit(train_text)

In [5]:
count_vectorizer.get_feature_names_out()

array(['adventure', 'be', 'but', 'daring', 'defeated', 'defeats',
       'difficulty', 'don', 'either', 'encounter', 'every', 'failure',
       'from', 'in', 'is', 'learn', 'let', 'life', 'many', 'may', 'more',
       'much', 'must', 'not', 'nothing', 'of', 'opportunity', 'optimist',
       'or', 'pessimist', 'sees', 'success', 'take', 'than', 'the',
       'today', 'too', 'up', 'we', 'yesterday', 'you'], dtype=object)

In [6]:
count_vectorizer.vocabulary_

{'the': 34,
 'pessimist': 29,
 'sees': 30,
 'difficulty': 6,
 'in': 13,
 'every': 10,
 'opportunity': 26,
 'optimist': 27,
 'don': 7,
 'let': 16,
 'yesterday': 39,
 'take': 32,
 'up': 37,
 'too': 36,
 'much': 21,
 'of': 25,
 'today': 35,
 'you': 40,
 'learn': 15,
 'more': 20,
 'from': 12,
 'failure': 11,
 'than': 33,
 'success': 31,
 'we': 38,
 'may': 19,
 'encounter': 9,
 'many': 18,
 'defeats': 5,
 'but': 2,
 'must': 22,
 'not': 23,
 'be': 1,
 'defeated': 4,
 'life': 17,
 'is': 14,
 'either': 8,
 'daring': 3,
 'adventure': 0,
 'or': 28,
 'nothing': 24}

In [7]:
count_vectorizer.vocabulary_.get('life')

17

In [8]:
transformer_vector = count_vectorizer.transform(train_text)

In [9]:
print(transformer_vector.shape)

(6, 41)


In [10]:
print(transformer_vector.toarray())

[[0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0
  0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1
  1 1 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
  0 0 0 0 1]
 [0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 2 0 0]
 [1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0]]


In [11]:
count_vectorizer.inverse_transform(transformer_vector)

[array(['difficulty', 'every', 'in', 'opportunity', 'pessimist', 'sees',
        'the'], dtype='<U11'),
 array(['difficulty', 'every', 'in', 'opportunity', 'optimist', 'sees',
        'the'], dtype='<U11'),
 array(['don', 'let', 'much', 'of', 'take', 'today', 'too', 'up',
        'yesterday'], dtype='<U11'),
 array(['failure', 'from', 'learn', 'more', 'success', 'than', 'you'],
       dtype='<U11'),
 array(['be', 'but', 'defeated', 'defeats', 'encounter', 'many', 'may',
        'must', 'not', 'we'], dtype='<U11'),
 array(['adventure', 'daring', 'either', 'is', 'life', 'nothing', 'or'],
       dtype='<U11')]

In [12]:
test_text = ["A room without books is like a body without a soul."]
count_vectorizer.transform(test_text).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

In [13]:
count_vectorizer.fit(test_text+train_text)

In [14]:
count_vectorizer.vocabulary_

{'room': 33,
 'without': 44,
 'books': 3,
 'is': 16,
 'like': 20,
 'body': 2,
 'soul': 35,
 'the': 39,
 'pessimist': 32,
 'sees': 34,
 'difficulty': 8,
 'in': 15,
 'every': 12,
 'opportunity': 29,
 'optimist': 30,
 'don': 9,
 'let': 18,
 'yesterday': 45,
 'take': 37,
 'up': 42,
 'too': 41,
 'much': 24,
 'of': 28,
 'today': 40,
 'you': 46,
 'learn': 17,
 'more': 23,
 'from': 14,
 'failure': 13,
 'than': 38,
 'success': 36,
 'we': 43,
 'may': 22,
 'encounter': 11,
 'many': 21,
 'defeats': 7,
 'but': 4,
 'must': 25,
 'not': 26,
 'be': 1,
 'defeated': 6,
 'life': 19,
 'either': 10,
 'daring': 5,
 'adventure': 0,
 'or': 31,
 'nothing': 27}

In [15]:
count_vectorizer.transform(test_text).toarray()

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        2, 0, 0]], dtype=int64)

## Bag on n-grams

In [16]:
n_gram_vectorizer = CountVectorizer(ngram_range=(2,2))

In [17]:
transformer_vector = n_gram_vectorizer.fit_transform(train_text)

In [18]:
n_gram_vectorizer.vocabulary_

{'the pessimist': 35,
 'pessimist sees': 29,
 'sees difficulty': 30,
 'difficulty in': 5,
 'in every': 14,
 'every opportunity': 10,
 'the optimist': 34,
 'optimist sees': 27,
 'sees opportunity': 31,
 'opportunity in': 26,
 'every difficulty': 9,
 'don let': 6,
 'let yesterday': 17,
 'yesterday take': 40,
 'take up': 32,
 'up too': 37,
 'too much': 36,
 'much of': 22,
 'of today': 25,
 'you learn': 41,
 'learn more': 16,
 'more from': 21,
 'from failure': 12,
 'failure than': 11,
 'than from': 33,
 'from success': 13,
 'we may': 38,
 'may encounter': 20,
 'encounter many': 8,
 'many defeats': 19,
 'defeats but': 4,
 'but we': 2,
 'we must': 39,
 'must not': 23,
 'not be': 24,
 'be defeated': 1,
 'life is': 18,
 'is either': 15,
 'either daring': 7,
 'daring adventure': 3,
 'adventure or': 0,
 'or nothing': 28}

In [19]:
transformer_vector.toarray()

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

In [21]:
transformer_vector.toarray().shape

(6, 42)

In [22]:
n_gram_vectorizer.inverse_transform(transformer_vector)

[array(['the pessimist', 'pessimist sees', 'sees difficulty',
        'difficulty in', 'in every', 'every opportunity'], dtype='<U17'),
 array(['in every', 'the optimist', 'optimist sees', 'sees opportunity',
        'opportunity in', 'every difficulty'], dtype='<U17'),
 array(['don let', 'let yesterday', 'yesterday take', 'take up', 'up too',
        'too much', 'much of', 'of today'], dtype='<U17'),
 array(['you learn', 'learn more', 'more from', 'from failure',
        'failure than', 'than from', 'from success'], dtype='<U17'),
 array(['we may', 'may encounter', 'encounter many', 'many defeats',
        'defeats but', 'but we', 'we must', 'must not', 'not be',
        'be defeated'], dtype='<U17'),
 array(['life is', 'is either', 'either daring', 'daring adventure',
        'adventure or', 'or nothing'], dtype='<U17')]

In [24]:
n_gram_vectorizer = CountVectorizer(ngram_range=(1,2))
transformer_vector = n_gram_vectorizer.fit_transform(train_text)

n_gram_vectorizer.vocabulary_

{'the': 68,
 'pessimist': 58,
 'sees': 60,
 'difficulty': 11,
 'in': 27,
 'every': 19,
 'opportunity': 52,
 'the pessimist': 70,
 'pessimist sees': 59,
 'sees difficulty': 61,
 'difficulty in': 12,
 'in every': 28,
 'every opportunity': 21,
 'optimist': 54,
 'the optimist': 69,
 'optimist sees': 55,
 'sees opportunity': 62,
 'opportunity in': 53,
 'every difficulty': 20,
 'don': 13,
 'let': 33,
 'yesterday': 79,
 'take': 64,
 'up': 74,
 'too': 72,
 'much': 43,
 'of': 50,
 'today': 71,
 'don let': 14,
 'let yesterday': 34,
 'yesterday take': 80,
 'take up': 65,
 'up too': 75,
 'too much': 73,
 'much of': 44,
 'of today': 51,
 'you': 81,
 'learn': 31,
 'more': 41,
 'from': 24,
 'failure': 22,
 'than': 66,
 'success': 63,
 'you learn': 82,
 'learn more': 32,
 'more from': 42,
 'from failure': 25,
 'failure than': 23,
 'than from': 67,
 'from success': 26,
 'we': 76,
 'may': 39,
 'encounter': 17,
 'many': 37,
 'defeats': 9,
 'but': 4,
 'must': 45,
 'not': 47,
 'be': 2,
 'defeated': 8,
 'we

In [25]:
transformer_vector.toarray().shape

(6, 83)

In [26]:
n_gram_vectorizer.inverse_transform(transformer_vector)

[array(['the', 'pessimist', 'sees', 'difficulty', 'in', 'every',
        'opportunity', 'the pessimist', 'pessimist sees',
        'sees difficulty', 'difficulty in', 'in every',
        'every opportunity'], dtype='<U17'),
 array(['the', 'sees', 'difficulty', 'in', 'every', 'opportunity',
        'in every', 'optimist', 'the optimist', 'optimist sees',
        'sees opportunity', 'opportunity in', 'every difficulty'],
       dtype='<U17'),
 array(['don', 'let', 'yesterday', 'take', 'up', 'too', 'much', 'of',
        'today', 'don let', 'let yesterday', 'yesterday take', 'take up',
        'up too', 'too much', 'much of', 'of today'], dtype='<U17'),
 array(['you', 'learn', 'more', 'from', 'failure', 'than', 'success',
        'you learn', 'learn more', 'more from', 'from failure',
        'failure than', 'than from', 'from success'], dtype='<U17'),
 array(['we', 'may', 'encounter', 'many', 'defeats', 'but', 'must', 'not',
        'be', 'defeated', 'we may', 'may encounter', 'encounter 

## TF-IDF

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tfidf_vectorizer = TfidfVectorizer()

In [29]:
transformer_vector = tfidf_vectorizer.fit_transform(train_text)

In [30]:
tfidf_vectorizer.vocabulary_

{'the': 34,
 'pessimist': 29,
 'sees': 30,
 'difficulty': 6,
 'in': 13,
 'every': 10,
 'opportunity': 26,
 'optimist': 27,
 'don': 7,
 'let': 16,
 'yesterday': 39,
 'take': 32,
 'up': 37,
 'too': 36,
 'much': 21,
 'of': 25,
 'today': 35,
 'you': 40,
 'learn': 15,
 'more': 20,
 'from': 12,
 'failure': 11,
 'than': 33,
 'success': 31,
 'we': 38,
 'may': 19,
 'encounter': 9,
 'many': 18,
 'defeats': 5,
 'but': 2,
 'must': 22,
 'not': 23,
 'be': 1,
 'defeated': 4,
 'life': 17,
 'is': 14,
 'either': 8,
 'daring': 3,
 'adventure': 0,
 'or': 28,
 'nothing': 24}

In [31]:
transformer_vector.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.36546139, 0.        , 0.        , 0.        ,
        0.36546139, 0.        , 0.        , 0.36546139, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.36546139, 0.        , 0.        , 0.44567684,
        0.36546139, 0.        , 0.        , 0.        , 0.36546139,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.36546139, 0.        , 0.        , 0.        ,
        0.36546139, 0.        , 0.        , 0.36546139, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.36546139, 0.44567684, 0.        , 0.        ,
        0.36546139, 0.     

In [32]:
transformer_vector.toarray().shape

(6, 41)

In [33]:
tfidf_vectorizer.idf_

array([2.25276297, 2.25276297, 2.25276297, 2.25276297, 2.25276297,
       2.25276297, 1.84729786, 2.25276297, 2.25276297, 2.25276297,
       1.84729786, 2.25276297, 2.25276297, 1.84729786, 2.25276297,
       2.25276297, 2.25276297, 2.25276297, 2.25276297, 2.25276297,
       2.25276297, 2.25276297, 2.25276297, 2.25276297, 2.25276297,
       2.25276297, 1.84729786, 2.25276297, 2.25276297, 2.25276297,
       1.84729786, 2.25276297, 2.25276297, 2.25276297, 1.84729786,
       2.25276297, 2.25276297, 2.25276297, 2.25276297, 2.25276297,
       2.25276297])

In [34]:
list(zip(tfidf_vectorizer.get_feature_names_out(), tfidf_vectorizer.idf_))

[('adventure', 2.252762968495368),
 ('be', 2.252762968495368),
 ('but', 2.252762968495368),
 ('daring', 2.252762968495368),
 ('defeated', 2.252762968495368),
 ('defeats', 2.252762968495368),
 ('difficulty', 1.8472978603872037),
 ('don', 2.252762968495368),
 ('either', 2.252762968495368),
 ('encounter', 2.252762968495368),
 ('every', 1.8472978603872037),
 ('failure', 2.252762968495368),
 ('from', 2.252762968495368),
 ('in', 1.8472978603872037),
 ('is', 2.252762968495368),
 ('learn', 2.252762968495368),
 ('let', 2.252762968495368),
 ('life', 2.252762968495368),
 ('many', 2.252762968495368),
 ('may', 2.252762968495368),
 ('more', 2.252762968495368),
 ('much', 2.252762968495368),
 ('must', 2.252762968495368),
 ('not', 2.252762968495368),
 ('nothing', 2.252762968495368),
 ('of', 2.252762968495368),
 ('opportunity', 1.8472978603872037),
 ('optimist', 2.252762968495368),
 ('or', 2.252762968495368),
 ('pessimist', 2.252762968495368),
 ('sees', 1.8472978603872037),
 ('success', 2.25276296849536

In [35]:
tfidf_vectorizer.inverse_transform(transformer_vector)

[array(['opportunity', 'every', 'in', 'difficulty', 'sees', 'pessimist',
        'the'], dtype='<U11'),
 array(['optimist', 'opportunity', 'every', 'in', 'difficulty', 'sees',
        'the'], dtype='<U11'),
 array(['today', 'of', 'much', 'too', 'up', 'take', 'yesterday', 'let',
        'don'], dtype='<U11'),
 array(['success', 'than', 'failure', 'from', 'more', 'learn', 'you'],
       dtype='<U11'),
 array(['defeated', 'be', 'not', 'must', 'but', 'defeats', 'many',
        'encounter', 'may', 'we'], dtype='<U11'),
 array(['nothing', 'or', 'adventure', 'daring', 'either', 'is', 'life'],
       dtype='<U11')]

## Hashing

Can be used to perform dimensionality reduction
Hash function maps values into the buckets (lower dimensionality representation)
Is a one way operation, is not possible to get the value from the bucket

Apply hash function to words to determine their location in the feature vector representing a document. Fast memory efficient but has no inverse transform.

In [36]:
from sklearn.feature_extraction.text import HashingVectorizer

In [37]:
vectorizer = HashingVectorizer(n_features=8, norm=None)
hash_vector = vectorizer.transform(train_text)
hash_vector.shape

(6, 8)

In [38]:
hash_vector.toarray()

array([[ 0.,  2.,  1., -1., -1.,  1., -1.,  0.],
       [ 0.,  2.,  0., -1., -1.,  1., -1., -1.],
       [-1., -1.,  0.,  0.,  2.,  1., -1., -1.],
       [ 0.,  0.,  0., -1.,  0.,  0.,  0., -1.],
       [ 1.,  1.,  0., -2.,  1.,  1.,  1.,  2.],
       [-1.,  0.,  0.,  0.,  1.,  1., -1.,  1.]])

In [39]:
vectorizer = HashingVectorizer(n_features=8, norm=None, alternate_sign=False)
hash_vector = vectorizer.transform(train_text)
hash_vector.toarray()

array([[0., 2., 1., 1., 1., 1., 1., 0.],
       [0., 2., 0., 1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 2., 1., 3., 1.],
       [0., 0., 2., 1., 4., 0., 0., 1.],
       [1., 1., 0., 2., 1., 1., 3., 2.],
       [1., 2., 0., 0., 1., 1., 1., 1.]])

In [40]:
vectorizer = HashingVectorizer(n_features=8, norm='l1')
hash_vector = vectorizer.transform(train_text)
hash_vector.toarray()

array([[ 0.        ,  0.28571429,  0.14285714, -0.14285714, -0.14285714,
         0.14285714, -0.14285714,  0.        ],
       [ 0.        ,  0.28571429,  0.        , -0.14285714, -0.14285714,
         0.14285714, -0.14285714, -0.14285714],
       [-0.14285714, -0.14285714,  0.        ,  0.        ,  0.28571429,
         0.14285714, -0.14285714, -0.14285714],
       [ 0.        ,  0.        ,  0.        , -0.5       ,  0.        ,
         0.        ,  0.        , -0.5       ],
       [ 0.11111111,  0.11111111,  0.        , -0.22222222,  0.11111111,
         0.11111111,  0.11111111,  0.22222222],
       [-0.2       ,  0.        ,  0.        ,  0.        ,  0.2       ,
         0.2       , -0.2       ,  0.2       ]])

In [41]:
vectorizer = HashingVectorizer(n_features=8, norm='l2')
hash_vector = vectorizer.transform(train_text)
hash_vector.toarray()

array([[ 0.        ,  0.66666667,  0.33333333, -0.33333333, -0.33333333,
         0.33333333, -0.33333333,  0.        ],
       [ 0.        ,  0.66666667,  0.        , -0.33333333, -0.33333333,
         0.33333333, -0.33333333, -0.33333333],
       [-0.33333333, -0.33333333,  0.        ,  0.        ,  0.66666667,
         0.33333333, -0.33333333, -0.33333333],
       [ 0.        ,  0.        ,  0.        , -0.70710678,  0.        ,
         0.        ,  0.        , -0.70710678],
       [ 0.2773501 ,  0.2773501 ,  0.        , -0.5547002 ,  0.2773501 ,
         0.2773501 ,  0.2773501 ,  0.5547002 ],
       [-0.4472136 ,  0.        ,  0.        ,  0.        ,  0.4472136 ,
         0.4472136 , -0.4472136 ,  0.4472136 ]])

## Feature extraction

In [42]:
from sklearn.feature_extraction import DictVectorizer
import pandas as pd

In [43]:
movie_ratings = [
    {'name': 'Avengers Endgame', 'imdb': 8.8},
    {'name':'Inception', 'imdb':8.8},
    {'name':'The Wolf of Wall Street', 'imdb':8.2}
]

In [44]:
vac = DictVectorizer()

In [45]:
transformer_vector = vac.fit_transform(movie_ratings).toarray()
transformer_vector

array([[8.8, 1. , 0. , 0. ],
       [8.8, 0. , 1. , 0. ],
       [8.2, 0. , 0. , 1. ]])

In [46]:
vac.get_feature_names_out()

array(['imdb', 'name=Avengers Endgame', 'name=Inception',
       'name=The Wolf of Wall Street'], dtype=object)

In [47]:
vac.vocabulary_

{'name=Avengers Endgame': 1,
 'imdb': 0,
 'name=Inception': 2,
 'name=The Wolf of Wall Street': 3}

In [48]:
df = pd.DataFrame(transformer_vector, columns=vac.get_feature_names_out())

In [49]:
df.head()

Unnamed: 0,imdb,name=Avengers Endgame,name=Inception,name=The Wolf of Wall Street
0,8.8,1.0,0.0,0.0
1,8.8,0.0,1.0,0.0
2,8.2,0.0,0.0,1.0
