# <b style="color:green">Text Representation</b>

- How to convert a text to number.
- Feature Extractions

## <b style="color:red">Introductions</b>
- What is Feature Extraction from text?
  - To convert text into numbers is called `Feature Extraction` or `Text Representation` or `Text Vectorization`. Because machine only understand numbers not words.
- Why is it difficult?
  - It is easy to convert a image to numbers.
  - It is easy to convert a speech to numbers.
  - It is not easy to convert to convert a sentence to number. Because meaning of sentence will change.
- What are the techniques to convert a text to numbers?
  - OHE : One Hot Encoding
  - Bag of words
  - N-Grams
  - TF-IDF
  - Custome Features : Make your own features
  - Word2Vec : embeddings (DL Topics)
- Common Terms
  - _Corpus (C)_ : Add all words of dataset including repeated words is called Corpus. Means Combination of all words of sentence.
  - _Vocabulary (V)_ : Get all uniques words of Corpus is called Vocabulary.
  - _Document (D)_ : Indivisual text-document of single row is called Document.
  - _Word (W)_ : Indivisual word of Document is called Word.
  - __Example__:
    - Dataset
      <pre>
      _____________________________
      |D1 | people watch campusx  |
      |___|_______________________|
      |D2 | campusx watch campusx |
      |___|_______________________|
      |D3 | people write comment  |
      |___|_______________________|
      |D4 | campusx write comment |
      |___|_______________________|
  </pre>
    - _Corpus_ : people watch campusx campusx watch campusx people write comment campusx write comment
    - _Vocabulary_ : people watch campusx write comment  : V=5

### <b style="color:red">One Hot Encoding (OHE)</b>
- Dataset
  <pre>
      _____________________________
      |D1 | people watch campusx  |
      |___|_______________________|
      |D2 | campusx watch campusx |
      |___|_______________________|
      |D3 | people write comment  |
      |___|_______________________|
      |D4 | campusx write comment |
      |___|_______________________|
  </pre>
  - _Corpus_ : people watch campusx campusx watch campusx people write comment campusx write comment
  - _Vocabulary_ : people watch campusx write comment  : V=5
- One Hot Encoding says that convert all words of your Documents into a `V` dimensional vector.
  <pre>
    ____________________________________________ 
    |       |people|watch|campusx|write|comment|
    |_______|______|_____|_______|_____|_______|
    |people |   1  |  0  |   0   |  0  |   0   |
    |_______|______|_____|_______|_____|_______|
    |watch  |   0  |  1  |   0   |  0  |   0   |
    |_______|______|_____|_______|_____|_______|
    |campux |   0  |  0  |   1   |  0  |   0   |
    |_______|______|_____|_______|_____|_______|
    |write  |   0  |  0  |   0   |  1  |   0   |
    |_______|______|_____|_______|_____|_______|
    |comment|   0  |  0  |   0   |  0  |   1   |
    |_______|______|_____|_______|_____|_______|
  </pre>
- One Hot Encoding
  <pre>
      D1 = [[1, 0, 0, 0, 0],
            [0, 1, 0, 0, 0],
            [0, 0, 1, 0, 0]]
      D2 = [[0, 0, 1, 0, 0],
            [0, 1, 0, 0, 0],
            [0, 0, 1, 0, 0]]
      D3 = [[1, 0, 0, 0, 0],
            [0, 0, 0, 1, 0],
            [0, 0, 0, 0, 1]]
      D4 = [[0, 0, 1, 0, 0],
            [0, 0, 0, 1, 0],
            [0, 0, 0, 0, 1]]
  </pre> 
- Advantage
  - It is intevtive.
  - It is easy to implement.
- Disadvantage
  - Sparsity (Identity array) : Overfitting
  - Different length of array
  - OOV(Out of Vocabulary) : New word come can not handle.
  - No capturing of semantic meaning. eg. : walk, run, bottle

### <b style="color:red">Bag of Words</b>
- Use for Text Classification Problems.
- Depends upon how many times word come in sentence.
- In Bag of Words, Order of words does not matter.
- Dataset
  - <pre>
      ___________________________________
      |D1 | people watch campusx  |  1  |
      |___|_______________________|_____|
      |D2 | campusx watch campusx |  1  |
      |___|_______________________|_____| 
      |D3 | people write comment  |  0  |
      |___|_______________________|_____|
      |D4 | campusx write comment |  0  |
      |___|_______________________|_____|
  </pre>
  - _Corpus_ : people watch campusx campusx watch campusx people write comment campusx write comment
  - _Vocabulary_ : people watch campusx write comment  : V=5
- Bag of words says that how many times word occures in a Document.
  <pre>
    ____________________________________________ 
    |       |people|watch|campusx|write|comment|
    |_______|______|_____|_______|_____|_______|
    |  D1   |   1  |  1  |   1   |  0  |   0   |
    |_______|______|_____|_______|_____|_______|
    |  D2   |   0  |  1  |   2   |  0  |   0   |
    |_______|______|_____|_______|_____|_______|
    |  D3   |   1  |  0  |   0   |  1  |   1   |
    |_______|______|_____|_______|_____|_______|
    |  D4   |   0  |  0  |   1   |  1  |   1   |
    |_______|______|_____|_______|_____|_______|
  </pre>
- Every _Document_ convert into ___V___ dimensional vector. \
  Words are related to each other in a vector plane. The `cos(theta)` will tell how close two words are related \
  to each other. Value of `theta` is high low related, value of `theta` is low high related. \
  For this we use a library `from sklearn.feature_extraction.text import CountVectorizer`.


In [1]:
import numpy as np
import pandas as pd

docs = ['people watch campusx',
       'campusx watch campusx',
       'people write comment',
       'campusx write comment']
output = [1, 1, 0, 0]

df = pd.DataFrame({'text':docs, 'output':output})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [3]:
bow = cv.fit_transform(df['text'])
# vocab
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [4]:
# [   0   ,    1   ,   2   ,  3   ,  4   ]
# [campusx, comment, people, watch, write]

print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())

[[1 0 1 1 0]]
[[2 0 0 1 0]]
[[0 1 1 0 1]]
[[1 1 0 0 1]]


In [5]:
# out of vocabulary problem not occure
cv.transform(['campusx watch and write comment of people because people watch campusx']).toarray()

array([[2, 1, 2, 2, 1]], dtype=int64)

In [6]:
msg = ["people and student watch movies and write comment. campusx is youtube channel. from where people can learn manythings."]
cv.transform(msg).toarray()

array([[1, 1, 2, 1, 1]], dtype=int64)

- `from sklearn.feature_extraction.text import CountVectorizer` \
   `cv = CountVectorizer()`
- lowercase=True, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b'   \
  ngram_range=(1, 1), binary=False/True (n>1 => 1), max_feature=1 (select only 1 feature whoes frequency high)  \
  max_feature=2 (select only 2 feature whoes frequecy high)
- Advantage :
  - Simple and intuitive
  - Different length of array acceptable always get fixed size of output
  - No problem of OOV(Out of Vocabulary)
  - Capture semantic relationship
- Disadvantage :
  - Sparsity : Overfitting
  - Ignore OOV(Out of Vocabulary) word. Information missing.
  - No consider order of words in a sentence.
  - * This is a very good movie.
    * This is not a very good movie.
    * Both sentence are opposite to each other. But `Bag of Word` think the are more close to each other.
  
  

### <b style="color:red">N-grams or Bag of N-grams</b>
- Rather than taking single word of Vocabulary in `Bag of Words`. Take multiple words to make Vocabulary in `N-grams`.
- Bi-grams : Take 2 contiguous words to make Vocabulary. \
  Tri-grams : Take 3 contiguous words to make Vocabulary. \
  n-grams : Take n contiguous words to make Vocabulary.
- Dataset
  - <pre>
      ___________________________________
      |D1 | people watch campusx  |  1  |
      |___|_______________________|_____|
      |D2 | campusx watch campusx |  1  |
      |___|_______________________|_____| 
      |D3 | people write comment  |  0  |
      |___|_______________________|_____|
      |D4 | campusx write comment |  0  |
      |___|_______________________|_____|
  </pre>
- Bag of ___Bi-grams___. \
  Vocabulary : | people watch | watch campusx | campusx watch | people write | write comment |campusx write | \
  __V__ = 6
  <pre>
    ___________________________________________________________________________________________
    |       |people watch|watch campusx|campusx watch|people write|write comment|campusx write|
    |_______|____________|_____________|_____________|____________|_____________|_____________|
    |  D1   |      1     |      1      |      0      |      0     |      0      |      0      |
    |_______|____________|_____________|_____________|____________|_____________|_____________|
    |  D2   |      0     |      1      |      1      |      0     |      0      |      0      |
    |_______|____________|_____________|_____________|____________|_____________|_____________|
    |  D3   |      0     |      0      |      0      |      1     |      1      |      0      |
    |_______|____________|_____________|_____________|____________|_____________|_____________|
    |  D4   |      0     |      0      |      0      |      0     |      1      |      1      |
    |_______|____________|_____________|_____________|____________|_____________|_____________|
  </pre>
- Bag of ___Tri-grams___. \
  Vocabulary : | people watch campusx | campusx watch campusx | people write comment | campusx write comment | \
  __V__ = 4
  <pre>
    _______________________________________________________________________________________________
    |       |people watch campusx|campusx watch campusx|people write comment|campusx write comment|
    |_______|____________________|_____________________|____________________|_____________________|
    |  D1   |          1         |          0          |          0         |          0          |
    |_______|____________________|_____________________|____________________|_____________________|
    |  D2   |          0         |          1          |          0         |          0          |
    |_______|____________________|_____________________|____________________|_____________________|
    |  D3   |          0         |          0          |          1         |          0          |
    |_______|____________________|_____________________|____________________|_____________________|
    |  D4   |          0         |          0          |          0         |          1          |
    |_______|____________________|_____________________|____________________|_____________________|
  </pre>

- `from sklearn.feature_extraction.text import CountVectorizer`  \
  `cv = CountVectorizer(ngram_range=(1, 1)) # Uni-grams or Bag of words` \
  `cv = CountVectorizer(ngram_range=(2, 2)) # Bi-grams` \
  `cv = CountVectorizer(ngram_range=(1, 2)) # Uni-grams + Bi-grams` \
  `cv = CountVectorizer(ngram_range=(3, 3)) # Tri-grams` \
  `cv = CountVectorizer(ngram_range=(2, 3)) # Bi-grams + Tri-grams` \
  `cv = CountVectorizer(ngram_range=(1, 3)) # Uni-grams + Bi-grams + Tri-grams`

In [7]:
import numpy as np
import pandas as pd

docs = ['people watch campusx',
       'campusx watch campusx',
       'people write comment',
       'campusx write comment']
output = [1, 1, 0, 0]

df = pd.DataFrame({'text':docs, 'output':output})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
# Uni-grams
cv = CountVectorizer(ngram_range=(1, 1))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [10]:
# Bi-grams
cv = CountVectorizer(ngram_range=(2, 2))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people watch': 2, 'watch campusx': 4, 'campusx watch': 0, 'people write': 3, 'write comment': 5, 'campusx write': 1}


In [11]:
# Uni-grams + Bi-grams
cv = CountVectorizer(ngram_range=(1, 2))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people': 4, 'watch': 7, 'campusx': 0, 'people watch': 5, 'watch campusx': 8, 'campusx watch': 1, 'write': 9, 'comment': 3, 'people write': 6, 'write comment': 10, 'campusx write': 2}


In [12]:
# Tri-grams
cv = CountVectorizer(ngram_range=(3, 3))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people watch campusx': 2, 'campusx watch campusx': 0, 'people write comment': 3, 'campusx write comment': 1}


In [13]:
# Bi-grams + Tri-grams
cv = CountVectorizer(ngram_range=(2, 3))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people watch': 4, 'watch campusx': 8, 'people watch campusx': 5, 'campusx watch': 0, 'campusx watch campusx': 1, 'people write': 6, 'write comment': 9, 'people write comment': 7, 'campusx write': 2, 'campusx write comment': 3}


In [14]:
# Uni-grams + Bi-grams + Tri-grams
cv = CountVectorizer(ngram_range=(1, 3))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)
print(len(cv.vocabulary_))

{'people': 6, 'watch': 11, 'campusx': 0, 'people watch': 7, 'watch campusx': 12, 'people watch campusx': 8, 'campusx watch': 1, 'campusx watch campusx': 2, 'write': 13, 'comment': 5, 'people write': 9, 'write comment': 14, 'people write comment': 10, 'campusx write': 3, 'campusx write comment': 4}
15


- Dataset ; \
  D1 : This movie is very good. \
  D2 : This movie is not good.
- Uni-grams :    
  <pre>
                   _________________________________________
      Vocabulary = | This | movie | is | very | good | not |       V=6
                   |______|_______|____|______|______|_____|
                D1 |  1   |   1   |  1 |  1   |  1   |  0  |  
                   |______|_______|____|______|______|_____|
                D2 |  1   |   1   |  1 |  0   |  1   |  1  |
                   |______|_______|____|______|______|_____|
                Similar  = 4
                Different = 2
                Both Vector are much closer to each other.
  </pre>
- Bi-grams :    
  <pre>
                   ___________________________________________________________________
      Vocabulary = | This movie | movie is | is very | very good | is not | not good |      V=6
                   |____________|__________|_________|___________|________|__________|
                D1 |      1     |     1    |    1    |      1    |    0   |     0    |  
                   |____________|__________|_________|___________|________|__________|
                D2 |      1     |     1    |    0    |      0    |    1   |     1    |
                   |____________|__________|_________|___________|________|__________|
                Similar  = 2
                Different = 4
                Both Vector are not close to each other.
  </pre>

- Advantage :
  - Able to capture semantic meaning of sentence.
  - Easy to implement
  - Intiutive to understand
- Disadvantage :
  - n-grams < (n+1)-grams => More complex model, More computation.
  - No solution for new word.


### <b style="color:red">TF-IDF</b>
- It assign different weightage to different word of Docuement. If fequence of a word in a Document is high. But low in Corpus. Then that word get more weightage for that Document comparision than other words.
- `TF : Term Frequency`  \
  `IDF : Inverse Document Frequency`   \
  `Weightage of Word = TF x IDF`
-
  <pre>
                 (Number of occurrences of term <b><i>t</i></b> in document <b><i>d</i></b>)
 TF(t, d) =  _______________________________________________________________
                    (Total number of terms in the document <b><i>d</i></b>)

                    (Total number of documents in the corpus)
 IDF(t) =  log<sub>e</sub>_________________________________________________________ + 1
                    (Number of documents with term <b><i>t</i></b> in them)

  </pre>
- Dataset
  - <pre>
      ___________________________________
      |D1 | people watch campusx  |  1  |
      |___|_______________________|_____|
      |D2 | campusx watch campusx |  1  |
      |___|_______________________|_____| 
      |D3 | people write comment  |  0  |
      |___|_______________________|_____|
      |D4 | campusx write comment |  0  |
      |___|_______________________|_____|
  </pre>
  - TF(people, D1) = 1/3   \
    TF(campusx, D2) = 2/3  \
    Term frequency is like __probability__ of that word in Document. Which will be in range of [0, 1]
  - __log<sub>e</sub>__ is also called `ln`. \
    IDF(campusx) = ln(4/3)+1 = 1.28  \
    IDF(watch) = ln(4/2)+1 = 1.69    \
    IDF(people) = ln(4/2)+1 = 1.69   \
    IDF(write) = ln(4/2)+1 = 1.69    \
    IDF(comment) = ln(4/2)+1 = 1.69  \
  - What if IDF(word) = ln(4/4)+1 = 1  : __coz ln(1)=0__  \
    What if IDF(word) = ln(1000/1)+1 = 7.90 : __Reason to take ln.__
  - `Weightage of Word = TF x IDF`
  <pre>
      ___________________________________________________________________________________
      |      |    people    |     watch    |   campusx    |   write      |   comment    |
      |______|______________|______________|______________|______________|______________|
      |  D1  | (1/3)*(1.69) | (1/3)*(1.69) | (1/3)*(1.28) | (0)*(1.69)   | (0)*(1.69)   |
      |______|______________|______________|______________|______________|______________|
      |  D2  | (0)*(1.69)   | (1/3)*(1.69) | (2/3)*(1.28) | (0)*(1.69)   | (0)*(1.69)   |
      |______|______________|______________|______________|______________|______________|
      |  D3  | (1/3)*(1.69) | (0)*(1.69)   | (0)*(1.28)   | (1/3)*(1.69) | (1/3)*(1.69) |
      |______|______________|______________|______________|______________|______________|
      |  D4  | (0)*(1.69)   | (0)*(1.69)   | (1/3)*(1.28) | (1/3)*(1.69) | (1/3)*(1.69) |
      |______|______________|______________|______________|______________|______________|
  </pre>
  

In [15]:
import numpy as np
import pandas as pd

docs = ['people watch campusx',
       'campusx watch campusx',
       'people write comment',
       'campusx write comment']
output = [1, 1, 0, 0]

df = pd.DataFrame({'text':docs, 'output':output})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(df['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [17]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']


- Advantage :
  - Use more in IRS(Information Retrival System) like search engine.
- Disadvantage :
  - Sparsity
  - OOV (Out of Vocabulary)
  - Dimension will be big. Overfitting may occures.
  - Can not capture semantic meaning.

### <b style="color:red">Custom Features</b>
- Number of positive words. \
  Number of negative words. \
  Ratio of positive/negative words \
  Word count. \
  Character count 
- Features
  1. Techniques Features
  2. Custome Features