## Part a: Bag of Words
## Transforming text to a vector

There are many ways to transform text data to numeric vectors. In this task you will try to use two of them. One of the well-known approaches is a bag-of-words representation. To create this transformation, follow the steps:

1. Find N most popular words in train corpus and numerate them. Now we have a dictionary of the most popular words.  
2. For each title in the corpora create a zero vector with the dimension equals to N.  
3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

In [14]:
import numpy as np

def my_bag_of_words(text, words_to_index, dict_size):
    """
    text: a string
    words_to_index: a list, train corpus words
    dict_size: size of the dictionary

    return a vector which is a bag-of-words representation of 'text'
    """
    result_vector = np.zeros(dict_size)
    
    words_idx = {word: idx for idx, word in enumerate(words_to_index)}
    
    for word in text.split():
        if word in words_idx:
            result_vector[words_idx[word]] += 1
    
    return result_vector

text = 'hi how are you'
words_to_index = ['hi', 'you', 'me', 'are']
n = len(words_to_index)

bow_vector = my_bag_of_words(text, words_to_index, n)
print(bow_vector)


[1. 1. 0. 1.]


## Part b:  TF-IDF

1. Test the script tfidf_demo.ipynb in the Jupiter note and make sure they work. 

2. Replace the movie review data "texts" in the script file with your own defined document and test it.

3.  Given the below documents:  
texts = [
    "good movie", "not a good movie", "did not like", 
    "i like it", "good one"
]

Given the definition of TF and IDF, what is the sum of TF-IDF values for 1-grams in "good movie" text? Enter a math expression as an answer.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

texts = [
    "good movie", "not a good movie", "did not like", 
    "i like it", "good one"
]

# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
df = pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names_out()
)

df

Unnamed: 0,good movie,like,movie,not
0,0.707107,0.0,0.707107,0.0
1,0.57735,0.0,0.57735,0.57735
2,0.0,0.707107,0.0,0.707107
3,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

test = [
    'a great film', 'great cast', 'a pleasure to watch', 
    'not good', 'hard to watch', 'boring film'
]

# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(test)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names_out()
)

Unnamed: 0,film,great,to,to watch,watch
0,0.707107,0.707107,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.57735,0.57735,0.57735
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.57735,0.57735,0.57735
5,1.0,0.0,0.0,0.0,0.0


To calculate the sum of TF-IDF values for 1-gram frequency of the term "good-movie" in the text:  
t = term, d = set of n docs. containing t, D = set of N total documents  

For a term t_i in T, for a given document d_j in corpus D:   
t_i = "good movie", D = 1, N = 1   
  
TFIDF(t_i,d_j,D) = TF(t_i,d_j)*IDF(t_i,D)  

TF(t_i,d_j) = 1-gram frequency of t_i in d_j, ex.:    
TF(t_i,d_j) =  n(t) / n(d_j)  

IDF(t_i,D) = log(N/| num. d_j|)  

If D = [["good movie", "not a good movie", "did not like",
"i like it", "good one"]],   
D = [[d_0, ... ,d_j]] = [[d_0, d_1, d_2, d_3, d_4]]   
  
For a set of terms T = ["good movie", "like", "movie", "not"],  
if i = 0 and t_i = "good movie",  
  
Using the l2 norm, for a given document:  
$$
||TFIDF(t_{i},d_{j},D)||^2 = \sqrt(\sum_{n=0}^i TFIDF(t_{i},d_{j})^2))
$$
The normalized TFIDF' for the document for a given term, t:  
$$
TFIDF'(t_{i},d_{j},D) = \sum_{n=0}^{i} TFIDF(t_{i},d_{j},D) / ||TFIDF(t_{i},d_{j},D)||^2 
$$
  
Since term t does not appear in documents 2,3, or 4:
$$
TFIDF(d_{j} | j = 2,3,4) = 0
$$
The sum of TFIDF values for 1-gram frequency of "good movie":  
$$
\sum_{n=0}^{j}(TFIDF(t_{i},d_{j},D)) = TFIDF(d_{0})/||TFIDF(d_{0})||^2 + TFIDF(d_{1})/||TFIDF(d_{1})||^2  
$$

In [11]:
print(f"1-gram sum(good movie): {df['good movie'].sum()}")

1-gram sum(good movie): 1.2844570503761732


If we wanted to find the 1-gram frequency of "good" and "movie" within "good movie", we can use the same formulas with a new set of terms, where each term is only 1 word long using the default inputs for the TfidfVectorizer() module:

In [16]:
# using default tokenizer in TfidfVectorizer
texts = [
    "good movie", "not a good movie", "did not like", 
    "i like it", "good one"
]

tfidf = TfidfVectorizer()
features = tfidf.fit_transform(texts)
df = pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names_out()
)

print(f"1-gram sum(good) = {df['good'].sum()}")
print(f"1-gram sum(movie) = {df['movie'].sum()}")

df

1-gram sum(good) = 1.7013655042127687
1-gram sum(movie) = 1.379265529325895


Unnamed: 0,did,good,it,like,movie,not,one
0,0.0,0.638711,0.0,0.0,0.769447,0.0,0.0
1,0.0,0.506204,0.0,0.0,0.609818,0.609818,0.0
2,0.659118,0.0,0.0,0.531772,0.0,0.531772,0.0
3,0.0,0.0,0.778283,0.627914,0.0,0.0,0.0
4,0.0,0.556451,0.0,0.0,0.0,0.0,0.830881
