# Alexine Studios

## Create sample data

In [24]:
import pandas as pd
df = pd.DataFrame([(0, 'He is studying in the library'), (1, 'He scored good marks in exams')], columns=["label", "sentence"])
df

Unnamed: 0,label,sentence
0,0,He is studying in the library
1,1,He scored good marks in exams


## CountVectorizer

In [25]:
corpus = df.sentence.to_list()
corpus

['He is studying in the library', 'He scored good marks in exams']

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray()

array([[0, 0, 1, 1, 1, 1, 0, 0, 1, 1],
       [1, 1, 1, 1, 0, 0, 1, 1, 0, 0]])

In [27]:
print(vectorizer.get_feature_names())

['exams', 'good', 'he', 'in', 'is', 'library', 'marks', 'scored', 'studying', 'the']


<h2 id="tf-idf">TF-IDF</h2>

Term frequency-inverse document frequency (TF-IDF)
is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by $t$, a document by  d , and the corpus by D.

Term frequency $TF(t, d)$ is the number of times that term $t$ appears in document $d$

If a term appears very often across the corpus, it means it doesn't carry special information about a particular document. 

Inverse document frequency is a numerical measure of how much information a term provides: 

$$ IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1} $$

where |D| is the total number of documents in the corpus.
Since logarithm is used, if a term appears in all documents, its IDF value becomes 0.

The TF-IDF measure is simply the product of TF and IDF:
$$ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). $$


In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray()

array([[0.        , 0.        , 0.31779954, 0.31779954, 0.44665616,
        0.44665616, 0.        , 0.        , 0.44665616, 0.44665616],
       [0.44665616, 0.44665616, 0.31779954, 0.31779954, 0.        ,
        0.        , 0.44665616, 0.44665616, 0.        , 0.        ]])