# Part I: Term Frequency
Term frequency calculated in Python using scikit-learn `CountVectorizer`:

```python
vectorizer = CountVectorizer()

term_frequencies = vectorizer.fit_transform([stanza])
```
- A `CountVectorizer` object is initialized
- The `CountVectorizer` object is fit (trained) and transformed (applied) on the corpus of data, returning the term frequencies for each term-document pair

```python
import codecademylib3_seaborn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from preprocessing import preprocess_text

poem = '''
Success is counted sweetest
By those who ne'er succeed.
To comprehend a nectar
Requires sorest need.

Not one of all the purple host
Who took the flag to-day
Can tell the definition,
So clear, of victory,

As he, defeated, dying,
On whose forbidden ear
The distant strains of triumph
Break, agonized and clear!'''

# define clear_count:
clear_count = 2

# preprocess text
processed_poem = preprocess_text(poem)

# initialize and fit CountVectorizer
vectorizer = CountVectorizer()
term_frequencies = vectorizer.fit_transform([processed_poem])

# get vocabulary of terms
feature_names = vectorizer.get_feature_names()

# create pandas DataFrame with term frequencies
try:
  df_term_frequencies = pd.DataFrame(term_frequencies.T.todense(), index=feature_names, columns=['Term Frequency'])
  print(df_term_frequencies)
except:
  pass

```

# Part II: Inverse document frequency
We can calculate the inverse document frequency for some term `t` across a corpus using the below equation. 
$$
\log \left(\frac{\text { Total number of documents }}{\text { Number of documents with term } t}\right)
$$
Inverse document frequency can be calculated on a group of documents using scikit-learn’s TfidfTransformer:
```python
transformer = TfidfTransformer(norm=None)
transformer.fit(term_frequencies)
inverse_doc_frequency = transformer.idf_
```
- a `TfidfTransformer` object is initialized. Don’t worry about the norm=None keyword argument for now, we will dig into this in the next exercise
- the `TfidfTransformer` is fit (trained) on a term-document matrix of term frequencies
- the `.idf_` attribute of the `TfidfTransformer` stores the inverse document frequencies of the terms as a NumPy array

```python
import codecademylib3_seaborn
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from term_frequency import term_frequencies, feature_names, df_term_frequencies

# display term-document matrix of term frequencies
print(df_term_frequencies)

# initialize and fit TfidfTransformer
transformer = TfidfTransformer(norm=None)
transformer.fit(term_frequencies)
idf_values = transformer.idf_

# create pandas DataFrame with inverse document frequencies
try:
  df_idf = pd.DataFrame(idf_values, index = feature_names, columns=['Inverse Document Frequency'])
  print(df_idf)
except:
  pass
```

# Part III: Putting it together
Putting It All Together: Tf-idf
Now that we understand how term frequency and inverse document frequency are calculated, let’s put it all together to calculate tf-idf!

Tf-idf scores are calculated on a term-document basis. That means there is a tf-idf score for each word, for each document. The tf-idf score for some term `t` in a document `d` in some `corpus` is calculated as follows:

$$
t f i d f(t, d)=t f(t, d) * i d f(t, \text { corpus })
$$

- `tf(t,d)` is the term frequency of term `t` in document `d`
- `idf(t,corpus)` is the inverse document frequency of a term `t` across `corpus`
We can easily calculate the tf-idf values for each term-document pair in our corpus using scikit-learn’s `TfidfVectorizer`:

```python
vectorizer = TfidfVectorizer(norm=None)
tfidf_vectorizer = vectorizer.fit_transform(corpus)
```

- a `TfidfVectorizer` object is initialized. The `norm=None` keyword argument prevents scikit-learn from modifying the multiplication of term frequency and inverse document frequency
- the `TfidfVectorizer` object is fit and transformed on the corpus of data, returning the tf-idf scores for each term-document pair

```python
import codecademylib3_seaborn
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from poems import poems
from preprocessing import preprocess_text

# preprocess documents
processed_poems = [preprocess_text(poem) for poem in poems]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)



# get vocabulary of terms
tfidf_scores = vectorizer.fit_transform(processed_poems)
feature_names = vectorizer.get_feature_names()

# get corpus index
corpus_index = [f"Poem {i+1}" for i in range(len(poems))]

# create pandas DataFrame with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=corpus_index)
  print(df_tf_idf)
except:
  pass
```

# Converting Bag-of-Words to Tf-idf
In addition to directly calculating the tf-idf scores for a set of terms across a corpus, you can also convert a bag-of-words model you have already created into tf-idf scores.

Scikit-learn’s `TfidfTransformer` is up to the task of converting your bag-of-words model to tf-idf. You begin by initializing a `TfidfTransformer` object.

```python
tf_idf_transformer = TfidfTransformer(norm=False)
```

Given a bag-of-words matrix `count_matrix`, you can now multiply the term frequencies by their inverse document frequency to get the tf-idf scores as follows:

```python
tf_idf_scores = tfidf_transformer.fit_transform(count_matrix)
```

This is very similar to how we calculated inverse document frequency, except this time we are fitting and transforming the `TfidfTransformer` to the term frequencies/bag-of-words vectors rather than just fitting the `TfidfTransformer` to them.

```python
import codecademylib3_seaborn
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from term_frequency import bow_matrix, feature_names, df_bag_of_words, corpus_index

# display term-document matrix of term frequencies (bag-of-words)
print(df_bag_of_words)

# initialize and fit TfidfTransformer, transform bag-of-words matrix
transformer = TfidfTransformer(norm=False)
tfidf_scores = transformer.fit_transform(bow_matrix)

# create pandas DataFrame with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index = feature_names, columns=corpus_index)
  print(df_tf_idf)
except:
  pass
```