In [None]:
import string
import math
import pandas as pd
pd.set_option('display.max_rows', 500)
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# **Data Set**

In [2]:
news_headlines = ['Stocks Slide After Jobless Claims Rise',
                   'Jobless Claims Rose Last Week, Still Historically Low',
                   'Stocks Extend Losses, Reversing Early-Week Gains',
                   'Jobless Claims Are Expected to Rise',
                   'More Americans apply for jobless benefits last week',
                   'NFL Week 4 Offense Rankings | NFL News, Rankings and Statistics',
                   'NFL rankings, figuring out the Ravens offense, Kenny Pickett gets the nod',
                   "Week 4's NFL Team of the Week - Offense",
                   '2022 NFL offense rankings',
                   'How Will Every NFL Offense Perform in 2022?']

# **Preprocess Text**

## **Remove Stop Words, turn characters into lowercase and remove punctuation**

In [4]:
stop_words = stopwords.words('english')

In [5]:
def stop_word_removal(text, stop_word_corpus, punct_str):
    clean_text = ' '.join([word.lower() for word in text.split() if word.lower()
                 not in stop_word_corpus]).replace('\n',' ')
    return clean_text.translate(str.maketrans('', '', punct_str))

In [6]:
news_cleaned = [stop_word_removal(headline,stop_words,string.punctuation)
                for headline in news_headlines]

news_cleaned

['stocks slide jobless claims rise',
 'jobless claims rose last week still historically low',
 'stocks extend losses reversing earlyweek gains',
 'jobless claims expected rise',
 'americans apply jobless benefits last week',
 'nfl week 4 offense rankings  nfl news rankings statistics',
 'nfl rankings figuring ravens offense kenny pickett gets nod',
 'week 4s nfl team week  offense',
 '2022 nfl offense rankings',
 'every nfl offense perform 2022']

# **Term Frequency (tf)**

**Term frequency, $tf(t,d)$, is the relative frequency of term $t$ within document $d$**

$\text{Term Frequency = }\large{\frac{f_{t,d}}{\sum_{t' \in d}f_{t',d}}}$ **, where the numerator represents the count of a given term and the denominator is the total number of terms in document $d$**

<sup>Source: [tf–idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) from Wikipedia.org<sup/>

In [12]:
def term_freq(document_list):
    
    combined_string = ' '.join(document_list)
    word_set = [word for word in set(combined_string.split()) if len(word) > 1]
    
    filtered_docs = []
    for document in document_list:
        filtered_docs.append(' '.join([word for word in document.split() if len(word) > 1]))
        
    doc_word_count = []
    for document in filtered_docs:
        doc_word_count.append([document.split().count(word)/len(document.split()) for word in word_set])
    
    return {doc:vector for doc,vector in zip(filtered_docs,doc_word_count)}

In [14]:
def word_list(document_list):
    
    combined_string = ' '.join(document_list)
    return [word for word in set(combined_string.split()) if len(word) > 1]

In [15]:
tf_df = pd.DataFrame(data= term_freq(news_cleaned).values(),
                     columns = word_list(news_cleaned),
                     index = term_freq(news_cleaned).keys())

tf_df

Unnamed: 0,last,historically,reversing,low,rose,losses,gets,ravens,nfl,statistics,...,nod,news,week,perform,americans,kenny,claims,jobless,slide,pickett
stocks slide jobless claims rise,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.2,0.0
jobless claims rose last week still historically low,0.125,0.125,0.0,0.125,0.125,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.125,0.0,0.0,0.0,0.125,0.125,0.0,0.0
stocks extend losses reversing earlyweek gains,0.0,0.0,0.166667,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
jobless claims expected rise,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.25,0.0,0.0
americans apply jobless benefits last week,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.166667,0.0,0.166667,0.0,0.0,0.166667,0.0,0.0
nfl week offense rankings nfl news rankings statistics,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.125,...,0.0,0.125,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0
nfl rankings figuring ravens offense kenny pickett gets nod,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.111111,0.111111,0.0,...,0.111111,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.111111
week 4s nfl team week offense,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,...,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022 nfl offense rankings,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
every nfl offense perform 2022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0


# **Inverse Document Frequency (idf)**

**The inverse document frequency (idf) is a measure of importance of a term $t$ in a given document $d$. The measure gives a larger weight to terms which are less common in the corpus while more frequent terms across a document are given less weight.**

**The below idf formula differs from the standard idf contained in academic textbooks. The Sci-Kit Learn module adds a `1` in the numerator and the denominator which is akin to adding a document to the corpus containing every term to prevent a division by zero. Sci-Kit Learn also adds a `1` to the quotient in the formula.**

$\text{Inverse Document Frequency = } log \large{\frac{| \ D \ | + 1}{{| \ d : t_i \in \ d  \ | + 1}}} + 1$, **where $D$ is the corpus (total number of documents), and $d_{ti}$ is the number of documents the given term $t$ appears in**

<sup>Source: [Text Mining and Network Analysis of Digital Libraries in R](https://www.sciencedirect.com/science/article/pii/B9780124115118000049) by Eric Nguyen<sup/>
    
<sup>Source: [“Sklearn’s TF-IDF” vs “Standard TF-IDF”](https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d) by Siva Sivarajah from Towards Data Science<sup/>

In [21]:
def inverse_doc_freq(document_list,unique_words):
    
    
    idf_dict = {}
    
    for word in unique_words:
        temp_word_list = []
        for sentence in document_list:
            temp_word_list.append(word in sentence)
        idf_dict[word] = temp_word_list.count(True)+1
        
    idf_list = []

    for value in idf_dict.values():
        idf_list.append(math.log((len(tf_df)+1)/value)+1)
        
    return idf_list

In [22]:
idf_array = inverse_doc_freq(tf_df.index,tf_df.columns)
idf_array

[2.2992829841302607,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 1.6061358035703155,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.2992829841302607,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.2992829841302607,
 2.7047480922384253,
 2.01160091167848,
 2.2992829841302607,
 2.7047480922384253,
 1.6061358035703155,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 1.6061358035703155,
 2.7047480922384253,
 2.7047480922384253,
 2.7047480922384253,
 2.01160091167848,
 1.7884573603642702,
 2.7047480922384253,
 2.7047480922384253]

# **Term Frequency-Inverse Document Frequency (tf-idf)**

**The term frequency-inverse document frequency (tf-idf), is an array of values that represents how important a word is in a given document within a corpus. The higher the tf-idf value is for a word in a document, the more important the word is.**

$\text{tf-idf Product = (Term Frequency)(Inverse Document Frequency)}$

$\text{tf-idf =} \Large{\frac{\text{tf-idf Product}}{||\text{tf-idf Product}||}}$

<sup>Source: [tf–idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) from Wikipedia.org<sup/>
    
<sup>Source: [“Sklearn’s TF-IDF” vs “Standard TF-IDF”](https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d) by Siva Sivarajah from Towards Data Science<sup/>

In [23]:
def tf_idf(term_freq_data,inverse_doc_freq_data):
    
    tf_idf_vector = []
    multiplied_vector = term_freq_data * inverse_doc_freq_data
    
    for array in multiplied_vector:
        tf_idf_vector.append(array/(sum(array**2))**.5)
        
    return tf_idf_vector

# **Saving the TF-IDF Vector to a pandas DataFrame**

In [24]:
tfidf_df = pd.DataFrame(data = tf_idf(tf_df.to_numpy(), np.array(idf_array)),
             columns = word_list(news_cleaned),
             index = term_freq(news_cleaned).keys())

tfidf_df

Unnamed: 0,last,historically,reversing,low,rose,losses,gets,ravens,nfl,statistics,...,nod,news,week,perform,americans,kenny,claims,jobless,slide,pickett
stocks slide jobless claims rise,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.401245,0.356735,0.539504,0.0
jobless claims rose last week still historically low,0.345166,0.406033,0.0,0.406033,0.406033,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.241111,0.0,0.0,0.0,0.301979,0.268481,0.0,0.0
stocks extend losses reversing earlyweek gains,0.0,0.0,0.418024,0.0,0.0,0.418024,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
jobless claims expected rise,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.451533,0.401445,0.0,0.0
americans apply jobless benefits last week,0.400181,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.279542,0.0,0.470751,0.0,0.0,0.311274,0.0,0.0
nfl week offense rankings nfl news rankings statistics,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.47211,0.397519,...,0.0,0.397519,0.236055,0.0,0.0,0.0,0.0,0.0,0.0,0.0
nfl rankings figuring ravens offense kenny pickett gets nod,0.0,0.0,0.0,0.0,0.0,0.0,0.371176,0.371176,0.220412,0.0,...,0.371176,0.0,0.0,0.0,0.0,0.371176,0.0,0.0,0.0,0.371176
week 4s nfl team week offense,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.292706,0.0,...,0.0,0.0,0.585412,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022 nfl offense rankings,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4219,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
every nfl offense perform 2022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.320731,0.0,...,0.0,0.0,0.0,0.540114,0.0,0.0,0.0,0.0,0.0,0.0


# **Scikit-learn's TfidfVectorizer Class**

In [25]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(news_cleaned)

In [26]:
sklearn_df = pd.DataFrame(data = X.toarray(),columns=vectorizer.get_feature_names_out(),index=news_cleaned)
sklearn_df

Unnamed: 0,2022,4s,americans,apply,benefits,claims,earlyweek,every,expected,extend,...,ravens,reversing,rise,rose,slide,statistics,still,stocks,team,week
stocks slide jobless claims rise,0.0,0.0,0.0,0.0,0.0,0.401245,0.0,0.0,0.0,0.0,...,0.0,0.0,0.458627,0.0,0.539504,0.0,0.0,0.458627,0.0,0.0
jobless claims rose last week still historically low,0.0,0.0,0.0,0.0,0.0,0.299895,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.403231,0.0,0.0,0.403231,0.0,0.0,0.266628
stocks extend losses reversing earlyweek gains,0.0,0.0,0.0,0.0,0.0,0.0,0.418024,0.0,0.0,0.418024,...,0.0,0.418024,0.0,0.0,0.0,0.0,0.0,0.355359,0.0,0.0
jobless claims expected rise,0.0,0.0,0.0,0.0,0.0,0.451533,0.0,0.0,0.607119,0.0,...,0.0,0.0,0.516107,0.0,0.0,0.0,0.0,0.0,0.0,0.0
americans apply jobless benefits last week,0.0,0.0,0.466399,0.466399,0.466399,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.308397
nfl week 4 offense rankings nfl news rankings statistics,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.394888,0.0,0.0,0.0,0.261111
nfl rankings figuring ravens offense kenny pickett gets nod,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.371176,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
week 4s nfl team week offense,0.0,0.473825,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.473825,0.626614
2022 nfl offense rankings,0.603976,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
every nfl offense perform 2022,0.459147,0.0,0.0,0.0,0.0,0.0,0.0,0.540114,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
for word in sklearn_df.index[0].split():
    print(word,sklearn_df[0:1][word.lower()][0])

stocks 0.4586274284579443
slide 0.539503693426004
jobless 0.3567353847924965
claims 0.40124480526077394
rise 0.4586274284579443


In [28]:
for word in tfidf_df.index[0].split():
    print(word,tfidf_df[0:1][word][0])

stocks 0.4586274284579443
slide 0.5395036934260039
jobless 0.3567353847924965
claims 0.4012448052607739
rise 0.4586274284579443


# **References and Additional Learning**

## **Article**

- **[“Sklearn’s TF-IDF” vs “Standard TF-IDF”](https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d) by Siva Sivarajah from Towards Data Science**

## **Textbook**

- **[Text Mining and Network Analysis of Digital Libraries in R](https://www.sciencedirect.com/science/article/pii/B9780124115118000049) by Eric Nguyen**

## **Websites**

- **[TfidfVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from Scikit-learn.org**

- **[tf–idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) from Wikipedia.org**

## **Videos**

- **[NLP with Python! Bag of Words (BoW)](https://www.youtube.com/watch?v=mVF0Lt5Sb84&t=3s) by Adrian Dolinay on YouTube**

- **[NLP with Python! Stop Words](https://www.youtube.com/watch?v=0D7ae7OaaHQ&t=3s) by Adrian Dolinay on YouTube**

# **Connect**
- **Feel free to connect with Adrian on [YouTube](https://www.youtube.com/channel/UCPuDxI3xb_ryUUMfkm0jsRA), [LinkedIn](https://www.linkedin.com/in/adrian-dolinay-frm-96a289106/), [Twitter](https://twitter.com/DolinayG), [GitHub](https://github.com/ad17171717) and [Odysee](https://odysee.com/@adriandolinay:0). Happy coding!**