<a href="https://colab.research.google.com/github/Yenaaa/24spring_hss510/blob/main/RepresentingComparing_Mar13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **HSS 510 Guided Coding: Representating and Comparing Texts**

### **2024 Mar 13, Taegyoon Kim**


---

## **Topics**
- Tokenization
- Text normalization
- Count-vectorization
- TF-IDF
- Consine similarity

## **Tokenization**

- Various algorithms
- `word_tokenize` from `nltk` is most commonly used
- If necessary, explore domain-specific tokenizers such as `TweetTokenizer`
- Or even customize one
- `sent_toeknize` for sentence tokenizations

In [None]:
# load tokenizers

import nltk
nltk.download('punkt') # install Punkt tokenizer separately

from nltk.tokenize import WhitespaceTokenizer, TreebankWordTokenizer, WordPunctTokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### `WhitespaceTokenizer`, `TreebankWordTokenizer`, `WordPunctTokenizer`

In [None]:
# example sentence

sentence = "Don't hesitate to ask questions! Are AI models 'understandable'?"
print(sentence)

# NLTK's WhitespaceTokenizer

Tokenizer = WhitespaceTokenizer()
tokens_white = Tokenizer.tokenize(sentence)
print("WhitespaceTokenizer:", tokens_white)

# NLTK's TreebankWordTokenizer

treebank_word_tokenizer = TreebankWordTokenizer()
tokens_treebank_word = treebank_word_tokenizer.tokenize(sentence)
print("TreebankWordTokenizer:", tokens_treebank_word)

# NLTK's WordPunctTokenizer

word_punct_tokenizer = WordPunctTokenizer()
tokens_word_punct = word_punct_tokenizer.tokenize(sentence)
print("WordPunctTokenizer:", tokens_word_punct)

Don't hesitate to ask questions! Are AI models 'understandable'?
WhitespaceTokenizer: ["Don't", 'hesitate', 'to', 'ask', 'questions!', 'Are', 'AI', 'models', "'understandable'?"]
TreebankWordTokenizer: ['Do', "n't", 'hesitate', 'to', 'ask', 'questions', '!', 'Are', 'AI', 'models', "'understandable", "'", '?']
WordPunctTokenizer: ['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions', '!', 'Are', 'AI', 'models', "'", 'understandable', "'?"]


###  `word_tokenize`

In [None]:
from nltk.tokenize import word_tokenize # this is built on TreebankWordTokenizer but does a bit of pre-processing

# example sentence

sentence = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."

# Tokenization using word_tokenize

tokens_word_tokenize = word_tokenize(sentence)

# Tokenization using TreebankWordTokenizer

treebank_word_tokenizer = TreebankWordTokenizer()
tokens_treebank_word = treebank_word_tokenizer.tokenize(sentence)

print("word_tokenize:", tokens_word_tokenize)
print("TreebankWordTokenizer:", tokens_treebank_word)

word_tokenize: ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
TreebankWordTokenizer: ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']


### `TweetTokenizer`

In [None]:
from nltk.tokenize import TweetTokenizer

tt = TweetTokenizer()
tweet = "Loving this new song!!! 😍 #awesomeMusic @theArtist go to http://www.awesomemusic.ai"

print(tt.tokenize(tweet), "\n", word_tokenize(tweet)) # note the differences

['Loving', 'this', 'new', 'song', '!', '!', '!', '😍', '#awesomeMusic', '@theArtist', 'go', 'to', 'http://www.awesomemusic.ai'] 
 ['Loving', 'this', 'new', 'song', '!', '!', '!', '😍', '#', 'awesomeMusic', '@', 'theArtist', 'go', 'to', 'http', ':', '//www.awesomemusic.ai']


###  `sent_tokenize`
- See how it handles different types of sentence demarcations

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
# periods within abbreviations

text1 = "We offer many citrus fruits, e.g., oranges, lemons, and limes. They're fresh and tasty."
print(text1, '\n', sent_tokenize(text1))

We offer many citrus fruits, e.g., oranges, lemons, and limes. They're fresh and tasty. 
 ['We offer many citrus fruits, e.g., oranges, lemons, and limes.', "They're fresh and tasty."]


In [None]:
# quotation marks

text2 = "She said, 'Welcome to the club.' Then she showed me around the building."
print(text2, '\n', sent_tokenize(text2))

She said, 'Welcome to the club.' Then she showed me around the building. 
 ["She said, 'Welcome to the club.'", 'Then she showed me around the building.']


In [None]:
# question marks and exclamation points

text3 = "Can you believe this? It's amazing! We should try it."
print(text3, '\n', sent_tokenize(text3))

Can you believe this? It's amazing! We should try it. 
 ['Can you believe this?', "It's amazing!", 'We should try it.']


In [None]:
# ellipses and dashes

text4 = "Well... I'm not sure. Maybe we could try a different approach—something less conventional."
print(text4, '\n', sent_tokenize(text4))

Well... I'm not sure. Maybe we could try a different approach—something less conventional. 
 ['Well...', "I'm not sure.", 'Maybe we could try a different approach—something less conventional.']


In [None]:
# decimal numbers and dates

text5 = "The total comes to $10.50. Please, make the payment by 12.30 p.m."
print(text5, '\n', sent_tokenize(text5))

The total comes to $10.50. Please, make the payment by 12.30 p.m. 
 ['The total comes to $10.50.', 'Please, make the payment by 12.30 p.m.']


## **Text normalization**
- Lowercasing
- Punctuations
- Stop words
- Lemmatization/stemming


### `lower`

In [None]:
text = "OpenAI has also worked with Microsoft to roll out ChatGPT-like technology across Microsoft’s products."
print(text)

OpenAI has also worked with Microsoft to roll out ChatGPT-like technology across Microsoft’s products.


In [None]:
print(text.lower())

openai has also worked with microsoft to roll out chatgpt-like technology across microsoft’s products.


### Stop words

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords') # lists of stop words by NLTK (over 20 languages)

print(len(stopwords.words('english'))) # English stop words
print(stopwords.words('english')[0:10])

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
print(len(stopwords.words('english'))) # Italian stop words
print(stopwords.words('italian')[0:10])

179
['ad', 'al', 'allo', 'ai', 'agli', 'all', 'agl', 'alla', 'alle', 'con']


In [None]:
tokens = word_tokenize(text)

tokens_no_sw = [t for t in tokens if not t.lower() in stopwords.words('english')] # remove stopwords

print(tokens)
print(tokens_no_sw)

['OpenAI', 'has', 'also', 'worked', 'with', 'Microsoft', 'to', 'roll', 'out', 'ChatGPT-like', 'technology', 'across', 'Microsoft', '’', 's', 'products', '.']
['OpenAI', 'also', 'worked', 'Microsoft', 'roll', 'ChatGPT-like', 'technology', 'across', 'Microsoft', '’', 'products', '.']


### `WordNetLemmatizer`

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
tokens = ['cats', 'doing', 'lives', 'has', 'going', 'legislate', 'asocial', 'flew', 'friendly', 'loved']

wordnet_lemmatizer = WordNetLemmatizer()
tokens_lemma = [wordnet_lemmatizer.lemmatize(w) for w in tokens]

print(tokens)
print(tokens_lemma)

['cats', 'doing', 'lives', 'has', 'going', 'legislate', 'asocial', 'flew', 'friendly', 'loved']
['cat', 'doing', 'life', 'ha', 'going', 'legislate', 'asocial', 'flew', 'friendly', 'loved']


- Combine with POS tags

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm') # load the spaCy model for lemmatization, pos tags, NER, etc.

sentence = "The cats are sitting on the window sill"
doc = nlp(sentence)

for token in doc:
    print(f"Token: {token.text}, Lemma: {token.lemma_}, POS: {token.pos_}")

Token: The, Lemma: the, POS: DET
Token: cats, Lemma: cat, POS: NOUN
Token: are, Lemma: be, POS: AUX
Token: sitting, Lemma: sit, POS: VERB
Token: on, Lemma: on, POS: ADP
Token: the, Lemma: the, POS: DET
Token: window, Lemma: window, POS: NOUN
Token: sill, Lemma: sill, POS: NOUN


### `PorterStemmer`

In [None]:
from nltk.stem import PorterStemmer

tokens_stem_p = []

ps = PorterStemmer()
for w in tokens:
    root = ps.stem(w)
    tokens_stem_p.append(root)

print(tokens)
print(tokens_stem_p)

['cats', 'doing', 'lives', 'has', 'going', 'legislate', 'asocial', 'flew', 'friendly', 'loved']
['cat', 'do', 'live', 'ha', 'go', 'legisl', 'asoci', 'flew', 'friendli', 'love']


### `LancasterStemmer`

In [None]:
from nltk.stem import LancasterStemmer

tokens_stem_l = []

ls = LancasterStemmer()
for w in tokens:
    root = ls.stem(w)
    tokens_stem_l.append(root)

print("No stemming:", tokens)
print("Porter:", tokens_stem_l)
print("Lancaster:", tokens_stem_p)


No stemming: ['cats', 'doing', 'lives', 'has', 'going', 'legislate', 'asocial', 'flew', 'friendly', 'loved']
Porter: ['cat', 'doing', 'liv', 'has', 'going', 'legisl', 'asoc', 'flew', 'friend', 'lov']
Lancaster: ['cat', 'do', 'live', 'ha', 'go', 'legisl', 'asoci', 'flew', 'friendli', 'love']


## **Count-vectorization**

-   No built-in lemmatization/stemming
-   `stop_words`
-   `lowercase`
-   `max_df`,
-   `min_df`
-   `ngram_range`
-   `binary`
-   See [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for a complete set of arguments

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

url = 'https://raw.githubusercontent.com/taegyoon-kim/programming_dhcss_23fw/main/week_14/inaugural_speech_us.csv'
inaugural_df = pd.read_csv(url)

print(len(inaugural_df)) # 58 speeches/documents
print(inaugural_df.head()) # the first five

58
          docnames                                               text
0  1789-Washington  Fellow-Citizens of the Senate and of the House...
1  1793-Washington  Fellow citizens, I am again called upon by the...
2       1797-Adams  When it was first perceived, in early times, t...
3   1801-Jefferson  Friends and Fellow Citizens:    Called upon to...
4   1805-Jefferson  Proceeding, fellow citizens, to that qualifica...


In [None]:
# 1789 by George Washington

print(inaugural_df['text'].iloc[0])

Fellow-Citizens of the Senate and of the House of Representatives:    Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not 

In [None]:
vectorizer = CountVectorizer() # plain vectorizer

In [None]:
# create a tokenizer + stemmer

def ps_stem_tokenize(text):
    return [ps.stem(token) for token in word_tokenize(text)]

In [None]:
vectorizer_norm = CountVectorizer( # custom vectorizer
    tokenizer = ps_stem_tokenize,
    lowercase = True,
    stop_words = 'english',
    ngram_range=(1, 1), # uni-grams and bi-grams for isntance: (1, 2)
    max_df=0.95, # a maximum document frequency (as a proportion of the documents or as an absolute count) above which terms will be discarded. This helps remove very common words
    min_df=0.05 # a minimum document frequency (as a proportion of the documents or as an absolute count) below which terms will be discarded. This helps remove very rare words.
    )

`fit_tranform`

-   **fit** means learning the vocabulary of the corpus
-   **transform** means creating a matrix and populating with counts

In [None]:
# create a document-term matrix

dtm = vectorizer.fit_transform(inaugural_df['text'])

<58x9046 sparse matrix of type '<class 'numpy.int64'>'
	with 43638 stored elements in Compressed Sparse Row format>

In [None]:
# number of documents * size of vocabulary

print(dtm.shape)

# number of non-zero elements

print(dtm.size)

43638

In [None]:
# sparsity

import numpy as np

1 - (dtm.size / np.prod(dtm.shape))

0.9168274032340451

In [None]:
# transform into a numpy array

arr_dtm = dtm.toarray()
print(arr_dtm)
print(arr_dtm.shape)

numpy.ndarray

In [None]:
# get the terms

vocab = vectorizer.get_feature_names_out()

print(len(vocab))
print(vocab[0:30])
print(vocab[-10:])

9046


In [None]:
# transform into a pandas data frame

df_dtm = pd.DataFrame(arr_dtm, columns = vocab)
df_dtm.head()

Unnamed: 0,000,100,120,125,13,14th,15th,16,1774,1776,...,your,yours,yourself,yourselves,youth,youthful,zeal,zealous,zealously,zone
0,0,0,0,0,0,1,0,0,0,0,...,9,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,7,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,4,0,0,0,0,0,3,0,0,0


In [None]:
# most frequent terms

sum_words = arr_dtm.sum(axis = 0) # 1D array (length = 9,046)
words_freq = [(word, sum_words[idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)

words_freq[0:10]

[('the', 9821),
 ('of', 6889),
 ('and', 5207),
 ('to', 4423),
 ('in', 2726),
 ('our', 2146),
 ('that', 1748),
 ('we', 1740),
 ('be', 1452),
 ('is', 1430)]

In [None]:
# let's experiment with removing stop words

vectorizer_nostop = CountVectorizer(stop_words = 'english')
dtm_nostop = vectorizer_nostop.fit_transform(inaugural_df['text'])

In [None]:
# number of documents * size of vocabulary

print("Without stop words:", dtm_nostop.shape)
print("With stop words:", dtm.shape)

Without stop words: (58, 8771)
With stop words: (58, 9046)


In [None]:
arr_dtm_nostop = dtm_nostop.toarray()
arr_dtm_nostop

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
# most frequent terms

sum_words_nostop = arr_dtm_nostop.sum(axis = 0) # 1D array (length = 8,771)
words_freq_nostop = [(word, sum_words_nostop[idx]) for word, idx in vectorizer_nostop.vocabulary_.items()]
words_freq_nostop = sorted(words_freq_nostop, key = lambda x: x[1], reverse = True)

words_freq_nostop[0:10]

[('government', 591),
 ('people', 566),
 ('great', 338),
 ('world', 337),
 ('states', 324),
 ('nation', 311),
 ('shall', 310),
 ('country', 303),
 ('peace', 254),
 ('new', 252)]

## **TF-IDF**

In [None]:
# load Tfidfvectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# mini corpus example

corpus = ['can you fly', 'can you sleep']

tf_idf_vectorizer = TfidfVectorizer()
tf_idf_matrix = tf_idf_vectorizer.fit_transform(corpus)

print(tf_idf_vectorizer.get_feature_names_out())
print(tf_idf_matrix.toarray())

['can' 'fly' 'sleep' 'you']
[[0.50154891 0.70490949 0.         0.50154891]
 [0.50154891 0.         0.70490949 0.50154891]]


In [None]:
url = 'https://raw.githubusercontent.com/taegyoon-kim/programming_dhcss_23fw/main/week_14/news_title.csv'
df_news = pd.read_csv(url, sep = ';')
df_news = df_news.sample(1000) # random sample of 1,000

In [None]:
news_df.head()

Unnamed: 0,No,News Title,Category
52488,52489,Is Antarctic ice sheet disintegrating?,Technology
49405,49406,Mila Kunis talks about falling for Ashton and ...,Entertainment
61944,61945,Lloyds stake sale raises £4.2bn,Business
8289,8290,Facebook Researcher Responds To Criticism Over...,Technology
15070,15071,Driverless cars will be allowed on UK streets ...,Technology


In [None]:
# create a TF-IDF matrix

tfidf_vectorizer = TfidfVectorizer()
tfidf_mat = tfidf_vectorizer.fit_transform(df_news['News Title'])

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

df_tfidf = pd.DataFrame(
  tfidf_mat.toarray(),
  columns = tfidf_feature_names,
  index = df_news['No'])

In [None]:
df_tfidf

Unnamed: 0_level_0,00,07,10,100,1000,1000000000000,100m,100th,101,106b,...,your,youtube,yum,z06,zac,zayn,zebra,zillow,zions,zuckerberg
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15813,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
61771,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
61369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## **Cosine similarity**

### Example with TF-IDF matrix

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# create a cosine similarity matrix

tfidf_cos_mat = cosine_similarity(tfidf_df, tfidf_df)
df_tfidf_cos = pd.DataFrame(
  tfidf_cos_mat,
  columns = news_df['No'],
  index = news_df['No'])

df_tfidf_cos

No,2455,32074,48120,15813,61771,30397,54680,40341,33687,44603,...,12462,65221,21128,57158,24003,57143,26128,61369,45189,56712
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2455,1.000000,0.000000,0.0,0.022954,0.000000,0.029955,0.028799,0.000000,0.125204,0.032461,...,0.097187,0.028589,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000
32074,0.000000,1.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.091603,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.084959,0.0,0.000000,0.000000
48120,0.000000,0.000000,1.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000
15813,0.022954,0.000000,0.0,1.000000,0.053542,0.029998,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000
61771,0.000000,0.000000,0.0,0.053542,1.000000,0.027843,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.019704,0.037829,0.0,0.0,0.0,0.054265,0.0,0.021041,0.029407
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57143,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.051339,0.000000,0.0,0.0,1.0,0.000000,0.0,0.000000,0.000000
26128,0.000000,0.084959,0.0,0.000000,0.054265,0.000000,0.000000,0.000000,0.000000,0.107797,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,1.000000,0.0,0.000000,0.000000
61369,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,1.0,0.000000,0.000000
45189,0.000000,0.000000,0.0,0.000000,0.021041,0.000000,0.000000,0.027428,0.000000,0.000000,...,0.000000,0.021692,0.041644,0.0,0.0,0.0,0.000000,0.0,1.000000,0.000000


In [None]:
tfidf_edges = []
for i in range(len(df_tfidf_cos)):
  for j in range(i+1, len(df_tfidf_cos)):
    weight = df_tfidf_cos.iloc[i, j]
    tfidf_edges.append((df_tfidf_cos.index[i], df_tfidf_cos.columns[j], weight))

df_tfidf_edges = pd.DataFrame(tfidf_edges, columns = ["source", "target", "weight"])
df_tfidf_edges

499500.0

In [None]:
tfidf_edges_df_m1 = pd.merge(
  df_tfidf_edges, news_df[['No', 'Category']],
  left_on = 'source',
  right_on = 'No',
  how = 'left')

tfidf_edges_df_m2 = pd.merge(
  tfidf_edges_df_m1, news_df[['No', 'Category']],
  left_on = 'target',
  right_on = 'No',
  how = 'left')

tfidf_edges_df_m2['comb'] = tfidf_edges_df_m2['Category_x'] + '-' + tfidf_edges_df_m2['Category_y']

tfidf_edges_df_m2['comb']

0              Technology-Business
1              Technology-Business
2         Technology-Entertainment
3            Technology-Technology
4         Technology-Entertainment
                    ...           
499495         Business-Technology
499496           Business-Business
499497    Entertainment-Technology
499498      Entertainment-Business
499499         Technology-Business
Name: comb, Length: 499500, dtype: object

In [None]:
tfidf_mean_cos = tfidf_edges_df_m2.groupby('comb')['weight'].agg('mean').sort_values(ascending = False)

### Example with Countvectors

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# create a count matrix

count_mat = vectorizer.fit_transform(df_news['News Title'])

count_feature_names = vectorizer.get_feature_names_out()

df_count = pd.DataFrame(
  count_mat.toarray(),
  columns = count_feature_names,
  index = df_news['No'])

In [None]:
df_count

Unnamed: 0_level_0,00,07,10,100,1000,1000000000000,100m,100th,101,106b,...,your,youtube,yum,z06,zac,zayn,zebra,zillow,zions,zuckerberg
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32074,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48120,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15813,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61771,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57143,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26128,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45189,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# create a cosine similarity matrix

count_cos_mat = cosine_similarity(df_count, df_count)
df_count_cos = pd.DataFrame(
  count_cos_mat,
  columns = news_df['No'],
  index = news_df['No'])

df_count_cos

No,2455,32074,48120,15813,61771,30397,54680,40341,33687,44603,...,12462,65221,21128,57158,24003,57143,26128,61369,45189,56712
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2455,1.000000,0.000000,0.0,0.071611,0.000000,0.086066,0.172133,0.0000,0.29277,0.172133,...,0.105409,0.155700,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000
32074,0.000000,1.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.100504,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.100504,0.0,0.000000,0.000000
48120,0.000000,0.000000,1.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000
15813,0.071611,0.000000,0.0,1.000000,0.076923,0.092450,0.000000,0.0000,0.00000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000
61771,0.000000,0.000000,0.0,0.076923,1.000000,0.092450,0.000000,0.0000,0.00000,0.000000,...,0.000000,0.083624,0.160128,0.0,0.0,0.0,0.092450,0.0,0.083624,0.113228
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57143,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.000000,...,0.000000,0.090909,0.000000,0.0,0.0,1.0,0.000000,0.0,0.000000,0.000000
26128,0.000000,0.100504,0.0,0.000000,0.092450,0.000000,0.000000,0.0000,0.00000,0.111111,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,1.000000,0.0,0.000000,0.000000
61369,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,1.0,0.000000,0.000000
45189,0.000000,0.000000,0.0,0.000000,0.083624,0.000000,0.000000,0.1066,0.00000,0.000000,...,0.000000,0.090909,0.174078,0.0,0.0,0.0,0.000000,0.0,1.000000,0.000000


In [None]:
count_edges = []
for i in range(len(df_count_cos)):
  for j in range(i+1, len(df_count_cos)):
    weight = df_count_cos.iloc[i, j]
    count_edges.append((df_count_cos.index[i], df_count_cos.columns[j], weight))

df_count_edges = pd.DataFrame(count_edges, columns = ["source", "target", "weight"])
df_count_edges

Unnamed: 0,source,target,weight
0,2455,32074,0.000000
1,2455,48120,0.000000
2,2455,15813,0.071611
3,2455,61771,0.000000
4,2455,30397,0.086066
...,...,...,...
499495,26128,45189,0.000000
499496,26128,56712,0.000000
499497,61369,45189,0.000000
499498,61369,56712,0.000000


In [None]:
count_edges_df_m1 = pd.merge(
  df_count_edges, news_df[['No', 'Category']],
  left_on = 'source',
  right_on = 'No',
  how = 'left')

count_edges_df_m2 = pd.merge(
  count_edges_df_m1, news_df[['No', 'Category']],
  left_on = 'target',
  right_on = 'No',
  how = 'left')

count_edges_df_m2['comb'] = count_edges_df_m2['Category_x'] + '-' + count_edges_df_m2['Category_y']

count_edges_df_m2['comb']

0              Technology-Business
1              Technology-Business
2         Technology-Entertainment
3            Technology-Technology
4         Technology-Entertainment
                    ...           
499495         Business-Technology
499496           Business-Business
499497    Entertainment-Technology
499498      Entertainment-Business
499499         Technology-Business
Name: comb, Length: 499500, dtype: object

In [None]:
count_mean_cos = count_edges_df_m2.groupby('comb')['weight'].agg('mean').sort_values(ascending = False)

Unnamed: 0,source,target,weight,No_x,Category_x,No_y,Category_y,comb
0,2455,32074,0.000000,2455,Technology,32074,Business,Technology-Business
1,2455,48120,0.000000,2455,Technology,48120,Business,Technology-Business
2,2455,15813,0.071611,2455,Technology,15813,Entertainment,Technology-Entertainment
3,2455,61771,0.000000,2455,Technology,61771,Technology,Technology-Technology
4,2455,30397,0.086066,2455,Technology,30397,Entertainment,Technology-Entertainment
...,...,...,...,...,...,...,...,...
499495,26128,45189,0.000000,26128,Business,45189,Technology,Business-Technology
499496,26128,56712,0.000000,26128,Business,56712,Business,Business-Business
499497,61369,45189,0.000000,61369,Entertainment,45189,Technology,Entertainment-Technology
499498,61369,56712,0.000000,61369,Entertainment,56712,Business,Entertainment-Business


In [None]:
print(tfidf_mean_cos)
print(count_mean_cos)

comb
Technology-Technology          0.012135
Medical-Medical                0.011207
Entertainment-Entertainment    0.010505
Business-Business              0.009733
Technology-Business            0.007703
Business-Technology            0.007557
Business-Medical               0.007417
Technology-Medical             0.007386
Medical-Business               0.006896
Technology-Entertainment       0.006730
Entertainment-Technology       0.006688
Entertainment-Medical          0.006593
Medical-Technology             0.006446
Medical-Entertainment          0.006369
Business-Entertainment         0.006247
Entertainment-Business         0.006143
Name: weight, dtype: float64