This is a very useful guide into TF-IDF calculation and K-means clustering using Sci-Kit Learn:

https://jonathansoma.com/lede/algorithms-2017/classes/clustering/k-means-clustering-with-scikit-learn/

In [2]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer #SciKit-Learn Machine Learning Library
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
from sklearn.cluster import KMeans
import nltk
from nltk import word_tokenize

In [26]:
df = pd.read_csv("grid_250m_newest_geocluster_round1.csv")
#group by grid cell number and then concatenate all tags
df = df.astype(str).groupby('PageNumber')['custom_f_1'].apply(lambda x: ' '.join(x)).reset_index()

#make sure all characters <3 and >15 are excluded
df['custom_f_1'] = df['custom_f_1'].str.findall('\w{4,15}').str.join(' ')

tokenized_tags = df['custom_f_1'].astype(str).apply(nltk.word_tokenize)

filter_words = ['wwwpoopmapeu','wwwpoopmapde','smcpentaxm','iansdigitalphot','italphotos','iansdigitalpho','flickrandroidap','camera','filter','none']
df["custom_f_1"] = [[t for t in tok_sent if t not in filter_words] for tok_sent in tokenized_tags]

#convert list back to string
df['custom_f_1'] = df['custom_f_1'].str.join(' ')

df

Unnamed: 0,PageNumber,custom_f_1
0,103,invernesskingsc ross virgineastcoast trailing ...
1,105,tower skyline canon square construction citysc...
2,108,building heritage history architecture cottage...
3,109,pufferfish twitter pufferball twittersphere sp...
4,110,scaffolding building builders hivis ibeams aut...
...,...,...
217,92,night lighttrails panorama night canon cycling...
218,93,landscape paisaje escocia edimburgo vegetaci g...
219,95,holyroodpark arthursseat holyroodpark building...
220,96,arthursseat holyroodpark nature landscape geol...


In [28]:
#create vectoriser - TdidfVectorizer normalises the tfidf values from 0-1
#set use_idf to 'True' so that it actually calculates the IDF part of TF-IDF (otherwise it's just TF which is bloody confusing)
#max_df = words that appear over 80% of entire corpus is missed out
#min_df = miss out on words that appear less than 5 times in corpus
# norm = 'l2' normalises the length of the documents to prepare for calculation (between two equal-length vectors) 
Vectorizer = TfidfVectorizer(lowercase = True, stop_words = 'english', use_idf = True, norm = 'l2')

#Vectors = Vectorizer.fit_transform(df['lemmatised_tags'])

#https://stackoverflow.com/questions/64743583/which-10-words-has-the-highest-tf-idf-value-in-each-document-total

#finding top 10 tfidf tags per document
X_tfidf = Vectorizer.fit_transform(df['custom_f_1'])
#X_tfidf_array = X_tfidf.toarray()


In [29]:
#convert array to sparse matrix
Vectors_sparse = sparse.csr_matrix(X_tfidf)

similarities = cosine_similarity(Vectors_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

pairwise dense output:
 [[1.         0.59257914 0.02686447 ... 0.00473226 0.00244154 0.        ]
 [0.59257914 1.         0.03820431 ... 0.01953031 0.00520625 0.        ]
 [0.02686447 0.03820431 1.         ... 0.01120797 0.00312765 0.        ]
 ...
 [0.00473226 0.01953031 0.01120797 ... 1.         0.59346654 0.24788304]
 [0.00244154 0.00520625 0.00312765 ... 0.59346654 1.         0.23540379]
 [0.         0.         0.         ... 0.24788304 0.23540379 1.        ]]



In [30]:
#check shape to see if it's a matrix - and yes it is
print(similarities.shape)

list = similarities.tolist()
cosine_similarity_matrix = pd.DataFrame(list)

#cosine_similarity_matrix.to_csv("cosine_sim_matrix_250m_geocluster_1.csv")

(222, 222)


This is useful and below code draws from it: https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity