This is a very useful guide into TF-IDF calculation and K-means clustering using Sci-Kit Learn:

https://jonathansoma.com/lede/algorithms-2017/classes/clustering/k-means-clustering-with-scikit-learn/

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer #SciKit-Learn Machine Learning Library
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
from sklearn.cluster import KMeans
import nltk
from nltk import word_tokenize

In [13]:
df = pd.read_csv("grid_1000m_newest_geocluster_round1.csv")
#group by grid cell number and then concatenate all tags
df = df.astype(str).groupby('PageNumber')['custom_f_1'].apply(lambda x: ' '.join(x)).reset_index()

#make sure all characters <3 and >15 are excluded
df['custom_f_1'] = df['custom_f_1'].str.findall('\w{4,15}').str.join(' ')

tokenized_tags = df['custom_f_1'].astype(str).apply(nltk.word_tokenize)

filter_words = ['wwwpoopmapeu','wwwpoopmapde','smcpentaxm','iansdigitalphot','italphotos','iansdigitalpho','flickrandroidap','camera','filter','none']
df["custom_f_1"] = [[t for t in tok_sent if t not in filter_words] for tok_sent in tokenized_tags]

#convert list back to string
df['custom_f_1'] = df['custom_f_1'].str.join(' ')

df


Unnamed: 0,PageNumber,custom_f_1
0,1,house villa quartz rubble rockery rafters mmsu...
1,10,holiday policebox weeshop edinburghpolice poli...
2,11,history wheel vintage canal vintagecar spokes ...
3,12,cameraphone urban graffiti flickriosapp ghostv...
4,13,irieyoyo meadowsfestival music march rally cam...
5,14,themeadows urbanarte square sutro foursquare v...
6,15,night lighttrails flickriosapp longexposure la...
7,16,holyroodpark arthursseat holyroodpark building...
8,17,portobello firthofforth edinburghscotla holyro...
9,19,sculpture house gate iron lodge installation f...


In [15]:
#create vectoriser - TdidfVectorizer normalises the tfidf values from 0-1
#set use_idf to 'True' so that it actually calculates the IDF part of TF-IDF (otherwise it's just TF which is bloody confusing)
#max_df = words that appear over 80% of entire corpus is missed out
#min_df = miss out on words that appear less than 5 times in corpus
# norm = 'l2' normalises the length of the documents to prepare for calculation (between two equal-length vectors) 
Vectorizer = TfidfVectorizer(lowercase = True, stop_words = 'english', use_idf = True, norm = 'l2')

#Vectors = Vectorizer.fit_transform(df['lemmatised_tags'])

#https://stackoverflow.com/questions/64743583/which-10-words-has-the-highest-tf-idf-value-in-each-document-total

#finding top 10 tfidf tags per document
X_tfidf = Vectorizer.fit_transform(df['custom_f_1'])
#X_tfidf_array = X_tfidf.toarray()



In [16]:
#convert array to sparse matrix
Vectors_sparse = sparse.csr_matrix(X_tfidf)

similarities = cosine_similarity(Vectors_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

pairwise dense output:
 [[1.         0.05929554 0.31165795 ... 0.02780234 0.05306333 0.03092724]
 [0.05929554 1.         0.19171665 ... 0.02337644 0.0122675  0.09834836]
 [0.31165795 0.19171665 1.         ... 0.04100583 0.03072733 0.07683422]
 ...
 [0.02780234 0.02337644 0.04100583 ... 1.         0.15699212 0.02142997]
 [0.05306333 0.0122675  0.03072733 ... 0.15699212 1.         0.01437803]
 [0.03092724 0.09834836 0.07683422 ... 0.02142997 0.01437803 1.        ]]



In [17]:
#check shape to see if it's a matrix - and yes it is
print(similarities.shape)

list = similarities.tolist()
cosine_similarity_matrix = pd.DataFrame(list)

#cosine_similarity_matrix.to_csv("cosine_sim_matrix_1000m_geocluster_3.csv")

(39, 39)


This is useful and below code draws from it: https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity