# Query-Centric Semantic Partitioning (SPARTI)

Run FP-Growth Algorithm to determine the most frequent patterns

In [1]:
data = sc.textFile("hdfs://localhost:9000/user/amadkour/datasets/parsedoutput.txt")
filtereddata = data.filter(lambda line: len(line.strip().split(' ')) >= 2)
transactions = filtereddata.map(lambda line: line.strip().split(' '))

The number of entries/transactions are:

In [2]:
transactions.count()

347793

In [3]:
from pyspark.mllib.fpm import FPGrowth

modelfpg = FPGrowth.train(transactions, minSupport=0.08, numPartitions=4)
result = modelfpg.freqItemsets().collect()
for fi in result:
    if len(fi[0]) > 1:
        print(fi[0])

[u'http://dbpedia.org/ontology/country', u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type']
[u'http://dbpedia.org/ontology/country', u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', u'http://www.w3.org/2000/01/rdf-schema#label']
[u'http://dbpedia.org/ontology/country', u'http://www.w3.org/2000/01/rdf-schema#label']
[u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', u'http://www.w3.org/2000/01/rdf-schema#label']
[u'http://dbpedia.org/ontology/thumbnail', u'http://www.w3.org/2000/01/rdf-schema#label']
[u'http://www.w3.org/2004/02/skos/core#broader', u'http://purl.org/dc/terms/subject']
[u'http://www.w3.org/2003/01/geo/wgs84_pos#long', u'http://www.w3.org/2003/01/geo/wgs84_pos#lat']
[u'http://www.w3.org/2004/02/skos/core#prefLabel', u'http://www.w3.org/2004/02/skos/core#broader']
[u'http://www.w3.org/2004/02/skos/core#prefLabel', u'http://www.w3.org/2004/02/skos/core#broader', u'http://purl.org/dc/terms/subject']
[u'http://www.w3.org/2004/02/skos/core#prefLabel', u'http://purl.org/d

Next, we will attempt to see how this would perform under word2vec

In [4]:
from pyspark.mllib.feature import Word2Vec

word2vec = Word2Vec()
word2vec.setLearningRate(0.05)
word2vec.setMinCount(100)
modelw2v = word2vec.fit(transactions)

Using the word2vec, we obtain the following:

In [5]:
synonyms = modelw2v.findSynonyms('http://dbpedia.org/ontology/country', 10)
for word, cosine_distance in synonyms:
    print("{} : {}".format(word, cosine_distance))

http://www.w3.org/1999/02/22-rdf-syntax-ns#type : 1.21623571296
http://www.w3.org/2000/01/rdf-schema#subClassOf : 1.20385914074
http://www.w3.org/2000/01/rdf-schema#label : 1.18294444618
http://dbpedia.org/property/populationCensus : 1.14484372829
http://dbpedia.org/property/populationEstimate : 1.13115934062
http://dbpedia.org/ontology/language : 1.07629413415
http://dbpedia.org/property/currency : 0.975898747114
http://dbpedia.org/property/country : 0.962037671681
bif:contains : 0.946819285897
http://dbpedia.org/ontology/abstract : 0.940576365816


Next, we will attempt to see how this would perform under GloVe (We still need to know how to filter based on word-cooccurences)

In [6]:
from __future__ import print_function
import argparse
import pprint
import gensim

from glove import Glove
from glove import Corpus

corpus_model = Corpus()
corpus_model.fit(transactions.collect())


glove = Glove(learning_rate=0.05)
glove.fit(corpus_model.matrix,no_threads=8)
glove.add_dictionary(corpus_model.dictionary)

Using the GloVe model:

In [9]:
glovesynonyms = glove.most_similar("http://dbpedia.org/ontology/country", number=11)
for word, cosine_distance in glovesynonyms:
    print("{} : {}".format(word, cosine_distance))

http://dbpedia.org/ontology/populationTotal : 0.998675068977
http://dbpedia.org/ontology/currency : 0.995637170392
http://dbpedia.org/ontology/author : 0.995277772094
http://dbpedia.org/property/densityrank : 0.994142278242
http://dbpedia.org/ontology/populationDensity : 0.993767780191
http://dbpedia.org/ontology/leaderName : 0.993577761598
http://dbpedia.org/ontology/areaTotal : 0.993363884078
http://www.w3.org/2000/01/rdf-schema#subClassOf : 0.992752886787
http://www.w3.org/2004/02/skos/core#prefLabel : 0.992731451232
http://dbpedia.org/ontology/wikiPageRedirects : 0.991522760901
