In [38]:
import os
import bz2
import pandas as pd
import numpy as np
import rdflib
from rdflib import Graph, URIRef, Literal, Namespace
from collections import Counter
import urllib.parse
import matplotlib.pyplot as plt
import seaborn as sns

In [13]:
ttl_files = [
    "data/hiwiki-20250601-anchor-text.ttl.bz2",
    "data/hiwiki-20250601-commons-page-links.ttl.bz2", 
    "data/hiwiki-20250601-disambiguations.ttl.bz2",
    "data/hiwiki-20250601-geo-coordinates-mappingbased.ttl.bz2",
    "data/hiwiki-20250601-homepages.ttl.bz2",
    "data/hiwiki-20250601-images.ttl.bz2",
    "data/hiwiki-20250601-instance-types-transitive.ttl.bz2",
    "data/hiwiki-20250601-instance-types.ttl.bz2",
    "data/hiwiki-20250601-mappingbased-literals.ttl.bz2",
    "data/hiwiki-20250601-mappingbased-objects-uncleaned.ttl.bz2",
    "data/hiwiki-20250601-redirects.ttl.bz2",
    "data/hiwiki-20250601-specific-mappingbased-properties.ttl.bz2",
    "data/hiwiki-20250601-topical-concepts.ttl.bz2"
]

In [41]:
all_triples = []
for file_path in ttl_files:
    print(f"Parsing {os.path.basename(file_path)}...")
    
    try:
        with bz2.open(file_path, 'rt', encoding='utf-8') as f:
            g = Graph()
            try:
                g.parse(f, format='turtle')
            except (ParserError, Exception) as e:
                print(f"  WARNING: Could not parse {os.path.basename(file_path)}. Error: {e}")
                continue

            for s, p, o in g:
                # clean subject and predicate which are always URIs
                # cleaning for this warning mostly: 
                # http://hi.wikipedia.org/wiki/चित्र:\nPreah_Khan_temple_at_Angkor,_Cambodia.jpg 
                # does not look like a valid URI, trying to serialize this will break.
                s_clean = str(s).replace('\n', '').replace('\r', '').strip()                
                p_clean = str(p).replace('\n', '').replace('\r', '').strip()

                if isinstance(o, URIRef):
                    o_clean = str(o).replace('\n', '').replace('\r', '').strip()                
                    all_triples.append((s_clean, p_clean, o_clean))
                
                elif isinstance(o, Literal):
                    # not including literals because it doesn't make sense to learn attributes like date of birth rather we want
                    # to learn symbolic links between 2 entities
                    # also theres a lot of issues with this one while parsing the graph like this:
                    # 
                    continue
                    # print(f"literal: {o}")
                    # o_clean = str(o.value) 
                    # all_triples.append((s_clean, p_clean, o_clean))

    except Exception as e:
        print(f"Error processing file {file_path}: {e}")

print(f"total: {len(all_triples)} triples")

Parsing hiwiki-20250601-anchor-text.ttl.bz2...
Parsing hiwiki-20250601-commons-page-links.ttl.bz2...
Parsing hiwiki-20250601-disambiguations.ttl.bz2...
Parsing hiwiki-20250601-geo-coordinates-mappingbased.ttl.bz2...
Parsing hiwiki-20250601-homepages.ttl.bz2...
Parsing hiwiki-20250601-images.ttl.bz2...


http://commons.wikimedia.org/wiki/Special:FilePath/\nBhutan-Paro-Stadt-06-Zentrum-2015-gje.jpg does not look like a valid URI, trying to serialize this will break.
http://commons.wikimedia.org/wiki/Special:FilePath/\nBhutan-Paro-Stadt-06-Zentrum-2015-gje.jpg does not look like a valid URI, trying to serialize this will break.
http://commons.wikimedia.org/wiki/Special:FilePath/\nBhutan-Paro-Stadt-06-Zentrum-2015-gje.jpg?width=300 does not look like a valid URI, trying to serialize this will break.
http://commons.wikimedia.org/wiki/Special:FilePath/\nBhutan-Paro-Stadt-06-Zentrum-2015-gje.jpg does not look like a valid URI, trying to serialize this will break.
http://commons.wikimedia.org/wiki/Special:FilePath/\nBhutan-Paro-Stadt-06-Zentrum-2015-gje.jpg?width=300 does not look like a valid URI, trying to serialize this will break.
http://commons.wikimedia.org/wiki/Special:FilePath/\nBhutan-Paro-Stadt-06-Zentrum-2015-gje.jpg does not look like a valid URI, trying to serialize this will bre

Parsing hiwiki-20250601-instance-types-transitive.ttl.bz2...
Parsing hiwiki-20250601-instance-types.ttl.bz2...
Parsing hiwiki-20250601-mappingbased-literals.ttl.bz2...


Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#date, Converter=<function parse_xsd_date at 0x113eeb490>
Traceback (most recent call last):
  File "/Users/adityavenkatesh/Documents/Code/nef_new/neural-extraction-framework/GSoC25_H/.env/lib/python3.10/site-packages/rdflib/term.py", line 2163, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
  File "/Users/adityavenkatesh/Documents/Code/nef_new/neural-extraction-framework/GSoC25_H/.env/lib/python3.10/site-packages/rdflib/xsd_datetime.py", line 593, in parse_xsd_date
    return parse_date(date_string if not minus else ("-" + date_string))
  File "/Users/adityavenkatesh/Documents/Code/nef_new/neural-extraction-framework/GSoC25_H/.env/lib/python3.10/site-packages/isodate/isodates.py", line 193, in parse_date
    raise ISO8601Error("Unrecognised ISO 8601 date format: %r" % datestring)
isodate.isoerror.ISO8601Error: Unrecognised ISO 8601 date format: '-0753-04-21'
Faile

Parsing hiwiki-20250601-mappingbased-objects-uncleaned.ttl.bz2...
Parsing hiwiki-20250601-redirects.ttl.bz2...
Parsing hiwiki-20250601-specific-mappingbased-properties.ttl.bz2...
Parsing hiwiki-20250601-topical-concepts.ttl.bz2...
total: 2039012 triples


In [24]:
# filter out disambiguation triplets
core_triples = [
    t for t in all_triples 
    if 'http://www.w3.org/2002/07/owl#sameAs' not in t[1] and 'http://dbpedia.org/ontology/wikiPageDisambiguates' not in t[1] and t[1].startswith('http://dbpedia.org/ontology/')
]

In [31]:
relation_counts = Counter(t[1] for t in core_triples)

for i, (relation, count) in enumerate(relation_counts.most_common(30)):
    print(f"{i+1}. {relation:<30} | Count: {count}")

1. http://dbpedia.org/ontology/language | Count: 108087
2. http://dbpedia.org/ontology/thumbnail | Count: 78098
3. http://dbpedia.org/ontology/wikiPageRedirects | Count: 77046
4. http://dbpedia.org/ontology/subdivision | Count: 49789
5. http://dbpedia.org/ontology/starring | Count: 32077
6. http://dbpedia.org/ontology/state | Count: 28398
7. http://dbpedia.org/ontology/district | Count: 25201
8. http://dbpedia.org/ontology/country | Count: 19176
9. http://dbpedia.org/ontology/birthPlace | Count: 14088
10. http://dbpedia.org/ontology/timeZone | Count: 11205
11. http://dbpedia.org/ontology/occupation | Count: 8277
12. http://dbpedia.org/ontology/termPeriod | Count: 7609
13. http://dbpedia.org/ontology/nationality | Count: 4206
14. http://dbpedia.org/ontology/deathPlace | Count: 3509
15. http://dbpedia.org/ontology/director | Count: 3279
16. http://dbpedia.org/ontology/politicalLeader | Count: 3214
17. http://dbpedia.org/ontology/producer | Count: 3178
18. http://dbpedia.org/ontology/type

In [32]:
least_common = relation_counts.most_common()[-30:]
for i, (relation, count) in enumerate(least_common):
    print(f"{i+1}. {relation:<30} | Count: {count}")

1. http://dbpedia.org/ontology/academicAdvisor | Count: 21
2. http://dbpedia.org/ontology/alongside | Count: 20
3. http://dbpedia.org/ontology/honours | Count: 19
4. http://dbpedia.org/ontology/recordedIn | Count: 17
5. http://dbpedia.org/ontology/architect | Count: 16
6. http://dbpedia.org/ontology/nonFictionSubject | Count: 16
7. http://dbpedia.org/ontology/administrativeCenter | Count: 15
8. http://dbpedia.org/ontology/viceChancellor | Count: 13
9. http://dbpedia.org/ontology/restingPlacePosition | Count: 11
10. http://dbpedia.org/ontology/minister | Count: 11
11. http://dbpedia.org/ontology/creatorOfDish | Count: 10
12. http://dbpedia.org/ontology/athletics | Count: 9
13. http://dbpedia.org/ontology/opponent | Count: 9
14. http://dbpedia.org/ontology/translator | Count: 9
15. http://dbpedia.org/ontology/coverArtist | Count: 8
16. http://dbpedia.org/ontology/province | Count: 7
17. http://dbpedia.org/ontology/manager | Count: 7
18. http://dbpedia.org/ontology/sport | Count: 7
19. ht

In [33]:
# Analyze Entity Degree (Connectivity)

entity_degrees = Counter()
for head, relation, tail in core_triples:
    entity_degrees[head] += 1
    entity_degrees[tail] += 1

print(f"\nTotal unique entities found in core_triples: {len(entity_degrees)}")



Total unique entities found in core_triples: 304395


In [34]:
degree_df = pd.DataFrame(entity_degrees.values(), columns=['degree'])
print(degree_df['degree'].describe())

count    304395.000000
mean          3.575361
std          84.226941
min           1.000000
25%           1.000000
50%           1.000000
75%           3.000000
max       28392.000000
Name: degree, dtype: float64


### metadata and structural noise we can remove
wikiPageRedirects (77046 count): This is not a semantic relationship. It's a structural artifact from Wikipedia that says "this page is a redirect to another". Keeping this will teach the model that many things are simply equivalent, which is not useful for predicting new facts. we can remove this safely. 


thumbnail (78098 count): This links an entity to its image URL. It has zero semantic value for predicting relationships like (Person, birthPlace, City). It's just noise


mainArticleForCategory (606 count): This is another structural link between a category page and its main article. Not a semantic fact. Remove.

language(108087 count): This almost always links an entity to a language entity (dbr:Hindi_language). technically it is semantic but it can create a massive, uninformative hub around the Hindi_language entity, biasing our model.




### high-frequency but potentially noisy relations
language (#2, 13,838 count): This links entities (like movies, books, people) to their language (e.g., http://dbpedia.org/resource/Hindi). This creates a massive "hub" where the "Hindi" entity is connected to thousands of things. While technically a fact, it doesn't help in predicting diverse relationships and can skew the model. We want to predict things about India, not that India's language is Hindi. For a more focused KGC task, it's best to remove this.


timeZone (#10, 1,418 count): Similar to language, this connects many locations to a few timezone entities (like "Asia/Kolkata"). This is also a candidate for removal to improve focus.




### semantic core relations we shold keep
starring, occupation, nationality, director, subdivision, state, district, country, birthPlace, deathPlace, location, residence. These are the facts we want to model.
starring, producer, director, writer, musicBy, genre: forms a strong sub-graph about media and films.
occupation, politician, party, almaMater: Great biographical and political relations.
All the others in the top 30 are generally good semantic relationships.

### rare relations 
generally model can't learn from sprase relations which only show up a couple of times. relations like Relations like bodyDiscovered (1), taoiseach (2), or mother (5) are too sparse.
My plan is to set a threshold and remove any relation that appears fewer than 50 or 100 times.  While we lose some specific facts, we can  create a denser, more learnable graph for the model.


## Entity Degree / connectivity analysis

count: 304395: we have ~305k unique entities in this core_triples set.
mean: 3.57: The average entity is connected to 3-4 other things. quite low..sparse graph. 
std: 84.22: degrees are not distributed evenly at all. some entities have a lot more connections than the avg. 

50% (median): 1.0: This is the most critical statistic. It means at least 50% of your entities have only ONE connection! These are "leaf" nodes. The model has no context for these entities and cannot learn a meaningful vector for them. They are pure noise for the training process.


max: 50000: This shows a massive "hub" entity, which is almost certainly the entity http://dbpedia.org/ontology/wikiPageRedirects is pointing to.


The Histogram: The "Zoomed-in View" plot confirms this visually. There is a gigantic bar at Degree = 1 with over 200,000 entities. This is the "long tail" of very sparsely connected nodes that we must prune.


The structure is dominated by a small number of massive "hubs" that connect to everything, while the vast majority of entities are on the periphery with very few connections.

### The Strategy: Pruning the Graph
Based on this analysis, my strategy is to perform a two-step pruning process to create a smaller, denser, and more semantically coherent graph.

Semantic Filtering: We will remove the relations from Category A and B using a "blacklist".


Frequency Pruning (K-Core Pruning): We will remove all relations that appear too infrequently and all entities that are not connected enough times.

when we split into train/val/test sets we should make sure that entities in our val/test also appear in train set. 