# Transform the dataset on Harry Potter Fanfiction into LOD

We tak the big [Harry Potter fanfiction data](https://www.kaggle.com/datasets/nehatiwari03/harry-potter-fanfiction-data), which can be download from [kaggle](https://www.kaggle.com/) and we transform it into RDF with Python. The idea is then to combine it with the RDF version of the characters from the [Harry Potter Dataset](https://www.kaggle.com/datasets/gulsahdemiryurek/harry-potter-dataset) that we prepare on other class from an original also avaible on [kaggle](https://www.kaggle.com/).

## Perparation

To follow along, you need:
* `pandas`
* `rdflib`
* the dataset in a file called `hpcleanvlarge1.csv.zip` that can be directly from kaggle (I have it in the my `~/Dowloads` folder and I read it from there; place it wherever you want and change the path accordingly)

In [1]:
import pandas as pd
import zipfile
import os

In [2]:
zip_path = os.path.expanduser('~/Downloads/hpcleanvlarge1.csv.zip')
csv_filename = 'hpcleanvlarge1.csv'

with zipfile.ZipFile(zip_path, 'r') as zip_file:
    with zip_file.open(csv_filename) as csv_file:        
        df = pd.read_csv(csv_file)


In [4]:
df.head()

Unnamed: 0,Chapters,Favs,Follows,Published,Reviews,Updated,Words,author,characters,genre,language,rating,story_link,synopsis,title,published_mmyy,pairing
0,1,2.0,,12/31/2019,1.0,,6840,reviews,"Sirius B., Remus L., James P., Regulus B.",Angst/Hurt/Comfort,English,T,https://www.fanfiction.net/s/13466909/1/If-You...,Regulus and James aren't happy. They know they...,If You Change Your Mind,12-2019,
1,1,1.0,,12/31/2019,,,10962,JoyI9199,"Harry P., Draco M., Narcissa M., Charlie W.",Angst/Drama,English,M,https://www.fanfiction.net/s/13466894/1/Bloody...,When a plot from the Founder's age is revealed...,Bloody Ballgowns,12-2019,
2,1,3.0,2.0,12/31/2019,,,8592,MoonytheMarauder1,"[James P., Regulus B.]",Angst,English,M,https://www.fanfiction.net/s/13466885/1/Nothin...,"Regulus Black is supposed to be dead, but he's...",Nothing Left To Do,12-2019,"James P., Regulus B."
3,2,,,12/31/2019,,,7260,LaviniaKatt,Cedric D.,Romance/Fantasy,English,M,https://www.fanfiction.net/s/13466880/1/Patien...,This is a spin off of Harry Potter taking plac...,Patience is a Virtue,12-2019,
4,1,4.0,3.0,12/31/2019,,,1529,Rowena-Moon-Moon,,,English,T,https://www.fanfiction.net/s/13466807/1/An-Und...,Harry makes a new discovery and perhaps a few ...,An Understanding,12-2019,


In [5]:
len(df)

648493

In [6]:
itdf = df[df.language == 'Italian']
len(itdf)

630

In [7]:
itdf.to_csv('/Users/francesco.mambrini/Desktop/fanfic_it.csv')

## Model the RDF graph version

We would use a very trivial mapping strategy to a set of custom and invented URI (in the http://example.org domain). We are not interested in all columns, and we will leave many of them as data types. Any other linking might be left for a further stage.

Let's inspect the list of the columns, to see what properties we are interested in.

In [5]:
df.columns

Index(['Chapters', 'Favs', 'Follows', 'Published', 'Reviews', 'Updated',
       'Words', 'author', 'characters', 'genre', 'language', 'rating',
       'story_link', 'synopsis', 'title', 'published_mmyy', 'pairing'],
      dtype='object')

In [6]:
df[df.story_link == 'https://www.fanfiction.net/s/4036064/1/Hermione-s-Parents']

Unnamed: 0,Chapters,Favs,Follows,Published,Reviews,Updated,Words,author,characters,genre,language,rating,story_link,synopsis,title,published_mmyy,pairing
484379,6,22,44,1/26/2008,44.0,6/4/2008,3418,LuvtoWrite,"Hermione G., Ron W.",,English,K+,https://www.fanfiction.net/s/4036064/1/Hermion...,"Ron...those...they're my parents!""",Hermione's Parents,1-2008,"Hermione G., Ron W."
484380,6,22,44,1/26/2008,44.0,6/4/2008,3418,LuvtoWrite,"Hermione G., Ron W.",,English,K+,https://www.fanfiction.net/s/4036064/1/Hermion...,"Ron...those...they're my parents!""",Hermione's Parents,1-2008,"Hermione G., Ron W."


In [10]:
from rdflib import Graph, Literal, XSD, URIRef, Namespace
from rdflib.namespace import DCTERMS, RDF, RDFS


In [24]:
a = RDF.type
onto = Namespace('http://example.org/HP/ontology/')

In [9]:
df.columns

Index(['Chapters', 'Favs', 'Follows', 'Published', 'Reviews', 'Updated',
       'Words', 'author', 'characters', 'genre', 'language', 'rating',
       'story_link', 'synopsis', 'title', 'published_mmyy', 'pairing'],
      dtype='object')

In [10]:
from tqdm import tqdm
g = Graph()


chars = {}
couples = {}

story = onto.FanFiction
for row in tqdm(df.sample(10000).itertuples()):
    sbj = URIRef(row.story_link)
    if sbj in g.subjects():
        continue
    chars[row.story_link] = row.characters
    couples[row.story_link] = row.pairing

    g.add((sbj, a, story))
    g.add((sbj, DCTERMS.issued, Literal(row.Published, datatype=XSD.date)))
    g.add((sbj, DCTERMS.abstract, Literal(row.synopsis)))
    g.add((sbj, DCTERMS.modified, Literal(row.Updated, datatype=XSD.date)))
    g.add((sbj, onto.wordCount, Literal(row.Words, datatype=XSD.integer)))
    g.add((sbj, onto.genre, Literal(row.genre)))
    g.add((sbj, onto.language, Literal(row.language)))
    g.add((sbj, onto.rating, Literal(row.rating)))
    g.add((sbj, DCTERMS.title, Literal(row.title)))


10000it [09:03, 18.39it/s]


In [11]:
g.bind('hp', onto)
g.bind('dct', DCTERMS)
g.bind('rdf', RDF)

In [12]:
g.serialize(destination='/Users/francesco.mambrini/Desktop/hpfanfic_selection.ttl', format='turtle')

<Graph identifier=Nc81f1fc994464e7abea1325017cde39a (<class 'rdflib.graph.Graph'>)>

In [25]:
import numpy as np
chars['https://www.fanfiction.net/s/10001163/1/Of-Purebloods-Mudbloods-and-Loyalty'] is np.nan

False

In [23]:
couples['https://www.fanfiction.net/s/10001163/1/Of-Purebloods-Mudbloods-and-Loyalty'] is np.nan

True

In [30]:
chars['https://www.fanfiction.net/s/10001163/1/Of-Purebloods-Mudbloods-and-Loyalty']

'Draco M., Hermione G., Voldemort, Harry P.'

In [29]:
with open('/Users/francesco.mambrini/Desktop/couples.tsv', 'w') as out:
    out.write('Story\tCouple\n')
    for st, coup in couples.items():
        if coup is not np.nan:
            out.write(f'{st}\t{coup}\n')

with open('/Users/francesco.mambrini/Desktop/characters.tsv', 'w') as out:
    out.write('Story\tCharacter\n')
    for st, ch in chars.items():
        if coup is not np.nan:
            out.write(f'{st}\t{coup}\n')

In [17]:
g = Graph()

In [18]:
g.parse('/Users/francesco.mambrini/Downloads/Italian-Fanfiction.ttl')

<Graph identifier=Nd60a707e18304a91b18cbe52229941f3 (<class 'rdflib.graph.Graph'>)>

In [19]:
sbs = [s for s in g.subjects()]
preds = [p for p in g.predicates()]

In [21]:
g.bind('hp', 'http://example.org/HP/ontology/')
g.bind('dct', DCTERMS)
g.bind('rdf', RDF)
g.bind('rdfs', RDFS)

In [25]:
hp = Namespace('http://example.org/HP/ontology/')
a = RDF.type

In [23]:
sbs[0].startswith('https://www.fanfiction')

False

In [26]:
for s in g.subjects():
    if s.startswith('https://www.fanfiction'):
        g.add((s, a, hp.FanfictionNarrative))

In [27]:
chrs = [c for c in g.objects(predicate=hp.hasCharacter)]

In [30]:
for c in chrs:
    g.add((c, a, hp.Character))

In [43]:
for story in itdf.itertuples():
     s = URIRef(story.story_link)
     if s in sbs:
          g.add((s, hp.rating, Literal(story.rating)))

In [44]:
g.serialize(destination='/Users/francesco.mambrini/Desktop/italian_fanfiction.ttl', format='turtle')

<Graph identifier=Nd60a707e18304a91b18cbe52229941f3 (<class 'rdflib.graph.Graph'>)>