# Demonstration Notebook for SCHEMA Dissertation
The purpose of this notebook is to demonstrate the various matching techniques developed and compared for mapping JSON schemas to improve data interoperability.

The sections continue as follows:
1. Overview of Schemas Analysed
2. Generalised Schema Approach
3. Matching Techniques
3.1. String Matching
3.2. Semantic Matching
3.3. Approaching Schemas as Graphs
4. Evaluation of Matching Techniques
5. Mapping
6. Challenges
7. Summary

## 1. Overview of Schemas Analysed
Facebook and Twitter archive data is the primary source of data for this research. Upon downloading and analysing the datasets, classes are developed manually to feed the data into the pipeline.

Classes will need to be developed as more data sources are added to the project. 

Below is the functionality that involves:
1. Reading in the data from the archive.
2. Transforming the JSON into the class based schemas.

###### Facebook

In [1]:
import json
from facebook_post import Post, Data, Place, Coordinates, Attachments, ExternalContext

In [2]:
from facebook_profile import Profile, Friends, Likes, Groups, Name, Location, RelationshipStatus, Date, Education, Work


empty_facebook_profile = Profile()
empty_facebook_profile

Profile(profile_id=None, name=Name(full_name=None, first_name=None, middle_name=None, last_name=None), date_of_birth=Date(year=None, month=None, day=None), current_city=Location(current_city=None), relationship_status=RelationshipStatus(anniversary=Date(year=None, month=None, day=None), status=None, partner=None), education_experiences=[], work_experiences=[], gender=None, phone_number=None, registration_timestamp=None, intro_bio=None, website=None, friends=Friends(friends=[]), likes=Likes(activities=[], music=[], movies=[], television=[], other=[], favourite_athletes=[], games=[], clothing=[]), emails=[])

In [3]:
fpath = '/Users/clairefarrell/College/TCD/DISS/facebook-Clairefarrell13/posts/your_posts_1.json'
with open(fpath, 'r') as f:
    json_str = json.loads(f.read())

postObjectList = []
for obj in json_str:
    postObjectList.append(Post(**obj))

In [4]:
postObjectList[0]

Post(title=None, timestamp=1603805068, data=[{'update_timestamp': 1603805068}], attachments=[], tags=[])

In [5]:
ffpath = '/Users/clairefarrell/College/TCD/DISS/facebook-Clairefarrell13/friends/friends.json'
with open(ffpath, 'r') as f:
    json_str = json.loads(f.read())

friendsObjectList = []
for obj in json_str['friends']:
    friendsObjectList.append(obj['name'])
friends = Friends(friendsObjectList)

In [6]:
ffpath = '/Users/clairefarrell/College/TCD/DISS/facebook-Clairefarrell13/profile_information/profile_information.json'
with open(ffpath, 'r') as f:
    json_str = json.loads(f.read())

for obj in json_str['profile']['pages']:
    for k,v in obj.items():
        if v == "Activities":
            activitiesList = obj['pages']

for obj in json_str['profile']['pages']:
    for k,v in obj.items():
        if v == "Music":
            musicList = obj['pages']
            
for obj in json_str['profile']['pages']:
    for k,v in obj.items():
        if v == "Movies":
            moviesList = obj['pages']

for obj in json_str['profile']['pages']:
    for k,v in obj.items():
        if v == "Television":
            televisionList = obj['pages']

for obj in json_str['profile']['pages']:
    for k,v in obj.items():
        if v == "Other":
            otherList = obj['pages']

for obj in json_str['profile']['pages']:
    for k,v in obj.items():
        if v == "Favorite Athletes":
            favoriteAtheletesList = obj['pages']

for obj in json_str['profile']['pages']:
    for k,v in obj.items():
        if v == "Games":
            gamesList = obj['pages']

for obj in json_str['profile']['pages']:
    for k,v in obj.items():
        if v == "Clothing":
            clothesList = obj['pages']

likes = Likes(activitiesList, musicList, moviesList, televisionList, otherList, favoriteAtheletesList, gamesList, clothesList)
likes

Likes(activities=['Christmas'], music=['Z00L', 'Today FM', 'RTÃ\x89 Gold', 'AminÃ©', 'RTÃ\x89 2fm', 'Daragh McSloy', 'Melanie Martinez', 'Ruby Rose', 'DIE ANTWOORD', 'Us The Duo', 'Superhumanoids', 'Hozier', 'HAIM', 'Jack Ryan', 'Slane Concert', 'Delorentos', 'Ivy Joe', 'Bruno Mars', 'The Embers', 'The Wanted', 'David Guetta', 'Ed Sheeran', 'Drake', 'Olly Murs', 'Far East Movement', 'Ed Sheeran', 'Conor Maynard', 'Rihanna', 'Fret 13', 'LMFAO', 'Pitbull', 'Cutting Edge', 'JA The DragAn', 'Dj Phonetique', 'Oh No Not Stereo', 'Lady Gaga', 'Eminem', 'Katy Perry', 'Justin Tyler', 'Justin Bieber', 'Justin bieber to Ireland', 'Michael Jackson', 'Fox Avenue', 'The Script'], movies=['Mattress Men - The Movie', 'Taken 3 Ireland', 'Mean Girls Memes', 'Dory', 'The Hunger Games', 'The Inbetweeners Movie', 'Mean Girls'], television=['RTÃ\x892', 'Love/Hate Series 5', 'Game Of Thrones UK', 'Game of Thrones Memes', "Mrs Brown's Boys", 'Two Tube', "Grey's Anatomy", 'Mrs Browns Boys Memes', 'The Fresh Pr

In [7]:
groupsList = []
for obj in json_str['profile']['groups']:
    groupsList.append(obj['name'])
groups = Groups(groupsList)
groups

Groups(groups=['Wild Camping Ireland', 'The Thought Leadership Accelerator Community', 'Dogspotting'])

In [8]:
full_name = json_str['profile']['name']['full_name']
first_name = json_str['profile']['name']['first_name']
middle_name = json_str['profile']['name']['middle_name']
last_name = json_str['profile']['name']['last_name']
name = Name(full_name, first_name, middle_name, last_name)
name

Name(full_name='Claire Farrell', first_name='Claire', middle_name='', last_name='Farrell')

In [9]:
current_city = json_str['profile']['current_city']['name']
current_city = Location(current_city)
current_city

Location(current_city='Dublin, Ireland')

In [10]:
relationship = json_str['profile']['relationship']
anniversary = Date(year = relationship['anniversary']['year'], month=relationship['anniversary']['month'], day=relationship['anniversary']['day'])
status = relationship['status']
partner = relationship['partner']
relationship_status = RelationshipStatus(anniversary, status, partner)
relationship_status

RelationshipStatus(anniversary=Date(year=2016, month=12, day=16), status='In a relationship', partner='A D')

In [11]:
education = json_str['profile']['education_experiences']
educationList = []
for e_obj in education:
    e_name = e_obj['name']
    graduated = e_obj['graduated']
    school_type = e_obj['school_type']
    start_timestamp = e_obj['start_timestamp']
    timestamp = e_obj['timestamp']
    concentrations = e_obj['concentrations']
    educationList.append(Education(e_name, graduated, school_type, start_timestamp, timestamp, concentrations))
    
educationList

[Education(name='Technological University Dublin', graduated=False, school_type='University', start_timestamp=1472713200, timestamp=1471872580, concentrations=['Business Analytics']),
 Education(name='Scoil Iosa', graduated=True, school_type='High School', start_timestamp=1316812400, timestamp=1316812400, concentrations=[])]

In [12]:
work = json_str['profile']['work_experiences']
workList = []
for e_obj in work:
    name = e_obj['name']
    title = e_obj['graduated']
    start_timestamp = e_obj['start_timestamp']
    end_timestamp = e_obj['end_timestamp']
    workList.append(Work(name, title, start_timestamp, end_timestamp))
    
workList

[]

In [13]:
dob = json_str['profile']['birthday']
date_of_birth = Date(year = dob['year'], month=dob['month'], day=dob['day'])
gender = json_str['profile']['gender']['gender_option']
phone_number = json_str['profile']['phone_numbers'][0]['phone_number']
emails= json_str['profile']['emails']['emails']
    
profile = Profile(name, date_of_birth, current_city, relationship_status,
                 educationList, workList, gender, phone_number,
                 friends, likes, emails)
profile

Profile(profile_id=Name(full_name='Claire Farrell', first_name='Claire', middle_name='', last_name='Farrell'), name=Date(year=1998, month=6, day=13), date_of_birth=Location(current_city='Dublin, Ireland'), current_city=RelationshipStatus(anniversary=Date(year=2016, month=12, day=16), status='In a relationship', partner='A D'), relationship_status=[Education(name='Technological University Dublin', graduated=False, school_type='University', start_timestamp=1472713200, timestamp=1471872580, concentrations=['Business Analytics']), Education(name='Scoil Iosa', graduated=True, school_type='High School', start_timestamp=1316812400, timestamp=1316812400, concentrations=[])], education_experiences=[], work_experiences='FEMALE', gender='+35387111111', phone_number=Friends(friends=['Jane Smith', 'John Smith', 'Joe Blogs']), registration_timestamp=Likes(activities=['Christmas'], music=['Z00L', 'Today FM', 'RTÃ\x89 Gold', 'AminÃ©', 'RTÃ\x89 2fm', 'Daragh McSloy', 'Melanie Martinez', 'Ruby Rose', 'D

In [14]:
empty_facebook_post = Post()
empty_facebook_post

Post(title=None, timestamp=None, data=Data(post=None, title=None, update_timestamp=None, timestamp=None, place=Place(name=None, address=None, url=None, coordinates=Coordinates(latitude=None, longitude=None)), external_context=ExternalContext(url=None), media=Media(uri=None, title=None, description=None, creation_timestamp=None, media_metadata=MediaMetadata(photo_metadata=None))), attachments=Attachments(data=[]), tags=[])

###### Twitter

In [15]:
from tweet import Tweet, Entities, UserMentions, Urls, Main

empty_tweet = Tweet()
empty_main = Main(empty_tweet)
empty_urls = Urls()
empty_user_mentions = UserMentions()
empty_entities = Entities()


# empty_main, empty_urls, empty_user_mentions, empty_entities, empty_tweet
tpath = '/Users/clairefarrell/College/TCD/DISS/Twitter/tweet.js'
with open(tpath, 'r') as f:
    tjson_str = json.loads(f.read())
tweetObjectList = []
for obj in tjson_str:
    tweetObjectList.append(Main(**obj))
tjson_str[:5]

[{'tweet': {'retweeted': False,
   'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
   'entities': {'hashtags': [],
    'symbols': [],
    'user_mentions': [{'name': 'James',
      'screen_name': 'Pojken2014',
      'indices': ['0', '11'],
      'id_str': '2786152654',
      'id': '2786152654'},
     {'name': 'Data Protection Commission Ireland',
      'screen_name': 'DPCIreland',
      'indices': ['12', '23'],
      'id_str': '786180288900542464',
      'id': '786180288900542464'}],
    'urls': [{'url': 'https://t.co/RkiqDmOAR5',
      'expanded_url': 'https://forms.dataprotection.ie/contact',
      'display_url': 'forms.dataprotection.ie/contact',
      'indices': ['71', '94']}]},
   'display_text_range': ['0', '94'],
   'favorite_count': '0',
   'in_reply_to_status_id_str': '1301136726837125121',
   'id_str': '1301139972175654914',
   'in_reply_to_user_id': '2786152654',
   'truncated': False,
   'retweet_count': '0',
   'id': '1301139972175654914

In [16]:
tweetObjectList[0]

Main(tweet={'retweeted': False, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'name': 'James', 'screen_name': 'Pojken2014', 'indices': ['0', '11'], 'id_str': '2786152654', 'id': '2786152654'}, {'name': 'Data Protection Commission Ireland', 'screen_name': 'DPCIreland', 'indices': ['12', '23'], 'id_str': '786180288900542464', 'id': '786180288900542464'}], 'urls': [{'url': 'https://t.co/RkiqDmOAR5', 'expanded_url': 'https://forms.dataprotection.ie/contact', 'display_url': 'forms.dataprotection.ie/contact', 'indices': ['71', '94']}]}, 'display_text_range': ['0', '94'], 'favorite_count': '0', 'in_reply_to_status_id_str': '1301136726837125121', 'id_str': '1301139972175654914', 'in_reply_to_user_id': '2786152654', 'truncated': False, 'retweet_count': '0', 'id': '1301139972175654914', 'in_reply_to_status_id': '1301136726837125121', 'possibly_sensitive': False, 'created_at': 'Wed Sep 02 12:48:3

In [17]:
from twitter_profile import Profile, Interests, Followers, Following

In [18]:
tpath = '/Users/clairefarrell/College/TCD/DISS/Twitter/follower.js'
with open(tpath, 'r') as f:
    tjson_str = f.read()
tjson_str1 = tjson_str.replace("window.YTD.follower.part0 = ", "")
json1 = json.loads(tjson_str1)

In [19]:
tpath1 = '/Users/clairefarrell/College/TCD/DISS/Twitter/following.js'
with open(tpath1, 'r') as f:
    tjson_str2 = f.read()
tjson_str2 = tjson_str2.replace("window.YTD.following.part0 = ", "")
json2 = json.loads(tjson_str2)

In [20]:
followersList = []
followingList = []
for obj in json1:
    followersList.append(obj['follower']['accountId'])
for obj in json2:
    followingList.append(obj['following']['accountId'])
followers = Followers(followersList)
following = Following(followingList)
following
# print(followers)
# print(followingList)

Following(following=['1912395307', '1243239258007183360', '902732309442514946', '105197819', '848945026822414336', '446461168', '59413748', '897060827316072448', '596377986', '177993199', '741210871154806785', '1151627408971390976', '349135414', '1269670536', '44340473', '284664146', '2497300458', '3415806478', '1076578032', '1138988831984771072', '1263416381644705792', '1214188137859039232', '972119197', '718045611795345408', '19231330', '86336954', '1126466698050199552', '88509498', '3749206157', '1195013928209788928', '1138429952251125760', '20309837', '3834530009', '857176506648416256', '1245118064057819142', '2617742840', '1179045316617748480', '216644075', '2731204826', '52753292', '114441145', '8919762', '1021665947243958274', '15028284', '10834752', '1366981790', '16191556', '3094974153', '180493678', '1110916901310480390', '16328393', '2646334171', '372372472', '2279552412', '2647394298', '2219841242', '17241749', '2255375424', '17405912', '22650465', '30967853', '541136693', 

In [21]:
tpath2 = '/Users/clairefarrell/College/TCD/DISS/Twitter/lists-member.js'
with open(tpath2, 'r') as f:
    tjson_str4 = f.read()
tjson_str4 = tjson_str4.replace("window.YTD.lists_member.part0 = ", "")
json4 = json.loads(tjson_str4)

In [22]:
interestsList = []
for obj in json4:
    interestsList.append(obj['userListInfo']['url'].rsplit('/', 1)[-1])
interests = Interests(interestsList)
interests

Interests(interests=['interesting-people1', 'diversity-culture', 'iswc', 'semanticweb', 'cs', 'python-list', 'evernote', 'privacy', 'linked-data'])

In [23]:
tpath3 = '/Users/clairefarrell/College/TCD/DISS/Twitter/profile.js'
with open(tpath3, 'r') as f:
    tjson_str5 = f.read()
tjson_str5 = tjson_str5.replace("window.YTD.profile.part0 = ", "")
json5 = json.loads(tjson_str5)[0]

In [24]:
tpath4 = '/Users/clairefarrell/College/TCD/DISS/Twitter/account.js'
with open(tpath4, 'r') as f:
    tjson_str6 = f.read()
tjson_str6 = tjson_str6.replace("window.YTD.account.part0 = ", "")
json6 = json.loads(tjson_str6)[0]

In [25]:
json6

{'account': {'email': 'me@harshp.com',
  'createdVia': 'web',
  'username': 'coolharsh55',
  'accountId': '54597193',
  'createdAt': '2009-07-07T16:29:46.000Z',
  'accountDisplayName': 'Harshvardhan J. Pandit'}}

In [26]:
bio = json5['profile']['description']['bio']
website = json5['profile']['description']['website']
location = json5['profile']['description']['location']
email = json6['account']['email']
created_via = json6['account']['createdVia']
username = json6['account']['username']
account_id = json6['account']['accountId']
created_at = json6['account']['createdAt']
account_display_name = json6['account']['accountDisplayName']

profile = Profile(bio, website, location, email, created_via, username, account_id, created_at, account_display_name)
profile

Profile(bio='Researcher @ ADAPT Centre, Trinity College Dublin. PhD in GDPR 🔄 Privacy 🔄 Sem-Web. Avid reader. Aspiring polymath. Inner voice: @bohketto', website='https://t.co/6aHq0NUWnw', location='Dublin City, Ireland', email='me@harshp.com', created_via='web', username='coolharsh55', account_id='54597193', created_at='2009-07-07T16:29:46.000Z', account_display_name='Harshvardhan J. Pandit', phone_number=None, followers=Followers(followers=[]), following=Following(following=[]), interests=Interests(interests=[]))

## 2. Generalised Schema Approach

The Schema Adapter is initialised below to demonstrate the various verticals involved and captured in the data transformation process.

In [27]:
from schema_adapter import Data, TextualData, MediaData, LocationData, Profile, Relationships, Interests

empty_schema_adapter = Data()
vars(empty_schema_adapter)

{'content': Content(text=TextualData(text=None, urls=None, hashtags=None, people=None), media=MediaData(url=None, id_str=None, caption=None, types=None), location=LocationData(name=None, address=None, url=None, coordinates=(0, 0))),
 'profile': Profile(name=None, phone_number=None, emails=None, date_of_birth=None, gender=None, biography=None, language=None, location=LocationData(name=None, address=None, url=None, coordinates=(0, 0)), education=[], profession=[], interests=[]),
 'relationships': Relationships(relations=[]),
 'interests': Interests(interests=[]),
 'timestamp': None}

Empty instances of each class is initialised below for use in the PatternMatching stage.

In [28]:
empty_text = TextualData()
empty_media = MediaData()
empty_location = LocationData()
empty_profile = Profile()
empty_relationship = Relationships()
empty_interests = Interests()

## 3. Matching Techniques
### 3.1. String Matching

##### Introduction
String matching is the direct matching of the attributes based on their variable name. 

##### Methodology
1. Create an  instance of the source data, e.g. initialise an Twitter post class from the dictionary as demonstrated above.
2. Collect the annotations of these classes for comparison.
3. For every class in the Schema Adapter, create a new instance checking if a attribute exists in the source class instances using string matching and assigning the value accordingly if present.
4. Return the Schema Adapter object for analysis.

In [29]:
from string_match import data as str_data, mediaData, locationData, textData

string_twitter = str_data(vars(tweetObjectList[0]))

Data(content=Content(text=TextualData(text=None, urls=None, hashtags=[], people=None), media=MediaData(url=None, id_str=None, caption=None, types=None), location=LocationData(name=None, address=None, url=None, coordinates=(0.0, 0.0))), profile=None, relationships=None, interests=None, timestamp=1602850945)


### 3.2. Semantic Matching

##### Intoduction
Semantic matching allows for words that mean the same thing to be matched together. This research employs the use of thesauri to match the attributes for different classes together. This methodology uses the schema adapter to find concepts from external data sources that are similar to it. 

i.e. From the Schema Adapter, what similar concepts can be found from the source data, e.g. Twitter archive. 

##### Methodology
1. Create an  instance of the source data, e.g. initialise an Twitter post class from the dictionary as demonstrated above.
2. Collect the annotations of these classes for comparison.
3. Create a list of synonyms for each attrbibute in the Schema Adapter classes.
4. For every class in the Schema Adapter, create a new instance checking if a attribute exists in the source class instances using semantic matching and assigning the value accordingly if present.
5. Return the Schema Adapter object for analysis.

In [30]:
from semantic_match import data as sem_data
# , mediaData, locationData, textData

semantic_twitter = sem_data(vars(tweetObjectList[0]))

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/clairefarrell/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 3.3. Pattern Matching
"Given two entities, if their children in their ontologies are similar, then it is likely the those entities are similar as well."

##### Introduction
Pattern matching involves the previous two matching techniques and combines them to output a matching score. Unlike the previous implementations, the matching is performed on the class as a whole, and not the individual attributes in isolation. 

##### Methodology
1. Empty instances of each class are initialised for the comparison.
2. Instances are added to a list of the overarching concept/organisation, i.e. all facebook classes are added to one list.
3. Two organisations are compared by _ the lists as parameters to the PatternMatching function. 
4. The PatternMatching function
    - The annotations for each class are collected.
    - Iterate through the classes in a nested manner.
    - In each iteration identify which class has the larger number of attributes, store this number.
    - For each attribute in the classes, create a list of synonyms for the variable name. 
    - Compare these lists and increase the semantic count score for every match found.
    - Similarly, compare the attributes on a string match bases, increasing the string match score for every match found.
    - Compute the similarity score for each class in the two lists and store this in a list of dictionaries.
        - e.g. [{class1:{class2:0.65, class3:0.3}]
            - class1 is more similar to class2 than class3 with a similarity score of 0.65 compared to 0.3.

In [36]:
from pattern_match import PatternMatching

schema_adapter_ls = [empty_text, empty_media, empty_location, empty_profile, empty_relationship, empty_interests]
facebook_post_ls = []
tweet_ls = [empty_main, empty_urls, empty_user_mentions, empty_entities, empty_tweet]

pattern_twitter = PatternMatching(schema_adapter_ls, tweet_ls)

print(PatternMatching(schema_adapter_ls, tweet_ls))

[{'TextualData': {'Entities': 0.375, 'Urls': 0.125, 'Main': 0.0, 'UserMentions': 0.0, 'Tweet': 0.0}}, {'MediaData': {'Urls': 0.25, 'Entities': 0.125, 'UserMentions': 0.1, 'Tweet': 0.02631578947368421, 'Main': 0.0}}, {'LocationData': {'Urls': 0.25, 'UserMentions': 0.2, 'Entities': 0.125, 'Tweet': 0.02631578947368421, 'Main': 0.0}}, {'Profile': {'UserMentions': 0.09090909090909091, 'Main': 0.0, 'Urls': 0.0, 'Entities': 0.0, 'Tweet': 0.0}}, {'Relationships': {'Main': 0.0, 'Urls': 0.0, 'UserMentions': 0.0, 'Entities': 0.0, 'Tweet': 0.0}}, {'Interests': {'Main': 0.0, 'Urls': 0.0, 'UserMentions': 0.0, 'Entities': 0.0, 'Tweet': 0.0}}]


##### Observations
One particularly interesting observation is that the TextualData object does not have any similarity with the Tweet object. 

## 4. Evaluation of Matching techniques

In order to understand the effectiveness of the matching techniques, three metrics have been implemented.
1. Completeness
2. Performance
3. Achieves Research Objectives

### 4.1 Completeness
Utilitsing the manual matching technique explored in the manual_matching_notebook as a baseline for what can be successfully mapped between the two schemas, the completeness of the three previously described matching techniques: string matching, semantic matching, and pattern matching. 

##### Direct Matching
- From twitter tweet to facebook post completeness: 52.63%  
- From twitter profile to facebook profile completeness: 38.7% 
- From facebook post to twitter tweet completeness: 19.23% 
- From facebook profile to twitter profile completeness: 92.31% 

##### Indirect Matching
- From twitter to schema adapter: 78.37% completeness
- From facebook to schema adapter: 51.72% completeness

In [32]:
import collections

def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

import json
def to_dict(obj):
    return json.loads(json.dumps(obj, default=lambda o: o.__dict__))

In [35]:
# String Matching

total = 0
count = 0
for k,v in flatten(to_dict(string_twitter)).items():
    total+=1
    if v == None:
        count+=1
print(1-count/total)

0.375


In [34]:
# Semantic Matching

total = 0
count = 0
for k,v in flatten(to_dict(semantic_twitter)).items():
    total+=1
    if v == None:
        count+=1
print(1-count/total)

0.25


  if isinstance(v, collections.MutableMapping):


In [40]:
# Pattern Matching

1.0
