# Data validation between the data set in BigQuery and GraphDB

The objectives are to:
1. Ensure complete upload of data to BigQuery and GraphDB.
2. Check for data qualtiy issues.
3. Gain insights on the data.
4. Engineer features.

In [92]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [93]:
# Mount to Google Drive to save results
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/MSc/2020-21/Research\ Project/Colab/
%ls

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/MSc/2020-21/Research Project/Colab


In [94]:
# Connect to GCP Bucket
from google.colab import auth
auth.authenticate_user()

In [95]:
# Set GCP project ID and region to Europe West 2 - London
PROJECT = 'detect-fake-news-313201'
!gcloud config set project $PROJECT
REGION = 'europe-west2'
CLUSTER = '{}-cluster'.format(PROJECT)
!gcloud config set compute/region $REGION
!gcloud config set dataproc/region $REGION

!gcloud config list # show some information

Updated property [core/project].
Updated property [compute/region].
Updated property [dataproc/region].
[component_manager]
disable_update_check = True
[compute]
gce_metadata_read_timeout_sec = 0
region = europe-west2
[core]
account = aaron.altrock@gmail.com
project = detect-fake-news-313201
[dataproc]
region = europe-west2

Your active configuration is: [default]


## Check the number of files in successive GCP cloud storage buckets

In [96]:
# Count the number of cleaned JSON files from the end of stage 1 in the pipeline
!gsutil ls -l gs://fake_news_cleaned_json/*.json | wc -l

59733


In [97]:
# Count the number of parsed JSON and TTL files into triples at the end of stage 2 in the pipeline
!gsutil ls -l gs://fake_news_ttl_json/*.ttl | wc -l
!gsutil ls -l gs://fake_news_ttl_json/*.json | wc -l

27590
27590


The variance between the 59,733 cleaned files to 27,590 turtle documents would suggest this is due to the raw data containing duplicating records for the same news web page, when the turtles are indexed by the hash value of the URLs and therefore would overwrite leading to small number of samples.

In [98]:
# Based on https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries#bigquery_simple_app_client-python
from google.cloud import bigquery
client = bigquery.Client(PROJECT)


## Profile the data in its original form held in BigQuery

In [99]:
# BIgQuery data row count
query_job = client.query(
    """
    SELECT COUNT(*) AS POPULATION_COUNT
    FROM `detect-fake-news-313201.fake_news_sql.src_fake_news`
    """
)

res_df = query_job.result().to_dataframe()  # Waits for job to complete.

res_df

Unnamed: 0,POPULATION_COUNT
0,27589


Therefore deviation by one record compared to the number of files in `gs://fake_news_ttl_json`.

In [100]:
# BIgQuery data row count
query_job = client.query(
    """
    WITH URL_LIST AS (
      SELECT 
      URL
      , COUNT(*) AS URL_COUNT
      FROM `detect-fake-news-313201.fake_news_sql.src_fake_news`
      GROUP BY URL
    )
    SELECT * FROM URL_LIST WHERE URL_COUNT > 1
    """
)

res_df = query_job.result().to_dataframe()  # Waits for job to complete.

res_df

Unnamed: 0,URL,URL_COUNT


Therefore no samples found to have duplicating URL in the BigQuery table, and all articles have unique URLs.

In [101]:
# BIgQuery data preview
query_job = client.query(
    """
    SELECT *
    FROM `detect-fake-news-313201.fake_news_sql.src_fake_news`
    LIMIT 10
    """
)

res_df = query_job.result().to_dataframe()  # Waits for job to complete.

res_df

Unnamed: 0,url,domain,domain_hash,title_hash,title,body_hash,body,label,url_hash
0,http://awm.com/woman-adopts-rescue-dog-starts-...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_761e82d05283bff6f1832c7164626e26,"Woman Adopts Rescue Dog, Starts Noticing Foul ...",body_072ebc1af96077202eb2ad843b6fe4cd,Kelly Benzel was looking for a new best friend...,unreliable,00d76dc2a2b9c02526ab6aba99cea68d
1,http://awm.com/grand-daughter-sings-favorite-s...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_2a30a9b6a5a75250a7317c60c0f871e6,Grand Daughter Sings Favorite Song to Grandma ...,body_3e0c1daeb28fa250067a74240a5af93d,Dementia is one of the scariest and heartbreak...,unreliable,022b655e21ca0d38a1fc33f2a37165a8
2,http://awm.com/first-class-passenger-sees-a-so...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_b89b152a0c80235c4e6174a96e98f166,First Class Passenger Sees A Soldier Walking U...,body_6953f3cbe02f2bdf2083937b1c8f9b7f,Air travel can be a tiring and trying experien...,unreliable,0405cdbffb7b0f0aef94eafaba0863ff
3,http://awm.com/model-who-boasts-she-transforme...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_9868eac5482fdbfebf7b94494bd078ec,Model Who Boasts She Transformed Into An Afric...,body_d9e3185ae65d5b97185f08d85c31abec,"Martina Big, a 28-year-old white German model,...",unreliable,09f269f0d217228587029f2012f875c9
4,http://awm.com/dad-puts-his-son-into-the-game-...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_08db9d62bccf3a51cb7ded0fdf666568,Dad Puts His Son Into The Game Days After Beat...,body_fcb8636125a87f14428856aff72e307f,A cancer diagnosis never comes at the right ti...,unreliable,0c739d0432f31e7805c0d62f6eff0849
5,http://awm.com/little-girls-reaction-to-shooti...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_0ef200f6649ab2907625d4db1f796c87,Little Girl’s Reaction To Shooting Her First D...,body_c8d4436c055c1516d67af592b850a17f,"In the YouTube video below, you’re going to se...",unreliable,154032234b0955080af9117e817a7676
6,http://awm.com/out-of-all-the-holiday-recipes-...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_0165dfcd088b4df972d8b23ee45198e0,"Out Of All The Holiday Recipes I’ve Tried, His...",body_7a5cb6d058bda9afc409df365438cc7f,"Once a year, the grocery stores start stocking...",unreliable,1af9a9f5326071c7bc5a6e16e04a75df
7,http://awm.com/a-starbucks-barista-couldnt-fin...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_55c16d853bfc26cc4697fe2158f2e273,A Starbucks Barista Couldn’t Find A Babysitter...,body_b2277eca06c8bd84aba7a6512dcbb5e6,Starbucks can be a great place to get luxuriou...,unreliable,201402629241e8945bc127ac0b387dc8
8,http://awm.com/the-easily-offended-whine-when-...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_ab7f6ea47a85aeacab878345864501df,The Easily Offended Whine When Kelly Clarkson ...,body_ddf7af702f3aef83dcb93cc5717ad34f,Dolling out corporal punishment is a controver...,unreliable,21825e99b66d5a6b55e9e3f8c5bdfe71
9,http://awm.com/during-heated-love-making-wife-...,awm.com,domain_9725d0802cf38288710d8bff8f64dcba,title_3ed2e9ed1879767b09e706f2a7f5ced5,"During Heated Love Making, Wife Accidentally B...",body_f62f488c90df847302cd010ff91e016b,He always told her that he liked it rough. But...,unreliable,24c1a494a889e14ef274e14d31fca550


In [102]:
# BIgQuery count by domain
query_job = client.query(
    """
    SELECT
    DOMAIN_HASH
    , LABEL
    , COUNT(*) AS ARTICLES_COUNT
    FROM `detect-fake-news-313201.fake_news_sql.src_fake_news`
    GROUP BY DOMAIN_HASH, LABEL
    ORDER BY ARTICLES_COUNT DESC
    """
)

res_df = query_job.result().to_dataframe()  # Waits for job to complete.

# Tally the domain hash to ensure each domain only has one label
domain_tally_ls = []
duplicate_domain_tally_ls = []
for i, row in res_df.iterrows():
  if row['DOMAIN_HASH'] in domain_tally_ls:
    duplicate_domain_tally_ls += [True]
  else:
    duplicate_domain_tally_ls += [False]
  
  # Add domain hash to the list of domains already reviwed
  domain_tally_ls += [row['DOMAIN_HASH']]

res_df['DOMAIN_HAS_MULTIPLE_LABEL'] = duplicate_domain_tally_ls

res_df

Unnamed: 0,DOMAIN_HASH,LABEL,ARTICLES_COUNT,DOMAIN_HAS_MULTIPLE_LABEL
0,domain_8f00b2b61ba335244231d632d390bf8d,fake,7046,False
1,domain_9f5bbbcbad4a48edd86162fabbe90b6e,political,5441,False
2,domain_7e058b16a00bca2c3620d5147881e34d,conspiracy,2819,False
3,domain_98845529bda170844508657ae469197a,bias,1162,False
4,domain_2bb13bea21c458421e33179105662bdd,political,1061,False
5,domain_bd8db3098e8505e2d80c8b89a2eccdb8,clickbait,813,False
6,domain_0d6ca907eef322628f81ae5a38735d44,bias,730,False
7,domain_f573aed8d0dcf08ef6ce3efd5f5c48c0,political,729,False
8,domain_213f2e7f6968dd534eb43ef113fce3e1,political,671,False
9,domain_8fbb80fdc3a337fd46babecb01f203fa,junksci,615,False


In [79]:
# BIgQuery count by label
query_job = client.query(
    """
    SELECT
    LABEL
    , COUNT(*) AS LABEL_COUNT
    FROM `detect-fake-news-313201.fake_news_sql.src_fake_news`
    GROUP BY LABEL
    ORDER BY LABEL_COUNT DESC
    """
)

res_df = query_job.result().to_dataframe()

print('Total: {}'.format(res_df['LABEL_COUNT'].sum()))

res_df

Total: 27589


Unnamed: 0,LABEL,LABEL_COUNT
0,political,9776
1,fake,7081
2,conspiracy,3512
3,bias,2793
4,clickbait,1238
5,junksci,993
6,,990
7,unreliable,437
8,unknown,354
9,reliable,178


In [106]:
# BIgQuery list of URLs
query_job = client.query(
    """
    SELECT DISTINCT
    URL_HASH
    FROM `detect-fake-news-313201.fake_news_sql.src_fake_news`
    """
)

res_df = query_job.result().to_dataframe()

print('Total: {}'.format(res_df.shape[0]))

res_df.head()

Total: 27589


Unnamed: 0,URL_HASH
0,00d76dc2a2b9c02526ab6aba99cea68d
1,022b655e21ca0d38a1fc33f2a37165a8
2,0405cdbffb7b0f0aef94eafaba0863ff
3,09f269f0d217228587029f2012f875c9
4,0c739d0432f31e7805c0d62f6eff0849


Noted that there were no classification for 990 samples, and further 354 with unknown classifications.



## Profile the data in GraphDB

In [80]:
# Install the wrapper package
# Source: https://github.com/RDFLib/sparqlwrapper
!pip install sparqlwrapper



In [110]:
# Code based on: https://sparqlwrapper.readthedocs.io/en/latest/main.html
from SPARQLWrapper import SPARQLWrapper, JSON

# queryString = """
# PREFIX aa: <http://www.city.ac.uk/ds/inm363/aaron_altrock#>
# PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
# PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

# select ?a ?b ?c where {
#   ?a ?b ?c .
#   ?a rdf:type aa:urlHash .
#   aa:unreliable rdf:type aa:newsLabel .
#   aa:unreliable rdfs:label "unreliable" .
# } limit 100
# """

queryString = """
PREFIX aa: <http://www.city.ac.uk/ds/inm363/aaron_altrock#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select (count(?url_hash) as ?url_count) where {
  ?url_hash rdf:type aa:urlHash .
}
"""


sparql = SPARQLWrapper("http://35.246.120.165:7200/repositories/src_fake_news")
sparql.setReturnFormat(JSON)
sparql.setQuery(queryString)

try :
   res_dct = sparql.query().convert()
   print('OK')

except Exception as e:
   print('ERROR: {}'.format(e))
   

OK


In [117]:
# No. of URL hash in GraphDB
res_dct.get('results').get('bindings')[0].get('url_count').get('value')

'25211'

In [83]:
pd.DataFrame.from_dict(res_dct.get('results').get('bindings'), orient='columns')

Unnamed: 0,a,b,c
0,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.w3.org/19...","{'type': 'uri', 'value': 'http://www.city.ac.u..."
1,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.w3.org/20...","{'type': 'literal', 'value': '433cd62efa69ab67..."
2,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u..."
3,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u..."
4,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u..."
5,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u..."
6,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u...",{'datatype': 'http://www.w3.org/2001/XMLSchema...
7,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u...",{'datatype': 'http://www.w3.org/2001/XMLSchema...
8,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.city.ac.u...",{'datatype': 'http://www.w3.org/2001/XMLSchema...
9,"{'type': 'uri', 'value': 'http://www.city.ac.u...","{'type': 'uri', 'value': 'http://www.w3.org/19...","{'type': 'uri', 'value': 'http://www.city.ac.u..."
