# 8. pr_to_fg.ipynb


This notebook streamlines the synchronization of personal data between two databases, PR and FactGrid. It processes CSV data from a local file, retrieves relevant FactGrid entries via SPARQL queries, and identifies discrepancies in the gsn ID. The script generates a new CSV file that prepares the data in a format compatible with QuickStatements, a tool used to batch-update FactGrid records.

In [13]:
import requests
import csv
import os
import pandas as pd
import json
import re
import time
from datetime import datetime, timedelta
import math
import traceback

In [49]:
today_string = datetime.now().strftime('%Y-%m-%d')

In [14]:
input_path = r"C:\Users\khan32\Documents\WIAGweb2\notebooks\sync_notebooks\input_files"
output_path = r"C:\Users\khan32\Documents\WIAGweb2\notebooks\sync_notebooks\output_files"

Run this query on personendatenbank. Please refer to the instructions at [Run_SQL_Query_and_Export_CSV.md](https://github.com/WIAG-ADW-GOE/WIAGweb2/blob/main/notebooks/sync_notebooks/docs/Run_SQL_Query_and_Export_CSV.md) for a detailed explanation of running the sql command below.

```sql
SELECT persons.factgrid, persons.id, gsn.nummer, gsn.deleted 
FROM items 
INNER JOIN persons ON persons.item_id = items.id AND items.status = 'online'
INNER JOIN gsn ON gsn.item_id = items.id
```

```sql
SELECT
  *
FROM
  items
INNER JOIN
  persons ON persons.item_id = items.id AND items.status = 'online'
INNER JOIN
  gsn ON gsn.item_id = items.id AND gsn.deleted = 0 AND gsn.id = (
        SELECT MIN(g2.id)
        FROM gsn g2
        WHERE g2.item_id = items.id
        AND g2.deleted = 0
    )
WHERE
  gsn.nummer = '005-02761-001' OR gsn.nummer = '060-01228-001'
```



<!-- Please ignore ```sql
SELECT persons.factgrid, persons.id, gsn.nummer, items.status, items.deleted, persons.deleted, gsn.deleted
FROM items INNER JOIN persons ON persons.item_id = items.id
INNER JOIN gsn ON gsn.item_id = items.id;
``` -->

In [33]:
filename = 'persons_all_2024-12-09.csv'

In [34]:
pr_df = pd.read_csv(os.path.join(input_path, filename), names=["fg_id", "id", "pd_id", 'is_deleted'])
pr_df

Unnamed: 0,fg_id,id,pd_id,is_deleted
0,,281468,002-03990-001,0
1,,281467,002-03989-001,0
2,,281463,002-03985-001,0
3,,281460,002-03982-001,0
4,,281453,002-03975-001,0
...,...,...,...,...
83387,,339114,086-00372-001,0
83388,,339106,082-00879-001,0
83389,,339108,078-01107-001,0
83390,,339110,050-07639-001,0


In [35]:
url = 'https://database.factgrid.de/sparql'
query = (
"""SELECT ?item ?prid WHERE {
  ?item wdt:P472 ?prid.
}""")
# SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

r = requests.get(url, params={'query': query}, headers={"Accept": "application/json"})
data = r.json()
factgrid_df = pd.json_normalize(data['results']['bindings'])

len(factgrid_df)

5409

In [36]:
# extract out q id
def extract_qid(df, column):
    df[column] = df[column].map(lambda x: x.strip('https://database.factgrid.de/entity/'))
 
#factgrid_df['item.value'] = factgrid_df['item.value'].map(lambda x: x.strip('https://database.factgrid.de/entity/'))

# drop irrelevant columns
def drop_type_columns(df):
    df.drop(columns=[column for column in df.columns if column.endswith('type')], inplace=True)
    df.drop(columns=[column for column in df.columns if column.endswith('xml:lang')], inplace=True)

In [37]:
drop_type_columns(factgrid_df)
extract_qid(factgrid_df, 'item.value')
factgrid_df.columns = ['FactGrid_ID', 'pd_id']
factgrid_df

Unnamed: 0,FactGrid_ID,pd_id
0,Q349,034-00045-001
1,Q218,039-01003-001
2,Q138,054-00308-001
3,Q243,058-01473-001
4,Q304,076-00451-001
...,...,...
5404,Q948490,700-00306-001
5405,Q948520,700-00807-001
5406,Q1059444,023-02009-001
5407,Q1059445,074-04585-001


In [38]:
joined_df = pr_df.merge(factgrid_df, left_on='fg_id', right_on='FactGrid_ID', suffixes=('_pr', '_fg'))
joined_df

Unnamed: 0,fg_id,id,pd_id_pr,is_deleted,FactGrid_ID,pd_id_fg
0,Q653492,281432,002-03954-001,0,Q653492,002-03954-001
1,Q653500,281434,002-03956-001,0,Q653500,002-03956-001
2,Q655778,280634,002-03152-001,0,Q655778,002-03152-001
3,Q655748,280534,002-03052-001,0,Q655748,002-03052-001
4,Q653079,279935,002-02450-001,0,Q653079,002-02450-001
...,...,...,...,...,...,...
3539,Q721387,339024,029-03106-001,0,Q721387,029-03106-001
3540,Q654830,339054,055-01352-001,0,Q654830,055-01352-001
3541,Q654929,339052,055-01438-001,0,Q654929,055-01438-001
3542,Q883349,339066,068-01773-001,0,Q883349,068-01773-001


In [39]:
unequal_df = joined_df[joined_df['is_deleted'] == 0]
unequal_df

Unnamed: 0,fg_id,id,pd_id_pr,is_deleted,FactGrid_ID,pd_id_fg
0,Q653492,281432,002-03954-001,0,Q653492,002-03954-001
1,Q653500,281434,002-03956-001,0,Q653500,002-03956-001
2,Q655778,280634,002-03152-001,0,Q655778,002-03152-001
3,Q655748,280534,002-03052-001,0,Q655748,002-03052-001
4,Q653079,279935,002-02450-001,0,Q653079,002-02450-001
...,...,...,...,...,...,...
3539,Q721387,339024,029-03106-001,0,Q721387,029-03106-001
3540,Q654830,339054,055-01352-001,0,Q654830,055-01352-001
3541,Q654929,339052,055-01438-001,0,Q654929,055-01438-001
3542,Q883349,339066,068-01773-001,0,Q883349,068-01773-001


In [40]:
# what is wrong with Q653473 	300460 	044-03599-001 	1 	Q653473 	044-03599-001 ?????????????????????????????
# query on sql is also strange with 1 1 1 for deleted

In [46]:
diff_df = unequal_df[unequal_df['pd_id_pr'] != unequal_df['pd_id_fg']]
diff_df

Unnamed: 0,fg_id,id,pd_id_pr,is_deleted,FactGrid_ID,pd_id_fg
2529,Q11290,324894,082-00609-001,0,Q11290,021-00687-001
2532,Q812,324913,082-00730-001,0,Q812,076-01114-001
3403,Q538645,335619,712-00097-001,0,Q538645,008-00055-001
3404,Q538645,335619,712-00097-001,0,Q538645,043-02320-001
3410,Q728993,336430,036-01925-001,0,Q728993,012-00602-001


In [48]:
export_csv = diff_df[['fg_id', 'pd_id_pr', 'pd_id_fg']].copy()
export_csv = export_csv.rename(columns={'fg_id': 'qid', 'pd_id_pr': 'P472', 'pd_id_fg': '-P472'})
export_csv

Unnamed: 0,qid,P472,-P472
2529,Q11290,082-00609-001,021-00687-001
2532,Q812,082-00730-001,076-01114-001
3403,Q538645,712-00097-001,008-00055-001
3404,Q538645,712-00097-001,043-02320-001
3410,Q728993,036-01925-001,012-00602-001


This generates a Factgrid file that can be uploaded on to quick statements here https://database.factgrid.de/quickstatements/#/batch. More details to perform this can be found here https://github.com/WIAG-ADW-GOE/WIAGweb2/blob/main/notebooks/sync_notebooks/docs/Run_factgrid_csv.md

In [50]:
export_csv["-P472"] = export_csv["-P472"].apply(lambda x: f'"{x}"')
export_csv["P472"] = export_csv["P472"].apply(lambda x: f'"{x}"')
export_csv.to_csv(
    os.path.join(
        output_path,
        f'factgrid_pr_id_update_{today_string}.csv'
    ),
    index=False
)