# Mismatch Finder

> Welcome!
> 
> Please download this notebook and load it into your [PAWS instance](https://hub-paws.wmcloud.org/)

This notebook provides an overview of [Mismatch Finder](https://www.wikidata.org/wiki/Wikidata:Mismatch_Finder) and how to prepare data for submission. Please also see the [Mismatch Finder User Guide on GitHub](https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md) for more information.

In [1]:
# pip install jupyter-black

In [2]:
%load_ext jupyter_black

In [3]:
import sys

import numpy as np
import pandas as pd

PATH_TO_UTILS = "../MismatchGeneration/"  # change based on your directory structure
sys.path.append(PATH_TO_UTILS)

from utils import check_mf_formatting

## Overview

The [Wikidata Mismatch Finder](https://www.wikidata.org/wiki/Wikidata:Mismatch_Finder) is a tool developed by Wikimedia Deutschland to derive discrepancies between Wikidata's data and that of external sources. The tool stores mismatching data between Wikidata and external databases, then presents it to editors to review and fix. It can also be used to suggest new statements that are missing in Wikidata but need a human-review step before adding them.

The main purposes of Mismatch Finder include:
- To support the Wikidata editors to spot and fix mistakes in Wikidata
- To allow organizations reusing Wikidata’s data to conveniently contribute back by reporting discrepancies in their data

Mismatches can be checked in multiple ways:
- Going to the [Mismatch Finder website](https://mismatch-finder.toolforge.org/) and entering QIDs or checking random mismatches
- Turn on the `Mismatch Finder` option in the [Gadgets section of your Wikidata user preferences](https://www.wikidata.org/wiki/Special:Preferences#mw-prefsection-gadgets)
  - Mismatches will then appear as notifications at the top of pages in the Wikidata user interface

To submit mismatched data to the Mismatch Store you would need to make an account, get an [access token for the Mismatch Finder](https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md#obtaining-an-api-access-token) and send a CSV with the mismatches [via an open API](https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md#accessing-the-api). At the start we will be submitting your generated mismatches for you, but we can discuss later if there's interest in you all being able to submit mismatches yourselves. You can also always open a ticket on the [Mismatch Finder project on Phabricator](https://phabricator.wikimedia.org/project/view/5385/) and include your mismatch file as an attachment.

To make your eventual submissions to Mismatch Finder easier, we've prepared `check_mf_formatting` that goes through a passed Pandas DataFrame and makes sure that all columns are present and validates their values. This function is found in [utils.py](https://github.com/Wikidata/Purdue-Data-Mine-2024/tree/main/MismatchGeneration/utils.py) of the [MismatchGeneration](https://github.com/Wikidata/Purdue-Data-Mine-2024/tree/main/MismatchGeneration) directory on GitHub. You'll see all its checks later when we systematically break a DataFrame to show all the possible problems :)

## Correctly Formatted Data

Load in the Mismatch Finder example CSV so we can test it to make sure it's valid to submit.

In [4]:
mismatch_finder_example_csv = "https://github.com/wmde/wikidata-mismatch-finder/raw/main/docs/exampleMismatchFile.csv"
df_mismatch_finder_example = pd.read_csv(mismatch_finder_example_csv)
df_mismatch_finder_example.head()

Unnamed: 0,item_id,statement_guid,property_id,wikidata_value,meta_wikidata_value,external_value,external_url,type
0,Q184746,Q184746$7814880A-A6EF-40EC-885E-F46DD58C8DC5,P569,3 April 1934,,1934-04-03,http://fake.source.url/12345,statement
1,Q184746,Q184746$7200D1AD-E4E8-401B-8D57-8C823810F11F,P21,Q6581072,,nonbinary,http://fake.source.url/12345,statement
2,Q184746,Q184746$417B1AD5-396D-471E-8F9F-D45619EDBE85,P101,Q7155,,Anthropologist,,qualifier
3,Q184746,Q184746$E347266B-AE85-4D91-84D9-442B28F6C33C,P937,Q170478,,Congo,,
4,Q184746,,P106,,,primatologist,,statement


In [5]:
check_mf_formatting(df_mismatch_finder_example)

All checks have passed! The data is ready to be uploaded to Mismatch Finder.


## Incorrectly Formatted Data

Now let's make some changes to the CSV to see what the output will be with improperly formatted data.

In [6]:
df_mismatch_finder_unformatted = df_mismatch_finder_example.copy()

# Add in a column that's not required.
df_mismatch_finder_unformatted["not_needed_col"] = np.nan * len(
    df_mismatch_finder_unformatted
)

# Remove the Q from the first `item_id` (QID).
df_mismatch_finder_unformatted.loc[0, "item_id"] = df_mismatch_finder_unformatted[
    "item_id"
][0].split("Q")[1]

# Make the fourth `property_id` (PID) null.l
df_mismatch_finder_unformatted.loc[3, "property_id"] = np.nan

# Make the fifth value in `wikidata_value` non-null while the `statement_guid` remains null.
df_mismatch_finder_unformatted.loc[4, "wikidata_value"] = "Q18612271"

# Make the third `external_url` an invalid URL.
df_mismatch_finder_unformatted.loc[2, "external_url"] = "broken.source.url/12345"

# Change the second `type` value to something that's not "statement", "qualifier" or null.
df_mismatch_finder_unformatted.loc[1, "type"] = "invalid_type"

# Change the third `external_value` to be a string with more than 1,500 characters.
df_mismatch_finder_unformatted.loc[2, "external_value"] = (
    df_mismatch_finder_unformatted.loc[2, "external_value"] * 150
)

In [7]:
df_mismatch_finder_unformatted.head()

Unnamed: 0,item_id,statement_guid,property_id,wikidata_value,meta_wikidata_value,external_value,external_url,type,not_needed_col
0,184746,Q184746$7814880A-A6EF-40EC-885E-F46DD58C8DC5,P569,3 April 1934,,1934-04-03,http://fake.source.url/12345,statement,
1,Q184746,Q184746$7200D1AD-E4E8-401B-8D57-8C823810F11F,P21,Q6581072,,nonbinary,http://fake.source.url/12345,invalid_type,
2,Q184746,Q184746$417B1AD5-396D-471E-8F9F-D45619EDBE85,P101,Q7155,,AnthropologistAnthropologistAnthropologistAnth...,broken.source.url/12345,qualifier,
3,Q184746,Q184746$E347266B-AE85-4D91-84D9-442B28F6C33C,,Q170478,,Congo,,,
4,Q184746,,P106,Q18612271,,primatologist,,statement,


In [8]:
check_mf_formatting(df_mismatch_finder_unformatted)

ValueError: 
There's a problem with the DataFrame. Please see the Mismatch Finder file creation directions on GitHub:

https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md#creating-a-mismatches-import-file

Directions on how to fix the DataFrame are also detailed below:

1. Please check that the following columns are present in this exact order:
    'item_id', 'statement_guid', 'property_id', 'wikidata_value', 'meta_wikidata_value', 'external_value', 'external_url', 'type'

2. Please assure that the following columns have valid ids:
    - item_id
    - property_id

3. Please assure that the following columns do not have null values:
    - property_id

4. Please assure that `statement_guid` is null only in cases where `wikidata_value` is as well.

5. Please check the following URLs in `external_url` to make sure that they're valid:
    - broken.source.url/12345

6. Please check that the `type` column contains only: 'statement', 'qualifier' or a null value.

7. Please assure that the following columns do not have values over 1,500 characters:
    - external_value
