# Measure success of Media Matching bots

[T275888](https://phabricator.wikimedia.org/T275888)

## Purpose

Structured data team are working on an API that will allow bot writers to automatically add highly relevant images to specific articles.

In order to undersrand how and whether to continue to move forward or if any major changes need to be made, we would like to collect metrics about the health of the project:

- How many edits are made by bots to add images?
- What proportion of those edits are reverted within 48 hours (aka “unconstructive edits”)? 
- How many images are added to an article in each edit? Does the number of images added per edit relate to revert rate?
- Are there certain topic areas where images added by bots are more likely to be reverted

## Data

For this analysis, we are collecting metrics for JarBot running on Arabic Wikipedia from 01 March 2021 to 31 August 2021. The image-adding edits can be found using [this query](https://quarry.wmflabs.org/query/57516).

In [24]:
import datetime as dt
import requests
import mwapi
import json
import urllib.parse

import pandas as pd
import numpy as np

from wmfdata import spark

In [10]:
##load image-adding edits into csv

filepath = "arwiki-jarbot-2021.csv"
hive.load_csv(
    filepath,
    field_spec="rev_timestamp string, page_id int, rev_id int, comment_text string",
    db_name="cchen",
    table_name="sd_jarbot_0301_0901",
)

## edits_count

In [9]:
edits_count_query = '''
SELECT 
  FROM_UNIXTIME(UNIX_TIMESTAMP(SUBSTR(rev_timestamp,0,8), 'yyyyMMdd')) AS `date`,
  COUNT(DISTINCT(rev_id)) AS rev
FROM cchen.sd_jarbot_0301_0901 
GROUP BY SUBSTR(rev_timestamp,0,8)
'''

In [10]:
edits_count = spark.run(edits_count_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [11]:
edits_count

Unnamed: 0,date,rev
0,2021-03-18 00:00:00,25
1,2021-03-19 00:00:00,54
2,2021-03-23 00:00:00,77
3,2021-08-01 00:00:00,2131
4,2021-07-31 00:00:00,17126
5,2021-03-08 00:00:00,13


In [12]:
#total image edits count
edits_count['rev'].sum()

19426

## Reverted edits

In [3]:
reverted_query = '''
SELECT 
  event_timestamp,
  revision_id, 
  b.page_id, 
  CASE 
    WHEN event_entity = "revision" 
        AND revision_is_identity_reverted 
        AND revision_seconds_to_identity_revert <= 172800 THEN 1
    ELSE 0 
  END AS reverted
FROM cchen.sd_jarbot_0301_0901 b
  INNER JOIN wmf.mediawiki_history w ON b.rev_id = w.revision_id
WHERE snapshot = '2021-08'
  AND wiki_db = 'arwiki'
  AND substr(event_timestamp,1,10) BETWEEN '2021-03-01' AND '2021-09-01'
'''

In [4]:
edits_revert = spark.run(reverted_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [7]:
#numnber of image edits reverted within 48 hours
edits_revert['reverted'].sum()

369

In [9]:
#proportion
edits_revert['reverted'].sum()/edits_revert['reverted'].count()

0.018995161124266448

The proportion of image edits are reverted within 48 hours is 1.9%. As a reference, the 48-hour revert rate is 5.0% for overall edits and 2.4% for bot edits in Arabic Wikipedia.

## Number of  images per edit

Use [API with action=compare](https://www.mediawiki.org/w/api.php?action=help&modules=compare) to identify how many images were added per edit.

In [9]:
image_added = []

#file name to look up: ملف
file_str = 'ملف'

for i in range(len(edits_revert)):
        
    try:
        end_point = 'https://ar.wikipedia.org'
        session = mwapi.Session(end_point, user_agent = "get image count <cchen@wikimedia.org>")
        api_result = session.get(action='compare', fromrev=edits_revert.iloc[i]['revision_id'], torelative = "prev")
            
        diff_html = api_result['compare']['*']
        image_conut = diff_html.count(file_str)
                
    except Exception:
        image_conut = 0
        
    image_added.append(image_conut)
        

In [13]:
image_added = pd.DataFrame(image_added).rename(columns={0: "image_count"})

In [14]:
edits_image = pd.concat([edits_revert, image_added],axis=1)

In [32]:
edits_image.groupby(['image_count']).agg({'revision_id':'size'}).assign(pct_total=lambda x: x / x.sum()).rename(columns={'revision_id':'edit_count'})


Unnamed: 0_level_0,edit_count,pct_total
image_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,5.1e-05
1,19255,0.991197
2,3,0.000154
3,126,0.006486
5,25,0.001287
7,7,0.00036
9,4,0.000206
11,1,5.1e-05
13,2,0.000103
21,2,0.000103


99.1% of the image edits by JarBot add 1 image to an article in each edits. 

In [33]:
summary = edits_image.groupby(['image_count']).agg({'revision_id':'size','reverted':'sum'}).rename(columns={'revision_id':'edit_count'})
summary["revert_rate"] = summary['reverted']/summary['edit_count']
summary

Unnamed: 0_level_0,edit_count,reverted,revert_rate
image_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,0,0.0
1,19255,369,0.019164
2,3,0,0.0
3,126,0,0.0
5,25,0,0.0
7,7,0,0.0
9,4,0,0.0
11,1,0,0.0
13,2,0,0.0
21,2,0,0.0


We see that all the reverts are happening in image edits with 1 image added by JarBot. In this case, we dont have suffcient data to draw a relationship between number of images added per edit and revert rate.

## Article topics and image edits

Note that one article may have multiple topics. We are counting edits and reverts per article topic. When topics are aggregated, this results in double counting of articles and makes the totals edits and reverts look much bigger than they are.

In [36]:
topic_query = '''
SELECT 
  event_timestamp,
  revision_id, 
  b.page_id,
  ato.topic,
  tc.main_topic, 
  tc.sub_topic,
  CASE 
    WHEN event_entity = "revision" 
        AND revision_is_identity_reverted 
        AND revision_seconds_to_identity_revert <= 172800 THEN 1
    ELSE 0 
  END AS reverted
FROM cchen.sd_jarbot_0301_0901 b
  INNER JOIN wmf.mediawiki_history w ON (
    b.rev_id = w.revision_id
    AND w.wiki_db = 'arwiki'
  )
  INNER JOIN isaacj.article_topics_outlinks_2021_07 ato ON (
    ato.wiki_db =  'arwiki'
    AND b.page_id = ato.pageid
    AND ato.score >= 0.5
  )
  LEFT JOIN cchen.topic_component tc ON ato.topic = tc.topic
WHERE snapshot = '2021-08'
  AND substr(event_timestamp,1,10) BETWEEN '2021-03-01' AND '2021-09-01'
'''

In [35]:
edits_topic = spark.run(topic_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [39]:
edits_topic.groupby(['main_topic']).agg({'revision_id':'size'}).assign(pct_total=lambda x: x / x.sum()).rename(columns={'revision_id':'edit_count'}).sort_values(['edit_count'],ascending=False)

Unnamed: 0_level_0,edit_count,pct_total
main_topic,Unnamed: 1_level_1,Unnamed: 2_level_1
STEM,16931,0.388263
Geography,10268,0.235467
Culture,9445,0.216594
History_and_Society,6963,0.159676


In [41]:
edits_topic.groupby(['topic']).agg({'revision_id':'size'}).assign(pct_total=lambda x: x / x.sum()).rename(columns={'revision_id':'edit_count'}).sort_values(['edit_count'],ascending=False).head(10)


Unnamed: 0_level_0,edit_count,pct_total
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
STEM.STEM*,8087,0.185452
Geography.Regions.Europe.Europe*,2149,0.049281
Culture.Biography.Biography*,2054,0.047103
Culture.Philosophy_and_religion,2002,0.04591
STEM.Medicine_&_Health,1990,0.045635
Geography.Regions.Asia.Asia*,1861,0.042677
History_and_Society.Politics_and_government,1660,0.038067
History_and_Society.History,1411,0.032357
STEM.Biology,1330,0.0305
STEM.Technology,1279,0.02933


The image edits by JarBot were made accross all 64 topics (please refer to [the taxonomy](https://www.mediawiki.org/wiki/ORES/Articletopic) for a detailed list of article topics). The most edited main topic is STEM (38.8% of total edits). 

In [44]:
summary_m = edits_topic.groupby(['main_topic']).agg({'revision_id':'size','reverted':'sum'}).rename(columns={'revision_id':'edit_count'})
summary_m["revert_rate"] = summary_m['reverted']/summary_m['edit_count']
summary_m.sort_values(['revert_rate'],ascending=False)

Unnamed: 0_level_0,edit_count,reverted,revert_rate
main_topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Culture,9445,366,0.038751
History_and_Society,6963,70,0.010053
Geography,10268,7,0.000682
STEM,16931,9,0.000532


The image edits by JarBot are more likely to be reverted in article with main topic Culture with a 3.9% revert rate, followed by History_and_Society topic with a 1% revert rate.

In [46]:
summary_t = edits_topic.groupby(['topic']).agg({'revision_id':'size','reverted':'sum'}).rename(columns={'revision_id':'edit_count'})
summary_t["revert_rate"] = summary_t['reverted']/summary_t['edit_count']
summary_t.sort_values(['reverted'],ascending=False).head(10)

Unnamed: 0_level_0,edit_count,reverted,revert_rate
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Culture.Biography.Biography*,2054,363,0.176728
History_and_Society.History,1411,69,0.048901
STEM.STEM*,8087,5,0.000618
STEM.Medicine_&_Health,1990,2,0.001005
Geography.Regions.Europe.Europe*,2149,2,0.000931
Culture.Sports,450,1,0.002222
STEM.Space,344,1,0.002907
STEM.Computing,803,1,0.001245
History_and_Society.Military_and_warfare,993,1,0.001007
Geography.Regions.Europe.Southern_Europe,595,1,0.001681


Most of the reverts were made to Culture.Biography.Biography* topic with a 17.7% revert rate. And 4.9% of History_and_Society.History topic edits get reverted. In other topics, the revert rates are comparatively lower. 