# Measurement of Lsjbot

[T275888](https://phabricator.wikimedia.org/T275888)

## Purpose

Structured data team are working on an API that will allow bot writers to automatically add highly relevant images to specific articles.

In order to undersrand how and whether to continue to move forward or if any major changes need to be made, we would like to collect metrics about the health of the project:

- How many edits are made by bots to add images?
- What proportion of those edits are reverted within 48 hours (aka “unconstructive edits”)? 
- How many images are added to an article in each edit? Does the number of images added per edit relate to revert rate?
- Are there certain topic areas where images added by bots are more likely to be reverted

In [1]:
import datetime as dt
import requests
import mwapi
import json
import urllib.parse
import re
import wmfdata 

import pandas as pd
import numpy as np

from wmfdata import hive, spark, mariadb

You are using wmfdata v1.3.1, but v1.3.2 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release --ignore-installed`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md


## Data

For this analysis, we are collecting metrics for Lsjbot running on Cebuano Wikipedia from Dec 2021 to Jan 2022. The image-adding edits can be found using [this query](https://quarry.wmcloud.org/query/61891).

Find image edits in mariadb using the Quarry query we mentioned before


In [2]:
image_edits_query = '''
SELECT 
    rev_timestamp, 
    page_id, 
    rev_id
FROM revision
    INNER JOIN page ON rev_page = page_id
    INNER JOIN revision_actor_temp ON rev_id = revactor_rev
    INNER JOIN actor ON revactor_actor = actor.actor_id
    INNER JOIN revision_comment_temp ON rev_id = revcomment_rev
    INNER JOIN comment ON comment_id = revcomment_comment_id
WHERE rev_timestamp between 20211201000000 and 20220201000000
    AND page_namespace = 0
    AND page.page_is_redirect = 0
    AND actor_name = 'Lsjbot'
    AND comment_text IN ('Images from API', 'Galeriya sa hulagway sa API');
'''

In [3]:
image_edits = mariadb.run(image_edits_query, 'cebwiki')

We create a temporary Spark dataframe to store our user data, then join that with MediaWiki history to aggregate edit data.

In [4]:
image_edits.tail()

Unnamed: 0,rev_timestamp,page_id,rev_id
49212,20211215022909,9820532,33581705
49213,20211218232452,1722030,33629897
49214,20211217030638,9820592,33608091
49215,20211218222630,9820594,33629473
49216,20211213212549,9820579,33571307


In [5]:
spark_session = spark.get_session()
image_edits_sdf = spark_session.createDataFrame(image_edits)
image_edits_sdf.createGlobalTempView("image_edits_view")

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


## edits_count

In [6]:
edits_count_query = '''
SELECT
  FROM_UNIXTIME(UNIX_TIMESTAMP(SUBSTR(rev_timestamp,0,8), 'yyyyMMdd')) AS `date`,
  COUNT(DISTINCT(rev_id)) AS rev
FROM global_temp.image_edits_view
GROUP BY SUBSTR(rev_timestamp,0,8)
'''

In [7]:
edits_count = spark.run(edits_count_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [8]:
edits_count.head()

Unnamed: 0,date,rev
0,2021-12-17 00:00:00,6669
1,2021-12-16 00:00:00,4427
2,2021-12-13 00:00:00,3277
3,2021-12-31 00:00:00,842
4,2021-12-14 00:00:00,4146


In [9]:
#total image edits count
edits_count['rev'].sum()

49217

## Reverted edits

In [10]:
reverted_query = '''
SELECT 
  event_timestamp,
  revision_id, 
  b.page_id, 
  CASE 
    WHEN event_entity = "revision" 
        AND revision_is_identity_reverted 
        AND revision_seconds_to_identity_revert <= 172800 THEN 1
    ELSE 0 
  END AS reverted
FROM global_temp.image_edits_view b
  INNER JOIN wmf.mediawiki_history w ON b.rev_id = w.revision_id
WHERE snapshot = '2022-01'
  AND wiki_db = 'cebwiki'
  AND substr(event_timestamp,1,10) BETWEEN '2021-11-01' AND '2022-02-01'
'''

In [11]:
edits_revert = spark.run(reverted_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [12]:
#numnber of image edits reverted within 48 hours
edits_revert['reverted'].sum()

8

In [13]:
#proportion
edits_revert['reverted'].sum()/edits_revert['reverted'].count()

0.00016254546193388464

The proportion of image edits are reverted within 48 hours is 0.016%. 

## Number of  images per edit

Use [API with action=compare](https://www.mediawiki.org/w/api.php?action=help&modules=compare) to identify how many images were added per edit.

In [81]:
image_added = []

#file name to look up:
file_str = ['.jpg', '.png', '.svg', '.gif', '.jpeg', '.tif', '.pdf', '.ogv', '.webm', '.mpg', '.mpeg',]


for i in range(len(edits_revert)):
    
    image_count = 0

   
try:
        end_point = 'https://ceb.wikipedia.org'
        session = mwapi.Session(end_point, user_agent = "image count <cchen@wikimedia.org>")
        api_result = session.get(action='compare', fromrev=edits_revert.iloc[i]['revision_id'], torelative = "prev")
        diff_html = api_result['compare']['*']
        

        for n in range(len(file_str)):
    
            count = diff_html.lower().count(file_str[n])
            image_count += count
                
    except Exception:
        image_count = 0
   
    image_added.append(image_count)

In [82]:
image_added = pd.DataFrame(image_added).rename(columns={0: "image_count"})

In [83]:
edits_image = pd.concat([edits_revert, image_added],axis=1)

In [24]:
edits_image.agg({'image_count':'sum'})

image_count    128294
dtype: int64

In [19]:
edits_image.groupby(['image_count']).agg({'rev_id':'size'}).assign(pct_total=lambda x: x / x.sum()).rename(columns={'rev_id':'edit_count'})

Unnamed: 0_level_0,edit_count,pct_total
image_count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,24129,0.490158
2,10038,0.203912
3,4616,0.09377
4,1926,0.039125
5,1465,0.02976
6,1245,0.025291
7,1914,0.038881
8,1759,0.035732
9,1079,0.021919
10,1055,0.021431


There are 128,294 images added by LsjBot through Nov 2021 - Jan 2021. 
49% of the image edits by LsjBot add 1 image to an article in each edits, and 20.4% add 2 images. 

In [21]:
summary = edits_image.groupby(['image_count']).agg({'rev_id':'size','reverted':'sum'}).rename(columns={'rev_id':'edit_count'})
summary["revert_rate"] = summary['reverted']/summary['edit_count']
summary

Unnamed: 0_level_0,edit_count,reverted,revert_rate
image_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,24129,2,8.3e-05
2,10038,3,0.000299
3,4616,0,0.0
4,1926,0,0.0
5,1465,1,0.000683
6,1245,0,0.0
7,1914,0,0.0
8,1759,0,0.0
9,1079,1,0.000927
10,1055,1,0.000948


Reverts happened in image edits with 1, 2, 5, 9 and 10 images. 

## Article topics and image edits

Note that one article may have multiple topics. We are counting edits and reverts per article topic. When topics are aggregated, this results in double counting of articles and makes the totals edits and reverts look much bigger than they are.

In [25]:
topic_query = '''
SELECT 
  event_timestamp,
  revision_id, 
  b.page_id,
  ato.topic,
  tc.main_topic, 
  tc.sub_topic,
  CASE 
    WHEN event_entity = "revision" 
        AND revision_is_identity_reverted 
        AND revision_seconds_to_identity_revert <= 172800 THEN 1
    ELSE 0 
  END AS reverted
FROM global_temp.image_edits_view b
  INNER JOIN wmf.mediawiki_history w ON (
    b.rev_id = w.revision_id
    AND w.wiki_db = 'cebwiki'
  )
  INNER JOIN isaacj.article_topics_outlinks_2021_11 ato ON (
    ato.wiki_db =  'cebwiki'
    AND b.page_id = ato.pageid
    AND ato.score >= 0.5
  )
  LEFT JOIN cchen.topic_component tc ON ato.topic = tc.topic
WHERE snapshot = '2022-01'
  AND substr(event_timestamp,1,10) BETWEEN '2021-11-01' AND '2022-02-01'
'''

In [26]:
edits_topic = spark.run(topic_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [27]:
edits_topic.groupby(['main_topic']).agg({'revision_id':'size'}).assign(pct_total=lambda x: x / x.sum()).rename(columns={'revision_id':'edit_count'}).sort_values(['edit_count'],ascending=False)

Unnamed: 0_level_0,edit_count,pct_total
main_topic,Unnamed: 1_level_1,Unnamed: 2_level_1
STEM,67357,0.646762
Geography,35148,0.337491
Culture,1431,0.01374
History_and_Society,209,0.002007


In [30]:
edits_topic.groupby(['topic']).agg({'revision_id':'size'}).assign(pct_total=lambda x: x / x.sum()).rename(columns={'revision_id':'edit_count'}).sort_values(['edit_count'],ascending=False).head(10)


Unnamed: 0_level_0,edit_count,pct_total
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
STEM.STEM*,33651,0.323117
STEM.Biology,33520,0.321859
Geography.Regions.Europe.Europe*,13130,0.126074
Geography.Regions.Europe.Western_Europe,12863,0.12351
Geography.Regions.Oceania,2648,0.025426
Geography.Regions.Asia.Asia*,1472,0.014134
Geography.Regions.Africa.Africa*,733,0.007038
Geography.Regions.Americas.South_America,670,0.006433
Geography.Regions.Asia.Southeast_Asia,615,0.005905
Geography.Regions.Americas.North_America,538,0.005166


The image edits by Lsj were made across all 64 topics (please refer to [the taxonomy](https://www.mediawiki.org/wiki/ORES/Articletopic) for a detailed list of article topics). The most edited main topic is STEM (64.7% of total edits). 

In [31]:
summary_m = edits_topic.groupby(['main_topic']).agg({'revision_id':'size','reverted':'sum'}).rename(columns={'revision_id':'edit_count'})
summary_m["revert_rate"] = summary_m['reverted']/summary_m['edit_count']
summary_m.sort_values(['revert_rate'],ascending=False)

Unnamed: 0_level_0,edit_count,reverted,revert_rate
main_topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
STEM,67357,8,0.000119
Geography,35148,4,0.000114
Culture,1431,0,0.0
History_and_Society,209,0,0.0


In [32]:
summary_t = edits_topic.groupby(['topic']).agg({'revision_id':'size','reverted':'sum'}).rename(columns={'revision_id':'edit_count'})
summary_t["revert_rate"] = summary_t['reverted']/summary_t['edit_count']
summary_t.sort_values(['reverted'],ascending=False).head(10)

Unnamed: 0_level_0,edit_count,reverted,revert_rate
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
STEM.STEM*,33651,4,0.000119
STEM.Biology,33520,4,0.000119
Geography.Regions.Americas.North_America,538,2,0.003717
Geography.Regions.Africa.Africa*,733,1,0.001364
Geography.Regions.Africa.Eastern_Africa,199,1,0.005025
Culture.Biography.Biography*,164,0,0.0
Geography.Regions.Europe.Northern_Europe,57,0,0.0
Geography.Regions.Oceania,2648,0,0.0
Geography.Regions.Europe.Western_Europe,12863,0,0.0
Geography.Regions.Europe.Southern_Europe,108,0,0.0
