# INFO 3402 – Week 09: Assignment - Solutions

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Background

My research explores how Wikipedia editors self-organize to cover breaking news events. If you go to an article about a current event (*e.g.*, [2022 Russian invasion of Ukraine](https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine)) you will see there is a tremendous amount of content (text, citations, images, *etc*.) that has been added in the space of days.

In this assignment we will use code I have developed to retrieve and analyze data about Wikipedia.

In [1]:
import pandas as pd
idx = pd.IndexSlice
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import wikifunctions as wf

## Question 01: Retrieve and explore an article revision history (10 pts)

Use the wikifunction `get_all_page_revisions` to retrieve the complete article revision history of the article "2022 Russian invasion of Ukraine" and store as `revs_en_df`. (1 pt)

Count the number of unique commits ("sha1") and distinct users ("user") in `revs_df`. (2 pts)

Make two side-by-side histograms. On the left, visualize the "lag" column with geometrically-sized bins and log-scaled x-axis. On the right, visualize the *absolute value* (hint: `np.abs`, [docs](https://numpy.org/doc/stable/reference/generated/numpy.absolute.html)) of the "diff" column with geometrically-sized bins and log-scaled x-axis. (5 pts)

Write a few sentences about what surprises you about the size and speed of the collaboration on this article. (2 pts)

**Extra credit (2 pts)**. Annotate each histogram with a vertical red line of the median values and the numeric value. 

## Question 02: Comparing across language editions (10 pts)

Use the `get_all_page_revisions` function to get the revision history for the Russian and Ukrainian language versions about the conflict and store as `revs_ru_df` and `revs_uk_df` respectively. Make sure to change the default "endpoint" in the retrieval function. Inspect each DataFrame. (4 pts)

In [46]:
ru_title = 'Вторжение России на Украину (2022)'
uk_title = 'Російське вторгнення в Україну (2022)'

In [None]:
wf.get_pageviews()

Print the number of unique revisions and unique users for each language. (2 pts)

How many users overlap between the English and Russian versions? English and Ukrainian? Russian and Ukrainian? (3 pts)

Write a sentence or two about some of the patterns you've seen so far. (1 pt)

## Question 03: Retrieve the pageviews (10 pts)

Use the `get_pageviews` function to retrieve the pageviews for the article "[Ukraine](https://en.wikipedia.org/wiki/Ukraine)", "[Украина](https://ru.wikipedia.org/wiki/%D0%A3%D0%BA%D1%80%D0%B0%D0%B8%D0%BD%D0%B0)" (ru), "[Україна](https://uk.wikipedia.org/wiki/%D0%A3%D0%BA%D1%80%D0%B0%D1%97%D0%BD%D0%B0)" (uk) in each language edition with a "start" of "20190101" and store the resulting pandas Series as `pvs_en_s`, `pvs_ru_s`, and `pvs_uk_s` respectively. (3 pts)

Make a DataFrame called `ukraine_pvs` with the three Series as columns named "English", "Russian", and "Ukrainian". Inspect the first few rows of `ukraine_pvs`. (2 pts)

Make one, larger-than-default plot containing the pageview data of all three languages with a legend. (3 pts)

Slice the `ukraine_pvs` data from July 1, 2019 to June 30, 2020 and re-visualize. (2 pts)

**Extra credit (1 pt)**. Identify a specific event that drove one of the five significant spikes in English pageviews in this timeframe.

## Question 04: Aggregate and visualize revisions (10 pts)

Using a `Grouper` on the "timestamp" column in each language's revision DataFrame, aggregate the number of unique revisions ("sha1") and number of unique users ("user") at an hourly frequency and store as `revs_en_h`, `revs_ru_h`, and `revs_uk_h` respectively. (3 pts)

Use pandas's `concat` function to combine the hourly revision activity for all three languages into `war_editing_h`. Rename the columns with the language. Inspect the *last* five rows. (3 pts)

Visualize `war_revs_h` with a figure containing two subplots in one column and two rows. (3 pts)

Write a sentence or two about some of the similarities, differences, or other patterns you see in the editing behavior across languages. (1 pt)