# Step 1: Loading the Data

This step is relatively self-explanatory. We load the data from the jsonlines format it was left in by the previous notebook and convert it to a columnar format.

In [1]:
import gzip
import json
from tqdm import tqdm
import polars as pl

revdocs_lines = [json.loads(line) for line in tqdm(gzip.open("/kaggle/input/wikipedia-downloader-extractor-final-stage-1/enwikinews_news_pages.jsonl.gz", "rt"))]
df = pl.from_dicts(revdocs_lines)
del revdocs_lines

df

23956it [00:01, 12342.46it/s]


revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp
i64,i64,i64,str,str,str
4568285,3,0,"""Main Page""","""<templatestyles src=""Main Page…","""2020-06-05T08:22:20Z"""
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z"""
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z"""
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z"""
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z"""
…,…,…,…,…,…
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z"""
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z"""
4804984,3003865,0,"""Music director PVR Raja comple…","""{{delete|Spam}} __NOINDEX__{{d…","""2024-11-16T09:10:37Z"""
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z"""


# Step 2: Fixing issues with the Data

The WikiNews website, designed to be mostly viewed by people, is not particularly organized from a computer perspective. Many pages are mostly administrative and meant for navigation as opposed to actual content. The format by which the news article refers to date is also not consistent.

The text is also in a format known as WikiText, which needs to be converted to plain text. While the specifics of WikiText are not particularly important, it should be noted that `[[title]]` is a link to the page with title "title", and `{{template}}` or `{{template|arg1|arg2}}` are ways of instantiating a template with one or more string arguments. Much of WikiNews uses these templates to organize things - we simply need to manipulate them in a way that allows us to extract the actual article text.

## Issue #1 - Main page included in export

The main page of WikiNews is simply a "hub" page that links to the current day's articles, and therefore should be removed.

In [2]:
print(df.filter(df["page_id"] == 3)["page_text"][0])

<templatestyles src="Main Page/minerva.css" />
<!-- DO NOT, UNDER ANY CIRCUMSTANCE, ADD CSS TO THE MAIN PAGE OR ITS TEMPLATES.
     Use [[MediaWiki:Common.css/Main Page]] or [[MediaWiki:Common.css]] for all
     your CSS needs. Else, beware the wrath of SGN -->
{| class="the_table"
!colspan="3"|
{{Main page header}}
|-
|colspan="2" class="lead_big upper_lead"|
<!-- Lead 1 -->
{{Lead article 1}}
|rowspan="3" class="latest_news" id="MainPage_latest_news"|
<!-- Latest News -->
{{Main headlines}}
|-
|class="lead_normal upper_lead"|
<!-- Lead 2 -->
{{Lead article 2}}
|class="lead_normal"|
<!-- Lead 3 -->
{{Lead article 3}}
|-
|class="lead_normal"|
<!-- Lead 4 -->
{{Lead article 4}}
|class="lead_normal"|
<!-- Lead 5 -->
{{Lead article 5}}
|-
|colspan="3" class="portals"|
{{Main page portals}}
|-
|class="main_popular" id="writeAnArticleCell"|<div id="mf-write">{{Main write}}</div>
|class="recent_interviews"|{{Main interviews}}
|class="original_stories"|{{Main original}} 
|-
|colspan="3" class

In [3]:
df = df.filter(df["page_id"] != 3)
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp
i64,i64,i64,str,str,str
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z"""
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z"""
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z"""
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z"""
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z"""
…,…,…,…,…,…
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z"""
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z"""
4804984,3003865,0,"""Music director PVR Raja comple…","""{{delete|Spam}} __NOINDEX__{{d…","""2024-11-16T09:10:37Z"""
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z"""


## Issue #2 - Article date of creation format

Most articles on WikiNews declare the date at the top of the article by instantiating a "date" template - for example, `{{date|November 30, 2024}}`. We need to extract this so it is accessible programmatically while not being present in the article.

In [4]:
import re
article_publish_date_regex = "{{date\|[A-Za-z0-9,\. ]+}}"

def get_date_elements(article_text):
    return re.findall(article_publish_date_regex, article_text)

df = df.with_columns(
    pl.col("page_text").map_elements(get_date_elements, return_dtype=pl.List(str), strategy='thread_local').alias('page_dates')
)
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""",[]
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""",[]
1975609,783,0,"""Iran Agrees to Suspend Uranium…","""{{Iran nuclear program}}{{byli…","""2013-08-25T23:36:57Z""",[]
4516766,797,0,"""Brazilian soccer player's moth…","""{{Brazil}}{{byline|date=Novemb…","""2019-09-28T10:59:26Z""",[]
1886435,798,0,"""Colin Powell Resigns as U.S. S…","""{{United States infobox}}[[Fil…","""2013-04-27T00:32:50Z""",[]
…,…,…,…,…,…,…
4804406,3003613,0,"""Potchefstroom university""","""{{delete|not news}} Also known…","""2024-11-10T22:50:18Z""",[]
4804681,3003730,0,"""Filter""","""{{delete|a3}} Your spirit {{h…","""2024-11-13T14:19:39Z""",[]
4804945,3003809,0,"""1 Ceres""","""{{delete|a1}} dwarf planet""","""2024-11-15T15:48:46Z""",[]
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""",[]


In [5]:
print(df.filter(df["page_id"] == 741)["page_text"][0])

[[File:Mahmoud abbas.jpg|frame|left|Mahmoud Abbas]] 
{{Palestine}}{{byline|date=November 14, 2004|location=[[w:Ramallah|RAMALLAH]]}} Acting president {{w|Rawhi Fattuh|Rawhi Fattuh}} has announced today that [[Palestine|Palestinian]] elections will be held on January 9. Futtuh, head of the {{w|Palestinian National Authority|Palestinian parliament}}, was sworn in hours after the death of {{w|Yasser Arafat|Yasser Arafat}} on Thursday, and Palestinian Basic Law dictates that he may only serve up to two months before elections are held.

New leadership could prove to be the key to revitalizing the peace process in the {{w|Middle East|Middle East}}, as both [[Israel]] and the [[United States]] had refused to work with Arafat.

The ''{{w|Haaretz|Haaretz}}'' had initially reported that former prime minister {{w|Mahmoud Abbas|Mahmoud Abbas}} was selected by the {{w|Fatah|Fatah}} central committee as their candidate for president, but Abbas has denied this, saying, "the matter is still being dis

## Issue #3: date format (again)

Some articles specify both a date and location using the byline template. This looks like this: `{{byline|location=[A-Za-z0-9,\. ]+|date=[A-Za-z0-9,\. ]+}}`. The location and date arguments in the byline template are also interchangeable. The location argument is also often a wikilink which links to a page about that location.

In [6]:
article_publish_date_regexes = [
    "{{date\|[A-Za-z0-9,\. ]+}}",
    "{{byline\|date=[A-Za-z0-9,\. ]+\|location=[^}{]+}}",
    "{{byline\|location=[^}{]+\|date=[A-Za-z0-9,\. ]+}}",
]
full_article_publish_date_regex = "|".join(article_publish_date_regexes)

def get_date_elements(article_text):
    return re.findall(full_article_publish_date_regex, article_text)

df = df.with_columns(
    pl.col("page_text").map_elements(get_date_elements, return_dtype=pl.List(str)).alias('page_dates')
)
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
1975609,783,0,"""Iran Agrees to Suspend Uranium…","""{{Iran nuclear program}}{{byli…","""2013-08-25T23:36:57Z""",[]
4516766,797,0,"""Brazilian soccer player's moth…","""{{Brazil}}{{byline|date=Novemb…","""2019-09-28T10:59:26Z""",[]
2630053,880,0,"""Australians talk tough ahead o…","""''November 17, Brisbane, Austr…","""2014-05-19T01:52:41Z""",[]
4329095,1027,0,"""Kanchi Shankaracharya Jayendra…","""''Thursday, November 18 2004, …","""2017-07-08T14:31:52Z""",[]
1535509,1037,0,"""Kyoto Treaty becomes legally b…","""''November 18 2004, Nairobi.''…","""2012-06-21T15:27:08Z""",[]
…,…,…,…,…,…,…
4804406,3003613,0,"""Potchefstroom university""","""{{delete|not news}} Also known…","""2024-11-10T22:50:18Z""",[]
4804681,3003730,0,"""Filter""","""{{delete|a3}} Your spirit {{h…","""2024-11-13T14:19:39Z""",[]
4804945,3003809,0,"""1 Ceres""","""{{delete|a1}} dwarf planet""","""2024-11-15T15:48:46Z""",[]
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""",[]


In [7]:
print(df.filter(df["page_id"] == 880)["page_text"][0])

''November 17, Brisbane, Australia'' - The Australian [[w:cricket|cricket]] team has brashly talked up their chances of winning the first [[w:Trans-Tasman|trans-Tasman]] test, which is due to begin in [[w:Brisbane|Brisbane]] tomorrow.

Bowler [[w:Shane Warne|Shane Warne]] has stated that he will be targeting New Zealand captain [[w:Stephen Fleming|Stephen Fleming]], [[w:Glenn McGrath|Glenn McGrath]] has revealed that he will be after [[w:Nathan Astle|Nathan Astle]], and Australian batsman [[w:Matthew Hayden|Matthew Hayden]] has vowed to hit New Zealand bowler [[w:Daniel Vettori|Daniel Vettori]] out of the ground.

The New Zealand team's reaction to this Australian bravado was muted.  "It's just history repeating itself", said former New Zealand coach and former test player [[w:John Bracewell|John Bracewell]].  "They just have a set of lines they've been using since I've been coming over here and it's exactly the same story, just a different name saying it. It's just repetitious", he we

In [8]:
article_publish_date_regexes = [
    "{{date\|[A-Za-z0-9,\. ]+}}",
    "{{byline\|date=[A-Za-z0-9,\. ]+\|location=[^}{]+}}",
    "{{byline\|location=[^}{]+\|date=[A-Za-z0-9,\. ]+}}",
    r"\[\[Category:\s*[A-Za-z]+\s*[0-9]+[A-Za-z]{0,2},\s*[0-9]+\]\]",
]
full_article_publish_date_regex = "|".join(article_publish_date_regexes)

def get_date_elements(article_text):
    return re.findall(full_article_publish_date_regex, article_text)

df = df.with_columns(
    pl.col("page_text").map_elements(get_date_elements, return_dtype=pl.List(str)).alias('page_dates')
)
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
1975609,783,0,"""Iran Agrees to Suspend Uranium…","""{{Iran nuclear program}}{{byli…","""2013-08-25T23:36:57Z""",[]
4516766,797,0,"""Brazilian soccer player's moth…","""{{Brazil}}{{byline|date=Novemb…","""2019-09-28T10:59:26Z""",[]
4516748,1053,0,"""New Beta Version of MSN Search…","""{{Microsoft}}{{byline|date=Nov…","""2019-09-28T10:07:56Z""",[]
4663471,1181,0,"""Ukraine election results delay…","""{{Ukraine}}{{byline|date=Novem…","""2022-02-26T03:49:45Z""",[]
440494,1255,0,"""US Secretary of Homeland Secur…","""{{Byline|date=November 30, 200…","""2007-06-09T00:31:59Z""",[]
…,…,…,…,…,…,…
4804406,3003613,0,"""Potchefstroom university""","""{{delete|not news}} Also known…","""2024-11-10T22:50:18Z""",[]
4804681,3003730,0,"""Filter""","""{{delete|a3}} Your spirit {{h…","""2024-11-13T14:19:39Z""",[]
4804945,3003809,0,"""1 Ceres""","""{{delete|a1}} dwarf planet""","""2024-11-15T15:48:46Z""",[]
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""",[]


## Issue #4: redirect pages

Some redirect pages were included, which start with the text `#REDIRECT` and do nothing other than redirect to a different page when loaded. These should be removed.

In [9]:
print(df.filter(df["page_id"] == 1354)["page_text"][0])

#REDIRECT: [[Main Page]]

[[Category:Protected mainspace redirects]]
[[Category:Non-news mainspace redirects]]


In [10]:
df = df.filter(df["page_text"].map_elements(lambda p:re.search("^\s*#redirect", p.lower()) == None, return_dtype=bool))
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|November 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=November 14, 2004|location=[[w:Ramallah|RAMALLAH]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|November 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=November 15, 2004|location=[[Melbourne|MELBOURNE]], [[Victoria, Australia|Victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|November 13, 2004}}""]"
…,…,…,…,…,…,…
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|November 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|November 15, 2024}}""]"
4804984,3003865,0,"""Music director PVR Raja comple…","""{{delete|Spam}} __NOINDEX__{{d…","""2024-11-16T09:10:37Z""","[""{{date|August 5, 2022}}""]"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""",[]


## Issue #5: pages marked for deletion

Some pages included are either deleted or marked for deletion from WikiNews. This is implemented by using the `delete` template with the syntax: `{{delete|<deletion reason>}}`. (This template expands into a large red banner saying "This page has been marked for deletion for reason \<reason\>" on the actual website.) We should remove any page containing this template from our dataset.

Alternative forms of the deletion banner include `{{Delete|<deletion reason>}}` (the case of the first character doesn't matter) and `{{speedy|<deletion reason>}}` for [speedy deletion](https://en.wikinews.org/wiki/Wikinews:Criteria_for_speedy_deletion).

In [11]:
print(df.filter(df["page_id"] == 3002171)["page_text"][0])

{{delete|a3}}
{{date|September 1, 1976}}

On Wednesday, Don Felder got stuck at Hotel California, where he checked in and stayed for the night. A woman stood in the doorway and showed him the way, and voices in the distance said, “Welcome to the Hotel California, what a lovely place.”

{{haveyoursay}}

== Sources ==
<!--Use at least two independent sources.-->
*{{source
|url        = 
|title      = 
|author     = 
|pub        = 
|date       = e.g. December 31, 1999
|archiveurl = 
}}
*{{source
|url        = 
|title      = 
|author     = 
|pub        = 
|date       = 
|archiveurl = 
}}


In [12]:
deletion_regexes = [
    "\{\{delete\|[^\}\{]+\}\}",
    "\{\{Delete\|[^\}\{]+\}\}",
    "\{\{speedy\|[^\}\{]+\}\}"
]
full_deletion_regex = "|".join(deletion_regexes)

df.filter(df["page_text"].str.contains(full_deletion_regex))

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4800971,3002171,0,"""Eagles member Felder gets stuc…","""{{delete|a3}} {{date|September…","""2024-10-09T13:15:00Z""","[""{{date|September 1, 1976}}""]"
4801814,3002220,0,"""Hotel California hasn’t had “t…","""{{delete|a3}} {{date|June 21, …","""2024-10-16T13:56:17Z""","[""{{date|June 21, 1974}}""]"
4802559,3002329,0,"""Florida officials warn hurrica…","""{{Delete|1=Author's request}} …","""2024-10-22T00:02:19Z""","[""{{date|October 12, 2024}}""]"
4801449,3002361,0,"""How to Craft Effective Global …","""{{speedy|spam}} == '''''How to…","""2024-10-13T14:35:56Z""",[]
4801716,3002384,0,"""Drone attack hurts more than 6…","""{{delete|Author request please…","""2024-10-15T18:01:10Z""","[""{{date|October 13, 2024}}""]"
…,…,…,…,…,…,…
4804681,3003730,0,"""Filter""","""{{delete|a3}} Your spirit {{h…","""2024-11-13T14:19:39Z""",[]
4804682,3003731,0,"""Number of seconds in days""","""{{delete|a3}} {{date|November …","""2024-11-13T14:19:54Z""","[""{{date|November 13, 2024}}""]"
4804783,3003774,0,"""Sdad""","""{{delete|nonsense}} {{develop}…","""2024-11-14T00:38:38Z""","[""{{date|November 13, 2024}}""]"
4804945,3003809,0,"""1 Ceres""","""{{delete|a1}} dwarf planet""","""2024-11-15T15:48:46Z""",[]


In [13]:
df = df.filter(~df["page_text"].str.contains(full_deletion_regex))
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|November 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=November 14, 2004|location=[[w:Ramallah|RAMALLAH]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|November 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=November 15, 2004|location=[[Melbourne|MELBOURNE]], [[Victoria, Australia|Victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|November 13, 2004}}""]"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|November 13, 2024}}""]"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|November 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|November 15, 2024}}""]"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""",[]


## Issue #6: Bylines that link to Wikipedia articles

Sometimes, bylines contain a conditional link to Wikipedia. These are represented as templates: `{{w|<link target>|anchor=<section>|<link label>}}`. This can still be parsed with regexes since these conditional links cannot be nested - we can add extra cases to handle these conditional links within the list of regexes we use to find dates of creation for pages.

In [14]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
1975609,783,0,"""Iran Agrees to Suspend Uranium…","""{{Iran nuclear program}}{{byli…","""2013-08-25T23:36:57Z""",[]
4516766,797,0,"""Brazilian soccer player's moth…","""{{Brazil}}{{byline|date=Novemb…","""2019-09-28T10:59:26Z""",[]
4516748,1053,0,"""New Beta Version of MSN Search…","""{{Microsoft}}{{byline|date=Nov…","""2019-09-28T10:07:56Z""",[]
4663471,1181,0,"""Ukraine election results delay…","""{{Ukraine}}{{byline|date=Novem…","""2022-02-26T03:49:45Z""",[]
440494,1255,0,"""US Secretary of Homeland Secur…","""{{Byline|date=November 30, 200…","""2007-06-09T00:31:59Z""",[]
…,…,…,…,…,…,…
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]
4784575,2995324,0,"""Main Page 2""","""<templatestyles src=""Main Page…","""2024-06-05T20:30:21Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""",[]


In [15]:
print(df.filter(df["page_id"] == 797)["page_text"][0])

{{Brazil}}{{byline|date=November 15, 2004|location={{w|Praia Grande|PRAIA GRANDE}}, [[Brazil]]}} Miss Marina Souza, aged 43, was kidnapped last Saturday, November 6, in Praia Grande, {{w|São Paulo|São Paulo}}, during a barbecue with her relatives. She is the mother of Robson de Souza, known as {{w|Robinho|Robinho}} (Little Robin), a Brazilian {{w|soccer|soccer}} player for the {{w|Santos FC|Santos Football Club}}. Since the incident, Robson De Souza ('Robinho') has made few public appearances and stopped playing soccer, troubling his team Santos.  "I hope this all finishes well and that I can go back to playing football again," he was quoted by ''{{w|Reuters|Reuters}}'' on November 9.

Robinho is considered one of the best Brazilian forward players at the present time and he is an important player for Santos.

The authorities are cautious to release any information concerning the case, as this could jeopardize both the investigations and Ms. Souza's life.


{{haveyoursay}}
== Sources =

In [16]:
article_publish_date_regexes = [
    "{{date\|[A-Za-z0-9,\. ]+}}",
    "{{byline\|date=[A-Za-z0-9,\. ]+\|[^}{]+}}",
    "{{byline\|date=[A-Za-z0-9,\. ]+\|[^}{]*{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*}}",
    "{{byline\|[^}{]+\|date=[A-Za-z0-9,\. ]+}}",
    "{{byline\|[^}{]+{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*\|date=[A-Za-z0-9,\. ]+}}",
    r"\[\[Category:\s*[A-Za-z]+\s*[0-9]+[A-Za-z]{0,2},\s*[0-9]+\]\]",
]
full_article_publish_date_regex = "|".join(article_publish_date_regexes)

def get_date_elements(article_text):
    return [m.group(0) for m in re.finditer(full_article_publish_date_regex, article_text)]

df = df.with_columns(
    pl.col("page_text").map_elements(get_date_elements, return_dtype=pl.Object).alias('page_dates')
)
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","['{{date|November 13, 2004}}']"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","['{{byline|date=November 14, 2004|location=[[w:Ramallah|RAMALLAH]]}}']"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","['{{date|November 13, 2004}}']"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","['{{byline|date=November 15, 2004|location=[[Melbourne|MELBOURNE]], [[Victoria, Australia|Victoria]]}}']"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","['{{date|November 13, 2004}}']"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","['{{date|November 13, 2024}}']"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","['{{date|November 12, 2024}}']"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","['{{date|November 15, 2024}}']"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""",[]


In [17]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
440494,1255,0,"""US Secretary of Homeland Secur…","""{{Byline|date=November 30, 200…","""2007-06-09T00:31:59Z""",[]
4493894,1631,0,"""South American Community of Na…","""[[File:SACN member states.jpg|…","""2019-07-26T12:11:46Z""",[]
4490582,1969,0,"""Draw for Champions League roun…","""{{Byline|date=December 17, 200…","""2019-07-17T18:13:45Z""",[]
4678714,2524,0,"""Digest/14December2004""","""<div style=""font-size:190%;"">D…","""2022-05-29T07:40:42Z""",[]
809056,2588,0,"""Main Page/Sandbox3""","""Hello all. Is this a sandbox? …","""2009-04-23T12:54:33Z""",[]
…,…,…,…,…,…,…
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]
4784575,2995324,0,"""Main Page 2""","""<templatestyles src=""Main Page…","""2024-06-05T20:30:21Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""",[]


## Issue #7: Other case-insensitive Wikipedia templates.

It turns out that the `date` and `byline` templates are also case-insensitive. This can be solved easily by converting the article text to lowercase.

In [18]:
article_publish_date_regexes = [
    "{{(d|D)ate\|[A-Za-z0-9,\. ]+}}",
    "{{(b|B)yline\|date=[A-Za-z0-9,\. ]+\|[^}{]+}}",
    "{{(b|B)yline\|date=[A-Za-z0-9,\. ]+\|[^}{]*{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*}}",
    "{{(b|B)yline\|[^}{]+\|date=[A-Za-z0-9,\. ]+}}",
    "{{(b|B)yline\|[^}{]+{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*\|date=[A-Za-z0-9,\. ]+}}",
    r"\[\[(c|C)ategory:\s*[A-Za-z]+\s*[0-9]+[A-Za-z]{0,2},\s*[0-9]+\]\]",
]
full_article_publish_date_regex = "|".join(article_publish_date_regexes)

def get_date_elements(article_text):
    return [m.group(0) for m in re.finditer(full_article_publish_date_regex, article_text.lower())]

df = df.with_columns(
    pl.col("page_text").map_elements(get_date_elements, return_dtype=pl.Object).alias('page_dates')
)
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","['{{date|november 13, 2004}}']"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","['{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}']"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","['{{date|november 13, 2004}}']"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","['{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}']"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","['{{date|november 13, 2004}}']"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","['{{date|november 13, 2024}}']"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","['{{date|november 12, 2024}}']"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","['{{date|november 15, 2024}}']"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","['{{date|november 18, 2024}}']"


## Issue #8: Digest pages

Some pages within the main wiki are article digest pages, which we can remove.

In [19]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
4493894,1631,0,"""South American Community of Na…","""[[File:SACN member states.jpg|…","""2019-07-26T12:11:46Z""",[]
4678714,2524,0,"""Digest/14December2004""","""<div style=""font-size:190%;"">D…","""2022-05-29T07:40:42Z""",[]
809056,2588,0,"""Main Page/Sandbox3""","""Hello all. Is this a sandbox? …","""2009-04-23T12:54:33Z""",[]
4539327,2619,0,"""Main Page/Sandbox4""","""{|{|cellpadding=""4"" cellspacin…","""2020-01-09T23:38:47Z""",[]
527576,2969,0,"""Digest/1January2005exp""","""<table valign=top><tr><td widt…","""2007-11-27T00:24:42Z""",[]
…,…,…,…,…,…,…
4732270,2973823,0,"""West Midlands""","""{{mainspace disambig}} __DISAM…","""2023-06-13T14:40:33Z""",[]
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]
4784575,2995324,0,"""Main Page 2""","""<templatestyles src=""Main Page…","""2024-06-05T20:30:21Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]


In [20]:
print(df.filter(df["page_id"] == 2524)["page_text"][0])

<div style="font-size:190%;">Digest for 14-25 December 2004</div>

< [[Digest/6December2004|6-13 December 2004]] • [[Wikinews:Digests|Index]] • [[Digest/26December2004|26-31 December 2004 >]]
__NOTOC__
Articles dated 14 to 25 December 2004 are included in the compilation below.

==Articles==
'''[[Mumbai officials demolish 39K shanties; 200K homeless.]]'''<br /> 
25 December 2004<br /> 
India government demolishes over 6,000 shanties in a push to eradicate the capital city's slums.

'''[[Former Indian PM Narasimha Rao passes away]]'''<br /> 
23 December 2004<br /> 
Former Indian PM Narasimha Rao passed away after suffering cardiac arrest in a private hospital in New Delhi.

'''[[Zambian government launches a new agricultural policy]]<br />
22 December 2oo4 <br />
African nation moves to modernize food production infrastructure.

'''[[Mozambique's ruling party retains control in elections]]'''<br /> 
22 December 2004<br /> 
Elections in the Southern African nation Mozambique have resulte

In [21]:
df.filter(df["page_title"].map_elements(lambda p:re.search("\s*Digest/[^ ]\s*", p) != None, return_dtype=bool))

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
1389307,1445,0,"""Digest/29November2004""","""{{date|November 29, 2004}} <Br…","""2012-02-01T01:12:28Z""","['{{date|november 29, 2004}}']"
810579,1446,0,"""Digest/22November2004""","""{{date|November 23, 2004}} <d…","""2009-04-26T15:32:32Z""","['{{date|november 23, 2004}}']"
514779,2242,0,"""Digest/6December2004""","""{{date|December 6, 2004}} <div…","""2007-11-02T02:15:54Z""","['{{date|december 6, 2004}}']"
4678714,2524,0,"""Digest/14December2004""","""<div style=""font-size:190%;"">D…","""2022-05-29T07:40:42Z""",[]
514184,2526,0,"""Digest/26December2004""","""{{date|December 31, 2004}} <di…","""2007-11-01T03:25:43Z""","['{{date|december 31, 2004}}']"
594869,2639,0,"""Digest/1January2005""","""{{date|January 2, 2005}} < [[…","""2008-03-11T20:56:58Z""","['{{date|january 2, 2005}}']"
527576,2969,0,"""Digest/1January2005exp""","""<table valign=top><tr><td widt…","""2007-11-27T00:24:42Z""",[]
515719,3137,0,"""Digest/11January2005""","""{{date|January 11, 2005}} < […","""2007-11-03T20:43:24Z""","['{{date|january 11, 2005}}']"


In [22]:
df = df.filter(df["page_title"].map_elements(lambda p:re.search("\s*Digest/[^ ]\s*", p) == None, return_dtype=bool))
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","['{{date|november 13, 2004}}']"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","['{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}']"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","['{{date|november 13, 2004}}']"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","['{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}']"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","['{{date|november 13, 2004}}']"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","['{{date|november 13, 2024}}']"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","['{{date|november 12, 2024}}']"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","['{{date|november 15, 2024}}']"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","['{{date|november 18, 2024}}']"


## Issue #9: Main Page Sandboxes

There are "sandbox" versions of the main page that can safely be deleted.

In [23]:
df.filter(df["page_title"].map_elements(lambda p:re.search(".*Sandbox.*", p) != None, return_dtype=bool))

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
809056,2588,0,"""Main Page/Sandbox3""","""Hello all. Is this a sandbox? …","""2009-04-23T12:54:33Z""",[]
4539327,2619,0,"""Main Page/Sandbox4""","""{|{|cellpadding=""4"" cellspacin…","""2020-01-09T23:38:47Z""",[]
308595,26966,0,"""Main Page/WideSandbox3""","""[[Category:No publish]] <!-- P…","""2006-09-05T22:43:17Z""",[]


In [24]:
df = df.filter(df["page_title"].map_elements(lambda p:re.search(".*Sandbox.*", p) == None, return_dtype=bool))
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","['{{date|november 13, 2004}}']"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","['{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}']"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","['{{date|november 13, 2004}}']"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","['{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}']"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","['{{date|november 13, 2004}}']"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","['{{date|november 13, 2024}}']"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","['{{date|november 12, 2024}}']"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","['{{date|november 15, 2024}}']"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","['{{date|november 18, 2024}}']"


## Issue #10: More date formats

`{{date|January 19, 2005|2005}}`, `{{date|1=January 26, 2005}}`, and `{{date|1=January 26, 2005|2=2005}}` are also valid formats.

In [25]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,object
4493894,1631,0,"""South American Community of Na…","""[[File:SACN member states.jpg|…","""2019-07-26T12:11:46Z""",[]
4412005,3331,0,"""2005/01/19 Pacific Northwest s…","""{{date|January 19, 2005|2005}}…","""2018-06-09T02:24:15Z""",[]
3888809,3697,0,"""Iraq: Marines killed in helico…","""{{date|1=January 26, 2005}} A…","""2015-10-11T17:39:29Z""",[]
527593,3882,0,"""Crosswords/2005/February""",""";< [[Crosswords/2005/January|J…","""2007-11-27T00:41:37Z""",[]
1089693,4618,0,"""'The Gates' opens in New York …","""__NOTOC__ {{byline| date=Febr…","""2010-09-06T20:26:56Z""",[]
…,…,…,…,…,…,…
4732270,2973823,0,"""West Midlands""","""{{mainspace disambig}} __DISAM…","""2023-06-13T14:40:33Z""",[]
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]
4784575,2995324,0,"""Main Page 2""","""<templatestyles src=""Main Page…","""2024-06-05T20:30:21Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]


In [26]:
article_publish_date_regexes = [
    "{{(d|D)ate\|[A-Za-z0-9=,\. ]+}}",
    "{{(d|D)ate\|[A-Za-z0-9=,\. ]+\|[A-Za-z0-9=,\. ]+}}",
    "{{(b|B)yline\|date=[A-Za-z0-9,\. ]+\|[^}{]+}}",
    "{{(b|B)yline\|date=[A-Za-z0-9,\. ]+\|[^}{]*{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*}}",
    "{{(b|B)yline\|[^}{]+\|date=[A-Za-z0-9,\. ]+}}",
    "{{(b|B)yline\|[^}{]+{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*\|date=[A-Za-z0-9,\. ]+}}",
    r"\[\[(c|C)ategory:\s*[A-Za-z]+\s*[0-9]+[A-Za-z]{0,2},\s*[0-9]+\]\]",
]
full_article_publish_date_regex = "|".join(article_publish_date_regexes)

def get_date_elements(article_text):
    return [m.group(0) for m in re.finditer(full_article_publish_date_regex, article_text.lower())]

df = df.with_columns(
    pl.col("page_text").map_elements(get_date_elements, return_dtype=pl.List(str)).alias('page_dates')
)
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|november 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|november 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|november 13, 2004}}""]"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|november 13, 2024}}""]"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|november 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|november 15, 2024}}""]"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","[""{{date|november 18, 2024}}""]"


## Issue #11: Crossword Puzzles

WikiNews also has crossword puzzles, which we have to remove (since they are not news articles).

In [27]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4493894,1631,0,"""South American Community of Na…","""[[File:SACN member states.jpg|…","""2019-07-26T12:11:46Z""",[]
527593,3882,0,"""Crosswords/2005/February""",""";< [[Crosswords/2005/January|J…","""2007-11-27T00:41:37Z""",[]
1089693,4618,0,"""'The Gates' opens in New York …","""__NOTOC__ {{byline| date=Febr…","""2010-09-06T20:26:56Z""",[]
4520272,5131,0,"""Princeton media class discusse…","""{{WikimediaMention}} {{datelin…","""2019-10-09T00:27:06Z""",[]
434369,5140,0,"""Bank of America declares 1.2 m…","""{{dateline|date=February 28, 2…","""2007-06-03T09:15:32Z""",[]
…,…,…,…,…,…,…
4732270,2973823,0,"""West Midlands""","""{{mainspace disambig}} __DISAM…","""2023-06-13T14:40:33Z""",[]
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]
4784575,2995324,0,"""Main Page 2""","""<templatestyles src=""Main Page…","""2024-06-05T20:30:21Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]


In [28]:
print(df.filter(df["page_id"] == 3882)["page_text"][0])

;< [[Crosswords/2005/January|January crosswords]]
Quick crosswords for February (solutions are on following days):

==February==
* [[Crosswords/2005/February/1|1 February 2005]]
* [[Crosswords/2005/February/2|2 February 2005]]
* [[Crosswords/2005/February/3|3 February 2005]]
* [[Crosswords/2005/February/4|4 February 2005]]
* [[Crosswords/2005/February/5|5 February 2005]]
* [[Crosswords/2005/February/6|6 February 2005]]
* [[Crosswords/2005/February/7|7 February 2005]]
* [[Crosswords/2005/February/8|8 February 2005]]
* [[Crosswords/2005/February/9|9 February 2005]]
* [[Crosswords/2005/February/10|10 February 2005]]
* [[Crosswords/2005/February/11|11 February 2005]]
* [[Crosswords/2005/February/12|12 February 2005]]
* [[Crosswords/2005/February/13|13 February 2005]]
* [[Crosswords/2005/February/14|14 February 2005]]
* [[Crosswords/2005/February/15|15 February 2005]]
* [[Crosswords/2005/February/16|16 February 2005]]
* [[Crosswords/2005/February/17|17 February 2005]]
* [[Crosswords/2005/Fe

In [29]:
df.filter(df["page_title"].map_elements(lambda p:re.search("^Crosswords/.*$", p) != None, return_dtype=bool))

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
515766,3726,0,"""Crosswords/2005/January/27""","""{{date|January 27, 2005}} Fee…","""2007-11-03T21:30:24Z""","[""{{date|january 27, 2005}}""]"
515752,3771,0,"""Crosswords/2005/January/28""","""{{date|January 28, 2007}} Fee…","""2007-11-03T21:22:55Z""","[""{{date|january 28, 2007}}""]"
515737,3808,0,"""Crosswords/2005/January/29""","""{{date|January 29, 2005}} Fee…","""2007-11-03T21:08:00Z""","[""{{date|january 29, 2005}}""]"
517667,3850,0,"""Crosswords/2005/January/30""","""{{date|January 30, 2005}} Fee…","""2007-11-07T03:14:58Z""","[""{{date|january 30, 2005}}""]"
517677,3877,0,"""Crosswords/2005/January/31""","""{{date|January 31, 2005}} Fee…","""2007-11-07T03:18:45Z""","[""{{date|january 31, 2005}}""]"
…,…,…,…,…,…,…
515729,5619,0,"""Crosswords/2005/March/12""","""{{date|March 12, 2005}} {{cro…","""2007-11-03T21:01:37Z""","[""{{date|march 12, 2005}}""]"
516197,5976,0,"""Crosswords/2005/March/19""","""{{date|March 19, 2005}} {{cro…","""2007-11-04T21:08:26Z""","[""{{date|march 19, 2005}}""]"
517670,7078,0,"""Crosswords/2005/April/2""","""{{date|April 2, 2005}} {{cros…","""2007-11-07T03:16:38Z""","[""{{date|april 2, 2005}}""]"
525647,18191,0,"""Crosswords/2005/August/11""","""{{date|August 11, 2005}} {{cro…","""2007-11-23T00:40:36Z""","[""{{date|august 11, 2005}}""]"


In [30]:
df = df.filter(df["page_title"].map_elements(lambda p:re.search("^Crosswords/.*$", p) == None, return_dtype=bool))
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|november 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|november 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|november 13, 2004}}""]"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|november 13, 2024}}""]"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|november 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|november 15, 2024}}""]"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","[""{{date|november 18, 2024}}""]"


## Issue #12: More date formats (again)

- The `dateline` template also provides another way for articles to define dates. It is identical to the byline template, for all intents and purposes.
- The `byline` template can also have spaces around keyword arguments
- Also some general issues with spacing

In [31]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4493894,1631,0,"""South American Community of Na…","""[[File:SACN member states.jpg|…","""2019-07-26T12:11:46Z""",[]
1089693,4618,0,"""'The Gates' opens in New York …","""__NOTOC__ {{byline| date=Febr…","""2010-09-06T20:26:56Z""",[]
4520272,5131,0,"""Princeton media class discusse…","""{{WikimediaMention}} {{datelin…","""2019-10-09T00:27:06Z""",[]
434369,5140,0,"""Bank of America declares 1.2 m…","""{{dateline|date=February 28, 2…","""2007-06-03T09:15:32Z""",[]
1884991,5148,0,"""Romania announces 18 percent i…","""[[Image:CJROCetatuie 2.jpg|thu…","""2013-04-25T14:39:41Z""",[]
…,…,…,…,…,…,…
4732270,2973823,0,"""West Midlands""","""{{mainspace disambig}} __DISAM…","""2023-06-13T14:40:33Z""",[]
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]
4784575,2995324,0,"""Main Page 2""","""<templatestyles src=""Main Page…","""2024-06-05T20:30:21Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]


In [32]:
print(df.filter(df["page_id"] == 5140)["page_text"][0])

{{dateline|date=February 28, 2005|location=Charlotte, North Carolina}} One of the biggest domestic banks in the [[w:United States|United States]], [[w:Bank of America|Bank of America]], has admitted to losing computer tapes containing 1.2 million federal employee accounts,  including the accounts of several U.S. senators, in a statement by the bank. According to the [[w:Pentagon|Pentagon]], most of the accounts belong to  staff and civilians in the [[w:Department of Defense|Department of Defense]]. The bank said the tapes were lost in December 2004 as they were being transported to a  data back-up centre by a commercial plane.

Currently, the U.S. [[W:Secret Service|Secret Service]] are looking in to the matter, a federal agency whose brief includes investigations of serious financial crime such as this. All parties concerned are worrying about possible [[w:identity theft|identity theft]] as it contained valuable information such as bank account numbers, names and addresses.   

== Sou

In [33]:
article_publish_date_regexes = [
    "{{(d|D)ate\|[A-Za-z0-9=,\. ]+}}",
    "{{(d|D)ate\|[A-Za-z0-9=,\. ]+\|[A-Za-z0-9=,\. ]+}}",
    "{{((b|B)yline|(d|D)ateline)\|date=[A-Za-z0-9,\. ]+\|[^}{]+}}",
    "{{((b|B)yline|(d|D)ateline)\|date=[A-Za-z0-9,\. ]+\|[^}{]*{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*}}",
    "{{((b|B)yline|(d|D)ateline)\|[^}{]+\|date=[A-Za-z0-9,\. ]+}}",
    "{{((b|B)yline|(d|D)ateline)\|[^}{]+{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*\|date=[A-Za-z0-9,\. ]+}}",
    r"\[\[(c|C)ategory:\s*[A-Za-z]+\s*[0-9]+[A-Za-z]{0,2},\s*[0-9]+\]\]",
]
full_article_publish_date_regex = "|".join(article_publish_date_regexes)

def get_date_elements(article_text):
    return [m.group(0) for m in re.finditer(full_article_publish_date_regex, article_text.lower())]

df = df.with_columns(
    pl.col("page_text").map_elements(get_date_elements, return_dtype=pl.List(str)).alias('page_dates')
)
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|november 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|november 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|november 13, 2004}}""]"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|november 13, 2024}}""]"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|november 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|november 15, 2024}}""]"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","[""{{date|november 18, 2024}}""]"


## Issue #13: More date formats (again)

It turns out that Wikitext template tags can also span newlines. We can correctly parse the date by removing newlines from the text of the article before looking for date tags.

In [34]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4493894,1631,0,"""South American Community of Na…","""[[File:SACN member states.jpg|…","""2019-07-26T12:11:46Z""",[]
1089693,4618,0,"""'The Gates' opens in New York …","""__NOTOC__ {{byline| date=Febr…","""2010-09-06T20:26:56Z""",[]
1321145,6221,0,"""Market Data""","""[[Image:Stop hand.svg|left|50p…","""2011-11-12T22:32:27Z""",[]
4562001,6222,0,"""Market Data/^DJI""","""{{historical}} ===Latest data=…","""2020-04-30T02:40:00Z""",[]
4561994,6314,0,"""Market Data/Homepage""","""{{historical}}<div style=""bord…","""2020-04-30T02:35:34Z""",[]
…,…,…,…,…,…,…
4732270,2973823,0,"""West Midlands""","""{{mainspace disambig}} __DISAM…","""2023-06-13T14:40:33Z""",[]
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]
4784575,2995324,0,"""Main Page 2""","""<templatestyles src=""Main Page…","""2024-06-05T20:30:21Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]


In [35]:
print(df.filter(df["page_id"] == 4618)["page_text"][0])

__NOTOC__

{{byline|
date=February 16, 2005|
location=New York, New York}}
On [[w:February 12|February 12]], [[w:2005|2005]], at 8:30 a.m., New York Mayor [[w:Michael Bloomberg|Michael Bloomberg]] dropped the first piece of fabric in '''[[w:The Gates|The Gates]]''', a [[w:Land art|land art]] project by [[w:Christo|Christo]] and [[w:Jeanne-Claude Denat de Guillebon|Jeanne Claude]].

The artists installed 7,500 metal "gates" along 23 miles of pathways in [[w:New York City|New York City's]] [[w:Central Park|Central Park]]. Each gate supported a flag-shaped piece of saffron fabric. The project is scheduled to run from [[w:February 12, 2005|Feb. 12]], [[w:2005|2005]] through [[w:February 27|Feb. 27]], [[w:2005|2005]]. The project is sometimes referred to as "''The Gates'', Central Park, New York, 1979-2005" in reference to the time between the artists' initial proposal and the present day.

== Installation ==

The installation of the project began on [[w:January 3|January 3]], [[w:2005|2005

In [36]:
article_publish_date_regexes = [
    r"{{(d|D)ate\s*\|[A-Za-z0-9=,\. ]+(\|?)}}",
    r"{{(d|D)ate\s*\|[A-Za-z0-9=,\. ]+\|[A-Za-z0-9=,\. ]+}}",
    r"{{((b|B)yline|(d|D)ateline)\s*\|\s*date=[A-Za-z0-9,\. ]+\s*\|\s*[^}{]+}}",
    r"{{((b|B)yline|(d|D)ateline)\s*\|\s*date=[A-Za-z0-9,\. ]+\s*\|\s*[^}{]*{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*}}",
    r"{{((b|B)yline|(d|D)ateline)\s*\|\s*[^}{]+\s*\|\s*date=[A-Za-z0-9,\. ]+}}",
    r"{{((b|B)yline|(d|D)ateline)\s*\|\s*[^}{]+{{w\|[^}{]+(\|[^}{]+)?(\|[^}{]+)?}}[^}{]*\s*\|\s*date=[A-Za-z0-9,\. ]+}}",
    r"\[\[(c|C)ategory:\s*[A-Za-z]+\s*[0-9]+[A-Za-z]{0,2},\s*[0-9]+\]\]",
]
full_article_publish_date_regex = "|".join(article_publish_date_regexes)

def get_date_elements(article_text):
    return [m.group(0) for m in re.finditer(full_article_publish_date_regex, article_text.lower().replace("\n", "").replace("\r", ""))]

df = df.with_columns(
    pl.col("page_text").map_elements(get_date_elements, return_dtype=pl.List(str)).alias('page_dates')
)
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|november 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|november 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|november 13, 2004}}""]"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|november 13, 2024}}""]"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|november 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|november 15, 2024}}""]"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","[""{{date|november 18, 2024}}""]"


## Issue #14: Disambiguation pages & "No Publish" pages

Some pages are marked as not published, and some pages are disambiguation pages meant to redirect the user to one of multiple pages, depending on what they mean.

In [37]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4493894,1631,0,"""South American Community of Na…","""[[File:SACN member states.jpg|…","""2019-07-26T12:11:46Z""",[]
1321145,6221,0,"""Market Data""","""[[Image:Stop hand.svg|left|50p…","""2011-11-12T22:32:27Z""",[]
4562001,6222,0,"""Market Data/^DJI""","""{{historical}} ===Latest data=…","""2020-04-30T02:40:00Z""",[]
4561994,6314,0,"""Market Data/Homepage""","""{{historical}}<div style=""bord…","""2020-04-30T02:35:34Z""",[]
4562016,6316,0,"""Market Data/^MERV""","""{{historical}} The MerVal inde…","""2020-04-30T03:06:21Z""",[]
…,…,…,…,…,…,…
4732270,2973823,0,"""West Midlands""","""{{mainspace disambig}} __DISAM…","""2023-06-13T14:40:33Z""",[]
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]
4784575,2995324,0,"""Main Page 2""","""<templatestyles src=""Main Page…","""2024-06-05T20:30:21Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]


In [38]:
print(df.filter(df["page_id"] == 2971496)["page_text"][0])

{{mainspace disambig}}
__DISAMBIG__

[[Category:No publish]]
[[Category:Protected disambiguation pages]]


In [39]:
deletion_regexes = [
    r"{{\s*mainspace disambig\s*}}",
    r"__DISAMBIG__",
    r"\[\[\s*Category\s*:\s*No publish\s*\]\]",
    r"\[\[\s*Category\s*:\s*Protected disambiguation pages\s*\]\]",
    r"{{\s*nopublish\s*}}"
]
full_deletion_regex = "|".join(deletion_regexes)

df.filter(df["page_text"].map_elements(lambda p:re.search(full_deletion_regex, p) != None, return_dtype=bool))

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4562001,6222,0,"""Market Data/^DJI""","""{{historical}} ===Latest data=…","""2020-04-30T02:40:00Z""",[]
4561994,6314,0,"""Market Data/Homepage""","""{{historical}}<div style=""bord…","""2020-04-30T02:35:34Z""",[]
4562016,6316,0,"""Market Data/^MERV""","""{{historical}} The MerVal inde…","""2020-04-30T03:06:21Z""",[]
4562003,6509,0,"""Market Data/^FTSE""","""{{historical}} ===Latest data …","""2020-04-30T02:42:12Z""",[]
4562002,6510,0,"""Market Data/^FCHI""","""{{historical}} ===Latest data …","""2020-04-30T02:40:32Z""",[]
…,…,…,…,…,…,…
4718990,2969011,0,"""IPCC""","""{{mainspace disambig}} __DISAM…","""2023-04-02T10:37:32Z""",[]
4726230,2971496,0,"""AI""","""{{mainspace disambig}} __DISAM…","""2023-05-10T06:51:56Z""",[]
4732270,2973823,0,"""West Midlands""","""{{mainspace disambig}} __DISAM…","""2023-06-13T14:40:33Z""",[]
4733445,2973832,0,"""Shropshire""","""{{Mainspace disambig}} __DISAM…","""2023-06-22T13:37:11Z""",[]


In [40]:
df = df.filter(df["page_text"].map_elements(lambda p:re.search(full_deletion_regex, p) == None, return_dtype=bool))
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|november 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|november 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|november 13, 2004}}""]"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|november 13, 2024}}""]"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|november 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|november 15, 2024}}""]"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","[""{{date|november 18, 2024}}""]"


## Issue #15: "Market Data" pages

In [41]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) == 0)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4493894,1631,0,"""South American Community of Na…","""[[File:SACN member states.jpg|…","""2019-07-26T12:11:46Z""",[]
1321145,6221,0,"""Market Data""","""[[Image:Stop hand.svg|left|50p…","""2011-11-12T22:32:27Z""",[]
510074,9607,0,"""Kentucky Derby field chart""","""{| border=1 cellpadding=4 cell…","""2007-10-25T21:24:16Z""",[]
4576378,60209,0,"""Wikinews Shorts: February 8, 2…","""{{Wikinews Shorts header|Febru…","""2020-07-30T08:07:31Z""",[]
860355,60521,0,"""Wikinews Shorts: February 12, …","""{{Wikinews Shorts header|Febru…","""2009-08-12T09:48:48Z""",[]
…,…,…,…,…,…,…
4281914,2796391,0,"""Wikinews Shorts: August 15, 20…","""{{Wikinews Shorts header|Augus…","""2017-01-23T23:23:21Z""",[]
4658703,2932497,0,"""North Korea launches ""early-st…","""{{Correction | label = Retract…","""2022-01-27T17:46:34Z""",[]
4671338,2945126,0,"""Wikisource""","""{{softredirect|s:Main page|Wik…","""2022-03-31T20:55:44Z""",[]
4804852,3001820,0,"""Jimmie Johnson's Consistency""","""{{abandoned|October 8, 2024|}}…","""2024-11-14T17:14:23Z""",[]


In [42]:
print(df.filter(df["page_id"] == 6221)["page_text"][0])

[[Image:Stop hand.svg|left|50px]]
<br style="clear:both;" />
----

Information about the world's markets index, no longer maintained.

{| style="background: #EEEEEE; border: 1px dotted #666666;"
|- style="background: #DDDDDD; border: 1px solid #666666;" 
! Index Name
! Description
! Current Value
! Change 
! Updated
{{Market Data/Index/Summary}}
|}

===Market Data===
{{:Market Data/Homepage}}
===Commodities===

==== Metals ====
*'''[[w:Forex Gold Index|Forex Gold Index]]:''' $424.30/[[w:barrel|barrel]], [[Image:Arrowupred.png]] $0.80

===Currencies</hr>===
*[[W:US dollar|1 US Dollar]] (US$):
:= STG£0.5349 = €0.7727 = ¥106.4000
*[[w:Euro|1 Euro]] (€):
:= STG£0.6923 = $1.2942 = ¥137.6900
*[[w:Pound sterling|1 Pound Sterling]] (STG£):
:= US$1.8694 = €1.4443 = ¥198.8550
*[[w:Yen|1 Japanese Yen]] (¥)
:= STG£0.0050 = $0.0094 = €0.0073
''(Commodities & currencies as of 2005-03-24 T 23:00 [[w:UTC|UTC]], or last close were applicable. '''None of this data is guaranteed to be correct'''. Please 

In [43]:
deletion_regex = r"\s*Market *Data/?.*"

df.filter(df["page_title"].map_elements(lambda p:re.search(deletion_regex, p) != None, return_dtype=bool))

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
1321145,6221,0,"""Market Data""","""[[Image:Stop hand.svg|left|50p…","""2011-11-12T22:32:27Z""",[]


In [44]:
df = df.filter(df["page_title"].map_elements(lambda p:re.search(deletion_regex, p) == None, return_dtype=bool))
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|november 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|november 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|november 13, 2004}}""]"
…,…,…,…,…,…,…
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|november 13, 2024}}""]"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|november 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|november 15, 2024}}""]"
4805304,3003969,0,"""US sanctions far-right Israeli…","""{{review}} {{Date|November 18…","""2024-11-20T05:45:46Z""","[""{{date|november 18, 2024}}""]"


## Issue #16 - Anything else remaining

After being very liberal with parsing, we can be relatively assured that anything else remaining is likely not a news article. I estimate that we may have only rejected about ~10-20 news articles total due to poor formatting.

In [45]:
df = df.filter(df["page_dates"].map_elements(len, return_dtype=int) > 0)
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|november 13, 2004}}""]"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}""]"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|november 13, 2004}}""]"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}""]"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|november 13, 2004}}""]"
…,…,…,…,…,…,…
4804529,3003708,0,"""Man drives car into crowd of p…","""{{develop}} {{date|November 12…","""2024-11-12T17:32:29Z""","[""{{date|november 12, 2024}}""]"
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|november 13, 2024}}""]"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|november 12, 2024}}""]"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|november 15, 2024}}""]"


### Note: 

Some articles still have multiple dates due to the variety of ways we probed for dates. This will be handled later.

In [46]:
df.filter(df["page_dates"].map_elements(len, return_dtype=int) > 1)

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates
i64,i64,i64,str,str,str,list[str]
1535527,1003,0,"""Israeli army kills three Egypt…","""{{date|November 18, 2004}} [[I…","""2012-06-21T15:36:10Z""","[""{{date|november 18, 2004}}"", ""[[category:november 18, 2004]]""]"
1525575,1083,0,"""Sudanese parties sign peace pl…","""[[Image:Darfur map.png|right|3…","""2012-06-10T16:02:34Z""","[""{{byline|date=november 19, 2004|location={{w|nairobi|nairobi}}}}"", ""[[category:november 29, 2004]]""]"
931431,1089,0,"""Spanish tourist fatally shot i…","""{{byline|date=November 19, 200…","""2010-01-03T00:00:22Z""","[""{{byline|date=november 19, 2004|location=rio de janeiro}}"", ""[[category:november 19, 2004]]""]"
354699,1095,0,"""19 Burmese political prisoners…","""{{byline|date=November 20, 200…","""2006-12-19T12:08:24Z""","[""{{byline|date=november 20, 2004|location=yangon, myanmar}}"", ""[[category:november 20, 2004]]""]"
715647,1103,0,"""Brazilian economist Celso Furt…","""[[Image:CFurtado4.jpg|thumb|Ce…","""2008-10-26T19:18:32Z""","[""{{byline|date=november 20, 2004|location=rio de janeiro}}"", ""[[category:november 20, 2004]]""]"
…,…,…,…,…,…,…
1643896,537233,0,"""Hailemariam Desalegn sworn in …","""{{date|September 21, 2012}} {{…","""2012-10-01T14:27:00Z""","[""{{date|september 21, 2012}}"", ""[[category:september 21, 2012]]""]"
4404330,783911,0,"""United States spies accused of…","""{{date|August 26, 2013}} [[Fi…","""2018-05-06T23:49:56Z""","[""{{date|august 26, 2013}}"", ""[[category:august 26, 2013]]""]"
3065067,1734356,0,"""Philae space probe lands on co…","""{{date|November 13, 2014}} {{s…","""2014-11-30T01:46:04Z""","[""{{date|november 13, 2014}}"", ""[[category:november 12, 2014]]""]"
4627193,2506784,0,"""Scottish Ebola nurse Pauline C…","""{{date|October 15, 2015}} {{Au…","""2021-07-11T06:14:40Z""","[""{{date|october 15, 2015}}"", ""[[category: october 14, 2015]]""]"


# Step 3. Obtaining Plain-Text versions of the Wikipedia articles found

The original plan here was to parse the WikiText of the articles and use this to get plain text versions of each of the articles. However, the MediaWiki text format is poorly specified and has no formal grammar, which makes it difficult to parse. Therefore, we are using the MediaWiki API in order to obtain HTML versions of each of the articles that we want, which we can then parse with an HTML parsing library to more easily convert into plain text.

In [47]:
import requests
import time
def get_page_text_extract(rev_id):
    backoff = 0.05
    response = None
    while response is None or response.status_code != 200:
        response = requests.get(
            'https://en.wikinews.org/w/api.php',
            params={
                'action': 'query',
                'format': 'json',
                'revids': rev_id,
                'prop': 'extracts',
                'exlimit': 'max',
                'exsectionformat': 'plain',
                'explaintext': 'true'
            }
        )
        if response is None or response.status_code != 200:
            print("failed, backoff")
            time.sleep(backoff)
            backoff *= 1.5

    return response.text    

In [48]:
from tqdm.auto import tqdm
import numpy as np

df = df.with_columns(
    pl.Series(map(get_page_text_extract, tqdm(df["revision_id"].to_list())), dtype=str).alias("page_text_extract_result")
)

  0%|          | 0/21657 [00:00<?, ?it/s]

In [49]:
df

revision_id,page_id,page_namespace,page_title,page_text,last_update_timestamp,page_dates,page_text_extract_result
i64,i64,i64,str,str,str,list[str],str
4516743,736,0,"""President of China lunches wit…","""{{date|November 13, 2004}} {{B…","""2019-09-28T09:51:53Z""","[""{{date|november 13, 2004}}""]","""{""batchcomplete"":"""",""query"":{""…"
4516759,741,0,"""Palestinians to elect new pres…","""[[File:Mahmoud abbas.jpg|frame…","""2019-09-28T10:45:51Z""","[""{{byline|date=november 14, 2004|location=[[w:ramallah|ramallah]]}}""]","""{""batchcomplete"":"""",""query"":{""…"
2280888,743,0,"""Brazilian delegation returns f…","""{{date|November 13, 2004}} {{P…","""2014-01-02T19:36:05Z""","[""{{date|november 13, 2004}}""]","""{""batchcomplete"":"""",""query"":{""…"
4516758,764,0,"""Hearing begins over David Hook…","""{{Crime and law}}{{byline|date…","""2019-09-28T10:39:36Z""","[""{{byline|date=november 15, 2004|location=[[melbourne|melbourne]], [[victoria, australia|victoria]]}}""]","""{""batchcomplete"":"""",""query"":{""…"
1973838,779,0,"""Iran close to decision on nucl…","""{{date|November 13, 2004}} {{I…","""2013-08-21T16:07:41Z""","[""{{date|november 13, 2004}}""]","""{""batchcomplete"":"""",""query"":{""…"
…,…,…,…,…,…,…,…
4804529,3003708,0,"""Man drives car into crowd of p…","""{{develop}} {{date|November 12…","""2024-11-12T17:32:29Z""","[""{{date|november 12, 2024}}""]","""{""batchcomplete"":"""",""query"":{""…"
4805088,3003768,0,"""Prison riot in Ecuador, at lea…","""{{tasks|src|npov|mos|re-review…","""2024-11-17T16:35:30Z""","[""{{date|november 13, 2024}}""]","""{""batchcomplete"":"""",""query"":{""…"
4805240,3003827,0,"""2024 ARPS Conference""","""{{develop}} {{date|November 12…","""2024-11-19T12:19:00Z""","[""{{date|november 12, 2024}}""]","""{""batchcomplete"":"""",""query"":{""…"
4805272,3003842,0,"""Japan's oldest Princess Yuriko…","""{{tasks|src|re-review}}{{date|…","""2024-11-19T18:23:32Z""","[""{{date|november 15, 2024}}""]","""{""batchcomplete"":"""",""query"":{""…"


In [50]:
df.write_parquet("enwikinews-pages-revdocs.parquet", compression="zstd", compression_level=22)