# Part 1 : Wikipedia Evolution

## How did change between 2007 and now ? 

In this part we investigate how Wikipedia evolved when compared to the 2007 version to the current one we use today. An interesting thought that occured to us was that the wikispeedia game was hard to play due to the not-so-up-to-date structure of the game. As players are using a restricted version from the 2007 Wikipedia data, this could also impact how the game is played. 

Before testing this hypothesis we first investigate how much it has changed between now and the 2024 version, to see if there is a significant difference. 

In the following part we will study  *only the four thousands articles* from the 2007 selection and compare them to their current corresponding version. 


#### Setting the environment 
Please check SETUP.md and pip_requirements.txt before running this notebook.

In [2]:
import pandas as pd
import numpy as np

import networkx as nx
import matplotlib.pyplot as plt
import os as os

### 1.0) About the subset of articles used in the game

As mentionned above, we are working with a subselection of articles Wikipedia that was set by the creators of Wikispeedia [1], [2]. From the list of the article names from Wikispeedia 2007 we extracted the corresponding 4587 articles in Wikipedia 2024.

[1] Robert West and Jure Leskovec:
     Human Wayfinding in Information Networks.
     21st International World Wide Web Conference (WWW), 2012.
[2] Robert West, Joelle Pineau, and Doina Precup:
     Wikispeedia: An Online Game for Inferring Semantic Distances between Concepts.
     21st International Joint Conference on Artificial Intelligence (IJCAI), 2009.


In [3]:
os.getcwd()

'/Users/eglantinevialaneix/Desktop/ADA/Project/ada-2024-project-outlier-1'

In [4]:
# Setting the path
DATA_PATH = 'data/2007/'

# Loading 2007 data from Wikispeedia
article_names_2007 = pd.read_csv(os.path.join(DATA_PATH, 'articles.tsv'), sep='\t', comment='#', names=['article_2007'])
links_2007 = pd.read_csv(os.path.join(DATA_PATH, 'links.tsv'), sep='\t', comment='#', names=['linkSource_2007', 'linkTarget2007'])

# Update path for 2024
DATA_PATH = 'data/2024/'

# Loading 2024 data from scrapped wikispeedia (see scrapping.ipynb)
raw_article_names_2024 = pd.read_csv(os.path.join(DATA_PATH, 'raw_articles2024.csv'), skiprows=2, comment='#', names=['article_2024']) #skips first row
raw_links_2024 = pd.read_csv(os.path.join(DATA_PATH, 'raw_links2024.csv'), skiprows=2, comment='#', names=['linkSource_2024', 'linkTarget_2024']) #skips first row)

n_articles_2007, n_articles_2024 = article_names_2007.shape[0], raw_article_names_2024.shape[0]
n_links_2007, n_links_2024 = links_2007.shape[0], raw_links_2024.shape[0]

print(f"The dataset of articles from 2007 contains {n_articles_2007} articles with a total of {n_links_2007} links.")
print(f"The dataset of retrieved articles from 2024 contains {n_articles_2024} articles with a total of {n_links_2024} links.")
print(f"There are {n_articles_2007 - n_articles_2024} articles from 2007 that could not be found in 2024.")
print(f"However, there are {n_links_2024/n_links_2007:.2f} times more links in 2024 than in 2007.")

The dataset of articles from 2007 contains 4604 articles with a total of 119882 links.
The dataset of retrieved articles from 2024 contains 4592 articles with a total of 377148 links.
There are 12 articles from 2007 that could not be found in 2024.
However, there are 3.15 times more links in 2024 than in 2007.


In [5]:
links_2007.drop_duplicates().shape

(119882, 2)

In [6]:
# Checking for uniqueness in the article names
print(np.unique(article_names_2007.article_2007).shape[0] == article_names_2007.article_2007.shape[0], 
      np.unique(raw_article_names_2024.article_2024).shape[0] == raw_article_names_2024.article_2024.shape[0])

# Checking for uniqueness in the article links
print(links_2007.drop_duplicates().shape[0] == links_2007.shape[0],
      raw_links_2024.drop_duplicates().shape[0] == raw_links_2024.shape[0])

True True
True False


#### Note: to decode article names for our graphs:

In [7]:
article_names_2007.iloc[1].article_2007

'%C3%85land'

In [8]:
from urllib.parse import unquote

decoded_article_name = unquote(article_names_2007.iloc[1].article_2007, encoding='utf-8')
print(decoded_article_name)

Åland


### From eleven missing articles to only four

In [9]:
# Articles from 2007 missing in 2024
i_2024 = 0
for i_2007, article_2007 in enumerate(article_names_2007.article_2007):
    article_2024 = raw_article_names_2024.article_2024.iloc[i_2024]
    if article_2007 != article_2024:#article is in 2007 but not in 2024
        print(article_2007) #print the name
        i_2024 -= 1 # goes back once to stay on the same article
    i_2024 += 1 # next article

%C3%81ed%C3%A1n_mac_Gabr%C3%A1in
Athletics_%28track_and_field%29
Bionicle__Mask_of_Light
Directdebit
Friend_Directdebit
Gallery_of_the_Kings_and_Queens_of_England
Newshounds
Sponsorship_Directdebit
Star_Wars_Episode_IV__A_New_Hope
Wikipedia_Text_of_the_GNU_Free_Documentation_License
Wowpurchase
X-Men__The_Last_Stand


When retrieving 2024 Wikipedia articles, eleven articles could not be found when scrapping for article with the exact same article name provided in Wikispeedia's data. For seven of them we could find the equivalent page on Wikipedia 2024 with a slightly different name. The corresponding URL to the new page is provided in brakets, along with the new name. However for four of them no evident equivalent page culd be found. This leads us to think that these four specific pages have been removed from Wikipedia between 2007 and 2024.

- Athletics_%28track_and_field%29 (https://en.wikipedia.org/wiki/Track_and_field, Track_and_field)
- Bionicle__Mask_of_Light (https://en.wikipedia.org/wiki/Bionicle:_Mask_of_Light, Bionicle:_Mask_of_Light)
- Directdebit (https://en.wikipedia.org/wiki/Direct_debit, Direct_debit)
- Friend_Directdebit (-)
- Gallery_of_the_Kings_and_Queens_of_England (-)
- Newshounds (https://en.wikipedia.org/wiki/News_Hounds, News_Hounds)
- Sponsorship_Directdebit (-)
- Star_Wars_Episode_IV__A_New_Hope (https://en.wikipedia.org/wiki/Star_Wars_(film), Star_Wars_(film))
- Wikipedia_Text_of_the_GNU_Free_Documentation_License (https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License, Wikipedia:Text_of_the_GNU_Free_Documentation_License)
- Wowpurchase (-)
- X-Men__The_Last_Stand (https://en.wikipedia.org/wiki/X-Men:_The_Last_Stand, X-Men:_The_Last_Stand)

We propose to manually add the seven newly named articles back to the dataframe of 2024 articles, by referring to them with their previous name, for comparability.
To investigate a bit more the four missing pages, let us look at their plain text articles, provided by Wikispeedia.

In [10]:
# define path to the data
PLAIN_TEXT_PATH = '../data/plaintext_articles/'

# extract the plain text articles from 2007
Friend_Directdebit = open(os.path.join(PLAIN_TEXT_PATH, 'Friend_directdebit.txt')).read()
Gallery_of_the_Kings_and_Queens_of_England = open(os.path.join(PLAIN_TEXT_PATH, 'Gallery_of_the_Kings_and_Queens_of_England.txt')).read()
Sponsorship_Directdebit = open(os.path.join(PLAIN_TEXT_PATH, 'Sponsorship_directdebit.txt')).read()
Wowpurchase = open(os.path.join(PLAIN_TEXT_PATH, 'Wowpurchase.txt')).read()

In [11]:
print(Friend_Directdebit)

                             [1x1.gif] [1x1.gif]


   [Direct_Debit.gif]

Become an SOS Friend - Direct Debit


   Thank you for taking a moment to complete this simple form, and for
   helping us help orphaned and abandoned children around the world.

   The minimum donation for an SOS friend is £10/month. If you cannot
   afford this, please use this link to making a smaller regular donation.

   All the normal Direct Debit safeguards and guarantees apply. No changes
   in the amount, date or frequency to be debited can be made without
   notifying you at least 10 working days in advance of your accounts
   being debited. In the event of any error, you are entitled to an
   immediate refund from your Bank or Building Society. You have the right
   to cancel a Direct Debit Instruction at any time simply by writing to
   your Bank or Building Society, with a copy to us.

   Any questions?
   If you have any queries, or would like to make a Direct Debit Donation
   over the phone, pleas

In [12]:
print(Gallery_of_the_Kings_and_Queens_of_England)

   #copyright

Gallery of the Kings and Queens of England

2007 Schools Wikipedia Selection. Related subjects: British History

   This is a gallery of the Kings and Queens of England.

House of Wessex

                         Alfred the Great (871-899)

                         Edward the Elder (899-924)

                               Ælfweard (924)

       Athelstan (924-939)The first de facto King of a unified England

                             Edmund I (939-946)

                               Edred (946-955)

                           Edwy the Fair (955-959)

                               Edgar (959-975)

                       St Edward the Martyr (975-978)

                 Ethelred the Unready (978-1013, 1014-1016)

                         Sweyn Forkbeard (1013-1014)

                           Edmund Ironside (1016)

                             Canute (1016-1035)

                         Harold Harefoot (1035-1040)

                          Harthacanute (1040-1042)


In [13]:
print(Sponsorship_Directdebit)

                             [1x1.gif] [1x1.gif]


   [Direct_Debit.gif]

Sponsor a Child with SOS Children - Direct Debit


   Thank you for taking a moment to complete this simple form, and for
   helping us help orphaned and abandoned children around the world.

   The minimum donation for child sponsorship is £20/month. If you cannot
   afford this, please consider making a smaller regular donation. Use one
   of these links: you can become an SOS friend for a minimum of £10 per
   month or make a regular donation for a minimum of £5 per month.

   All the normal Direct Debit safeguards and guarantees apply. No changes
   in the amount, date or frequency to be debited can be made without
   notifying you at least 10 working days in advance of your accounts
   being debited. In the event of any error, you are entitled to an
   immediate refund from your Bank or Building Society. You have the right
   to cancel a Direct Debit Instruction at any time simply by writing to
   your Bank 

In [14]:
print(Wowpurchase)

                             [1x1.gif] [1x1.gif]


Buy WOW wrist bands
to support SOS Children


   You can purchase WOW Wristbands from this page.
   You will need a debit or credit card to pay for them.
   If you don't have a card, please place your order by post.

   The wrist bands cost £1 each.

   Postage is 50p for up to 5 bands, £1 for 6 to 20 bands,
   £1.50 for 21 to 50 bands and £2 for 51 to 1000 bands.
   Number of wrist bands: _____
   Proceed to Purchase

                           SOS Children's Villages
                            WOW wrist band detail

   SOS Children refers to the worldwide work of SOS-KDI and is a trading
   name for SOS Children's Villages UK

   For further information about our work please see our children charity
   web site or sponsoring a child.

   Charity Commission registered number 1069204

   [1x1.gif] [1x1.gif]



Interestingly, it seems like the three articles DirectDebit, SponsorshipDebit and Wowpurchase are not articles but foundraise forms. It is probable that since this kind of page does not respect Wikipedia's policy, they were removed from the plateform. Wikispeedia players could however go to these pages and click on their links if they wanted to. Lastly, it is not sure why Gallery_of_the_Kings_and_Queens_of_England was removed from wikipedia, it is probable that this subject was restructured into several pages for each of the royal family, which is why we cannot find one single equivalent page in 2024.

For the rest of our study, we will let these four article in 2007 without matching article in 2024. However, we will probably have to remove them from some visualizations when matching dimensions between the two datasets will be needed.

### Re-import 2024 articles
As seen in the previous part, we were able to identify seven articles present in 2007 that we missed when importing their equivalent in 2024 because of a name change. In scrapping.ipynb we rescrapped all 2024 articles with the seven additional ones, with their actual name. Let's import this new dataset and use it for the rest of our analysis.

In [15]:
# Re-loading 2024 data from scrapped wikispeedia (see scrapping.ipynb)
article_names_2024 = pd.read_csv('data/2024/articles2024.csv', skiprows=2, comment='#', names=['article_2024']) #skips first row
links_2024 = pd.read_csv('data/2024/links2024.csv', skiprows=2, comment='#', names=['linkSource_2024', 'linkTarget_2024']) #skips first row)


In [16]:
# Reformating the names of all new names to the old names
old_names = ["Athletics_%28track_and_field%29",
             "Bionicle__Mask_of_Light",
             "Directdebit",
             "Newshounds",
             "Star_Wars_Episode_IV__A_New_Hope",
             "Wikipedia_Text_of_the_GNU_Free_Documentation_License",
             "X-Men__The_Last_Stand"]

new_names = ["Track_and_field",
            "Bionicle:_Mask_of_Light",
            "Direct_debit",
            "News_Hounds",
            "Star_Wars_(film)",
            "Wikipedia:Text_of_the_GNU_Free_Documentation_License",
            "X-Men:_The_Last_Stand"]

links_2024 = links_2024.replace(to_replace = new_names, value = old_names)

In [17]:
links_2024

Unnamed: 0,linkSource_2024,linkTarget_2024
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,D%C3%A1l_Riata
1,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,D%C3%A1l_Riata
2,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Columba
3,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Orkney
4,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Isle_of_Man
...,...,...
377269,Zuid-Gelders,German_language
377270,Zuid-Gelders,Dutch_language
377271,Zuid-Gelders,East_Flemish
377272,Zuid-Gelders,West_Flemish


In [18]:
links_2024 = links_2024.drop_duplicates()

In [19]:
article_names_2024

Unnamed: 0,article_2024
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in
1,%C3%85land
2,%C3%89douard_Manet
3,%C3%89ire
4,%C3%93engus_I_of_the_Picts
...,...
4599,Zionism
4600,Zirconium
4601,Zoroaster
4602,Zuid-Gelders


In [20]:
links_2024

Unnamed: 0,linkSource_2024,linkTarget_2024
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,D%C3%A1l_Riata
2,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Columba
3,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Orkney
4,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Isle_of_Man
5,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Bede
...,...,...
377265,Zuid-Gelders,Afrikaans
377266,Zuid-Gelders,West_Flemish
377267,Zuid-Gelders,East_Flemish
377269,Zuid-Gelders,German_language


In [21]:
links_2024

Unnamed: 0,linkSource_2024,linkTarget_2024
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,D%C3%A1l_Riata
2,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Columba
3,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Orkney
4,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Isle_of_Man
5,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Bede
...,...,...
377265,Zuid-Gelders,Afrikaans
377266,Zuid-Gelders,West_Flemish
377267,Zuid-Gelders,East_Flemish
377269,Zuid-Gelders,German_language


In [22]:
import src.scripts.scrapper_and_writters as scr

scr.export_df_links_to_csv(links_2024, "data/2024/links2024.csv")