# The Wikipedia Game

## The Rules

* Set a wikipedia url as input.
* Extract all links from the page.
* Go to a random other page.
* Check if one of these links follows back to the inital page.
* If yes, you succeeded.
* If no follow randomly links, until you reach the initial page.
* Print the number of needed pages.

## Setup and requirements

In [5]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import random as rd

## Handle wikipedia links

First, let us try to extract all links from the Wikipedia page of "Python (programming language)".

In [6]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

html = urlopen(url)   
soup = bs(html.read())
all_links = []

for link in soup.find_all('a'):
    current_link = link.get('href')
    all_links.append(current_link)

print(len(all_links))

1654


Apply some string manipulation.

In [7]:
all_links = [link for link in all_links if link is not None] # remove all none types
all_links = all_links[3:] # delete standard Wikipedia head

bad_string = ["$", "%", "&", ":", "Main_Page", "ISO"] # exclude extensions with this characters
temp = []
for i in range(0, len(all_links)):
    if any(string in all_links[i] for string in bad_string):
        continue
    else:
        temp.append(all_links[i])
all_links = temp

Now we try to only keep these links that yield to other Wikipedia pages.

In [8]:
substring = "/wiki/"

wiki_links = []
for i in range(0, len(all_links)):
    if all_links[i].startswith(substring):
        wiki_links.append(all_links[i])
        
print(wiki_links[0:9])
print(len(wiki_links))

['/wiki/Python_(disambiguation)', '/wiki/Programming_paradigm', '/wiki/Multi-paradigm_programming_language', '/wiki/Functional_programming', '/wiki/Imperative_programming', '/wiki/Object-oriented_programming', '/wiki/Reflective_programming', '/wiki/Software_design', '/wiki/Guido_van_Rossum']
738


We extracted the endings of the Wikipedia URLs. Now we try to reformulate parsable URLs.

In [9]:
substring = "https://en.wikipedia.org"

wiki_urls = []

for i in range(0, len(wiki_links)):
    url = substring + wiki_links[i]
    wiki_urls.append(url)
    
print(wiki_urls[0:9])

['https://en.wikipedia.org/wiki/Python_(disambiguation)', 'https://en.wikipedia.org/wiki/Programming_paradigm', 'https://en.wikipedia.org/wiki/Multi-paradigm_programming_language', 'https://en.wikipedia.org/wiki/Functional_programming', 'https://en.wikipedia.org/wiki/Imperative_programming', 'https://en.wikipedia.org/wiki/Object-oriented_programming', 'https://en.wikipedia.org/wiki/Reflective_programming', 'https://en.wikipedia.org/wiki/Software_design', 'https://en.wikipedia.org/wiki/Guido_van_Rossum']


##  The Function

In [10]:
def wiki_winner(input_url):
    
    # read html
    html = urlopen(input_url)   
    soup = bs(html.read())
    
    # extract links
    all_links = []
    for link in soup.find_all('a'):
        current_link = link.get('href')
        all_links.append(current_link)
    
    all_links = [link for link in all_links if link is not None]
    all_links = all_links[3:]
    
    bad_string = ["$", "%", "&", ":", "Main_Page", "ISO"] # exclude extensions with this characters
    temp = []
    for i in range(0, len(all_links)):
        if any(string in all_links[i] for string in bad_string):
            continue
        else:
            temp.append(all_links[i])
    all_links = temp
    
    # keep only wiki links
    substring = "/wiki/"
    wiki_links = []
    for i in range(0, len(all_links)):
        if all_links[i].startswith(substring):
            wiki_links.append(all_links[i])
    
    # generate wiki urls
    substring = "https://en.wikipedia.org"
    wiki_urls = []
    for i in range(0, len(wiki_links)):
        url = substring + wiki_links[i]
        wiki_urls.append(url)
        
    # delete input_url and choose random url
    if input_url in wiki_urls: wiki_urls.remove(input_url)
    current_url = rd.choice(wiki_urls)
    
    # initialize counter
    counter = 1
    
    # parse and scrape pages until initial page is reached
    while current_url != input_url:
        
        print("Iteration ", counter, ": ", current_url)
        
        # again read html
        html = urlopen(current_url)
        soup = bs(html.read())
        
        # again extract links
        all_links = []
        for link in soup.find_all('a'):
            current_link = link.get('href')
            all_links.append(current_link)
        
        all_links = [link for link in all_links if link is not None]
        all_links = all_links[3:]
        
        bad_string = ["$", "%", "&", ":", "Main_Page", "ISO"] # exclude extensions with this characters
        temp = []
        for i in range(0, len(all_links)):
            if any(string in all_links[i] for string in bad_string):
                continue
            else:
                temp.append(all_links[i])
        all_links = temp
        
        # again keep only wiki links
        substring = "/wiki/"
        wiki_links = []
        for i in range(0, len(all_links)):
            if all_links[i].startswith(substring):
                wiki_links.append(all_links[i])
        
        # again generate wiki urls
        substring = "https://en.wikipedia.org"
        wiki_urls = []
        for i in range(0, len(wiki_links)):
            url = substring + wiki_links[i]
            wiki_urls.append(url)
        
        # search for input url
        if input_url in wiki_urls:
            print("-----------------------------")
            print("You succeeded in ", counter, " steps")
            print("You reached your input URL: ", input_url)
            print("-----------------------------")
            break
        else:
            current_url = rd.choice(wiki_urls)
            counter += 1
        

## Some examples

In [12]:
url = "https://en.wikipedia.org/wiki/Germany"
wiki_winner(url)

Iteration  1 :  https://en.wikipedia.org/wiki/Printing
Iteration  2 :  https://en.wikipedia.org/wiki/Goryeo
Iteration  3 :  https://en.wikipedia.org/wiki/Chongchon_River
Iteration  4 :  https://en.wikipedia.org/wiki/Lake_Rangrim
Iteration  5 :  https://en.wikipedia.org/wiki/Daedong_Bay_Important_Bird_Area
Iteration  6 :  https://en.wikipedia.org/wiki/Black-faced_spoonbill
Iteration  7 :  https://en.wikipedia.org/wiki/Green_ibis
Iteration  8 :  https://en.wikipedia.org/wiki/Paraguay
Iteration  9 :  https://en.wikipedia.org/wiki/Central_Bank_of_Paraguay
Iteration  10 :  https://en.wikipedia.org/wiki/Discount_window
Iteration  11 :  https://en.wikipedia.org/wiki/Early_2000s_recession
-----------------------------
You succeeded in  11  steps
You reached your input URL:  https://en.wikipedia.org/wiki/Germany
-----------------------------


In [14]:
url = "https://en.wikipedia.org/wiki/Facebook"
wiki_winner(url)

Iteration  1 :  https://en.wikipedia.org/wiki/George_Orwell
Iteration  2 :  https://en.wikipedia.org/wiki/The_Road_to_Wigan_Pier
Iteration  3 :  https://en.wikipedia.org/wiki/Radio_Four
Iteration  4 :  https://en.wikipedia.org/wiki/Isle_of_Man
Iteration  5 :  https://en.wikipedia.org/wiki/Culture_of_Ireland
Iteration  6 :  https://en.wikipedia.org/wiki/Trinity
Iteration  7 :  https://en.wikipedia.org/wiki/Dead_Sea_Scrolls
Iteration  8 :  https://en.wikipedia.org/wiki/Aramaic_Enoch_Scroll
Iteration  9 :  https://en.wikipedia.org/wiki/7Q5
Iteration  10 :  https://en.wikipedia.org/wiki/Temple_Scroll
Iteration  11 :  https://en.wikipedia.org/wiki/Torah
Iteration  12 :  https://en.wikipedia.org/wiki/Summa_Theologica
Iteration  13 :  https://en.wikipedia.org/wiki/Thomism
Iteration  14 :  https://en.wikipedia.org/wiki/Edmund_Husserl
Iteration  15 :  https://en.wikipedia.org/wiki/Authenticity_(philosophy)
Iteration  16 :  https://en.wikipedia.org/wiki/Walter_Kaufmann_(philosopher)
Iteration  1