# Wikipedia Analysis

### Project 3 - Massive computing

By  Mireia Kesti (100406960) and Aleksandra Jamróz (100491363)

### Context - page rank algorithm
The goal of this algorithm was to set a weight for each webpage on the internet, based on the number of external links pointing to the page. The intention behind this was to objectively determine the interest of a webpage. 
The PageRank algorithm was proposed in 1998 by Larry Page and Sergey Brin and served as the foundation for the original Google search engine.

### Context - goals of this project
We will be implementing the original algorithm to set the weights for each page in Wikipedia (English version), using the image stored in the Databricks Databases Set. This database contains 5823210 entries, stored in a parquet set of files. For simplicity and optimization purposes we will not be using the complete dataset for the analysis of the structure, but a smaller version instead. 
To achieve this we will be using Apache Spark Dataframes to handle both the Wikipedia database and the intermediate results. The end-goal is to obtain a Pandas DF with three columns, namely the title of the page, its ID, and its pagerank. 


#### Steps we will be following
- 1. Configuration of the library
- 2. Checking of the raw data structure
- 3. Extraction of relevant data
- 4. Creation of a table of forward links
- 5. Computation of the number of outgoing links
- 6. Creation of a table of reverse links 
- 7. Creation of the page rank table 
- 8. Definition of the number of iterations for the page rank

# 1. Library configuration

In [0]:
import pandas as pd
import re
import math as m
import numpy as np

In [0]:
from pyspark.sql.types import * # Define the data types in the PySpark data model that will be used. Once the type of data is defined, it makes the analysis of data easier
from pyspark.sql.types import ArrayType, StringType, LongType, FloatType

from pyspark.sql.functions import *
from pyspark.sql.functions import size, explode, collect_list 

from pyspark.sql import functions as F # List of built-in functions available for DataFrame. Import them as F to avoid namespace coverage (such as pyspark sum function covering python built-in sum function)
from pyspark.sql import SparkSession # So we can access PySpark/Spark SQL capabilities in PySpark

from operator import add 

In [0]:
# To optimize when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() & when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df), set to "true"
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# 2. Checking of the raw data structure
Now that our libraries are configured, we will import the dataset and inspect it in its raw form in order to get an idea of the data that we will be handling.

In [0]:
# The URI for the database, which contains 5823210 entries, stored in a parquet set of files.

# Create a Spark DataFrame from those parquet files
wikipediaDF = spark.read.parquet("dbfs:/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet")

As we can see, this database contains 7 columns with the following names and datatypes:
- title: string
- id: integer
- revisionId: integer
- revisionTimestamp: timestamp
- revisionUsername: string
- revisionUsernameId: integer
- text: string

The relevant information for our Wikipedia analysis will be stored in the columns "title", "id", and "text".

It is important to be aware of which datatype each of this columns is storing, else our algorithm will not function correctly.

In [0]:
# N = wikipediaDF.count() 

# N was a count the total number of pages in the entire database. This would be the size of the dataset that we operate on if we used the whole dataset. For performing calculations on smaller dataset, N value will be the size of reverse df 

In [0]:
# Instead of using the full database, we will use a smaller version to analyse the structure, with just 0.01% of records (approx. 582 records)
# We will select this small percentage of records randomly. In order to minimize random results, we will set a fixed seed value equal to 0

PartialWikipediaDF = wikipediaDF.sample(fraction=0.0001,seed=0).cache()

In [0]:
# PartialWikipediaDF.count()

# This command counts the number of pages in our new, reduced database Since this command is not relevant for our algorithm and takes a long time to run (7.88 min), we decided to comment it out. 
# If we do run it we can appreciate how much our dataset has diminished in comparison to the original one. 

In [0]:
# display(PartialWikipediaDF)

# Display our smaller version of the database. This way we can see the information in an easier to read format. 
# For curiosity purposes, we also created a word cloud which displays the most frequent words used the largest. 

# As it is not necessary to view this table for the algorithm to work and it took a long time to run (7.59 min), we will be commenting this line out as well.

# 3. Extraction of relevant data
Now that we have a rough idea of our dataset and have decreased its size to a more manageable subset, we will begin with our analysis.

As we mentioned previously and highlighted when displayed as a table, the most important information for our algorithm is stored in the columns named “title”, “id”, and “text”. 
- Title stores the title of the document (string)
- ID stores the ID of the document (integer)
- Text stores the content of the document (string)

This is intuitive when thinking about the purpose of the page rank algorithm, which is to rank web pages in their search engine results. In order to complete this task, the algorithm will need to know the name (title) of the page, what the page is about (text), and the unique identity of the page to distinguish it from others (ID). The most important piece of information for this algorithm is are the links that the page content (text) includes, as the rating of importance each page will recieve is a recursively defined measure whereby a page becomes important if important pages link to it.

To create the table containing the forward links, we have to first identify which are the outgoing links (identified with double "[[]]") and then select the ID rows for these hyperlinks. Further, we will be ignoring all external references and resources, such as images. 

Thus, in this section we will focus on those three columns, and extract the key information from the text column, namely the links (identifiable as strings in brackets). To extract this information we will use regular expressions because regular expressions are a powerful tool to split texts into fragments.

As we will also be using Python functions that are not directly callable from the Spark Dataframe, we will be have to define some UDFs (user defined functions).

In [0]:
# Implementation of the parse_links function to parse the links in the document body
# This function recieves strings and returns returns all non-overlapping matches of the pattern in the string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

def parse_links(document_body):
  # document_body = document_body.lower() # Convert to lowercase. To optimize our code we chose to only convert to lowercase the pontential candidates that may be links instead of all the text 
  
  # Locate the potential candidates that may be links
  data = re.findall(r'\[\[(.+?)\]\]',document_body) 
  links = [] # Create empty list of links
  
  if (len(data)>0): # Check if there are links in the article
    
    # Convert to lowercase all candidate links
    for s in data:
      s = s.lower() 
      
    # Get article name when [[name_article | text_in_article]]
    links = [s.split('|')[0].lower() for s in data]  
    
    # Get article name after a colon
    links = [s.split(':')[1] if "category:" in s else s for s in links]  
      
    # Deleting not relevant links for page rank:
    bad_words = ['file', 'image', 'video', 'template']
    for bad_word in bad_words:
      for link in links:
        if bad_word in link:
          links.remove(link)

     # Ignore references or internal links within an article
    links = [s.split('#')[0] if "#" in s else s for s in links]
 
  return links

In [0]:
test="{{Use Indian English|date=April 2015}} {{Infobox person | name = Shavez Khan | image = | caption = | birth_date = | birth_place = India | nationality = India | residence = [[Mumbai]], India | occupation = [[Actor]] | years_active = present | height = }} '''Shavez Khan''' is an [[India]]n television [[actor]]. He has done his roles in various Indian television shows like Shaitaan,<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/shavez-khan-feature-episodic-of-colors-shaitaan|title=Shavez Khan to feature in an episodic of Colors' Shaitaan|work=Tellychakkar|date=11 April 2013|accessdate=24 April 2015}}</ref> [[Encounter (Indian TV series)|Encounter]], [[Ek Hasina Thi (TV series)|Ek Hasina Thi]], [[Savdhaan India]],<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/shavez-khan-anshul-singh-and-damini-joshi-episodic-of-savdhan-india-140915|title=Shavez Khan, Anshul Singh and Damini Joshi in an episodic of Savdhan India|work=Tellychakkar|date=15 September 2014|accessdate=24 April 2015}}</ref> [[SuperCops vs Supervillains]],<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/rituraj-singh-and-shavez-khan-life-oks-shapath-141009|title=Rituraj Singh and Shavez Khan in Life OK's Shapath|work=Tellychakkar|date=9 October 2014|accessdate=24 April 2015}}</ref> Pyaar Ka The End,<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/shavez-khan-bindass-pyaar-ka-the-end-141029|title=Shavez Khan in Bindass' Pyaar Ka The End|work=Tellychakkar|date=29 October 2014|accessdate=24 April 2015}}</ref> [[Pyaar Kii Ye Ek Kahaani]], [[MTV Fanaah]], [[Crime Patrol (TV series)|Crime Patrol]]. He has played his recent role in [[Sony Entertainment Television (India)|Sony TV]]'s [[C.I.D. (Indian TV series)|CID]].<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/shavez-khan-sony-tvs-cid-150417|title=Shavez Khan in Sony TV's CID|work=Tellychakkar|date=17 April 2015|accessdate=24 April 2015}}</ref> ==Television== *[[Colors (TV channel)|Colors]]'s Shaitaan *[[Sony Entertainment Television (India)|Sony TV]]'s [[Encounter (Indian TV series)|Encounter]], [[Crime Patrol (TV series)|Crime Patrol]] & [[C.I.D. (Indian TV series)|CID]] *[[Star Plus]]'s [[Ek Hasina Thi (TV series)|Ek Hasina Thi]] *[[Life OK]]'s [[Savdhaan India]] & [[SuperCops vs Supervillains]] *[[Bindass]]' Pyaar Ka The End *[[Star One]]'s [[Pyaar Kii Ye Ek Kahaani]] *[[MTV]]'s [[MTV Fanaah]] ==References== {{Reflist}} ==External links== {{Persondata | NAME = Khan, Shavez | ALTERNATIVE NAMES = | SHORT DESCRIPTION = Indian model and television actor | DATE OF BIRTH = <!--Birth date has been contested. Do not add without providing a reliably published source with a reputation for editorial oversight--> | PLACE OF BIRTH = India | DATE OF DEATH = | PLACE OF DEATH = }} {{DEFAULTSORT:Khan, Shavez}} [[Category:Living people]] [[Category:Indian male television actors]] [[Category:Actors in Hindi television]] [[Category:Indian television personalities]]"

In [0]:
test_links = parse_links(test)
test_links

Out[147]: ['mumbai',
 'actor',
 'india',
 'actor',
 'encounter (indian tv series)',
 'ek hasina thi (tv series)',
 'savdhaan india',
 'supercops vs supervillains',
 'pyaar kii ye ek kahaani',
 'mtv fanaah',
 'crime patrol (tv series)',
 'sony entertainment television (india)',
 'c.i.d. (indian tv series)',
 'colors (tv channel)',
 'sony entertainment television (india)',
 'encounter (indian tv series)',
 'crime patrol (tv series)',
 'c.i.d. (indian tv series)',
 'star plus',
 'ek hasina thi (tv series)',
 'life ok',
 'savdhaan india',
 'supercops vs supervillains',
 'bindass',
 'star one',
 'pyaar kii ye ek kahaani',
 'mtv',
 'mtv fanaah',
 'living people',
 'indian male television actors',
 'actors in hindi television',
 'indian television personalities']

In [0]:
# User defined function to parse the links 
parse_links_udf = udf(parse_links,ArrayType(StringType()))

# 4. Creation of a table of forward links
As our goal is to know how many links each document forwards to, we will now transform our list of links to a list of document IDs and create a table with them. In order to extract the ID of the target linked documents (the pages that are linked) we will have to analyse the entire Wikipedia database and create a table with these two pieces of information. 

The table of forward links with feature a column of page IDs and another column with containing all the target pages of that ID ( "origin" page).

In [0]:
tolower_udf= udf(lambda x: x.lower())

In [0]:
# Extract the data from the partial Wikipedia DF & store it in a new Spark DF
TempForwardDF = PartialWikipediaDF.select("id", tolower_udf("title").alias("title"),parse_links_udf("text").alias("links"))

In [0]:
display(TempForwardDF)

id,title,links
11986,godzilla,"List(godzilla (franchise), godzilla (1954 film), godzilla (1954 film), godzilla (2014 film), tomoyuki tanaka, ishirō honda, eiji tsubaraya, haruo nakajima, takeo murata, ishirō honda, godzilla (1954 film), dvd, classic media, al c. ward, ishirō honda, terry morse, godzilla, king of the monsters!, dvd, classic media, katsumi tezuka, yū sekida, seiji onaka, toru kawai, kenpachiro satsuma, tsutomu kitagawa, mizuho yoshida (actor), akira watanabe (film art director), teizô toshimitsu, godzilla, king of the monsters!, ishirō honda, terry o. morse, toho, godzilla vs. megaguirus, toho, godzilla: final wars, ryuhei kitamura, toho, godzilla raids again, toho, invasion of astro-monster, ishirō honda, toho, shusuke kaneko, toho, godzilla (2014 film), gareth edwards (director), legendary pictures, kaiju, kaiju, tokusatsu, godzilla (franchise), japan, ishirō honda, godzilla (1954 film), godzilla (comics), godzilla (franchise), toho, godzilla_(franchise), godzilla, king of the monsters!, nuclear bombings of hiroshima and nagasaki, daigo fukuryū maru, nuclear weapon, portmanteau, gorilla, whale, ateji, kanji, hepburn romanization, kunrei romanization, ray harryhausen, the beast from 20,000 fathoms, amphibian, dinosaur, wired (magazine), tyrannosaurus, iguanodon, stegosaurus, alligator, usa today, chimera (mythology), life (magazine), keloid, hibakusha, turner classic movies, akira ifukube, double bass, godzilla 2000, black belt (martial arts), godzilla raids again, transitional form, regeneration (biology), godzilla 2000, takao okawara, toho, godzilla vs. king ghidorah, kazuki ōmori, toho, godzilla vs. mechagodzilla, jun fukuda, toho, godzilla vs. biollante, kazuki ōmori, toho, godzilla: destroy all monsters melee, pipeworks software, zone fighter, the godzilla power hour, godzilla vs. hedorah, yoshimitsu banno, toho, ghidorah, the three-headed monster, ishirō honda, toho, godzilla: unleashed, the return of godzilla, koji hashimoto (director), toho, shinto, king kong vs. godzilla, gender-neutral pronoun, godzilla (1998 film), parthogenesis, jsdf, king ghidorah, gigan, mechagodzilla, megalon, biollante, megaguirus, mothra, rodan, anguirus, minilla, king kong, fantastic four, godzilla (1954 film), godzilla, king of the monsters!, joseph e. levine, tokyo metropolitan government building, godzilla vs. king ghidorah, legendary pictures, godzilla (2014 film), gareth edwards (director), godzilla (2016 film), suitmation, animatronics, stop-motion, king kong (1933 film), eiji tsuburaya, the return of godzilla, godzilla vs biollante, godzilla vs destoroyah, kenpachiro satsuma, oxygen deprivation, godzilla vs. spacegodzilla, patrick tatopoulos, motion capture, digitigrade, stan winston, jurassic park (film), the moving picture company, bear, komodo dragon, lizard, lion, gray wolf, the hollywood reporter, popular culture, kaijū, tokusatsu, metaphor, atomic bombings of hiroshima and nagasaki, mtv movie awards, hollywood walk of fame, godzilla: final wars, bambi meets godzilla, mystery science theater 3000, godzilla (song), blue öyster cult, nike, inc., basketball, charles barkley, godzilla vs. charles barkley, jeff butler, gamera, yonggary, gorgo (film), gojirasaurus, nomen dubium, coelophysidae, paleontologist, kenneth carpenter, dakosaurus, thalattosuchia, jurassic, nickname, ceratosauria, darren naish, public domain, generic trademark, subway (restaurant), honda, honda odyssey, sea shepherd conservation society, toho, gojira (band), technical death metal, voltage pictures, anne hathaway, shinjuku, toho, godzilla characters, toho monsters, fictional characters introduced in 1954, horror film characters, fictional mutants, fictional dinosaurs, fictional characters with superhuman strength, science fiction film characters, king kong characters, fictional characters with nuclear or radiation abilities, fictional telepaths)"
13049,george eliot,"List(nuneaton, warwickshire, chelsea, london, middlesex, highgate cemetery, highgate, george henry lewes, the mill on the floss, silas marner, middlemarch, daniel deronda, baruch spinoza, miguel de cervantes, honoré de balzac, charlotte brontë, jane austen, arthur schopenhauer, ludwig feuerbach, leo tolstoy, marcel proust, thomas mann, virginia woolf, margaret atwood, norman mailer, christopher hitchens, martin amis, j.k. rowling, victorian era, adam bede, the mill on the floss, silas marner, middlemarch, daniel deronda, realism (arts), george henry lewes, middlemarch, martin amis, julian barnes, the paris review, arbury hall, newdigate family, nuneaton, bedworth, attleborough, warwickshire, evangelicalism, low church, anglican, english midlands, english dissenters, foleshill, coventry, charles bray, robert owen, herbert spencer, harriet martineau, ralph waldo emerson, david strauss, ludwig feuerbach, coventry herald and observer, geneva, john chapman (publisher), the westminster review, herbert spencer, george henry lewes, open marriage, thornton leigh hunt, weimar, baruch spinoza, ethics (spinoza), charles bray, friedrich engels, wilkie collins, realism (arts), scenes of clerical life, blackwood's magazine, adam bede, parson, princess louise, duchess of argyll, queen victoria, edward henry corbould, the mill on the floss, daniel deronda, witley, john walter cross, westminster abbey, highgate cemetery, george henry lewes, poets' corner, foleshill, john birch (engineer), nuneaton museum & art gallery, adam bede, the mill on the floss, silas marner, felix holt, the radical, the legend of jubal, middlemarch, reform act 1832, john ruskin, modern painters, westminster review, william wordsworth, bucolic, romola, florence, girolamo savonarola, impressions of theophrastus such, daniel deronda, virginia woolf, harold bloom, the western canon: the books and school of the ages, harold bloom, the western canon: the books and school of the ages, adam bede, the mill on the floss, silas marner, romola, felix holt, the radical, middlemarch, daniel deronda, david strauss, ludwig feuerbach, scenes of clerical life, the lifted veil, impressions of theophrastus such, gordon s. haight, henry alley, david daiches, fr leavis, the mill on the floss, geraldine jewsbury, harry ransom center, university of texas at austin, daniel radcliffe, lily evans, arbury, nuneaton, chelsea, london, 1819 births, 1880 deaths, alumni of bedford college (london), alumni of royal holloway, university of london, english people of welsh descent, english essayists, english sceptics, english women novelists, english novelists, victorian women writers, victorian novelists, 19th-century women writers, 19th-century english writers, people from nuneaton, pseudonymous writers, burials at highgate cemetery, 19th-century british novelists, english translators, english philosophers, 19th-century philosophers, english women philosophers, women essayists, women translators, 19th-century translators)"
38010,heliox,"List(breathing gas, helium, oxygen, saturation diving, technical diving, medicine, air, laminar flow, turbulent flow, standard conditions for temperature and pressure, reynolds number, hagen–poiseuille equation, asthma, bronchodilators, vocal cord dysfunction, croup, chronic obstructive pulmonary disease, dyspnea, exhaustion, anaesthesia, ogg, commercial diving, rebreather, open circuit scuba, wiktionary:hypoxic, bespoke, gas blending, diving cylinder, sound, human voice, formant, trimix (breathing gas), technical diving, argox (breathing gas), nitrox, hydreliox, hydrox (breathing gas), trimix (breathing gas), breathing gases, asthma, helium, medical treatments, respiratory therapy, ro:heliox)"
38252,radiohead,"List(abingdon-on-thames, oxfordshire, alternative rock, experimental rock, electronic music, art rock, atoms for peace (band), 7 worlds collide, xl recordings, ticker tape ltd., hostess entertainment, tbd records, parlophone, capitol records, thom yorke, jonny greenwood, colin greenwood, ed o'brien, philip selway, rock music, abingdon-on-thames, oxfordshire, thom yorke, jonny greenwood, colin greenwood, phil selway, ed o'brien, creep (radiohead song), pablo honey, the bends, ok computer, social alienation, kid a, amnesiac (album), electronic music, krautrock, jazz, hail to the thief, emi, in rainbows, music download, the king of limbs, bbc, rolling stone, rolling stone, abingdon school, abingdon, oxfordshire, the sydney morning herald, thom yorke, colin greenwood, ed o'brien, phil selway, jonny greenwood, nigel powell, andy yorke, jericho tavern, guitar world, the new yorker, headless chickens (uk band), thames valley, independent music, shoegazing, ride (band), slowdive, mojo (magazine), emi, a&r, talking heads, true stories (talking heads album), drill (ep), extended play, paul q. kolderie, sean slade, pixies (band), dinosaur jr., record producer, creep (radiohead song), nme, bbc radio 1, melody maker, pablo honey, anyone can play guitar, stop whispering, pop is dead, grunge, nirvana (band), falsetto, guitar distortion, israel, yoav kutner, tel aviv, kits, modern rock tracks, top 40, uk singles chart, abbey road studios, john leckie, blender (magazine), australasia, the wire (magazine), my iron lung (ep), vox (magazine), the bends, britpop, fake plastic trees, high and dry, just (song), street spirit (fade out), r.e.m., michael stipe, yahoo! music, lucky (radiohead song), war child (charity), the help album, nigel godrich, audio engineer, b-side, street spirit (fade out), didcot, circus (magazine), alanis morissette, st. catherine's court, bath, somerset, the beatles, dj shadow, ennio morricone, miles davis, exit music (for a film), baz luhrmann, romeo + juliet, ok computer, ambient music, avant garde, electronic music, rolling stone, progressive rock, pink floyd, the dark side of the moon, select (magazine), billboard 200, grammy awards, grammy award for best alternative music album, grammy award for album of the year, paranoid android, karma police, no surprises, grant gee, meeting people is easy, maryland film festival, 7 television commercials, airbag/how am i driving?, no surprises/running from demons, amnesty international, q (magazine), copenhagen, gloucester, oxford, the observer, kid a, minimalism, ondes martenot, electronic music, string orchestra, billboard 200, spice girls, napster, san francisco chronicle, mp3 newswire, promotional recording, optimistic (song), idioteque, naomi klein, anti-globalisation, no logo, electronic music, grammy award, grammy award for best alternative music album, grammy award for album of the year, independent music, underground music, metacritic, amnesiac (album), uk albums chart, mercury music prize, pyramid song, knives out, i might be wrong, i might be wrong: live recordings, alternative rock, nigel godrich, hail to the thief, metacritic, british phonographic industry, riaa certification, there there, go to sleep, 2 + 2 = 5 (song), modern rock, grammy award for best alternative music album, grammy award for best engineered album, non-classical, united states presidential election, 2000, bbc radio 4, glastonbury festival, coachella valley music and arts festival, com lag (2plus2isfive), the eraser, bodysong (album), there will be blood (album), war child (charity), help!: a day in the life, spike stent, somerset, in rainbows, music download, pay what you want, wired (magazine), billboard (magazine), financial times, the colbert report, comedy central, yahoo! news, xl recordings, tbd records, metacritic, rolling stone, mercury music prize, grammy award for best alternative music album, grammy award for album of the year, house of cards (radiohead song), jigsaw falling into place, nude (song), billboard hot 100, bodysnatchers (song), reckoner, remix, social networking service, greatest hits, radiohead: the best of, reading and leeds festivals, these are my twisted words, harry patch (in memory of), harry patch, world war ii, british legion, henry fonda theater, oxfam, 2010 haiti earthquake, familial (album), fan-made, radiohead for haiti, live in praha, the king of limbs, valentine's day, rolling stone, billboard 200, billboard (magazine), prometheus global media, uk albums chart, music week, united business media, metacritic, metacritic, 54th grammy awards, grammy award for best alternative music album, grammy award for best boxed or special limited edition package, lotus flower, grammy award for best rock performance, grammy award for best rock song, supercollider / the butcher, supercollider / the butcher, record store day, tkol rmx 1234567, a.v. club, portishead (band), the king of limbs: live from the basement, the daily mail / staircase, metro (british newspaper), glastonbury festival, roseland ballroom, the colbert report, saturday night live, downsview park, ministry of labour (ontario), live nation, cbc news, white stripes, jack white, amok (atoms for peace album), atoms for peace (band), android (operating system), ios, digital art, pitchfork media, tomorrow's modern boxes, pitchfork media, weatherhouse (album), inherent vice (film), supergrass, the guardian, drowned in sound, pitchfork media, charles mingus, timing (music), queen (band), pink floyd, elvis costello, post-punk, joy division, siouxsie and the banshees, magazine (band), alternative rock, r.e.m., pixies, the smiths, sonic youth, hip hop, sample (music), london free press, miles davis, ennio morricone, the beatles, the beach boys, phil spector, girl groups, krzysztof penderecki, electronic music, glitch (music), ambient music, intelligent dance music, warp records, autechre, aphex twin, computer music, jazz, charles mingus, alice coltrane, miles davis, krautrock, can (band), neu!, juice (magazine), 20th century classical music, olivier messiaen, ondes martenot, björk, m.i.a. (artist), liars (band), modeselektor, spank rock, mojo (magazine), experimental music, ondes martenot, the age, nigel godrich, canadian broadcasting corporation, george martin, fifth beatle, graphic design, stanley donwood, grammy award for best recording package, carbon-neutral, led, colin greenwood, jonny greenwood, ondes martenot, analogue synthesiser, ed o'brien, philip selway, thom yorke, clive deamer, pablo honey, the bends, ok computer, kid a, amnesiac (album), hail to the thief, in rainbows, the king of limbs, radiohead and philosophy: fitter happier more deductive, radiohead, ato records artists, english alternative rock groups, english electronic music groups, experimental rock groups, capitol records artists, english rock music groups, ivor novello award winners, grammy award winners, nme awards winners, music in oxford, musical groups established in 1985, musical quintets, parlophone artists, xl recordings artists)"
196900,leon schlesinger,"List(looney tunes, you ought to be in pictures, philadelphia, viral infection, warner bros. cartoons, warner bros. cartoons, golden age of american animation, looney tunes, merrie melodies, harman-ising, philadelphia, palace theater (new york city), buffalo, new york, the buffalo news, pacific title & art studio, silent film, talkie, film historian, the jazz singer, looney tunes, animator, hugh harman, rudy ising, bosko, sunset boulevard, bob clampett, friz freleng, looney tunes, merrie melodies, tex avery, chuck jones, frank tashlin, carl stalling, mel blanc, porky pig, daffy duck, bugs bunny, pacific title & art studio, termite terrace, academy awards, list of assets owned by disney, mgm cartoon studio, looney tunes, lisp (speech), mel blanc, sylvester the cat, wikt:foppish, you ought to be in pictures, hollywood steps out, russian rhapsody (film), nutty news, b-movie, western movie, eddie selzer, hollywood forever cemetery, wikipedia:persondata, philadelphia, 1884 births, 1949 deaths, american film producers, american film studio executives, american people of german descent, american television producers, burials at hollywood forever cemetery, businesspeople from los angeles, california, businesspeople from new york, businesspeople from pennsylvania, people from buffalo, new york)"
201227,it can't happen here,"List(sinclair lewis, united states, english language, political fiction, doubleday (publisher), hardcover, political fiction, sinclair lewis, fascism, united states senate, president of the united states, plutocratic, totalitarianism, adolf hitler, schutzstaffel, huey long, united states presidential election, 1936, president of the united states, populism, adjusted for inflation, franklin delano roosevelt, authoritarianism, paramilitary, united states congress, kangaroo court, corporatism, underground railroad, social liberalism, federal theater project, huey long, works progress administration, new deal, federal theater project, rotary club, national socialism, federal theater project, mgm, louis b. mayer, will h. hays, motion picture production code, television movie, screen gems, television pilot, television series, kenneth johnson (producer), nbc, miniseries, extraterrestrials, science fiction, v (1983 miniseries), person of interest (tv series), it happened here, the iron heel, v for vendetta, boston globe, 1935 novels, american novels adapted into films, american political novels, american satirical novels, dystopian novels, novels about totalitarianism, novels by sinclair lewis, novels set in vermont, novels set in washington, d.c., prometheus award winning works)"
324799,baby monster group,"List(group theory, sporadic simple group, order (group theory), monster group, centralizer, outer automorphism group, schur multiplier, bernd fischer (mathematician), character table, charles sims (mathematician), robert griess, john horton conway, oxford university press, griess algebra, group representation, finite field, vertex operator algebra, monstrous moonshine, dedekind eta function, journal of algebra, sporadic groups)"
162369,list of heads of state of north korea,"List(emblem of north korea, kim yong-nam, kim tu-bong, heads of state, north korea, kim tu-bong, workers' party of korea, choe yong-gon (army commander), kim il-sung, eternal president of the republic, kim jong-il, yang hyong-sop, kim yong-nam, supreme people's assembly, kim jong un, list of leaders of north korea, president of north korea, eternal president of the republic, premier of north korea, list of premiers of north korea, government of north korea, politics of north korea, government of north korea, heads of state of north korea, north korea-related lists, lists of heads of state, politics of north korea)"
340927,shenzhou 5,"List(shenzhou (spacecraft), long march 2f, jiuquan satellite launch center, jiuquan launch area 4, geocentric orbit, low earth orbit, yang liwei, shenzhou 4, shenzhou 6, shenzhou program, human spaceflight, chinese space program, shenzhou spacecraft, long march 2f, soviet union, russia, united states, yang liwei, chinese nationalism, united nations, people's daily, spacedaily, political status of taiwan, people's daily, general secretary of the communist party of china, president of the people's republic of china, hu jintao, great hall of the people, sina.com, central committee of the communist party of china, state council of china, central military commission, sina.com, prime minister of japan, junichiro koizumi, cctv.com, george w. bush, united states department of state, sean o'keefe, dprk, kwangmyŏngsŏng-1, yang liwei, chinese space program, tiangong program, shenzhou spacecraft, long march rocket, jiuquan satellite launch center, shenzhou 4, shenzhou 6, human spaceflights, shenzhou program, spacecraft launched in 2003, 2003 in china)"
343031,friedmann–lemaître–robertson–walker metric,"List(riemannian metric, exact solutions in general relativity, einstein field equations, general relativity, homogeneity (physics), isotropic, metric expansion of space, universe, simply connected space, multiply connected, physics reports, tuple, scale factor (cosmology), alexander friedmann, georges lemaître, howard p. robertson, arthur geoffrey walker, springer (publisher), homogeneity (physics), isotropy, elliptical space, euclidean space, hyperbolic space, scale factor (universe), gaussian curvature, circumference, schwarzschild coordinates, comoving distance, radius of curvature (mathematics), elliptical geometry, gaussian curvature, comoving distance, radius of curvature (mathematics), analytic function, power series, sinc function, equation of state (cosmology), einstein field equations, friedmann equations, energy-momentum tensor, international journal of theoretical physics, big bang, lambda-cdm model, observable universe, primordial fluctuations, cosmic background explorer, wmap, constant of integration, thermodynamics of the universe, first law of thermodynamics, adiabatic process, gravitation, general relativity, cosmological constant, dark energy, cosmological constant, cosmological constant, cosmological constant, dark energy, scalar field theory, quintessence (physics), first law of thermodynamics, cosmological constant, planck epoch, quantum mechanics, alexander friedmann, zeitschrift für physik, albert einstein, georges lemaître, catholic university of leuven (1834–1968), edwin hubble, arthur eddington, monthly notices of the royal astronomical society, howard p. robertson, arthur geoffrey walker, radius of curvature (mathematics), einstein's universe, static spacetime, gravitational constant, light year, wilkinson microwave anisotropy probe, planck (spacecraft), ehlers–geren–sachs theorem, the large scale structure of space-time, monthly notices of the royal astronomical society, astrophysical journal, astrophysical journal, astrophysical journal, proceedings of the london mathematical society, coordinate charts in general relativity, exact solutions in general relativity, physical cosmology, metric tensors)"


In [0]:
# Extract the data (ID and title) from the full Wikipedia DF & store it in a new Spark DF
titleidDF=wikipediaDF.select("id",tolower_udf("title").alias("title"));

In [0]:
# Convert from Spark DF to a Pandas DF that contains the title and ID of the documents
# running it for the first time never worked for us, sometimes it needed several attempts to work properly
titleidPDF = titleidDF.toPandas()

In [0]:
# Extract the ID value of the links in the title column 

def titles2id(links, titleidPDF):
  data_titles = titleidPDF
  
  if (len(links)>0): # Check if there are links
    ids = data_titles[data_titles.title.isin(links)].id.to_list() # Return a list of IDs from data_titles that match the titles in links
    
  else: # If there are no links
    ids = [] # Create empty list 
    
  return list(set(ids)) # Convert the result to a list 

In [0]:
# Testing our title matching function
titles2id(test_links, titleidPDF)

Out[156]: [28473060,
 14533,
 42314121,
 42331404,
 34063244,
 43374716,
 19189,
 592282,
 18531772,
 17336029,
 36496543]

In [0]:
# Create a UDF function of the titles2id function 
titles2id_UDF = udf(lambda x: titles2id(x,titleidPDF),ArrayType(LongType(),False))

In [0]:
# Extract the IDs of all the links a page forwards to in order to create the forward links table 
ForwardDF = TempForwardDF.select("id",titles2id_UDF("links").alias("links")).cache()

In [0]:
# Display the IDs of all the links a page forwards to, i.e. the forward links table
display(ForwardDF)

id,links
11986,"List(36865, 438275, 303628, 79887, 1190931, 36896, 930853, 20518, 1670693, 925736, 3135027, 1222711, 854081, 567874, 538696, 208463, 634965, 187479, 1922658, 18653283, 18637926, 1947751, 267370, 17229419, 621, 28272, 4627056, 8311, 1236610, 36101762, 1862287, 21123733, 1009817, 396968, 1236650, 30313654, 53430, 4259513, 302275, 4726468, 870601, 2348754, 11988, 15573, 5220058, 325852, 12000, 12004, 12005, 37604, 45178603, 9073903, 325875, 3572981, 9182971, 1909502, 9322751, 12546, 30467, 18993927, 18184, 226575, 490771, 192790, 147734, 1646358, 21785, 7396632, 23188764, 18057502, 1112869, 15655, 2120494, 4400, 105776, 54581, 3921, 162143, 11014498, 894309, 3540339, 1245556, 791422, 23624578, 65411, 11778948, 68485, 11664, 8660880, 570258, 570263, 1356696, 1252253, 1790878, 13729, 18637733, 33702, 570279, 16823212, 1407410, 248760, 228798, 30863297, 2156483, 17360, 30683, 266204, 20002271, 59361, 5765607, 65001, 9182702, 33777, 18998781)"
13049,"List(42368, 230019, 6532, 33925, 228998, 2427653, 29604227, 21244047, 94097, 12785172, 34543638, 199961, 21018, 323481, 208411, 248859, 32798, 670749, 383648, 294177, 15782, 18622119, 2799016, 3038250, 70188, 8495660, 213936, 4368178, 47923, 147635, 23953851, 700, 875196, 27480261, 1550022, 87240, 1972169, 32915915, 189774, 153295, 45777, 7997906, 22928212, 26202, 213978, 1047259, 803804, 44766, 310113, 946273, 209892, 39658852, 32742, 12521, 1129578, 5178089, 10891114, 43245, 1056621, 6148078, 16320493, 1128563, 19444, 34115577, 2922874, 161275, 3745788, 26255613)"
38010,"List(31489, 43638795, 18957, 11024, 30206738, 22303, 460321, 3117601, 18994087, 291242, 49838, 19334830, 23868856, 13256, 42480973, 459471, 38481, 200033, 9955, 44905, 294126, 18209535)"
38252,"List(43008, 57858, 259082, 797714, 48147, 42010, 156699, 2000410, 156702, 2943008, 10277, 494630, 5166, 2776113, 14733874, 51763, 22068, 41523, 72758, 2555445, 174650, 2416701, 21611071, 2465857, 273476, 431685, 198217, 1926730, 186444, 273484, 273486, 273487, 431695, 273489, 2011729, 18949200, 30866008, 44635, 6126173, 11690592, 22113, 1072228, 4883556, 152171, 3823213, 191086, 915566, 18933360, 43956336, 29812, 13696118, 21113, 238715, 1102461, 9282173, 39483010, 19344515, 31365, 25063047, 43144329, 42634, 8803981, 491664, 148113, 1775761, 28309, 371350, 13696153, 2416795, 23648411, 303261, 32927, 146595, 24007843, 22944933, 66730, 79026, 87731, 1721523, 25804468, 462002, 25817785, 228538, 171195, 18994363, 437949, 182462, 5369534, 16579, 591556, 3852997, 232142, 1407696, 639185, 92374, 1574616, 24061144, 31514330, 31453, 2530527, 730350, 21231, 2004720, 30456, 423161, 30419193, 587003, 5126908, 15613, 1588483, 19698439, 32009, 10510, 19344654, 53518, 170771, 170772, 170773, 205078, 511253, 954140, 2041117, 1170208, 22308, 7129894, 9510, 47777069, 355118, 51503, 205105, 81213, 6974, 172350, 28480, 1732421, 29511, 6766412, 13825869, 18309966, 2798927, 140624, 140630, 1462102, 238424, 155482, 7038301, 34041182, 27803997, 719715, 20836, 52581, 235879, 38252, 27705200, 136566, 1029495, 1654648, 170874, 77179, 4477, 1121666, 65411, 156547, 180103, 17292, 1529230, 37610899, 2186651, 202652, 20110751, 1712034, 7444902, 486312, 30860202, 45294511, 416688, 17339824, 2632114, 12610483, 20405, 276919, 16728504, 1988033, 612293, 1560005, 2291656, 4865481, 8642505, 3897803, 2520013, 6062032, 319442, 5079506, 53207, 28178906, 928738, 1714150, 24047593, 13603306, 27424748, 805870, 921071, 167409, 1571314, 1040371, 7668, 32722929, 33642486, 13695994, 2808829, 12799)"
196900,"List(198400, 51847, 51848, 73100, 161935, 3985, 50585, 68145, 199608, 324, 44748, 26956, 180432, 198354, 42524885, 201174, 6547798, 140632, 197469, 325863, 50287, 7672)"
201227,"List(333058, 31756, 24850, 21780, 50714, 396572, 27040, 24909346, 26787, 14705955, 73257, 146730, 501161, 11054, 102446, 24113, 2320314, 8569916, 32190, 2731583, 400064, 21347657, 31770697, 2764750, 55779, 30439, 40558, 382576, 765430, 350712, 3434750)"
324799,"List(29628162, 5034215, 728168, 15807, 493527, 1121587, 728596, 14838867, 20022166, 12695, 1287577, 47422, 11615)"
162369,"List(2012485, 21255, 2646090, 21259, 5345872, 3803825, 15572722, 154099, 392436, 695507, 19718837, 25084215, 7380029)"
340927,"List(24833, 3414021, 153223, 263163, 342667, 22309268, 553623, 31769, 26779, 151210, 4980140, 25391, 340917, 340934, 2481352, 47568, 28749264, 340947, 357847, 489179, 26440802, 31975, 869878, 14014329, 66555, 3434750, 220159)"
343031,"List(5378, 166404, 19604228, 251399, 31880, 30862736, 14865, 10639249, 4116, 318742, 610583, 5916, 67229, 144417, 61478, 2406953, 1529775, 1536565, 59958, 38454, 285623, 1686520, 224698, 2578746, 728892, 1148092, 5985207, 23874378, 38992, 5173456, 921168, 1147994, 424540, 1682143, 736, 39137, 164193, 147939, 186468, 1540704, 10363747, 523879, 9697, 563689, 18399589, 985963, 25202, 12024, 10489, 18421631)"


# 5. Computation of the number of outgoing links
Now that we have created the forward links matrix, we will compute the backward links matrix. For this we will compute the number of outgoing links for each page (document) using our previous forward links DF. This is a crucial step because our page rank algorithm needs this information in order to adequately perfom its ranking and importance calculations.

To perform this computation we will use the PySpark function "size" to check the length of the links in the ForwardDF dataframe.

In [0]:
# Compute the number of outgoing links for each document
OutgoingsLinksCountersDF = ForwardDF.select("id",F.size("links").alias("counter"))

**Note:** Instead use user defined function udf_count_links, check the pyspark.sql.functions library and use *size* function, which is more efficient!

In [0]:
display(OutgoingsLinksCountersDF)

id,counter
11986,114
13049,68
38010,22
38252,211
196900,22
201227,31
324799,13
162369,13
340927,27
343031,50


In [0]:
# Convert from Spark DF to a Pandas DF
OutgoingsLinksCountersPDF = OutgoingsLinksCountersDF.toPandas()

# 6. Creation of a table of reverse links 
Next, we will create the table of reverse links using the previously computed forward links DF. This step is also important, as our page rank algorithm needs to know which links are directed to each document.

In [0]:
# Create a new dataframe (TemporalReverseLinks) from an existing one (ForwardDF)
TemporalReverseLinks = ForwardDF.select("id",F.explode("links").alias("t_link"))

# Display the first 10 rows
TemporalReverseLinks.show(10)

+-----+-------+
|   id| t_link|
+-----+-------+
|11986|  36865|
|11986| 438275|
|11986| 303628|
|11986|  79887|
|11986|1190931|
|11986|  36896|
|11986| 930853|
|11986|  20518|
|11986|1670693|
|11986| 925736|
+-----+-------+
only showing top 10 rows



In [0]:
# Create a new DF (ReverseDF) by grouping the TemporalReverseLinks DF by the t_link column and aggregating the id column using collect_list
# Creating RDD will help us later with optimizing the algorithm

tempReverseDF = TemporalReverseLinks.groupBy("t_link").agg(F.collect_list("id").alias("links"))
reverseRDD = tempReverseDF.rdd
reverseDF=spark.createDataFrame(reverseRDD,["id","links"])

In [0]:
display(reverseDF)

id,links
307,List(27021283)
324,List(196900)
358,"List(1493498, 45318426)"
621,List(11986)
627,List(45318426)
657,List(2316601)
700,List(13049)
717,"List(520457, 1099529, 15873675)"
734,"List(32264786, 29880047)"
736,List(343031)


# 7. Creation of the page rank table 
Now that we have computed the forward, backward, and reverse links matrices, we will use the algorithm described in Brin and Page's paper to set the PageRank for each page. Our goal is to create a table that displays the page rank by ID. 

We will initialise the rank number to be 0,85 / N, where N is the total number of pages in the full Wikipedia database and 0,85 is the so-called damping factor, i.e. the click-through probability (probability that a user will go to the linked web page).

The pagerankDF will include the same IDs are the ones in the Reverse DF.

In [0]:
# Collecting data from rdd: 
data = [row["id"] for row in reverseDF.collect()]                   
PageRankPDF = pd.DataFrame(data,columns=["id"])

# Counting number of pages that we will calculate pagerank for:
N = reverseDF.count()

# Initiating all ranks equally
PageRankPDF["pagerank"] = 0.85/N

In [0]:
display(PageRankPDF)

id,pagerank
307,0.0001133031191682218
324,0.0001133031191682218
358,0.0001133031191682218
621,0.0001133031191682218
627,0.0001133031191682218
657,0.0001133031191682218
700,0.0001133031191682218
717,0.0001133031191682218
734,0.0001133031191682218
736,0.0001133031191682218


# 8. Implementation of page rank algorithm

Page rank algorithm is based on links forwarding to a page. Steps of calculating new rank for 1 page:
1. take current rank of chosen page
2. for every page forwarding to chosen page:
- take its rank
- take the number of sites this page is forwarding to
- divide its rank by number of sites if number of sites is not zero
3. Sum calculated value multiplied by 0.15 and 0.85 divided by number of all pages

Loop steps above until ranks are stable or maximum number of iterations is approached

In [0]:
def new_pagerank(link_id: int, links: list, current_pr):    
  """
  Calculate new rank for specific page
  
  Arguments: 
  - link_id: id of chosen link; Used to retrieve information about current rank from dataframe
  - links: array containing ids of pages forwarding to chosen page
  - current_pr: pandas dataframe containing: pages ids, current rank of pages, differences between current and previous ranks
  
  Returns:
  - n_pr: new rank
  - difference: difference between current and previous rank
  """
  
  n_pr = 0
  curr_pr = current_pr[current_pr["id"] == link_id].iloc[0][1]
  for link in links:
    counter = OutgoingsLinksCountersPDF[OutgoingsLinksCountersPDF["id"] == link].iloc[0]["counter"]
    if counter > 0:
      current_link_pr = current_pr[current_pr["id"] == link]      
      if len(current_link_pr)>0:                                              
        current_link_pr = current_link_pr.iloc[0][1]                                       
      else:                                                     
        current_link_pr = 0
      n_pr += current_link_pr/counter
  n_pr = 0.85/N + 0.15*n_pr
  difference = n_pr - curr_pr
  
  return [float(n_pr), float(difference)]

Now we need to determine how many times the page rank should calculate the rank. To do this, we will use a loop. 
We will set a maximum of 20 iterations to calculate the pagerank. In the case that our algorithm were not to converge in this number of iterations, the loop will stop and display the end results.
Another break condition for the function is stability of the rank. If difference between previous and new rank is smaller than a specific treshold, the loop is terminated too.

In [0]:
count = 1
stop = False

# Loop with 2 exit conditions: no of iterations and stability of rank
while count < 20 and not stop: 
  print("Iteration no:", count)
  new_pagerank_udf = udf(lambda x, y: new_pagerank(x,y,PageRankPDF), ArrayType(FloatType()))
  NewPageRankDF = reverseDF.select('id',new_pagerank_udf('id','links').alias('pagerank'))
  NewPageRankDF = NewPageRankDF.select('id',NewPageRankDF.pagerank[0].alias("pagerank"), NewPageRankDF.pagerank[1].alias("difference"))
  PageRankPDF = NewPageRankDF.toPandas()
  
  stop = True
  for diff in PageRankPDF["difference"]:
    if diff >= 1e-7:
      stop = False
      
  count += 1 

Iteration no: 1
Iteration no: 2


In [0]:
# Counted ranks:
display(PageRankPDF)

id,pagerank,difference
307,0.00011330312,8.992626e-13
324,0.00011330312,8.992626e-13
358,0.00011330312,8.992626e-13
621,0.00011330312,8.992626e-13
627,0.00011330312,8.992626e-13
657,0.00011330312,8.992626e-13
700,0.00011330312,8.992626e-13
717,0.00011330312,8.992626e-13
734,0.00011330312,8.992626e-13
736,0.00011330312,8.992626e-13


# 8. Showing the results

In order to better analyze the results, we create dataframe containing titles apart from pages ids. We use function *id2title*, similar to previously used function *titletoid* to get the titles.

In [0]:
# Extract the title value of the page based on id
def id2title(ids, titleidPDF):
  title = titleidPDF[titleidPDF["id"] == ids].title.to_list() # Return a list of titles from data_titles that match the titles in links
  if len(title) > 0:
    return title[0]
  else: # If there are no titles
    return "There is no machting title to this id" 

# Create a UDF function of the titles2id function 
id2title_UDF = udf(lambda x: id2title(x,titleidPDF),StringType())

In [0]:
FinalPDF = spark.createDataFrame(PageRankPDF)
FinalPDF = FinalPDF.select("id", id2title_UDF("id").alias("title"), "pagerank")

In [0]:
display(FinalPDF)

id,title,pagerank
307,abraham lincoln,0.00011330312
324,academy awards,0.00011330312
358,algeria,0.00011330312
621,amphibian,0.00011330312
627,agriculture,0.00011330312
657,asphalt,0.00011330312
700,arthur schopenhauer,0.00011330312
717,alberta,0.00011330312
734,actinopterygii,0.00011330312
736,albert einstein,0.00011330312


# 9. Conclusions

Our implementation of the algorithm is a simplified version of page rank algorithm created and used by google. Before performing it on the data, a number of preparation steps had to be performed in order to obtain dataframes to work on. 

First step was uploading the data. We used databricks file, so we could import it directly from there. Parquet is a columnar storage format for big data. It is designed to be highly efficient for both reading and writing large datasets. A Parquet file is a binary file format that stores data in a columnar format. This means that data is organized into columns rather than rows, and each column is stored separately. This allows for more efficient compression and encoding of the data, as well as more efficient querying, because only the columns needed for a particular query need to be read from disk.

For time-saving purposes we used only a fraction of a whole dataset. It contains over 5 milion of rows containing wikipedia pages information. Then we retrieved only usefull columns and parsed the links that were relevant for the algorithm. Next steps included creating *forward dataframe*: dataframe with link ids and lists of links that those pages forward to, and *reverse dataframe*: dataframe with link ids and lists of pages forwarding to this specific pages. These were crucial for performing the algorithm. They are described in detail in proper places in the notebook. User defined functions can be found everywhere across the notebook. They allow to apply a function to a column in spark dataframe quickly.

As we observe the results of the algorithm, we can see that they don't vary too much. At first we expected more differences between the calculated values. Then, after closer analysys, we came to conclusion that little variance between the values is a consequence of drastical shrinkening of data size. Taking only 0.001 of all records of wikipedia dataset caused a lot of pages having a little number of other pages forwarding to them, so their rank didn't change. Those values should vary more while performing the algorithm on bigger chunk or the whole dataset. Setting the proper thershold is a thing while performing the calculae. Choosing bigger threshold allows to get more stable ranks, but also requires more iterations to compute. For testing the implementation we chose the one that made the calculations quicker.