# Wikipedia Analysis

### Project 3 - Massive computing

By  Mireia Kesti (100406960) and Aleksandra Jamróz (100491363)

### Context - page rank algorithm
The goal of this algorithm was to set a weight for each webpage on the internet, based on the number of external links pointing to the page. The intention behind this was to objectively determine the interest of a webpage. 
The PageRank algorithm was proposed in 1998 by Larry Page and Sergey Brin and served as the foundation for the original Google search engine.

### Context - goals of this project
We will be implementing the original algorithm to set the weights for each page in Wikipedia (English version), using the image stored in the Databricks Databases Set. This database contains 5823210 entries, stored in a parquet set of files. For simplicity and optimization purposes we will not be using the complete dataset for the analysis of the structure, but a smaller version instead. 
To achieve this we will be using Apache Spark Dataframes to handle both the Wikipedia database and the intermediate results. The end-goal is to obtain a Pandas DF with three columns, namely the title of the page, its ID, and its pagerank. 


#### Steps we will be following
- 1. Configuration of the library
- 2. Checking of the raw data structure
- 3. Extraction of relevant data
- 4. Creation of a table of forward links
- 5. Computation of the number of outgoing links
- 6. Creation of a table of reverse links 
- 7. Creation of the page rank table 
- 8. Definition of the number of iterations for the page rank

# 1. Library configuration

In [0]:
import pandas as pd
import re
import math as m
import numpy as np

In [0]:
from pyspark.sql.types import * # Define the data types in the PySpark data model that will be used. Once the type of data is defined, it makes the analysis of data easier
from pyspark.sql.types import ArrayType, StringType, LongType, FloatType

from pyspark.sql.functions import *
from pyspark.sql.functions import size, explode, collect_list 

from pyspark.sql import functions as F # List of built-in functions available for DataFrame. Import them as F to avoid namespace coverage (such as pyspark sum function covering python built-in sum function)
from pyspark.sql import SparkSession # So we can access PySpark/Spark SQL capabilities in PySpark

from operator import add 

In [0]:
# To optimize when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() & when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df), set to "true"
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# 2. Checking of the raw data structure
Now that our libraries are configured, we will import the dataset and inspect it in its raw form in order to get an idea of the data that we will be handling.

In [0]:
# The URI for the database, which contains 5823210 entries, stored in a parquet set of files.

# Create a Spark DataFrame from those parquet files
wikipediaDF = spark.read.parquet("dbfs:/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet")

As we can see, this database contains 7 columns with the following names and datatypes:
- title: string
- id: integer
- revisionId: integer
- revisionTimestamp: timestamp
- revisionUsername: string
- revisionUsernameId: integer
- text: string

The relevant information for our Wikipedia analysis will be stored in the columns "title", "id", and "text".

It is important to be aware of which datatype each of this columns is storing, else our algorithm will not function correctly.

In [0]:
# N = wikipediaDF.count() 

# N was a count the total number of pages in the entire database. This would be the size of the dataset that we operate on if we used the whole dataset. For performing calculations on smaller dataset, N value will be the size of reverse df 

In [0]:
# Instead of using the full database, we will use a smaller version to analyse the structure, with just 0.01% of records (approx. 582 records)
# We will select this small percentage of records randomly. In order to minimize random results, we will set a fixed seed value equal to 0

PartialWikipediaDF = wikipediaDF.sample(fraction=0.0001,seed=0).cache()

In [0]:
# PartialWikipediaDF.count()

# This command counts the number of pages in our new, reduced database Since this command is not relevant for our algorithm and takes a long time to run (7.88 min), we decided to comment it out. 
# If we do run it we can appreciate how much our dataset has diminished in comparison to the original one. 

In [0]:
# display(PartialWikipediaDF)

# Display our smaller version of the database. This way we can see the information in an easier to read format. 
# For curiosity purposes, we also created a word cloud which displays the most frequent words used the largest. 

# As it is not necessary to view this table for the algorithm to work and it took a long time to run (7.59 min), we will be commenting this line out as well.

# 3. Extraction of relevant data
Now that we have a rough idea of our dataset and have decreased its size to a more manageable subset, we will begin with our analysis.

As we mentioned previously and highlighted when displayed as a table, the most important information for our algorithm is stored in the columns named “title”, “id”, and “text”. 
- Title stores the title of the document (string)
- ID stores the ID of the document (integer)
- Text stores the content of the document (string)

This is intuitive when thinking about the purpose of the page rank algorithm, which is to rank web pages in their search engine results. In order to complete this task, the algorithm will need to know the name (title) of the page, what the page is about (text), and the unique identity of the page to distinguish it from others (ID). The most important piece of information for this algorithm is are the links that the page content (text) includes, as the rating of importance each page will recieve is a recursively defined measure whereby a page becomes important if important pages link to it.

To create the table containing the forward links, we have to first identify which are the outgoing links (identified with double "[[]]") and then select the ID rows for these hyperlinks. Further, we will be ignoring all external references and resources, such as images. 

Thus, in this section we will focus on those three columns, and extract the key information from the text column, namely the links (identifiable as strings in brackets). To extract this information we will use regular expressions because regular expressions are a powerful tool to split texts into fragments.

As we will also be using Python functions that are not directly callable from the Spark Dataframe, we will be have to define some UDFs (user defined functions).

In [0]:
# Implementation of the parse_links function to parse the links in the document body
# This function recieves strings and returns returns all non-overlapping matches of the pattern in the string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

def parse_links(document_body):
  # document_body = document_body.lower() # Convert to lowercase. To optimize our code we chose to only convert to lowercase the pontential candidates that may be links instead of all the text 
  
  # Locate the potential candidates that may be links
  data = re.findall(r'\[\[(.+?)\]\]',document_body) 
  links = [] # Create empty list of links
  
  if (len(data)>0): # Check if there are links in the article
    
    # Convert to lowercase all candidate links
    for s in data:
      s = s.lower() 
      
    # Get article name when [[name_article | text_in_article]]
    links = [s.split('|')[0].lower() for s in data]  
    
    # Get article name after a colon
    links = [s.split(':')[1] if "category:" in s else s for s in links]  
      
    # Deleting not relevant links for page rank:
    bad_words = ['file', 'image', 'video', 'template']
    for bad_word in bad_words:
      for link in links:
        if bad_word in link:
          links.remove(link)

     # Ignore references or internal links within an article
    links = [s.split('#')[0] if "#" in s else s for s in links]
 
  return links

In [0]:
test="{{Use Indian English|date=April 2015}} {{Infobox person | name = Shavez Khan | image = | caption = | birth_date = | birth_place = India | nationality = India | residence = [[Mumbai]], India | occupation = [[Actor]] | years_active = present | height = }} '''Shavez Khan''' is an [[India]]n television [[actor]]. He has done his roles in various Indian television shows like Shaitaan,<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/shavez-khan-feature-episodic-of-colors-shaitaan|title=Shavez Khan to feature in an episodic of Colors' Shaitaan|work=Tellychakkar|date=11 April 2013|accessdate=24 April 2015}}</ref> [[Encounter (Indian TV series)|Encounter]], [[Ek Hasina Thi (TV series)|Ek Hasina Thi]], [[Savdhaan India]],<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/shavez-khan-anshul-singh-and-damini-joshi-episodic-of-savdhan-india-140915|title=Shavez Khan, Anshul Singh and Damini Joshi in an episodic of Savdhan India|work=Tellychakkar|date=15 September 2014|accessdate=24 April 2015}}</ref> [[SuperCops vs Supervillains]],<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/rituraj-singh-and-shavez-khan-life-oks-shapath-141009|title=Rituraj Singh and Shavez Khan in Life OK's Shapath|work=Tellychakkar|date=9 October 2014|accessdate=24 April 2015}}</ref> Pyaar Ka The End,<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/shavez-khan-bindass-pyaar-ka-the-end-141029|title=Shavez Khan in Bindass' Pyaar Ka The End|work=Tellychakkar|date=29 October 2014|accessdate=24 April 2015}}</ref> [[Pyaar Kii Ye Ek Kahaani]], [[MTV Fanaah]], [[Crime Patrol (TV series)|Crime Patrol]]. He has played his recent role in [[Sony Entertainment Television (India)|Sony TV]]'s [[C.I.D. (Indian TV series)|CID]].<ref>{{cite web|url=http://www.tellychakkar.com/tv/tv-news/shavez-khan-sony-tvs-cid-150417|title=Shavez Khan in Sony TV's CID|work=Tellychakkar|date=17 April 2015|accessdate=24 April 2015}}</ref> ==Television== *[[Colors (TV channel)|Colors]]'s Shaitaan *[[Sony Entertainment Television (India)|Sony TV]]'s [[Encounter (Indian TV series)|Encounter]], [[Crime Patrol (TV series)|Crime Patrol]] & [[C.I.D. (Indian TV series)|CID]] *[[Star Plus]]'s [[Ek Hasina Thi (TV series)|Ek Hasina Thi]] *[[Life OK]]'s [[Savdhaan India]] & [[SuperCops vs Supervillains]] *[[Bindass]]' Pyaar Ka The End *[[Star One]]'s [[Pyaar Kii Ye Ek Kahaani]] *[[MTV]]'s [[MTV Fanaah]] ==References== {{Reflist}} ==External links== {{Persondata | NAME = Khan, Shavez | ALTERNATIVE NAMES = | SHORT DESCRIPTION = Indian model and television actor | DATE OF BIRTH = <!--Birth date has been contested. Do not add without providing a reliably published source with a reputation for editorial oversight--> | PLACE OF BIRTH = India | DATE OF DEATH = | PLACE OF DEATH = }} {{DEFAULTSORT:Khan, Shavez}} [[Category:Living people]] [[Category:Indian male television actors]] [[Category:Actors in Hindi television]] [[Category:Indian television personalities]]"

In [0]:
test_links = parse_links(test)
test_links

In [0]:
# User defined function to parse the links 
parse_links_udf = udf(parse_links,ArrayType(StringType()))

# 4. Creation of a table of forward links
As our goal is to know how many links each document forwards to, we will now transform our list of links to a list of document IDs and create a table with them. In order to extract the ID of the target linked documents (the pages that are linked) we will have to analyse the entire Wikipedia database and create a table with these two pieces of information. 

The table of forward links with feature a column of page IDs and another column with containing all the target pages of that ID ( "origin" page).

In [0]:
tolower_udf= udf(lambda x: x.lower())

In [0]:
# Extract the data from the partial Wikipedia DF & store it in a new Spark DF
TempForwardDF = PartialWikipediaDF.select("id", tolower_udf("title").alias("title"),parse_links_udf("text").alias("links"))

In [0]:
display(TempForwardDF)

In [0]:
# Extract the data (ID and title) from the full Wikipedia DF & store it in a new Spark DF
titleidDF=wikipediaDF.select("id",tolower_udf("title").alias("title"));

In [0]:
# Convert from Spark DF to a Pandas DF that contains the title and ID of the documents
# running it for the first time never worked for us, sometimes it needed several attempts to work properly
titleidPDF = titleidDF.toPandas()

In [0]:
# Extract the ID value of the links in the title column 

def titles2id(links, titleidPDF):
  data_titles = titleidPDF
  
  if (len(links)>0): # Check if there are links
    ids = data_titles[data_titles.title.isin(links)].id.to_list() # Return a list of IDs from data_titles that match the titles in links
    
  else: # If there are no links
    ids = [] # Create empty list 
    
  return list(set(ids)) # Convert the result to a list 

In [0]:
# Broadcasting dataframe with all linkages between links and titles
# broadcast_title_idPDF = sc.broadcast(titleidPDF)

In [0]:
# Testing our title matching function
titles2id(test_links, titleidPDF)

In [0]:
# Create a UDF function of the titles2id function 
titles2id_UDF = udf(lambda x: titles2id(x,titleidPDF),ArrayType(LongType(),False))

In [0]:
# Extract the IDs of all the links a page forwards to in order to create the forward links table 
ForwardDF = TempForwardDF.select("id",titles2id_UDF("links").alias("links")).cache()

In [0]:
# Display the IDs of all the links a page forwards to, i.e. the forward links table
display(ForwardDF)

# 5. Computation of the number of outgoing links
Now that we have created the forward links matrix, we will compute the backward links matrix. For this we will compute the number of outgoing links for each page (document) using our previous forward links DF. This is a crucial step because our page rank algorithm needs this information in order to adequately perfom its ranking and importance calculations.

To perform this computation we will use the PySpark function "size" to check the length of the links in the ForwardDF dataframe.

In [0]:
# Compute the number of outgoing links for each document
OutgoingsLinksCountersDF = ForwardDF.select("id",F.size("links").alias("counter"))

**Note:** Instead use user defined function udf_count_links, check the pyspark.sql.functions library and use *size* function, which is more efficient!

In [0]:
display(OutgoingsLinksCountersDF)

In [0]:
# Convert from Spark DF to a Pandas DF
OutgoingsLinksCountersPDF = OutgoingsLinksCountersDF.toPandas()

# 6. Creation of a table of reverse links 
Next, we will create the table of reverse links using the previously computed forward links DF. This step is also important, as our page rank algorithm needs to know which links are directed to each document.

In [0]:
# Create a new dataframe (TemporalReverseLinks) from an existing one (ForwardDF)
TemporalReverseLinks = ForwardDF.select("id",F.explode("links").alias("t_link"))

# Display the first 10 rows
TemporalReverseLinks.show(10)

In [0]:
# Create a new DF (ReverseDF) by grouping the TemporalReverseLinks DF by the t_link column and aggregating the id column using collect_list
# Creating RDD will help us later with optimizing the algorithm

tempReverseDF = TemporalReverseLinks.groupBy("t_link").agg(F.collect_list("id").alias("links"))
reverseRDD = tempReverseDF.rdd
reverseDF=spark.createDataFrame(reverseRDD,["id","links"])

In [0]:
display(reverseDF)

# 7. Creation of the page rank table 
Now that we have computed the forward, backward, and reverse links matrices, we will use the algorithm described in Brin and Page's paper to set the PageRank for each page. Our goal is to create a table that displays the page rank by ID. 

We will initialise the rank number to be 0,85 / N, where N is the total number of pages in the full Wikipedia database and 0,85 is the so-called damping factor, i.e. the click-through probability (probability that a user will go to the linked web page).

The pagerankDF will include the same IDs are the ones in the Reverse DF.

In [0]:
# Collecting data from rdd: 
data = [row["id"] for row in reverseDF.collect()]                   
PageRankPDF = pd.DataFrame(data,columns=["id"])

# Counting number of pages that we will calculate pagerank for:
N = reverseDF.count()

# Initiating all ranks equally
PageRankPDF["pagerank"] = 0.85/N

In [0]:
display(PageRankPDF)

id,pagerank
307,0.0001133031191682218
324,0.0001133031191682218
358,0.0001133031191682218
621,0.0001133031191682218
627,0.0001133031191682218
657,0.0001133031191682218
700,0.0001133031191682218
717,0.0001133031191682218
734,0.0001133031191682218
736,0.0001133031191682218


# 8. Implementation of page rank algorithm

Page rank algorithm is based on links forwarding to a page. Steps of calculating new rank for 1 page:
1. take current rank of chosen page
2. for every page forwarding to chosen page:
- take its rank
- take the number of sites this page is forwarding to
- divide its rank by number of sites if number of sites is not zero
3. Sum calculated value multiplied by 0.15 and 0.85 divided by number of all pages

Loop steps above until ranks are stable or maximum number of iterations is approached

In [0]:
def new_pagerank(link_id: int, links: list, current_pr: pandas.DataFrame):    
  """
  Calculate new rank for specific page
  
  Arguments: 
  - link_id: id of chosen link; Used to retrieve information about current rank from dataframe
  - links: array containing ids of pages forwarding to chosen page
  - current_pr: pandas dataframe containing: pages ids, current rank of pages, differences between current and previous ranks
  
  Returns:
  - n_pr: new rank
  - difference: difference between current and previous rank
  """
  
  n_pr = 0
  curr_pr = current_pr[current_pr["id"] == link_id].iloc[0][1]
  for link in links:
    counter = OutgoingsLinksCountersPDF[OutgoingsLinksCountersPDF["id"] == link].iloc[0]["counter"]
    if counter > 0:
      current_link_pr = current_pr[current_pr["id"] == link]      
      if len(current_link_pr)>0:                                              
        current_link_pr = current_link_pr.iloc[0][1]                                       
      else:                                                     
        current_link_pr = 0
      n_pr += current_link_pr/counter
  n_pr = 0.85/N + 0.15*n_pr
  difference = n_pr - curr_pr
  
  return [float(n_pr), float(difference)]

Now we need to determine how many times the page rank should calculate the rank. To do this, we will use a loop. 
We will set a maximum of 20 iterations to calculate the pagerank. In the case that our algorithm were not to converge in this number of iterations, the loop will stop and display the end results.
Another break condition for the function is stability of the rank. If difference between previous and new rank is smaller than a specific treshold, the loop is terminated too.

In [0]:
count = 1
stop = False

# Loop with 2 exit conditions: no of iterations and stability of rank
while count < 20 and not stop: 
  print("Iteration no:", count)
  new_pagerank_udf = udf(lambda x, y: new_pagerank(x,y,PageRankPDF), ArrayType(FloatType()))
  NewPageRankDF = reverseDF.select('id',new_pagerank_udf('id','links').alias('pagerank'))
  NewPageRankDF = NewPageRankDF.select('id',NewPageRankDF.pagerank[0].alias("pagerank"), NewPageRankDF.pagerank[1].alias("difference"))
  PageRankPDF = NewPageRankDF.toPandas()
  
  stop = True
  for diff in PageRankPDF["difference"]:
    if diff >= 1e-7:
      stop = False
      
  count += 1 

Iteration no: 1
Iteration no: 2


In [0]:
# Counted ranks:
display(PageRankPDF)

id,pagerank,difference
307,0.00011330312,8.992626e-13
324,0.00011330312,8.992626e-13
358,0.00011330312,8.992626e-13
621,0.00011330312,8.992626e-13
627,0.00011330312,8.992626e-13
657,0.00011330312,8.992626e-13
700,0.00011330312,8.992626e-13
717,0.00011330312,8.992626e-13
734,0.00011330312,8.992626e-13
736,0.00011330312,8.992626e-13


# 8. Showing the results

In order to better analyze the results, we create dataframe containing titles apart from pages ids. We use function *id2title*, similar to previously used function *titletoid* to get the titles.

In [0]:
# Extract the title value of the page based on id
def id2title(ids, titleidPDF):
  title = titleidPDF[titleidPDF["id"] == ids].title.to_list() # Return a list of titles from data_titles that match the titles in links
  if len(title) > 0:
    return title[0]
  else: # If there are no titles
    return "There is no machting title to this id" 

# Create a UDF function of the titles2id function 
id2title_UDF = udf(lambda x: id2title(x,titleidPDF),StringType())

In [0]:
FinalPDF = spark.createDataFrame(PageRankPDF)
FinalPDF = FinalPDF.select("id", id2title_UDF("id").alias("title"), "pagerank")

In [0]:
display(FinalPDF)

id,title,pagerank
307,abraham lincoln,0.00011330312
324,academy awards,0.00011330312
358,algeria,0.00011330312
621,amphibian,0.00011330312
627,agriculture,0.00011330312
657,asphalt,0.00011330312
700,arthur schopenhauer,0.00011330312
717,alberta,0.00011330312
734,actinopterygii,0.00011330312
736,albert einstein,0.00011330312


# 9. Conclusions