# Wikipedia Analysis

This notebook IS NOT an skeleton. Is a sample of instruccions to analyse in classroom the Wikipedia Dataset provided by Databricks

During the class we will fill cells to implement the PageRank algorithm

Link to explain the theory: https://blog.majestic.com/company/understanding-googles-algorithm-how-pagerank-works/

# Steps to implement **Page Rank Algorithm**
* Check Data Raw structure
* Extract relevant data: [Document ID, List of links]
* Transform *List of links* to *List od Docuemts ID*: *Forward Links Table*
* Calculate *Number of output links*
* Construct *Reverse Links Table* from *Forward Links Table*
* Initialize *Page Rank Table*
* Recalculate *Page Rank Table* until:
 * All the *Page Rank* values are stable
 * Reach number of iterations (sugested value: 20 iterations)

# A) Library configuration

In [0]:
import pandas as pd
import re
import math 

In [0]:
from pyspark.sql.types import *
from pyspark.sql.types import ArrayType, StringType,LongType, FloatType
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from operator import add 

In [0]:
spark.conf.set("spark.sql.execution.arrow.enabled", "false")

# 1) Check Data Raw Structure
We will uses Databrics Wikipedia dataset, which contains 2012 Wikipedia Database in english

Here we defines the wikipediaDF Spark Dataframe, with the full database content:
We need to know the total number of documents in this database.

In [0]:
wikipediaDF=spark.read.parquet("dbfs:/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet")

To program the PageRank Algorithm, we need to extract a subset of the full database. We will select 0.00001 of the full database, and to avoid randomness behaviour, we set to a fixed seed value=0.

**Note:** For the final evaluation, should change the fraction to 0.001

In [0]:
PartialWikipediaDF=wikipediaDF.sample(fraction=0.00001,seed=0).cache()

Now, we can check the data structure:

## Conclusions with the data raw analysis:
* There are several columns, but the relevant information is stored in the follow columns:
 * **title**: The title of the document.
 * **id**: Id of the document
 * **text**: The content of the document. The most relevant information here (for the page rank algorithm) are the links to other documents. The link is enclosed in brackets, and contains the title of the document.

# 2) Extract Relevant Data
From the previous conclusions, we knows we need to select just three columns [*title*,*id*,*text*], and the relevant information from the *text* column are the links, identified by the titles enclosed in brackets.

Here we need to use regular expressions to select the relevant information from the *text* column.

Here we will implement a *parse_links* function, who receive a string and return a list of strings with the titles of the pointed documents.

This is a Python function, so is not direct callable from the Spark Dataframe, so, we need to define also the User Defined Function, to be usable in Spark Dataframes.

In [0]:
# This is a suboptimal function. Should be improved, to detect more complex links, links to documents in image references, etc...
# ESTUDIAR ESTE ARTICULO: https://es.wikipedia.org/wiki/Wikipedia:Estructura_de_un_art%C3%ADculo 
def parse_links(document_body):
  document_body_u = document_body.lower() 
  data=re.findall(r'\[\[(?!category|wikipedia|file|help|special)((?:(?![\[\]]).)*)\]\]',document_body_u)
  if (len(data)>0):
    links=[s.split('|')[0].split('#')[0].lower() for s in data]
  else:
    links=[]
  return links

In [0]:
parse_links_udf = udf(parse_links,ArrayType(StringType()))

It is necesarly convert the text to lowercase (both: *title* and *text* columns)

In [0]:
tolower_udf= udf(lambda x: x.lower())

Now, we create parseDF with the selected information, renaming the result tables to "title", and "links"

In [0]:
parsedDF = PartialWikipediaDF.select(tolower_udf("title").alias("title"),"id",parse_links_udf("text").alias("links"))

# 3) Transform *List of links* to *List od Docuemts ID*: *Forward Links Table*

To get the *id* of the target documents, we need analyse the full Wikipedia Database and extract a table with this two information.

This information is static, and used in distributed way. So, we will collect the data and convert to a Pandas Dataframe (PDF suffix).

In [0]:
titleidDF=wikipediaDF.select("id",tolower_udf("title").alias("title"));

In [0]:
titleidPDF=titleidDF.toPandas()

In [0]:
broadcast_title_idPDF = sc.broadcast(titleidPDF)

In [0]:
def text_links_2idx_2(links):
  title_idxPDF = broadcast_title_idPDF.value
  if ( len(links)>0):
  # This command looks in the title column if the elements in the list links exists, and if it exists
  # gets his id value. The result is converted to a list.
    result = title_idxPDF[title_idxPDF.title.isin(links)].id.to_list()
  else:
    result = [] 
  return result

In [0]:
udf_text_links_2idx =udf(text_links_2idx_2,ArrayType(LongType()))

In [0]:
ForwardDF = parsedDF.select("id",udf_text_links_2idx("links").alias("links")).cache()

Once verified the function, we need to define the UDF function, to invoke it from the **parsedDF** to select just [id, list of ids].

To be efficient, we can broadcast the variable *titleidPDF*, to call it in the transformation.

We will call this Dataframe as: **ForwardDF**

# 4) Calculate *Number of output links*
Using the **ForwardDF**, we need to calculate the number of output links per document. 
Because we will need this information to calculate the PageRank we will collect this information in a Pandas Dataframe, and define a Broadcast variable.

In [0]:
ForwardDF_WC= ForwardDF.select("id","links",F.size("links").alias("n_succesors"))

# 5) Construct *Reverse Links Table* from *Forward Links Table*
Now, we will define the Reverse Links Table Dataframe (**ReverseDF**), transforming the **ForwardDF** to a Dataframe with [*id*,*list of ids*] or similar.

*Suggestion*: Maybe the *list of ids* could contains not only the id of the target document, also the number of output links. This will improve the Page Rank calcule.

In [0]:
def reverseId(id,links):
  if (len(links)>0):
    reverse = [ (tgt_id,id) for tgt_id in links ]
  else:
    reverse=[]
  return reverse

In [0]:
ForwardRDD = ForwardDF_WC.rdd

In [0]:
ReverseRDD=(ForwardRDD
 .flatMap(lambda r: reverseId(r.id,r.links))
 .groupByKey()
 .map(lambda r: (r[0],list(r[1])))
 )


In [0]:
reverseDF=spark.createDataFrame(ReverseRDD,["id","links"])

In [0]:
reverseDF_WC= reverseDF.select("id","links",F.size("links").alias("n_precessors")).withColumnRenamed("links","precessors").withColumnRenamed("id","id1")

In [0]:
joineddf = reverseDF_WC.join(ForwardDF_WC, ForwardDF_WC.id == reverseDF_WC.id1, how='full')
YesId1DF = joineddf.filter("id1 is NOT NULL")
NoId1DF = joineddf.filter("id1 is NULL")
NoId1DF_v1 = NoId1DF.withColumn("id1", NoId1DF.id)
FullId1DF= YesId1DF.union(NoId1DF_v1)
ReverseDF = FullId1DF.select("id1", "precessors", "n_precessors", "links", "n_succesors").withColumnRenamed("links","succesors").withColumnRenamed("id1","id")

In [0]:
ReverseDF.show()

+--------+--------------------+------------+---------+-----------+
|      id|          precessors|n_precessors|succesors|n_succesors|
+--------+--------------------+------------+---------+-----------+
|   22936|[14800, 755367, 6...|          68|     null|       null|
|31108048|      [25652, 25652]|           2|     null|       null|
|  175924|             [28413]|           1|     null|       null|
|25575836|             [28413]|           1|     null|       null|
| 3350364|             [10577]|           1|     null|       null|
|10929004|             [10577]|           1|     null|       null|
|  266204|             [11986]|           1|     null|       null|
|  925736|    [11986, 6416597]|           2|     null|       null|
|   26840|[50479, 718098, 1...|          10|     null|       null|
|  572302|             [50479]|           1|     null|       null|
|  291336|[55578, 1669998, ...|           3|     null|       null|
|   22326|[68554, 2751139, ...|           6|     null|       n

# 6) Initialize *Page Rank Table*
We define a Pandas DataFrame (**PageRankPDF**) with the ids of documents in the **ReverseDF**, with the initial value. 

This could be:

\\(\frac{0.85}{N}\\)

where *N* is the number of documents in the **ReverseDF**.

In [0]:
ReverseDF= ReverseDF.select("id", "precessors", "n_precessors","n_succesors").withColumn("rank", F.lit(0.2)).na.fill(0).cache()

In [0]:
ReversePDF= ReverseDF.toPandas()

In [0]:
line = ReversePDF[ReversePDF["id"]== 204228]["precessors"].tolist()[0]
line[0]

Out[27]: 37729

# 7) Recalculate *Page Rank Table* until:
 * All the *Page Rank* values are stable
 * Reach number of iterations (sugested value: 20 iterations)

Should define a PageRank function, who receives the *id* of the document, *list of links*, and current PageRank table (**PageRankPDF**)*, and returns the new PageRank of the document *id* 
The loop must use the **ReverseDF** as master of information.

Now, we define a loop:
* While the conditions of exit are false:
 * Calculate the new page ranks, invoking the PageRank function from the **ReverseDF**, creating a new Dataframe NewPageRankDF with the [id,new_pagerank] info. (Could contains more information, if you use the suggestion in section 5) )
 * Collect the new page ranks and compares it with the previous **PageRankPDF**, checking if the new Page Ranks vary more than a threshold. If not, the exit condition is complied.
 * Update the **PageRankPDF** with the new values,and update the UDF functions.

In [0]:
def new_pagerank_wd(id1, pagerank, reversePDF):
    '''
    1 - Iterar por los diferentes ids de los preccessors 
    2 - Hacer el sumatorio ponderado (formula de PageRank) 
    '''
    list_of_ids= reversePDF[reversePDF["id"]== id1]["precessors"].tolist()[0]
    d = 0.85 
    
    suma= 0.0
    new_rank = 0.0
    
    if list_of_ids != None:     
      no_repeated_list = list(set(list_of_ids))
      for k in no_repeated_list:
        a = int(k) 
        temp= pagerank
        if int(id1) != a:
          s= pagerank[pagerank["id"]== a]["n_succesors"].tolist()
          r= pagerank[pagerank["id"]== a]["rank"].tolist()
          
          suma += float(r[0]/s[0])
    else: 
      suma= 0 
          
    new_rank = (1-d)+ d * suma
          
    return float(new_rank)

In [0]:
def new_pagerank(id1, pagerank, reversePDF):
    '''
    1 - Iterar por los diferentes ids de los preccessors 
    2 - Hacer el sumatorio ponderado (formula de PageRank) 
    '''
    list_of_ids= reversePDF[reversePDF["id"]== id1]["precessors"].tolist()[0]
    
    suma= 0.0
    new_page_rank = 0.0
    if list_of_ids != None:     
      no_repeated_list = list(set(list_of_ids))
      for k in no_repeated_list:
        a = int(k) 
        temp= pagerank
        if int(id1) != a:
          s= pagerank[pagerank["id"]== a]["n_succesors"].tolist()
          r= pagerank[pagerank["id"]== a]["rank"].tolist()
          
          suma += float(r[0]/s[0]) 
        
    return float(suma)

In [0]:
def new_pagerank_nosuccessors(pagerank,size):
    '''
    Takes into account all the ids that do not have any succesor, and shares their rank with the whole dataframe  
    '''
    # First we add the corresponding value for the pages with no successors 
    withoutNextDF = pagerank.filter(pagerank["n_succesors"]==0)
    #temp= withoutNextDF.select("rank").rdd.reduce(add)
    #sum= 0.0
    #for i in temp: sum += i 
    value = withoutNextDF.select(F.sum(withoutNextDF["rank"]).alias("rank")).collect()[0]["rank"]
    
    UnionDF = pagerank.withColumn("rank", pagerank.rank + F.lit(value/size))
    
    # Then we will normalize the data set so that the greatest value of the rank will be 1 
    #max_val = UnionDF.agg({"rank" : "max"}).first()[0]
    #newpagerank= UnionDF.withColumn("rank", UnionDF.rank/max_val)
    
    return UnionDF

In [0]:
count= 0
temp= 2
# broadcast_PageRankPDF= sc.broadcast(ReversePDF)
PageRankPDF= ReversePDF
size = ReversePDF.shape[0]
PrevRankPDF= PageRankPDF
temps= []

while (count < 5) and (temp > 0.00000000001):
    
    # First we take into account the contribution of the usual pages
    # udf_new_pagerank = udf(lambda l: new_pagerank(l, PageRankPDF), FloatType())
    udf_new_pagerank = udf(lambda l: new_pagerank_wd(l, PageRankPDF, ReversePDF), FloatType())
    # NewPageRankDF = ReverseDF.select("id",udf_new_pagerank("precessors").alias("rank"),"n_succesors").cache()
    NewPageRankDF = ReverseDF.select("id", udf_new_pagerank("id").alias("rank"),"n_succesors").cache()
    
    # Then the contribution of the non successor pages 
    NewPageRankDF_2 = new_pagerank_nosuccessors(NewPageRankDF,size)
    PageRankPDF = NewPageRankDF_2.toPandas() 
    #PageRankPDF = NewPageRankDF.toPandas()
    
    # temp = abs(PageRankPDF["rank"] - PrevRankPDF["rank"]).sum()
    temp = abs((abs(PageRankPDF["rank"].sum()) - abs(PrevRankPDF["rank"].sum()))/size) *1.0 
    temps.append([temp]) 
    
    PrevRankPDF = PageRankPDF
    count += 1 
        
# FinalPageRankDF = NewPageRankDF_2.orderBy(F.desc("rank")).select("id", "rank")
FinalPageRankDF = NewPageRankDF_2.orderBy(F.desc("rank")).select("id", "rank")

In [0]:
FinalPageRankDF = NewPageRankDF_2.orderBy(F.desc("rank")).select("id", "rank")
FinalPageRankDF.show()
print("count")
print(count)
print("temp")
print(temp)
print("size")
print(size)


In [0]:
ReverseDF.filter("id == 6667872").show()

+-------+----------+------------+-----------+----+
|     id|precessors|n_precessors|n_succesors|rank|
+-------+----------+------------+-----------+----+
|6667872|      null|           0|          1| 0.2|
+-------+----------+------------+-----------+----+



In [0]:
temps

Out[48]: [[0.10090229524218522],
 [0.005070704351300779],
 [0.00025946528151423636],
 [1.327645951414854e-05],
 [6.791474163904148e-07]]