# Create a Historical Link Graph for Wikipedia

https://phabricator.wikimedia.org/T186558

note that in this notebook spark context is already created by the DRIVER. To run this notebook, execute:

* export PYSPARK_DRIVER_PYTHON=jupyter 
* export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
* pyspark2 --master yarn --deploy-mode client --executor-memory 2g --conf spark.dynamicAllocation.maxExecutors=32 

If you are executing this script with spark-submit, you should add an init function like this;

def spark_init():
    ### Initialize spark context and quiet logs
    sc = SparkContext()
    log4j = sc._jvm.org.apache.log4j
    log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
    sqlctx = HiveContext(sc)
    return sc, sqlctx

sc, sqlContext = spark_init()

## Define UDF to get wikilinks

In [None]:
from pyspark.sql.functions import udf
import re

def getWikilinks(wikitext): #UDF to get wikipedia pages titles
    links  = re.findall("\[\[(.*?)\]\]",wikitext) #get wikilinks
    titles = [link.split('|')[0] for link in links] #get pages
    return titles

udfGetWikilinks = udf(getWikilinks)


## Loading parquet dump
To create  the parquet dump from XML, follow this ticket this ticket: https://phabricator.wikimedia.org/T186559#3977087

In [None]:
df = spark.read.parquet('hdfs:///user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01/enwiki')


In [None]:
df2 = df.where('page_namespace  ==0')

In [None]:
df2 = df.withColumn('wikilinks',udfGetWikilinks(df.revision_text))


In [None]:
df2.show()

In [None]:
df2 = df2.select('page_id','revision_id','wikilinks')

In [None]:
df2.show()

In [None]:
query = "SELECT revision_id,event_timestamp,page_title  from wmf.mediawiki_history WHERE wiki_db='enwiki' AND page_namespace=0 and snapshot='2018-01'"
result = spark.sql(query)

In [None]:
allInfo = df2.join(result,'revision_id')
allInfo.show()

In [None]:
allInfo.write.parquet('linkGraph_es.parquet')