# ST446 Distributed Computing for Big Data
## Homework PART 1
### Milan Vojnovic, Christine Yuen, Simon Schoeller LT 2019
---


## P1: Querying the YAGO semantic knowledge base

YAGO is a semantic knowledge base, derived from Wikipedia, WordNet and GeoNames. YAGO contains knowledge about more than 10 million entities (like persons, organizations and cities) and contains more than 120 million facts about these entities. You may find more about YAGO [here](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/#c10444).

*You may use GCP or your own computer. Please document your steps. We highly recommend using GCP, as the data sets used are about 20 GB in total.*

In this homework assignment, you are asked to use parts of the YAGO dataset to demonstrate your knowledge about Spark graphframes and motif queries. In particular, you are asked to **_use motif queries_** to find out answers to the following queries stated in English:

**A (max points 0)**. _Which city was Albert Einstein born in?_ 

**B (max points 5)**. _Politicians who are also scientists_ (sorted alphabetically by name of person)

**C (max points 5)**. _Companies whose founders were born in London_ (sorted alphabetically by name of founder)

**D (max points 5)**. _Writers who have won a Nobel Prize (in any discipline)_ (sorted alphabetically by name of person)

**E (max points 5)**. _Nobel prize winners who were born in the same city as their spouses_ (sorted alphabetically by name of person)

**F (max points 5)**. _Politicians that are affiliated with a right-wing party_ (sorted alphabetically by name of person)

Please always show the first 20 entries of the resulting DataFrame and the total count of relevant entries.

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.4.0-spark2.0-s_2.11'

In [2]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

import graphframes
from graphframes import *
import matplotlib.pyplot as plt
%matplotlib inline

from pyspark.sql.types import *
from pyspark.sql.functions import col, lit, when
from pyspark.sql import Row

from datetime import datetime

import re
import numpy as np


## 0.1 Get YAGO data

You will need to download the following datasets that are part of YAGO (see [here](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/) for more information):

* A set of relationships between instances (for example, specifying that Emomali Rahmon is the leader of the Military of Tajikistan). Link: http://resources.mpi-inf.mpg.de/yago-naga/yago3.1/yagoFacts.tsv.7z

* A set of subclass relationships (for example, specifying that *A1086* is *a road in England*, or that *Salmonella Dub* is *a Reggae music group* and also a *New Zealand dub musical group*). Link: http://resources.mpi-inf.mpg.de/yago-naga/yago3.1/yagoTransitiveType.tsv.7z

Please use `wget` to download the data to your compute engine (the files are big!).

Next, you will need extract `tsv` files from the `7z` archives that you have downloaded.
Use the following commands to install `p7zip` on your compute engine and extract the files.
```
sudo apt-get install p7zip-full
7z x yagoTransitiveType.tsv.7z 
7z x yagoFacts.tsv.7z 
```
Please note that this can take a while, in particular as `yagoTransitiveType.tsv` is **18GB** large.

Put the files (`yagoTransitiveType.tsv` and `yagoFacts.tsv`) into the hadoop file system. 
Also, have a look at their first few lines to understand what kind of data they contain.

## 0.2 Read the data into a Spark DataFrame

Please load the data from `yagoFacts.tsv` into a DataFrame called `df` and `yagoTransitiveType.tsv` into a DataFrame called `df_subclasses`.
Have a look at the beginning of the files to understand the schema.
Once imported, both DataFrames should have columns labelled as `id`, `subject`, `predicate`, `object` and `value`.
In the case of `yagoTransitiveType.tsv`, some of the predicates can be understood as *"is a sublcass of"* or *"is member of the class"*, and the objects can be understood as classes.

In [3]:
df = spark.read.option("sep", "\t").csv("hdfs://anyacluster02apr2019-m/user/Anya/yagoFacts.tsv")
df_subclasses = spark.read.option("sep", "\t").csv("hdfs://anyacluster02apr2019-m/user/Anya/yagoTransitiveType.tsv")

df = df.selectExpr("_c0 as id", "_c1 as subject", "_c2 as predicate", "_c3 as object", "_c4 as value")
df_subclasses = df_subclasses.selectExpr("_c0 as id", "_c1 as subject", "_c2 as predicate", "_c3 as object", "_c4 as value")

## 0.3 Understand the database schema

Let's look at the schema:

In [4]:
df.printSchema()
df_subclasses.printSchema()

root
 |-- id: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- predicate: string (nullable = true)
 |-- object: string (nullable = true)
 |-- value: string (nullable = true)

root
 |-- id: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- predicate: string (nullable = true)
 |-- object: string (nullable = true)
 |-- value: string (nullable = true)



The useful information is in columns "subject", "predicate" and "object". "predicate" defines the relation between entities "subject" and "object". For example, for "Albert Einstein was born in Ulm", "Albert Einstein" is the subject, "was born in" is the predicate and "Ulm" is the object.

## 0.4 Simple query example

To get information about where Albert Einstein was born, we load data into Spark using the following query:

In [35]:
born_city_df = df.where("predicate == '<wasBornIn>'")
born_city_df.show(1)

+--------------------+--------------------+-----------+---------------+-----+
|                  id|             subject|  predicate|         object|value|
+--------------------+--------------------+-----------+---------------+-----+
|<id_thPX9b1zg!_7f...|<William_Jones_(W...|<wasBornIn>|<Penrhiwceiber>| null|
+--------------------+--------------------+-----------+---------------+-----+
only showing top 1 row



In [36]:
born_city_df.where("subject = '<Albert_Einstein>'").show()

+--------------------+-----------------+-----------+------+-----+
|                  id|          subject|  predicate|object|value|
+--------------------+-----------------+-----------+------+-----+
|<id_sbCVliqDT2_7f...|<Albert_Einstein>|<wasBornIn>| <Ulm>| null|
+--------------------+-----------------+-----------+------+-----+



You may wonder how one would know whether to use the predicate '&lt;wasBornIn&gt;' or '&lt;was_born_in&gt;' and subject '&lt;Albert_Einstein&gt;' or '&lt;AlbertEinstein&gt;'. For YAGO subjects (and objects), the naming is aligned with Wikipedia. For example, Albert Einstein's wiki is: https://en.wikipedia.org/wiki/Albert_Einstein and you can see it is 'Albert_Einstein'. 

For predicates, you can look at the "property" list from the [yago web interface](https://gate.d5.mpi-inf.mpg.de/webyagospotlx/WebInterface?L01=%3Fx&L0R=%3CwasBornIn%3E&L02=%3Fc&L0T=&L03=&L0L=&L04=&L05=&L11=&L1R=&L12=&L1T=&L13=&L1L=&L14=&L15=&L21=&L2R=&L22=&L2T=&L23=&L2L=&L24=&L25=&L31=&L3R=&L32=&L3T=&L33=&L3L=&L34=&L35=&L41=&L4R=&L42=&L4T=&L43=&L4L=&L44=&L45=). 
Try different queries with this web interface query to understand more how to query YAGO.

## 0.5 Simple motif example (Question A)

In this part of the homework, you are required to use **motif** to find out answer to the 4 questions. Please complete the following example to find out: "Which city was Albert Einstein born in?" using motif queries instead of  SQL queries on the first dataframe (`df`):

## 0.6 Some useful tips

### Get a subset of YAGO database
YAGO database is large, so we don't try to load the entire database into a dataframe and then query it. If you do this, you will find that you won't even be able to execute `df.take(1)`, as it would take up too much of space (at least on a laptop). Instead, you use Spark SQL commands or `df.where` to get a suitable fraction of the data.

### Try the queries in the YAGO web interface first
It is sometimes tricky to get the right "subject", "predicate" and "object". It is easier if you start from [yago web interface](https://gate.d5.mpi-inf.mpg.de/webyagospotlx/WebInterface?L01=%3Fx&L0R=%3CwasBornIn%3E&L02=%3Fc&L0T=&L03=&L0L=&L04=&L05=&L11=&L1R=&L12=&L1T=&L13=&L1L=&L14=&L15=&L21=&L2R=&L22=&L2T=&L23=&L2L=&L24=&L25=&L31=&L3R=&L32=&L3T=&L33=&L3L=&L34=&L35=&L41=&L4R=&L42=&L4T=&L43=&L4L=&L44=&L45=) rather than directly querying in Pyspark. Once your query works, you can convert your query to Pyspark code. Note that sometimes the web version of object/subject code may be different from what you need to type here. For example, company code is &lt;wordnet_company_108058098&gt; when you do the query here but when you do it via the web interface it is &lt;wordnet company 108058098&gt;. 

### Be patient and don't do this exercise in the last minute
Some trial and error is needed to get the query right and it may take some time get the result for a query. For these reasons, we advise you not to wait to work out this exercise just before the submission deadline. 

### Make sure to get the initialization actions right
For this exercise, you will be using GraphFrames.

## 1. Politicians who are also scientists (Question B)
Find all politicians who are also scientists. Output top 20 of them. How many people are in the dataset who are both scientists and politicians?
Please follow these steps:
* Operate on the subsets of `df_subclasses` where the objects are `'<wordnet_scientist_110560637>` (scientists) and `'<wordnet_politician_110450303>'` (politicians), and where the predicates are `rdf:type`.
* Use graphframes and the right parts of `df_subclasses` to construct a graph whose (directed) edges point from subjects to objects. Hence, its source vertices are subjects and it destination vertices are objects. It may be convenient to use intermediate DataFrames and join all the required dataframes of edges and vertices.
* The subjects will be people and the objects will be classes (e.g., scientists, politicians).
* Use a motif query to find all instances that fulfil the criteria specified in the question.
* It is a good idea to define a function that takes a DataFrame and outputs a set of data frames for vertices and edges.

Please sort the output alphabetically by the person column.

In [69]:
Scientist_Polit = df_subclasses.where("predicate == 'rdf:type' and (object == '<wordnet_scientist_110560637>' or object == '<wordnet_politician_110450303>')")

In [5]:
Scientist_Polit.show(5)

+--------------------+--------------------+---------+--------------------+-----+
|                  id|             subject|predicate|              object|value|
+--------------------+--------------------+---------+--------------------+-----+
|<id_wGHfubCwBs_KC...|<Jean-Baptiste-Jo...| rdf:type|<wordnet_politici...| null|
|<id_EQgbQobwPR_KC...|       <Reg_Freeson>| rdf:type|<wordnet_politici...| null|
|<id_AZU1dcMWPB_KC...|       <Akbar_Ahmad>| rdf:type|<wordnet_politici...| null|
|<id_OqLYLwYtOx_KC...|<it/Luigi_Di_Paol...| rdf:type|<wordnet_politici...| null|
|<id_X76ScLZ?xM_KC...|         <Larry_Wos>| rdf:type|<wordnet_scientis...| null|
+--------------------+--------------------+---------+--------------------+-----+
only showing top 5 rows



In [70]:
Scientist_Polit.createOrReplaceTempView("Scientist_Polit")
People = spark.sql("select distinct sp.subject from Scientist_Polit sp")
People = People.withColumn('type', lit('subject'))

In [9]:
People.show(3)

+--------------------+-------+
|             subject|   type|
+--------------------+-------+
|    <Vladimír_Mařík>|subject|
|        <Ivan_Bauer>|subject|
|<fr/Jacques-Antoi...|subject|
+--------------------+-------+
only showing top 3 rows



In [71]:
Jobs = spark.sql("select distinct sp.object from Scientist_Polit sp")
Jobs = Jobs.withColumn('type', lit('object'))

In [10]:
Jobs.show()

+--------------------+------+
|              object|  type|
+--------------------+------+
|<wordnet_politici...|object|
|<wordnet_scientis...|object|
+--------------------+------+



In [73]:
Vertex = People.union(Jobs).withColumnRenamed("subject", "id")

In [11]:
Vertex.show(3)

+--------------------+-------+
|             subject|   type|
+--------------------+-------+
|    <Vladimír_Mařík>|subject|
|        <Ivan_Bauer>|subject|
|<fr/Jacques-Antoi...|subject|
+--------------------+-------+
only showing top 3 rows



In [74]:
Edges = spark.sql("select sp.subject, sp.object from Scientist_Polit sp").withColumnRenamed("subject", "src").withColumnRenamed("object", "dst")

In [13]:
Edges.show(3)

+--------------------+--------------------+
|                 src|                 dst|
+--------------------+--------------------+
|<Jean-Baptiste-Jo...|<wordnet_politici...|
|       <Reg_Freeson>|<wordnet_politici...|
|       <Akbar_Ahmad>|<wordnet_politici...|
+--------------------+--------------------+
only showing top 3 rows



In [75]:
SciPoliGrFr = GraphFrame(Vertex, Edges)

In [76]:
motifs = SciPoliGrFr.find("(a)-[]->(b); (a)-[]->(c)")
result = motifs.filter("b != c").select("a").distinct()

In [77]:
result.sort("a", ascending = True).show(20, False)

+------------------------------------+
|a                                   |
+------------------------------------+
|[<A._C._Cuza>, subject]             |
|[<A._P._J._Abdul_Kalam>, subject]   |
|[<Aad_Kosto>, subject]              |
|[<Aad_Nuis>, subject]               |
|[<Aaron_Aaronsohn>, subject]        |
|[<Aaron_Farrugia>, subject]         |
|[<Ab_Klink>, subject]               |
|[<Abba_P._Lerner>, subject]         |
|[<Abbas_Ahmad_Akhoundi>, subject]   |
|[<Abbie_Hoffman>, subject]          |
|[<Abbott_Lawrence_Lowell>, subject] |
|[<Abdallah_Salem_el-Badri>, subject]|
|[<Abdelbaki_Hermassi>, subject]     |
|[<Abdellatif_Abid>, subject]        |
|[<Abdelouahed_Souhail>, subject]    |
|[<Abdelwahed_Radi>, subject]        |
|[<Abdesslam_Yassine>, subject]      |
|[<Abdi_Farah_Shirdon>, subject]     |
|[<Abdirahman_Duale_Beyle>, subject] |
|[<Abdiweli_Mohamed_Ali>, subject]   |
+------------------------------------+
only showing top 20 rows



In [20]:
print(result.count())

7182


The total number of politicians that are also scientists is: 7182.  The top 20 when sorted alphabetically are shown above.  

## 2. Companies whose founders were born in London (Question C)
For companies, use `'<wordnet_company_108058098>'`. 
For *"being founder"*, use `<created>`.

By now, you will understand which DataFrame to use for what. 

Set up a graph and use a motif query to find companies whose founders were born in London.
Please take some time to figure out how a suitable configuration of nodes and edges should look like.  How many such companies are there in our dataset?

Please sort the output alphabetically by the founder column.

In [63]:
Companies_edge = df_subclasses.where("predicate == 'rdf:type' and object == '<wordnet_company_108058098>'").select("subject", "predicate", "object")
Companies_vert = Companies_edge.select("subject").distinct().withColumn('type', lit('creation')).withColumnRenamed("subject", "id")
Companies_vert2 = Companies_edge.select("object").distinct().withColumn('type', lit('companydescription')).withColumnRenamed("object", "id")

In [64]:
print(Companies_edge.count())
print(Companies_vert.count())
print(Companies_vert2.count())

134474
134474
1


In [65]:
#Get founders of those companies
Founders_edge = df.where("predicate == '<created>'").select("subject", "predicate", "object")
Founders_vert = Founders_edge.select("subject").distinct().withColumn('type', lit('person')).withColumnRenamed("subject", "id")
Creation_vert = Founders_edge.select("object").distinct().withColumn('type', lit('creation')).withColumnRenamed("object", "id")

In [66]:
print(Founders_edge.count())
print(Founders_vert.count())
print(Creation_vert.count())

485392
113433
360647


In [67]:
#Get birth locations of founders
Births_edge = df.where("predicate == '<wasBornIn>'").select("subject", "predicate", "object")
Births_vert = Births_edge.select("subject").distinct().withColumn('type', lit('person')).withColumnRenamed("subject", "id")
Locs_vert = Births_edge.select("object").distinct().withColumn('type', lit('place')).withColumnRenamed("object", "id")

In [68]:
print(Births_edge.count())
print(Births_vert.count())
print(Locs_vert.count())

848846
848846
91153


In [8]:
#Make dataframe of vertices
Vertices = Companies_vert.union(Companies_vert2).union(Founders_vert).union(Creation_vert).union(Births_vert).union(Locs_vert).distinct()

In [9]:
#Make dataframe of edges: relationships
Edges = Companies_edge.union(Founders_edge).union(Births_edge).withColumnRenamed("predicate", "relationship").withColumnRenamed("subject", "src").withColumnRenamed("object", "dst").distinct()

In [10]:
LondonFoundersGrFr = GraphFrame(Vertices, Edges)

In [11]:
motifs = LondonFoundersGrFr.find("(a)-[e1]->(b); (a)-[e2]->(c); (c)-[e3]->(d)")
result = motifs.filter("b.id == '<London>' and c.type == 'creation' and d.id == '<wordnet_company_108058098>' ").select("a", "c").distinct() 

There are 61 companies created by people who were born in London, but only 53 distinct founders.  The top 20 companies and their founders sorted alphabetically by founder's name are shown below.  

In [12]:
result.sort("a", ascending = True).show(20, False)

+-----------------------------------------+-------------------------------------------------+
|a                                        |c                                                |
+-----------------------------------------+-------------------------------------------------+
|[<Adam_Hamdy>, person]                   |[<Dare_Comics>, creation]                        |
|[<Alexander_Asseily>, person]            |[<Jawbone_(company)>, creation]                  |
|[<Antony_Jay>, person]                   |[<Video_Arts>, creation]                         |
|[<Aubrey_de_Grey>, person]               |[<SENS_Research_Foundation>, creation]           |
|[<Ben_Horowitz>, person]                 |[<Andreessen_Horowitz>, creation]                |
|[<Bernard_MacMahon_(filmmaker)>, person] |[<LO-MAX_Records>, creation]                     |
|[<Brian_Maxwell>, person]                |[<PowerBar>, creation]                           |
|[<Bruno_Heller>, person]                 |[<Primrose_Hill_P

In [13]:
print(result.count())

61


In [38]:
result = motifs.filter("b.id == '<London>' and d.id == '<wordnet_company_108058098>' ").select("a").distinct() 
print(result.count())

53


## 3. Writers who have won a Nobel Prize in any discipline, including economics (Question D)
Tags for nobel prizes look like these: `'<Nobel_Prize_in_Chemistry>`, `<Nobel_Prize_in_Physics>'`, `<Nobel_Prize>` or `<Nobel_Prize>` etc.
We are also counting this one: `'<Nobel_Memorial_Prize_in_Economic_Sciences>'`.

The tag for writers is `'<wordnet_writer_110794014>'`.

You will need to use `'<hasWonPrize>'` as a predicate.

Please sort the output alphabetically by the person column.

In [26]:
#Writers
writers_edge = df_subclasses.where("predicate == 'rdf:type' and object == '<wordnet_writer_110794014>'").select("subject", "predicate", "object")
writers_vert1 = writers_edge.select("subject").distinct().withColumnRenamed("subject", "id").withColumn('type', lit('name'))
writers_vert2 = writers_edge.select("object").distinct().withColumnRenamed("object", "id").withColumn('type', lit('occupation'))

In [27]:
print(writers_edge.count())
print(writers_vert1.count())
print(writers_vert2.count())

256772
256772
1


There are 6 different Nobel prize or Nobel memorial prize objects, which match the true categories.  

In [28]:
#Make sure this includes all types of Nobel prizes
NobelPrizes = df.where("predicate == '<hasWonPrize>' and (object like '%<Nobel_Prize%' or object like '%<Nobel_Memorial_Prize%')").select("object").distinct()
NobelPrizes.show(10, False)

+-------------------------------------------+
|object                                     |
+-------------------------------------------+
|<Nobel_Prize_in_Physiology_or_Medicine>    |
|<Nobel_Prize_in_Chemistry>                 |
|<Nobel_Prize_in_Literature>                |
|<Nobel_Prize_in_Physics>                   |
|<Nobel_Memorial_Prize_in_Economic_Sciences>|
|<Nobel_Prize>                              |
+-------------------------------------------+



In [33]:
#Prize recipents
NobelPrizes_edge = df.where("predicate == '<hasWonPrize>' and (object like '%<Nobel_Prize%' or object like '%<Nobel_Memorial_Prize%')").select("subject", "predicate", "object")
NobelPrizes_vert1 = NobelPrizes_edges.select("subject").distinct().withColumnRenamed("subject", "id").withColumn('type', lit('name'))
NobelPrizes_vert2 = NobelPrizes_edges.select("object").distinct().withColumnRenamed("object", "id").withColumn('type', lit('prize'))

In [34]:
print(NobelPrizes_edge.count())
print(NobelPrizes_vert1.count())
print(NobelPrizes_vert2.count())

838
826
6


In [35]:
#Make dataframe of vertices
Vertices = writers_vert1.union(writers_vert2).union(NobelPrizes_vert1).union(NobelPrizes_vert2).distinct()

In [36]:
#Make dataframe of edges: relationships
Edges = writers_edge.union(NobelPrizes_edge).withColumnRenamed("predicate", "relationship").withColumnRenamed("subject", "src").withColumnRenamed("object", "dst").distinct()

In [37]:
NobelWritersGrFr = GraphFrame(Vertices, Edges)

In [38]:
motifs = NobelWritersGrFr.find("(a)-[]->(b); (a)-[]->(c)")
result = motifs.filter("b != c").select("a").distinct() 

There are 219 writers who have won a nobel prize or nobel memorial prize.  The top 20 listed alphabetically are shown below.  

In [39]:
print(result.count())
result.sort("a", ascending = True).show(20, False)

219
+--------------------------------------+
|a                                     |
+--------------------------------------+
|[<Adrienne_Clarkson>, name]           |
|[<Albert_Camus>, name]                |
|[<Albert_Einstein>, name]             |
|[<Aleksandr_Solzhenitsyn>, name]      |
|[<Alexander_Prokhorov>, name]         |
|[<Alexei_Alexeyevich_Abrikosov>, name]|
|[<Alexis_Carrel>, name]               |
|[<Alfred_Kastler>, name]              |
|[<Alice_Munro>, name]                 |
|[<Alvin_E._Roth>, name]               |
|[<Alvin_Toffler>, name]               |
|[<Amartya_Sen>, name]                 |
|[<Anatole_France>, name]              |
|[<André_Gide>, name]                  |
|[<António_Egas_Moniz>, name]          |
|[<Arthur_Kornberg>, name]             |
|[<Artturi_Ilmari_Virtanen>, name]     |
|[<Aziz_Sancar>, name]                 |
|[<Bert_Sakmann>, name]                |
|[<Bertrand_Russell>, name]            |
+--------------------------------------+
only showing

## 4. Nobel prize winners who were born in the same city as their spouses (Question E)
You may find the predicate `'<isMarriedTo>'` useful to create a Dataframe of all mariages.
Please also show the cities in which the Nobel laureates and their spouses were born.

Please sort the output alphabetically by the person (prize winner) column.

In [43]:
#Prize recipents
NobelPrizes_edge = df.where("predicate == '<hasWonPrize>' and (object like '%<Nobel_Prize%' or object like '%<Nobel_Memorial_Prize%')").select("subject", "predicate", "object")
NobelPrizes_vert1 = NobelPrizes_edge.select("subject").distinct().withColumnRenamed("subject", "id").withColumn('type', lit('person'))
NobelPrizes_vert2 = NobelPrizes_edge.select("object").distinct().withColumnRenamed("object", "id").withColumn('type', lit('prize'))

In [49]:
#Marriages
Marriages_edge = df.where("predicate == '<isMarriedTo>'").select("subject", "predicate", "object")
Marriages_edge2 = df.where("predicate == '<isMarriedTo>'").select("object", "predicate", "subject")

Marriages_vert1 = Marriages_edge.select("subject").distinct().withColumn('type', lit('person'))
Marriages_vert2 = Marriages_edge.select("object").distinct().withColumn('type', lit('person'))

In [48]:
print(Marriages_edge.count())
print(Marriages_vert1.count())
print(Marriages_vert2.count())

53039
47539
47122


In [45]:
#Get birth locations of founders
Births_edge = df.where("predicate == '<wasBornIn>'").select("subject", "predicate", "object")
Births_vert = Births_edge.select("subject").distinct().withColumn('type', lit('person')).withColumnRenamed("subject", "id")
Locs_vert = Births_edge.select("object").distinct().withColumn('type', lit('place')).withColumnRenamed("object", "id")

In [46]:
#Make dataframe of vertices
Vertices = NobelPrizes_vert1.union(NobelPrizes_vert2).union(Marriages_vert1).union(Marriages_vert2).union(Births_vert).union(Locs_vert).distinct()

In [50]:
#Make dataframe of edges: relationships
Edges = NobelPrizes_edge.union(Marriages_edge).union(Marriages_edge2).union(Births_edge).withColumnRenamed("predicate", "relationship").withColumnRenamed("subject", "src").withColumnRenamed("object", "dst").distinct()

In [58]:
NobelMarriagesGF = GraphFrame(Vertices, Edges)
motifs = NobelMarriagesGF.find("(a)-[]->(b); (a)-[e1]->(c); (a)-[e2]->(d); (c)-[e3]->(d)")
result = motifs.filter("b.type == 'prize' and e1.relationship == '<isMarriedTo>' and e2.relationship == '<wasBornIn>' and e2.relationship == e3.relationship").select("a", "c", "d").distinct() 

There are 5 people who have the Nobel prize or Nobel memorial prize and were born in the same city as their spouse.  It should be noted that 4 of them appear to be 2 couples where each person won the Nobel prize or Nobel memorial prize, and it is possible that they were collaborators and won it together.  The couples and birth cities are shown below, where column a represents the prize winner.  

In [59]:
print(result.count())

5


In [60]:
result.sort("a", ascending = True).show(20, False)

+---------------------------------+---------------------------------+------------------------+
|a                                |c                                |d                       |
+---------------------------------+---------------------------------+------------------------+
|[<Carl_Ferdinand_Cori>, person]  |[<Gerty_Cori>, person]           |[<Prague>, place]       |
|[<Frédéric_Joliot-Curie>, person]|[<Irène_Joliot-Curie>, person]   |[<Paris>, place]        |
|[<Gerty_Cori>, person]           |[<Carl_Ferdinand_Cori>, person]  |[<Prague>, place]       |
|[<Irène_Joliot-Curie>, person]   |[<Frédéric_Joliot-Curie>, person]|[<Paris>, place]        |
|[<Robert_Hofstadter>, person]    |[<Douglas_Hofstadter>, person]   |[<New_York_City>, place]|
+---------------------------------+---------------------------------+------------------------+



## 5. Politicians that are affiliated with a right-wing party (Question F)

We are looking for all connections of the form `polician -> party`, where party is a right-wing party and politicians are defined above. If one politician is associated with several right wing parties, you may count them several times.

Use `'<isAffiliatedTo>'` to find membership in organisations and `'<wikicat_Right-wing_parties>'` for right-wing parties organisations.

There are multiple ways to do this.

Please sort the output alphabetically by the person (politician) column.

In [17]:
#Politicians
Polits_edge = df_subclasses.where("object == '<wordnet_politician_110450303>'").select("subject", "predicate", "object")
Polits_vert1 = Polits_edge.select("subject").distinct().withColumnRenamed("subject", "id").withColumn('type', lit('name'))
Polits_vert2 = Polits_edge.select("object").distinct().withColumnRenamed("object", "id").withColumn('type', lit('occupation'))

In [15]:
#Affilated organizations
AffOrgs_edge = df.where("predicate == '<isAffiliatedTo>'").select("subject", "predicate", "object")
AffOrgs_vert1 = AffOrgs_edge.select("subject").distinct().withColumnRenamed("subject", "id").withColumn('type', lit('name'))
AffOrgs_vert2 = AffOrgs_edge.select("object").distinct().withColumnRenamed("object", "id").withColumn('type', lit('org'))


In [16]:
#Orgs that are right-wing
RWOrgs_edge = df_subclasses.where("object == '<wikicat_Right-wing_parties>'").select("subject", "predicate", "object")
RWOrgs_vert1 = RWOrgs_edge.select("subject").distinct().withColumnRenamed("subject", "id").withColumn('type', lit('org'))
RWOrgs_vert2 = RWOrgs_edge.select("object").distinct().withColumnRenamed("object", "id").withColumn('type', lit('RW'))


In [19]:
#Make dataframe of vertices
Vertices = Polits_vert1.union(Polits_vert2).union(AffOrgs_vert1).union(AffOrgs_vert2).union(RWOrgs_vert1).union(RWOrgs_vert2).distinct()

In [21]:
#Make dataframe of edges: relationships
Edges = Polits_edge.union(AffOrgs_edge).union(RWOrgs_edge).withColumnRenamed("predicate", "relationship").withColumnRenamed("subject", "src").withColumnRenamed("object", "dst").distinct()

In [22]:
RWPolitGrFr = GraphFrame(Vertices, Edges)

In [40]:
motifs = RWPolitGrFr.find("(a)-[]->(b); (b)-[]->(c); (a)-[]->(d)").filter("d.id == '<wordnet_politician_110450303>'")
result = motifs.select("a").distinct() 

There are 32736 politicians who are affiliated with right wing organizations.  The top 20 sorted alphabetically are shown below.  

In [41]:
print(result.count())

32736


In [42]:
result.sort("a", ascending = True).show(20, False)

+---------------------------------------+
|a                                      |
+---------------------------------------+
|[<A.N.M._Ehsanul_Hoque_Milan>, name]   |
|[<A._A._Wijethunga>, name]             |
|[<A._Anwhar_Raajhaa>, name]            |
|[<A._Arunmozhithevan>, name]           |
|[<A._B._Colton>, name]                 |
|[<A._C._Clemons>, name]                |
|[<A._C._Gibbs>, name]                  |
|[<A._C._Hamlin>, name]                 |
|[<A._Clifford_Jones>, name]            |
|[<A._Dean_Jeffs>, name]                |
|[<A._Devaraj>, name]                   |
|[<A._F._M._Ahsanuddin_Chowdhury>, name]|
|[<A._G._Crowe>, name]                  |
|[<A._Homer_Byington>, name]            |
|[<A._J._M._Muzammil>, name]            |
|[<A._J._McNamara>, name]               |
|[<A._J._Ranasinghe>, name]             |
|[<A._K._A._Firoze_Noon>, name]         |
|[<A._K._Patel>, name]                  |
|[<A._K._S._Vijayan>, name]             |
+---------------------------------