
# Advanced Graph Analysis & NLP

In this notebook, we will be using a combination of Natural Language Processing and network analysis to look at a Mafia network. The settings the mafia network is the following: As members of an investigation unit, we have been observing a network of families, which have been associated with nefarious activities. We will use two datasets:

1. PDF files written by an undercover agent
2. A network of interactions between those families

For NLP in Spark we can use the open-source *spark-nlp* library, which allows us to use a variety of Deep Learning models.

Since here we are given several PDFs to work with, we need a Python library to parse them using ``UDF``. We first need to install the external Python library ``pypdf``. The *graphframes* library, which we used in the previous lab, offers the useful ``GraphFrame``, but choices for graph algorithms are relatively limited. Thus, we will install the ``networkx`` library, which offers a range of popular graph algorithms.

In [0]:
!pip3 install pypdf networkx

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[?25l[K     |█▏                              | 10 kB 17.3 MB/s eta 0:00:01[K     |██▎                             | 20 kB 2.6 MB/s eta 0:00:01[K     |███▍                            | 30 kB 3.8 MB/s eta 0:00:01[K     |████▌                           | 40 kB 3.3 MB/s eta 0:00:01[K     |█████▋                          | 51 kB 3.5 MB/s eta 0:00:01[K     |██████▊                         | 61 kB 4.1 MB/s eta 0:00:01[K     |████████                        | 71 kB 4.2 MB/s eta 0:00:01[K     |█████████                       | 81 kB 4.3 MB/s eta 0:00:01[K     |██████████▏                     | 92 kB 4.8 MB/s eta 0:00:01[K     |███████████▎                    | 102 kB 4.6 MB/s eta 0:00:01[K     |████████████▍                   | 112 kB 4.6 MB/s eta 0:00:01[K     |█████████████▌                  | 122 kB 4.6 MB/s eta 0:00:01[K     |██████████████▋                 | 133 kB 4.6 MB/s eta 0:00:01[K    

In [0]:
import io
import numpy as np
import networkx as nx
from pypdf import PdfReader 

We install the ``spark-nlp`` dependencies next. 

In [0]:
import sparknlp

# Start Spark Session
spark = sparknlp.start()

from sparknlp.base import DocumentAssembler, Pipeline, LightPipeline
from sparknlp.annotator import (
    Tokenizer,
    WordEmbeddingsModel,
    NerDLModel,
    NerConverter
)

import pyspark.sql.functions as F



We load all the reports from our agent as PDF files.

In [0]:
mafia_network_communications = spark.read.format("binaryFile").load("dbfs:/FileStore/crime_letters/*.pdf")

In [0]:
mafia_network_communications.display()

path,modificationTime,length,content
dbfs:/FileStore/crime_letters/communication_22_5_2013.pdf,2024-05-06T22:50:58.000+0000,142828,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9YT2JqZWN0IDw8IC9JbTQxIDIgMCBSID4+IC9Gb250IDw8IC8= (truncated)
dbfs:/FileStore/crime_letters/communication_15_4_2013.pdf,2024-05-06T22:50:58.000+0000,91833,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9YT2JqZWN0IDw8IC9JbTIzIDIgMCBSID4+IC9Gb250IDw8IC8= (truncated)
dbfs:/FileStore/crime_letters/communication_2_3_2013.pdf,2024-05-06T22:50:57.000+0000,89489,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9YT2JqZWN0IDw8IC9JbTUgMiAwIFIgPj4gL0ZvbnQgPDwgL0Y= (truncated)
dbfs:/FileStore/crime_letters/communication_19_4_2013.pdf,2024-05-06T22:50:58.000+0000,27676,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9Gb250IDw8IC9GMSAyIDAgUiA+PiA+Pg0KL0NvbnRlbnRzIDM= (truncated)
dbfs:/FileStore/crime_letters/communication_3_4_2013.pdf,2024-05-06T22:50:57.000+0000,27587,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9Gb250IDw8IC9GMSAyIDAgUiA+PiA+Pg0KL0NvbnRlbnRzIDM= (truncated)
dbfs:/FileStore/crime_letters/communication_17_4_2013.pdf,2024-05-06T22:50:58.000+0000,27535,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9Gb250IDw8IC9GMSAyIDAgUiA+PiA+Pg0KL0NvbnRlbnRzIDM= (truncated)
dbfs:/FileStore/crime_letters/communication_6_4_2013.pdf,2024-05-06T22:50:58.000+0000,27517,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9Gb250IDw8IC9GMSAyIDAgUiA+PiA+Pg0KL0NvbnRlbnRzIDM= (truncated)


As the next step, we define a Python UDF that takes the binary content of each PDF and convert it to text.

In [0]:
@udf
def pdf_to_text(pdf) -> str:
    """
    We transform a PDF (binary) into a string. The "content" column is already binary, so we read the bytes directly.
    """

    # First we load the binary content
    bytes_stream = io.BytesIO(pdf)

    # We initialize the reader
    reader = PdfReader(bytes_stream)

    # As the final step, we go over each page (note though that in our case our PDFs have only one page each) and extract the text
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"

    return text

In [0]:
mafia_network_communications = mafia_network_communications \
  .withColumn("report_text", pdf_to_text(F.col("content")))

In [0]:
mafia_network_communications.display()

path,modificationTime,length,content,report_text
dbfs:/FileStore/crime_letters/communication_22_5_2013.pdf,2024-05-06T22:50:58.000+0000,142828,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9YT2JqZWN0IDw8IC9JbTQxIDIgMCBSID4+IC9Gb250IDw8IC8= (truncated),"I am writing to report on my recent surveillance findings regarding illegal activities within various crime families. Giulia Bianchi, a key figure in the Bianchi organization, has been observed exchanging encrypted messages with Federico Romano of the Romano syndicate. These communications suggest a potential collaboration between the two families to expand their illicit enterprises. Furthermore, intercepted conversations between Elena Conti from the Conti Crime Family and Francesco Ricci indicate negotiations for a large-scale drug trafficking operation. It is evident that these families are actively seeking to strengthen their positions in the criminal underworld. Additionally, during my investigations, I uncovered disturbing information regarding Luca Moretti, who appears to be operating an illegal Balsamico Vinegar factory that adulterates the finest Tuscan Balsamico vinegar. The shipments are facilitated by Giuseppe Rossi, who acts as a business associate in transporting goods from Peru to Italy. This operation poses significant risks to public health and safety and must be addressed promptly. On a personal note, amidst the complexities of my undercover work, I must confess to finding solace in the indulgence of Italian Grappa spirit and embracing the ""Dolce Vita"" lifestyle. While it may seem trivial in the context of my duties, it serves as a reminder of the finer pleasures amidst the darkness of criminal activities. I will continue to gather intelligence and keep you informed of any significant developments."
dbfs:/FileStore/crime_letters/communication_15_4_2013.pdf,2024-05-06T22:50:58.000+0000,91833,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9YT2JqZWN0IDw8IC9JbTIzIDIgMCBSID4+IC9Gb250IDw8IC8= (truncated),"I am writing to update you on my ongoing undercover investigation into various criminal syndicates. Recent surveillance has revealed interactions between Sofia Rossi of the Rossi Crime Family and Carlo Romano from the Romano organization. These meetings suggest a potential partnership between the two families to expand their criminal enterprises. Additionally, intercepted communications between Giovanni Moretti of the Moretti syndicate and Francesca Marini from the Marini family indicate discussions regarding the smuggling of contraband goods. It is crucial that we take decisive action to disrupt these illegal activities and dismantle the alliances forming between the crime families. I will continue to gather intelligence and provide further updates as necessary."
dbfs:/FileStore/crime_letters/communication_2_3_2013.pdf,2024-05-06T22:50:57.000+0000,89489,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9YT2JqZWN0IDw8IC9JbTUgMiAwIFIgPj4gL0ZvbnQgPDwgL0Y= (truncated),"I am writing to provide updates on my infiltration into the criminal underworld. During my surveillance, I have observed interactions between members of various families. Giuseppe Rossi, head of the Rossi Crime Family, has been seen meeting with Marco Bianchi from the Bianchi syndicate. This rendezvous raises concerns about a potential collaboration between the two families. Additionally, there have been intercepted communications between Francesco Ricci of the Ricci Crime Family and Teresa Marini from the Marini organization, discussing the distribution of illegal goods. Such alliances could significantly impact organized crime dynamics in the region. I will continue to monitor these developments closely and provide further updates as necessary."
dbfs:/FileStore/crime_letters/communication_19_4_2013.pdf,2024-05-06T22:50:58.000+0000,27676,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9Gb250IDw8IC9GMSAyIDAgUiA+PiA+Pg0KL0NvbnRlbnRzIDM= (truncated),"I am writing to provide updates on my infiltration into the criminal underworld. During my surveillance, I have observed interactions between members of various families. Giuseppe Rossi, head of the Rossi Crime Family, has been seen meeting with Marco Bianchi from the Bianchi syndicate. This rendezvous raises concerns about a potential collaboration between the two families. Additionally, there have been intercepted communications between Francesco Ricci of the Ricci Crime Family and Teresa Marini from the Marini organization, discussing the distribution of illegal goods. Such alliances could significantly impact organized crime dynamics in the region. I will continue to monitor these developments closely and provide further updates as necessary. P.S. I must express my frustration as my cover was almost blown during a close encounter with a member of the Rossi Family. Extra caution is warranted moving forward."
dbfs:/FileStore/crime_letters/communication_3_4_2013.pdf,2024-05-06T22:50:57.000+0000,27587,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9Gb250IDw8IC9GMSAyIDAgUiA+PiA+Pg0KL0NvbnRlbnRzIDM= (truncated),"I am writing to report on my recent surveillance findings regarding illegal activities within various crime families. Giulia Bianchi, a key figure in the Bianchi organization, has been observed exchanging encrypted messages with Federico Romano of the Romano syndicate. These communications suggest a potential collaboration between the two families to expand their illicit enterprises. Furthermore, intercepted conversations between Elena Conti from the Conti Crime Family and Francesco Ricci indicate negotiations for a large-scale drug trafficking operation. It is evident that these families are actively seeking to strengthen their positions in the criminal underworld. I will continue to gather intelligence and keep you informed of any significant developments."
dbfs:/FileStore/crime_letters/communication_17_4_2013.pdf,2024-05-06T22:50:58.000+0000,27535,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9Gb250IDw8IC9GMSAyIDAgUiA+PiA+Pg0KL0NvbnRlbnRzIDM= (truncated),"I am writing to provide an update on my surveillance of criminal syndicates operating in the region. Recent observations have revealed interactions between Francesco Ricci of the Ricci Crime Family and Paolo Conti from the Conti organization. These encounters suggest a potential partnership between the two families to expand their influence in the illicit arms trade. Additionally, intercepted communications between Giuliana Conti of the Conti syndicate and Giovanni Moretti from the Moretti family indicate discussions regarding the distribution of narcotics. It is imperative that we intervene to disrupt these criminal activities and prevent further collaboration between the crime families. I will continue to monitor their movements and communications closely and provide updates as necessary."
dbfs:/FileStore/crime_letters/communication_6_4_2013.pdf,2024-05-06T22:50:58.000+0000,27517,JVBERi0xLjQNCiXi48/TDQoxIDAgb2JqDQo8PA0KL1R5cGUgL1BhZ2UNCi9NZWRpYUJveCBbIDAgMCA1OTUuMzA0IDg0MS44OSBdDQovUmVzb3VyY2VzIDw8IC9Gb250IDw8IC9GMSAyIDAgUiA+PiA+Pg0KL0NvbnRlbnRzIDM= (truncated),"I am writing to provide updates on my covert surveillance of criminal activities within various syndicates. Recent observations have revealed interactions between Antonio Rossi of the Rossi Crime Family and Chiara Ricci from the Ricci organization. These encounters suggest a potential alliance between the two families to consolidate their power in the region. Additionally, communications intercepted between Luca Bianchi of the Bianchi syndicate and Martina Romano from the Romano family indicate discussions about a large-scale money laundering operation. It is imperative that we intervene to disrupt these illicit activities and prevent further collaboration between the crime families. I will continue to monitor their movements and communications closely."


#### Named Entity Recognition
Named Entity Recognition (NER) is a natural language processing (NLP) technique that identifies and categorizes named entities within text into predefined categories such as names of persons, organizations, locations and dates. Our informant, who has infiltrated the organization, is sending regular letters. You are tasked with building a prototype of automatically processing the reports and extracting the names of the people in each letter, which can then be related to the larger network. 

We will define manually a pipeline that transforms our report text first into an embedding representation and then extracts our entities.

In [0]:
# Step 1: Transforms raw texts to "document" annotation
documentAssembler = DocumentAssembler()\
    .setInputCol("report_text")\
    .setOutputCol("document")

# Step 2: Tokenization
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

# Step 3: Get the embeddings using glove_100d
embeddings = WordEmbeddingsModel.pretrained("glove_100d").\
                  setInputCols(["document", "token"]).\
                  setOutputCol("embeddings")

# Step 4: Use the ``ner_dl`` model
public_ner = NerDLModel.pretrained("ner_dl", "en") \
          .setInputCols(["document", "token", "embeddings"]) \
          .setOutputCol("ner")

# Step 5: Convert to NER
ner_converter = NerConverter() \
                .setInputCols(["document", "token", "ner"]) \
                  .setOutputCol("entities")

# Define the pipeline
ner_pipeline = Pipeline(stages=[ documentAssembler, 
                                 tokenizer,
                                 embeddings,
                                 public_ner,
                                 ner_converter
                                 ])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][OK!]


``Pipelines`` are a Spark concept that we will revisit during our labs on Machine Learning. You will find the API similar to that of ``scikit-learn`` in that it follows the fit/transform structure. Using pipelines, you may specify the steps of a sequence. Most often, you will find yourself using them 

In [0]:
# We fit our model 
ner_pipeline_model = ner_pipeline.fit(mafia_network_communications)

# And transform the data
processed = ner_pipeline_model.transform(mafia_network_communications)

In [0]:
ner_results = processed \
    .select(F.col("ner"), F.col("path"))

Let us now inspect the results.

In [0]:
ner_results.limit(1).display()

ner,path
"List(List(named_entity, 0, 0, O, Map(word -> I, sentence -> 0), List()), List(named_entity, 2, 3, O, Map(word -> am, sentence -> 0), List()), List(named_entity, 5, 11, O, Map(word -> writing, sentence -> 0), List()), List(named_entity, 13, 14, O, Map(word -> to, sentence -> 0), List()), List(named_entity, 16, 21, O, Map(word -> report, sentence -> 0), List()), List(named_entity, 23, 24, O, Map(word -> on, sentence -> 0), List()), List(named_entity, 26, 27, O, Map(word -> my, sentence -> 0), List()), List(named_entity, 29, 34, O, Map(word -> recent, sentence -> 0), List()), List(named_entity, 36, 47, O, Map(word -> surveillance, sentence -> 0), List()), List(named_entity, 49, 56, O, Map(word -> findings, sentence -> 0), List()), List(named_entity, 58, 66, O, Map(word -> regarding, sentence -> 0), List()), List(named_entity, 68, 74, O, Map(word -> illegal, sentence -> 0), List()), List(named_entity, 78, 87, O, Map(word -> activities, sentence -> 0), List()), List(named_entity, 89, 94, O, Map(word -> within, sentence -> 0), List()), List(named_entity, 96, 102, O, Map(word -> various, sentence -> 0), List()), List(named_entity, 104, 108, O, Map(word -> crime, sentence -> 0), List()), List(named_entity, 110, 117, O, Map(word -> families, sentence -> 0), List()), List(named_entity, 118, 118, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 120, 125, B-PER, Map(word -> Giulia, sentence -> 0), List()), List(named_entity, 127, 133, I-PER, Map(word -> Bianchi, sentence -> 0), List()), List(named_entity, 134, 134, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 136, 136, O, Map(word -> a, sentence -> 0), List()), List(named_entity, 138, 140, O, Map(word -> key, sentence -> 0), List()), List(named_entity, 142, 147, O, Map(word -> figure, sentence -> 0), List()), List(named_entity, 149, 150, O, Map(word -> in, sentence -> 0), List()), List(named_entity, 153, 155, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 157, 163, B-ORG, Map(word -> Bianchi, sentence -> 0), List()), List(named_entity, 165, 176, O, Map(word -> organization, sentence -> 0), List()), List(named_entity, 177, 177, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 179, 181, O, Map(word -> has, sentence -> 0), List()), List(named_entity, 183, 186, O, Map(word -> been, sentence -> 0), List()), List(named_entity, 188, 195, O, Map(word -> observed, sentence -> 0), List()), List(named_entity, 197, 206, O, Map(word -> exchanging, sentence -> 0), List()), List(named_entity, 208, 216, O, Map(word -> encrypted, sentence -> 0), List()), List(named_entity, 218, 225, O, Map(word -> messages, sentence -> 0), List()), List(named_entity, 228, 231, O, Map(word -> with, sentence -> 0), List()), List(named_entity, 233, 240, B-PER, Map(word -> Federico, sentence -> 0), List()), List(named_entity, 242, 247, I-PER, Map(word -> Romano, sentence -> 0), List()), List(named_entity, 249, 250, O, Map(word -> of, sentence -> 0), List()), List(named_entity, 252, 254, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 256, 261, B-PER, Map(word -> Romano, sentence -> 0), List()), List(named_entity, 263, 271, O, Map(word -> syndicate, sentence -> 0), List()), List(named_entity, 272, 272, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 274, 278, O, Map(word -> These, sentence -> 0), List()), List(named_entity, 280, 293, O, Map(word -> communications, sentence -> 0), List()), List(named_entity, 295, 301, O, Map(word -> suggest, sentence -> 0), List()), List(named_entity, 305, 305, O, Map(word -> a, sentence -> 0), List()), List(named_entity, 307, 315, O, Map(word -> potential, sentence -> 0), List()), List(named_entity, 317, 329, O, Map(word -> collaboration, sentence -> 0), List()), List(named_entity, 331, 337, O, Map(word -> between, sentence -> 0), List()), List(named_entity, 339, 341, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 343, 345, O, Map(word -> two, sentence -> 0), List()), List(named_entity, 347, 354, O, Map(word -> families, sentence -> 0), List()), List(named_entity, 356, 357, O, Map(word -> to, sentence -> 0), List()), List(named_entity, 359, 364, O, Map(word -> expand, sentence -> 0), List()), List(named_entity, 366, 370, O, Map(word -> their, sentence -> 0), List()), List(named_entity, 372, 378, O, Map(word -> illicit, sentence -> 0), List()), List(named_entity, 381, 391, O, Map(word -> enterprises, sentence -> 0), List()), List(named_entity, 392, 392, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 394, 404, O, Map(word -> Furthermore, sentence -> 0), List()), List(named_entity, 405, 405, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 407, 417, O, Map(word -> intercepted, sentence -> 0), List()), List(named_entity, 419, 431, O, Map(word -> conversations, sentence -> 0), List()), List(named_entity, 433, 439, O, Map(word -> between, sentence -> 0), List()), List(named_entity, 441, 445, B-PER, Map(word -> Elena, sentence -> 0), List()), List(named_entity, 447, 451, I-PER, Map(word -> Conti, sentence -> 0), List()), List(named_entity, 454, 457, O, Map(word -> from, sentence -> 0), List()), List(named_entity, 459, 461, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 463, 467, B-ORG, Map(word -> Conti, sentence -> 0), List()), List(named_entity, 469, 473, I-ORG, Map(word -> Crime, sentence -> 0), List()), List(named_entity, 475, 480, I-ORG, Map(word -> Family, sentence -> 0), List()), List(named_entity, 482, 484, O, Map(word -> and, sentence -> 0), List()), List(named_entity, 486, 494, B-PER, Map(word -> Francesco, sentence -> 0), List()), List(named_entity, 496, 500, I-PER, Map(word -> Ricci, sentence -> 0), List()), List(named_entity, 502, 509, O, Map(word -> indicate, sentence -> 0), List()), List(named_entity, 511, 522, O, Map(word -> negotiations, sentence -> 0), List()), List(named_entity, 524, 526, O, Map(word -> for, sentence -> 0), List()), List(named_entity, 528, 528, O, Map(word -> a, sentence -> 0), List()), List(named_entity, 531, 541, O, Map(word -> large-scale, sentence -> 0), List()), List(named_entity, 543, 546, O, Map(word -> drug, sentence -> 0), List()), List(named_entity, 548, 558, O, Map(word -> trafficking, sentence -> 0), List()), List(named_entity, 560, 568, O, Map(word -> operation, sentence -> 0), List()), List(named_entity, 569, 569, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 571, 572, O, Map(word -> It, sentence -> 0), List()), List(named_entity, 574, 575, O, Map(word -> is, sentence -> 0), List()), List(named_entity, 577, 583, O, Map(word -> evident, sentence -> 0), List()), List(named_entity, 585, 588, O, Map(word -> that, sentence -> 0), List()), List(named_entity, 590, 594, O, Map(word -> these, sentence -> 0), List()), List(named_entity, 596, 603, O, Map(word -> families, sentence -> 0), List()), List(named_entity, 606, 608, O, Map(word -> are, sentence -> 0), List()), List(named_entity, 610, 617, O, Map(word -> actively, sentence -> 0), List()), List(named_entity, 619, 625, O, Map(word -> seeking, sentence -> 0), List()), List(named_entity, 627, 628, O, Map(word -> to, sentence -> 0), List()), List(named_entity, 630, 639, O, Map(word -> strengthen, sentence -> 0), List()), List(named_entity, 641, 645, O, Map(word -> their, sentence -> 0), List()), List(named_entity, 647, 655, O, Map(word -> positions, sentence -> 0), List()), List(named_entity, 657, 658, O, Map(word -> in, sentence -> 0), List()), List(named_entity, 660, 662, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 664, 671, O, Map(word -> criminal, sentence -> 0), List()), List(named_entity, 674, 683, O, Map(word -> underworld, sentence -> 0), List()), List(named_entity, 684, 684, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 686, 697, O, Map(word -> Additionally, sentence -> 0), List()), List(named_entity, 698, 698, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 700, 705, O, Map(word -> during, sentence -> 0), List()), List(named_entity, 707, 708, O, Map(word -> my, sentence -> 0), List()), List(named_entity, 710, 723, O, Map(word -> investigations, sentence -> 0), List()), List(named_entity, 724, 724, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 726, 726, O, Map(word -> I, sentence -> 0), List()), List(named_entity, 728, 736, O, Map(word -> uncovered, sentence -> 0), List()), List(named_entity, 738, 747, O, Map(word -> disturbing, sentence -> 0), List()), List(named_entity, 749, 759, O, Map(word -> information, sentence -> 0), List()), List(named_entity, 763, 771, O, Map(word -> regarding, sentence -> 0), List()), List(named_entity, 773, 776, B-PER, Map(word -> Luca, sentence -> 0), List()), List(named_entity, 778, 784, I-PER, Map(word -> Moretti, sentence -> 0), List()), List(named_entity, 785, 785, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 787, 789, O, Map(word -> who, sentence -> 0), List()), List(named_entity, 791, 797, O, Map(word -> appears, sentence -> 0), List()), List(named_entity, 799, 800, O, Map(word -> to, sentence -> 0), List()), List(named_entity, 802, 803, O, Map(word -> be, sentence -> 0), List()), List(named_entity, 805, 813, O, Map(word -> operating, sentence -> 0), List()), List(named_entity, 815, 816, O, Map(word -> an, sentence -> 0), List()), List(named_entity, 818, 824, O, Map(word -> illegal, sentence -> 0), List()), List(named_entity, 826, 834, B-ORG, Map(word -> Balsamico, sentence -> 0), List()), List(named_entity, 837, 843, I-ORG, Map(word -> Vinegar, sentence -> 0), List()), List(named_entity, 845, 851, O, Map(word -> factory, sentence -> 0), List()), List(named_entity, 853, 856, O, Map(word -> that, sentence -> 0), List()), List(named_entity, 858, 868, O, Map(word -> adulterates, sentence -> 0), List()), List(named_entity, 870, 872, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 874, 879, O, Map(word -> finest, sentence -> 0), List()), List(named_entity, 881, 886, B-MISC, Map(word -> Tuscan, sentence -> 0), List()), List(named_entity, 888, 896, I-MISC, Map(word -> Balsamico, sentence -> 0), List()), List(named_entity, 898, 904, O, Map(word -> vinegar, sentence -> 0), List()), List(named_entity, 905, 905, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 907, 909, O, Map(word -> The, sentence -> 0), List()), List(named_entity, 912, 920, O, Map(word -> shipments, sentence -> 0), List()), List(named_entity, 922, 924, O, Map(word -> are, sentence -> 0), List()), List(named_entity, 926, 936, O, Map(word -> facilitated, sentence -> 0), List()), List(named_entity, 938, 939, O, Map(word -> by, sentence -> 0), List()), List(named_entity, 941, 948, B-PER, Map(word -> Giuseppe, sentence -> 0), List()), List(named_entity, 950, 954, I-PER, Map(word -> Rossi, sentence -> 0), List()), List(named_entity, 955, 955, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 957, 959, O, Map(word -> who, sentence -> 0), List()), List(named_entity, 961, 964, O, Map(word -> acts, sentence -> 0), List()), List(named_entity, 966, 967, O, Map(word -> as, sentence -> 0), List()), List(named_entity, 969, 969, O, Map(word -> a, sentence -> 0), List()), List(named_entity, 971, 978, O, Map(word -> business, sentence -> 0), List()), List(named_entity, 980, 988, O, Map(word -> associate, sentence -> 0), List()), List(named_entity, 992, 993, O, Map(word -> in, sentence -> 0), List()), List(named_entity, 995, 1006, O, Map(word -> transporting, sentence -> 0), List()), List(named_entity, 1008, 1012, O, Map(word -> goods, sentence -> 0), List()), List(named_entity, 1014, 1017, O, Map(word -> from, sentence -> 0), List()), List(named_entity, 1019, 1022, B-LOC, Map(word -> Peru, sentence -> 0), List()), List(named_entity, 1024, 1025, O, Map(word -> to, sentence -> 0), List()), List(named_entity, 1027, 1031, B-LOC, Map(word -> Italy, sentence -> 0), List()), List(named_entity, 1032, 1032, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 1034, 1037, O, Map(word -> This, sentence -> 0), List()), List(named_entity, 1039, 1047, O, Map(word -> operation, sentence -> 0), List()), List(named_entity, 1049, 1053, O, Map(word -> poses, sentence -> 0), List()), List(named_entity, 1055, 1065, O, Map(word -> significant, sentence -> 0), List()), List(named_entity, 1068, 1072, O, Map(word -> risks, sentence -> 0), List()), List(named_entity, 1074, 1075, O, Map(word -> to, sentence -> 0), List()), List(named_entity, 1077, 1082, O, Map(word -> public, sentence -> 0), List()), List(named_entity, 1084, 1089, O, Map(word -> health, sentence -> 0), List()), List(named_entity, 1091, 1093, O, Map(word -> and, sentence -> 0), List()), List(named_entity, 1095, 1100, O, Map(word -> safety, sentence -> 0), List()), List(named_entity, 1102, 1104, O, Map(word -> and, sentence -> 0), List()), List(named_entity, 1106, 1109, O, Map(word -> must, sentence -> 0), List()), List(named_entity, 1111, 1112, O, Map(word -> be, sentence -> 0), List()), List(named_entity, 1114, 1122, O, Map(word -> addressed, sentence -> 0), List()), List(named_entity, 1124, 1131, O, Map(word -> promptly, sentence -> 0), List()), List(named_entity, 1132, 1132, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 1134, 1135, O, Map(word -> On, sentence -> 0), List()), List(named_entity, 1137, 1137, O, Map(word -> a, sentence -> 0), List()), List(named_entity, 1139, 1146, O, Map(word -> personal, sentence -> 0), List()), List(named_entity, 1148, 1151, O, Map(word -> note, sentence -> 0), List()), List(named_entity, 1152, 1152, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 1154, 1159, O, Map(word -> amidst, sentence -> 0), List()), List(named_entity, 1161, 1163, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 1165, 1176, O, Map(word -> complexities, sentence -> 0), List()), List(named_entity, 1178, 1179, O, Map(word -> of, sentence -> 0), List()), List(named_entity, 1181, 1182, O, Map(word -> my, sentence -> 0), List()), List(named_entity, 1184, 1193, O, Map(word -> undercover, sentence -> 0), List()), List(named_entity, 1195, 1198, O, Map(word -> work, sentence -> 0), List()), List(named_entity, 1199, 1199, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 1201, 1201, O, Map(word -> I, sentence -> 0), List()), List(named_entity, 1203, 1206, O, Map(word -> must, sentence -> 0), List()), List(named_entity, 1209, 1215, O, Map(word -> confess, sentence -> 0), List()), List(named_entity, 1217, 1218, O, Map(word -> to, sentence -> 0), List()), List(named_entity, 1220, 1226, O, Map(word -> finding, sentence -> 0), List()), List(named_entity, 1228, 1233, O, Map(word -> solace, sentence -> 0), List()), List(named_entity, 1235, 1236, O, Map(word -> in, sentence -> 0), List()), List(named_entity, 1238, 1240, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 1242, 1251, O, Map(word -> indulgence, sentence -> 0), List()), List(named_entity, 1253, 1254, O, Map(word -> of, sentence -> 0), List()), List(named_entity, 1256, 1262, B-MISC, Map(word -> Italian, sentence -> 0), List()), List(named_entity, 1264, 1269, I-MISC, Map(word -> Grappa, sentence -> 0), List()), List(named_entity, 1271, 1276, O, Map(word -> spirit, sentence -> 0), List()), List(named_entity, 1278, 1280, O, Map(word -> and, sentence -> 0), List()), List(named_entity, 1283, 1291, O, Map(word -> embracing, sentence -> 0), List()), List(named_entity, 1293, 1295, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 1297, 1297, O, Map(word -> "", sentence -> 0), List()), List(named_entity, 1298, 1302, B-ORG, Map(word -> Dolce, sentence -> 0), List()), List(named_entity, 1304, 1307, I-ORG, Map(word -> Vita, sentence -> 0), List()), List(named_entity, 1308, 1308, O, Map(word -> "", sentence -> 0), List()), List(named_entity, 1310, 1318, O, Map(word -> lifestyle, sentence -> 0), List()), List(named_entity, 1319, 1319, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 1321, 1325, O, Map(word -> While, sentence -> 0), List()), List(named_entity, 1327, 1328, O, Map(word -> it, sentence -> 0), List()), List(named_entity, 1330, 1332, O, Map(word -> may, sentence -> 0), List()), List(named_entity, 1334, 1337, O, Map(word -> seem, sentence -> 0), List()), List(named_entity, 1339, 1345, O, Map(word -> trivial, sentence -> 0), List()), List(named_entity, 1347, 1348, O, Map(word -> in, sentence -> 0), List()), List(named_entity, 1350, 1352, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 1354, 1360, O, Map(word -> context, sentence -> 0), List()), List(named_entity, 1364, 1365, O, Map(word -> of, sentence -> 0), List()), List(named_entity, 1367, 1368, O, Map(word -> my, sentence -> 0), List()), List(named_entity, 1370, 1375, O, Map(word -> duties, sentence -> 0), List()), List(named_entity, 1376, 1376, O, Map(word -> ,, sentence -> 0), List()), List(named_entity, 1378, 1379, O, Map(word -> it, sentence -> 0), List()), List(named_entity, 1381, 1386, O, Map(word -> serves, sentence -> 0), List()), List(named_entity, 1388, 1389, O, Map(word -> as, sentence -> 0), List()), List(named_entity, 1391, 1391, O, Map(word -> a, sentence -> 0), List()), List(named_entity, 1393, 1400, O, Map(word -> reminder, sentence -> 0), List()), List(named_entity, 1402, 1403, O, Map(word -> of, sentence -> 0), List()), List(named_entity, 1405, 1407, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 1409, 1413, O, Map(word -> finer, sentence -> 0), List()), List(named_entity, 1415, 1423, O, Map(word -> pleasures, sentence -> 0), List()), List(named_entity, 1425, 1430, O, Map(word -> amidst, sentence -> 0), List()), List(named_entity, 1432, 1434, O, Map(word -> the, sentence -> 0), List()), List(named_entity, 1437, 1444, O, Map(word -> darkness, sentence -> 0), List()), List(named_entity, 1446, 1447, O, Map(word -> of, sentence -> 0), List()), List(named_entity, 1449, 1456, O, Map(word -> criminal, sentence -> 0), List()), List(named_entity, 1458, 1467, O, Map(word -> activities, sentence -> 0), List()), List(named_entity, 1468, 1468, O, Map(word -> ., sentence -> 0), List()), List(named_entity, 1470, 1470, O, Map(word -> I, sentence -> 0), List()), List(named_entity, 1472, 1475, O, Map(word -> will, sentence -> 0), List()), List(named_entity, 1477, 1484, O, Map(word -> continue, sentence -> 0), List()), List(named_entity, 1486, 1487, O, Map(word -> to, sentence -> 0), List()), List(named_entity, 1489, 1494, O, Map(word -> gather, sentence -> 0), List()), List(named_entity, 1496, 1507, O, Map(word -> intelligence, sentence -> 0), List()), List(named_entity, 1509, 1511, O, Map(word -> and, sentence -> 0), List()), List(named_entity, 1513, 1516, O, Map(word -> keep, sentence -> 0), List()), List(named_entity, 1518, 1520, O, Map(word -> you, sentence -> 0), List()), List(named_entity, 1522, 1529, O, Map(word -> informed, sentence -> 0), List()), List(named_entity, 1531, 1532, O, Map(word -> of, sentence -> 0), List()), List(named_entity, 1534, 1536, O, Map(word -> any, sentence -> 0), List()), List(named_entity, 1539, 1549, O, Map(word -> significant, sentence -> 0), List()), List(named_entity, 1551, 1562, O, Map(word -> developments, sentence -> 0), List()), List(named_entity, 1563, 1563, O, Map(word -> ., sentence -> 0), List()))",dbfs:/FileStore/crime_letters/communication_22_5_2013.pdf


To facilitate our analysis, we will use the ``path`` column to extract the date of the report and use the date to assign a report id. We will accomplish this by using a Window function. From there, we explode the array of struct (the result of the NER) and retrieve only the entity and the associated word.

In [0]:
from pyspark.sql.window import Window

In [0]:
# 1. We extract the date
# 2. Get a row ID based on date
# 3. Explode the column
# 4. Extract the results (in a struct)

reports_parsed = ner_results \
    .withColumn("date", F.to_date(F.regexp_extract(F.col("path"), r"(\d{4}-\d+-\d{2})", 1))) \
    .withColumn("report_id", F.row_number().over(Window.orderBy("date"))) \
    .withColumn("ner_exploded", F.explode("ner")) \
    .withColumns({
        "result":  F.col("ner_exploded.result"), 
        "metadata": F.col("ner_exploded.metadata.word") 
    }
    ) \
    .withColumn("row_number", F.row_number().over(Window.orderBy("report_id"))) \
    .select(F.col("result"), F.col("metadata"), F.col("report_id"), F.col("row_number"))

In [0]:
reports_parsed.limit(50).display()

result,metadata,report_id,row_number
O,I,1,1
O,am,1,2
O,writing,1,3
O,to,1,4
O,report,1,5
O,on,1,6
O,my,1,7
O,recent,1,8
O,surveillance,1,9
O,findings,1,10


We know that our agent always spells out the full name of the persons he follows (how convenient!). This allows us to do a clever join to get the people's full name: We join the DataFrame onto itself on the newly created variable ``row_number``, where the left side corresponds to the first name and the right side to the last name. Since we know the ordering we have as the join key (``row_number``, ``row_number - 1``).

In [0]:
sub_network = reports_parsed.alias("df1").withColumnRenamed("metadata", "First Name").join(
    reports_parsed.alias("df2").withColumnRenamed("metadata", "Last Name"),
    (F.col("df1.row_number") == F.col("df2.row_number") - 1) & 
    (F.col("df1.result") == "B-PER") & 
    (F.col("df2.result") == "I-PER"),
    "inner"
)

display(sub_network)

result,First Name,report_id,row_number,result.1,Last Name,report_id.1,row_number.1
B-PER,Giulia,1,19,I-PER,Bianchi,1,20
B-PER,Federico,1,37,I-PER,Romano,1,38
B-PER,Elena,1,65,I-PER,Conti,1,66
B-PER,Francesco,1,73,I-PER,Ricci,1,74
B-PER,Luca,1,113,I-PER,Moretti,1,114
B-PER,Giuseppe,1,139,I-PER,Rossi,1,140
B-PER,Carlo,2,280,I-PER,Romano,2,281
B-PER,Giovanni,2,308,I-PER,Moretti,2,309
B-PER,Francesca,2,315,I-PER,Marini,2,316
B-PER,Giuseppe,3,394,I-PER,Rossi,3,395


We have now extracted a subnet of the overall Mafia network. Our tasks are now two-fold:

1. Which is the individual with the highest influence within the sub-network (based on each relation type)?
2. Which is the individual with the highest influence within the overall network (based on each relation type)?

Both of these question can be answered using network analysis!

In [0]:
from graphframes import * 

nodes = spark.read.csv("dbfs:/FileStore/mafia_nodes.csv", header=True)
edges = spark.read.csv("dbfs:/FileStore/mafia_edges.csv", header=True)


We will create an index column for the relation types.

In [0]:
import pyspark.sql.types as tp

edges = edges \
    .withColumn("relation_type_index", F.dense_rank().over(Window.orderBy("relation_type"))) \
    .withColumn("weight", F.col("weight").cast(tp.IntegerType()))
edges.display()

src,dst,relation_type,weight,relation_type_index
1,2,Asked for Meeting With,1,1
3,4,Asked for Meeting With,2,1
4,1,Asked for Meeting With,1,1
5,6,Asked for Meeting With,1,1
5,7,Asked for Meeting With,1,1
6,5,Asked for Meeting With,2,1
7,5,Asked for Meeting With,1,1
8,7,Asked for Meeting With,1,1
9,12,Asked for Meeting With,1,1
11,12,Asked for Meeting With,2,1


Now we instantiate our complete graph.

In [0]:
mafia_graph = GraphFrame(nodes, edges)



In [0]:
# Let us inspect the graph
mafia_graph.vertices.show()
mafia_graph.edges.show()

+---+----------+---------+--------------+
| id|First Name|Last Name|        Family|
+---+----------+---------+--------------+
|  1|  Giuseppe|    Rossi|  Rossi Family|
|  2|     Maria|    Rossi|  Rossi Family|
|  3|   Antonio|    Rossi|  Rossi Family|
|  4|     Sofia|    Rossi|  Rossi Family|
|  5|      Luca|  Bianchi|Bianchi Family|
|  6|    Giulia|  Bianchi|Bianchi Family|
|  7|     Marco|  Bianchi|Bianchi Family|
|  8|   Alessia|  Bianchi|Bianchi Family|
|  9| Francesco|    Ricci|  Ricci Family|
| 10|    Chiara|    Ricci|  Ricci Family|
| 11|   Roberto|    Ricci|  Ricci Family|
| 12|     Laura|    Ricci|  Ricci Family|
| 13|  Giovanni|  Moretti|Moretti Family|
| 14|      Anna|  Moretti|Moretti Family|
| 15|    Matteo|  Moretti|Moretti Family|
| 16|     Elena|  Moretti|Moretti Family|
| 17|     Carlo|   Romano| Romano Family|
| 18|     Lucia|   Romano| Romano Family|
| 19|  Federico|   Romano| Romano Family|
| 20|   Martina|   Romano| Romano Family|
+---+----------+---------+--------

In [0]:
import pyspark.sql.functions as F
mafia_graph.edges.select(F.col("relation_type")).distinct().show()

+--------------------+
|       relation_type|
+--------------------+
|Asked for Meeting...|
|          Threatened|
|          Sent Money|
|              Called|
+--------------------+



We now subset our overall ``nodes`` DataFrame to extract the sub-network only.

In [0]:
sub_network_nodes = mafia_graph.vertices \
    .join(sub_network, on=["First Name", "Last Name"], how="inner") \
    .dropDuplicates(["First Name", "Last Name"])

display(sub_network_nodes)

First Name,Last Name,id,Family,result,report_id,row_number,result.1,report_id.1,row_number.1
Antonio,Rossi,3,Rossi Family,B-PER,7,898,I-PER,7,899
Carlo,Romano,17,Romano Family,B-PER,2,280,I-PER,2,281
Chiara,Ricci,10,Ricci Family,B-PER,7,906,I-PER,7,907
Elena,Conti,26,Conti Family,B-PER,1,65,I-PER,1,66
Federico,Romano,19,Romano Family,B-PER,1,37,I-PER,1,38
Francesca,Marini,24,Marini Family,B-PER,2,315,I-PER,2,316
Francesco,Ricci,9,Ricci Family,B-PER,1,73,I-PER,1,74
Giovanni,Moretti,13,Moretti Family,B-PER,2,308,I-PER,2,309
Giulia,Bianchi,6,Bianchi Family,B-PER,1,19,I-PER,1,20
Giuliana,Conti,28,Conti Family,B-PER,6,818,I-PER,6,819


In [0]:
# mafia_graph_edges_sub_network = sub_network_nodes \
#     .withColumn("edge_id", F.concat(F.col("src"), F.col("dst"), F.col("row_number"))) \
#     .selectExpr("edge_id", 'explode(array(src, dst)) AS node_in_edge') \
#     .dropDuplicates()

# mafia_graph_edges_sub_network.display()

edge_id,node_in_edge
121,1
121,2
341,3
341,4
411,4
411,1
561,5
561,6
571,5
571,7


In [0]:
# Get unique edges from subnetwork
sub_network_edges = list(map(lambda x: x["id"], sub_network_nodes.select("id").collect()))

# Filter network by subnetwork nodes
sub_network_df = mafia_graph.filterEdges(F.col("src").isin(sub_network_edges) | F.col("dst").isin(sub_network_edges)).edges



In [0]:
mafia_subgraph = GraphFrame(sub_network_nodes, sub_network_df)

mafia_subgraph.vertices.show()



+----------+---------+---+--------------+------+---------+----------+------+---------+----------+
|First Name|Last Name| id|        Family|result|report_id|row_number|result|report_id|row_number|
+----------+---------+---+--------------+------+---------+----------+------+---------+----------+
|   Antonio|    Rossi|  3|  Rossi Family| B-PER|        7|       898| I-PER|        7|       899|
|     Carlo|   Romano| 17| Romano Family| B-PER|        2|       280| I-PER|        2|       281|
|    Chiara|    Ricci| 10|  Ricci Family| B-PER|        7|       906| I-PER|        7|       907|
|     Elena|    Conti| 26|  Conti Family| B-PER|        1|        65| I-PER|        1|        66|
|  Federico|   Romano| 19| Romano Family| B-PER|        1|        37| I-PER|        1|        38|
| Francesca|   Marini| 24| Marini Family| B-PER|        2|       315| I-PER|        2|       316|
| Francesco|    Ricci|  9|  Ricci Family| B-PER|        1|        73| I-PER|        1|        74|
|  Giovanni|  Morett

While ``graphframes`` offers a method to compute ``degree centrality``, its inventory is relatively limited. Since we are operating on different partitions of the graph (i.e. the subgraph induced by the mentioned people in the reports and the entire graph), we can use Spark's capabilities and parallelize the operations. To this end, we will make use of the ``networkx`` library, which offers a wealth of functions to work with graphs.

As we can see, there are four different types of edges:
- Asked for Meeting
- Threatened
- Sent Money
- Called

Those edge types give us the idea that this graph is directed.

Let us now proceed to our actual network analysis. We will compute to network centrality measures here, (in-/out-)degree centrality and betweenness centrality.

- *Degree centrality* measures the importance of a node in a network based on its connections. In the context of in-degree centrality, this metric quantifies how many incoming connections a node has, reflecting its popularity or influence within the network. Conversely, out-degree centrality assesses the number of outgoing connections from a node, indicating its capacity to disseminate information or influence others. 

- *Betweenness centrality*, on the other hand, evaluates the extent to which a node serves as a bridge or intermediary between other nodes in the network. Nodes with high betweenness centrality often lie on many shortest paths between pairs of nodes, suggesting their critical role in maintaining connectivity and facilitating communication within the network.

Both of these, among many others, are implement in ``networkx``. To make us of these functionalities, we use a trick we learned in a previous class: ``pandas`` UDFs, which allow us to pass our graphframe or dataframe into a function and operate on it as normal Python code.

In [0]:
import pandas as pd

output_schema_degree = tp.StructType([
    tp.StructField("relation_type", tp.StringType(), False),
    tp.StructField("node", tp.StringType(), False),
    tp.StructField("in_degree_centrality", tp.FloatType(), False),
    tp.StructField("out_degree_centrality", tp.FloatType(), False),
])

output_schema_betweenness = tp.StructType([
    tp.StructField("relation_type", tp.StringType(), False),
    tp.StructField("node", tp.StringType(), False),
    tp.StructField("betweenness_centrality", tp.FloatType(), False),
])

def nx_degree_centrality(pdf: pd.DataFrame) -> pd.DataFrame:
    # We get the relation_type key
    key = pdf["relation_type"].iloc[0]

    # Here we instantiate a networkx directed graph (DiGraph)
    in_degree_centralities = nx.in_degree_centrality(nx.DiGraph(nx.from_pandas_edgelist(pdf, "src", "dst", edge_attr="weight")))
    out_degree_centralities = nx.out_degree_centrality(nx.DiGraph(nx.from_pandas_edgelist(pdf, "src", "dst", edge_attr="weight")))
    
    # Finally, we return a pd.DataFrame
    return pd.DataFrame(
        {
            "relation_type": [key for _ in range(1, len(in_degree_centralities.values()) + 1)], 
            "node": in_degree_centralities.keys(), 
            "in_degree_centrality": in_degree_centralities.values(),
            "out_degree_centrality": out_degree_centralities.values(),
        }
    )

def nx_betweenness_centrality(pdf: pd.DataFrame) -> pd.DataFrame:
    # We get the relation_type key
    key = pdf["relation_type"].iloc[0]

    # Here we instantiate a networkx directed graph (DiGraph)
    betweenness_centrality = nx.betweenness_centrality(nx.DiGraph(nx.from_pandas_edgelist(pdf, "src", "dst", edge_attr="weight")))
    
    # Finally, we return a pd.DataFrame
    return pd.DataFrame(
        {
            "relation_type": [key for _ in range(1, len(betweenness_centrality.values()) + 1)], 
            "node": betweenness_centrality.keys(), 
            "betweenness_centrality": betweenness_centrality.values()
        }
    )



We compute degree centrality and betweenness centrality for the subgraph

In [0]:
# Summarize results for degree centrality
mafia_subgraph \
    .edges \
    .groupby("relation_type") \
    .applyInPandas(nx_degree_centrality, output_schema_degree) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "in_degree_centrality", ascending=False) \
    .display()

relation_type,node,in_degree_centrality,out_degree_centrality,id,First Name,Last Name,Family
Threatened,2,0.07692308,0.07692308,2,Maria,Rossi,Rossi Family
Threatened,1,0.07692308,0.07692308,1,Giuseppe,Rossi,Rossi Family
Threatened,4,0.07692308,0.07692308,4,Sofia,Rossi,Rossi Family
Threatened,3,0.07692308,0.07692308,3,Antonio,Rossi,Rossi Family
Threatened,10,0.07692308,0.07692308,10,Chiara,Ricci,Ricci Family
Threatened,9,0.07692308,0.07692308,9,Francesco,Ricci,Ricci Family
Threatened,12,0.07692308,0.07692308,12,Laura,Ricci,Ricci Family
Threatened,11,0.07692308,0.07692308,11,Roberto,Ricci,Ricci Family
Threatened,30,0.07692308,0.07692308,30,Paola,Rossi,Rossi Family
Threatened,29,0.07692308,0.07692308,29,Giorgio,Rossi,Rossi Family


In [0]:
# Summarize results for betweenness centrality
mafia_subgraph \
    .edges \
    .groupby("relation_type") \
    .applyInPandas(nx_betweenness_centrality, output_schema_betweenness) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "betweenness_centrality", ascending=False) \
    .display()

relation_type,node,betweenness_centrality,id,First Name,Last Name,Family
Threatened,2,0.0,2,Maria,Rossi,Rossi Family
Threatened,1,0.0,1,Giuseppe,Rossi,Rossi Family
Threatened,4,0.0,4,Sofia,Rossi,Rossi Family
Threatened,3,0.0,3,Antonio,Rossi,Rossi Family
Threatened,10,0.0,10,Chiara,Ricci,Ricci Family
Threatened,9,0.0,9,Francesco,Ricci,Ricci Family
Threatened,12,0.0,12,Laura,Ricci,Ricci Family
Threatened,11,0.0,11,Roberto,Ricci,Ricci Family
Threatened,30,0.0,30,Paola,Rossi,Rossi Family
Threatened,29,0.0,29,Giorgio,Rossi,Rossi Family


We repeat this exercise for the entire graph.

In [0]:
# Summarize results for degree centrality
mafia_graph \
    .edges \
    .groupby("relation_type") \
    .applyInPandas(nx_degree_centrality, output_schema_degree) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "in_degree_centrality", ascending=False) \
    .display()

relation_type,node,in_degree_centrality,out_degree_centrality,id,First Name,Last Name,Family
Threatened,2,0.07692308,0.07692308,2,Maria,Rossi,Rossi Family
Threatened,1,0.07692308,0.07692308,1,Giuseppe,Rossi,Rossi Family
Threatened,4,0.07692308,0.07692308,4,Sofia,Rossi,Rossi Family
Threatened,3,0.07692308,0.07692308,3,Antonio,Rossi,Rossi Family
Threatened,10,0.07692308,0.07692308,10,Chiara,Ricci,Ricci Family
Threatened,9,0.07692308,0.07692308,9,Francesco,Ricci,Ricci Family
Threatened,12,0.07692308,0.07692308,12,Laura,Ricci,Ricci Family
Threatened,11,0.07692308,0.07692308,11,Roberto,Ricci,Ricci Family
Threatened,30,0.07692308,0.07692308,30,Paola,Rossi,Rossi Family
Threatened,29,0.07692308,0.07692308,29,Giorgio,Rossi,Rossi Family


In [0]:
# Summarize results for betweenness centrality
mafia_graph \
    .edges \
    .groupby("relation_type") \
    .applyInPandas(nx_betweenness_centrality, output_schema_betweenness) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "betweenness_centrality", ascending=False) \
    .display()

relation_type,node,betweenness_centrality,id,First Name,Last Name,Family
Threatened,2,0.0,2,Maria,Rossi,Rossi Family
Threatened,1,0.0,1,Giuseppe,Rossi,Rossi Family
Threatened,4,0.0,4,Sofia,Rossi,Rossi Family
Threatened,3,0.0,3,Antonio,Rossi,Rossi Family
Threatened,10,0.0,10,Chiara,Ricci,Ricci Family
Threatened,9,0.0,9,Francesco,Ricci,Ricci Family
Threatened,12,0.0,12,Laura,Ricci,Ricci Family
Threatened,11,0.0,11,Roberto,Ricci,Ricci Family
Threatened,30,0.0,30,Paola,Rossi,Rossi Family
Threatened,29,0.0,29,Giorgio,Rossi,Rossi Family


#### Bonus
If would like to compute the overall influence, without regard to the relation type, you can use a *fictional* relation type by creating a column with a literal value, such as 1 or a string. Note that using networkx does not leverage the speed of Spark, unless we partition our network in some way. Since we have a network with weights, we can use a groupby to sum the weight, which automatically imposes a uniqueness condition. We still need our fictional groupby to use ``applyInPandas``.

In [0]:
mafia_graph \
    .edges \
    .withColumn("relation_type", F.lit("1")) \
    .groupby(["src", "dst", "relation_type"]) \
    .agg(F.sum("weight").alias("weight")) \
    .groupby("relation_type") \
    .applyInPandas(nx_degree_centrality, output_schema_degree) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "in_degree_centrality", ascending=False) \
    .display()

relation_type,node,in_degree_centrality,out_degree_centrality,id,First Name,Last Name,Family
1,9,0.29090908,0.29090908,9,Francesco,Ricci,Ricci Family
1,13,0.29090908,0.29090908,13,Giovanni,Moretti,Moretti Family
1,1,0.29090908,0.29090908,1,Giuseppe,Rossi,Rossi Family
1,3,0.29090908,0.29090908,3,Antonio,Rossi,Rossi Family
1,6,0.27272728,0.27272728,6,Giulia,Bianchi,Bianchi Family
1,8,0.27272728,0.27272728,8,Alessia,Bianchi,Bianchi Family
1,11,0.27272728,0.27272728,11,Roberto,Ricci,Ricci Family
1,4,0.27272728,0.27272728,4,Sofia,Rossi,Rossi Family
1,7,0.27272728,0.27272728,7,Marco,Bianchi,Bianchi Family
1,5,0.27272728,0.27272728,5,Luca,Bianchi,Bianchi Family
