# THE GOAL

The goal is to take Watson's role and using the intel (the data in the supplied files) from the police, Interpol, and undercover agents about Europe's criminals to identify the name behind which Moriarty is hiding. 


# SOLUTION

# PART 1
-Watson, just like our grand-grand-fathers we are again after Moriarty. 

We need to catch him. H-mmm... I need to be careful here - maybe it is not him, maybe it is her. All we know is 
that someone is masterminding unlawful activities and planning something bad. The Interpol agents, with the help of my boys, collected information that should provide us the clues to determine the name Moriarty's is hiding brhind and arrest him.

-I have a number of .csv and .txt files about criminal activity and high-profile suspicious sales that were sent over from our neighbors: France, Germany, Netherlands, and our own MI-6 in the United Kingdom.

So, the first task would be to combine the data into one table. I requested info on the name, alias, and the location of the last known whereabouts, as latitude and longitude, but since the data comes from all around the Europe they might have named the columns differently.

I am thinking that adding the country to the data might be helpful in our future analysis.

Lastly, from my correspondence with our undercover agents, all the activity seems to be happening around major financial centers. If the city names are not in the data, I suppose you can extract it based on the latitude and logitude. Mmmm... And a map of course, unless your knowledge of Europe's geography is excepitonal. 





Text:
Tasks:
1. Read in data from the files into a separate dataframe and add the country name ('country' column).
2. Identify the city around which the criminals operate. Add it to the dataframe ('city' column).
3. Concatenate dfs into a single dataframe with the four original columns renamed to: [name, alias, latitude, longitude]
4. Fill NAs in aliases with an empty string.


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [2]:
%ls data

crime_type_profit_France.txt          criminals_Germany.csv
crime_type_profit_Germany.txt         criminals_Netherlands.csv
crime_type_profit_Netherlands.txt     criminals_United Kingdom.csv
crime_type_profit_United Kingdom.txt  id_dates.csv
criminals_France.csv


In [3]:
#sample one of the csvs
country_ = "France"
file_name = "./data/criminals_{}.csv".format(country_)
df_country = spark.read.csv(file_name, header=True, inferSchema=True)
print(df_country.columns)
df_country.show(5, False)

['id', 'nom', 'pseudonyme', 'latitude', 'longitude']
+---+-------------------------+----------+--------+---------+
|id |nom                      |pseudonyme|latitude|longitude|
+---+-------------------------+----------+--------+---------+
|0  |Henriette Thomas du Peron|null      |48.9072 |2.2521   |
|1  |Marianne Francois        |null      |48.7158 |2.167    |
|2  |Chantal Laurent          |null      |48.8507 |2.3281   |
|3  |Dorothée Coulon          |null      |49.0233 |2.5613   |
|4  |Astrid Meunier           |null      |49.0044 |2.547    |
+---+-------------------------+----------+--------+---------+
only showing top 5 rows



In [4]:
import pyspark.sql.functions as F

In [5]:
def rename_cols(df, new_col_names):
    """"""
    for col, new_col in zip(df.columns, new_col_names):
        df = df.withColumnRenamed(col, new_col)
        
    return df

#explore the dataframes: column names, shapes and combine into a single dataframe
country_list = ["United Kingdom", "Germany", "Netherlands", "France"]
dfs_dict = {}
for country_ in country_list:
    file_name = "./data/criminals_{}.csv".format(country_)
    df = spark.read.csv(file_name, header=True, inferSchema=True)
    print("Country: {}, rows: {}".format(country_, df.count()))
    new_col_names = ["id", "name", "alias", "latitude", "longitude"]
    df = rename_cols(df, new_col_names)
    df = df.withColumn('country', F.lit(country_))
    dfs_dict[country_] = df  # add data frame to the dict for a future union
print("Len dfs_dict: {}".format(len(dfs_dict)))



Country: United Kingdom, rows: 306
Country: Germany, rows: 264
Country: Netherlands, rows: 250
Country: France, rows: 349
Len dfs_dict: 4


In [6]:
len(list(dfs_dict.values()))

4

In [7]:
# from https://datascience.stackexchange.com/questions/11356/merging-multiple-data-frames-row-wise-in-pyspark
from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(dfs):
    return reduce(DataFrame.unionAll, dfs)

df_criminals_combined = unionAll(list(dfs_dict.values()))
print("Rows in combined df: {}".format(df_criminals_combined.count()))

Rows in combined df: 1169


In [8]:
#alternatively, a chaining union one-by-one
dfs_list = list(dfs_dict.values())
df_combined = dfs_list[0].union(dfs_list[1]).union(dfs_list[2]).union(dfs_list[3])
print("Rows in combined df: {}".format(df_criminals_combined.count()))
df_criminals_combined.show(5, False)

Rows in combined df: 1169
+---+------------------------+-----+--------+---------+--------------+
|id |name                    |alias|latitude|longitude|country       |
+---+------------------------+-----+--------+---------+--------------+
|0  |Ms. Diane Barnett       |null |51.3327 |-0.0328  |United Kingdom|
|1  |Elizabeth McDonald      |null |51.3732 |-0.0396  |United Kingdom|
|2  |Jacqueline Martin-Winter|null |51.3536 |-0.223   |United Kingdom|
|3  |Roger Farmer            |null |51.2891 |-0.208   |United Kingdom|
|4  |Mrs. Georgina Harrison  |null |51.6004 |0.0054   |United Kingdom|
+---+------------------------+-----+--------+---------+--------------+
only showing top 5 rows



In [10]:
# calculate mean latitude and longitude to identify the major financial centers (cities)
# (copy and paste the lat, lon values into Google Maps)
# dataframe.filter(df['salary'] > 100000).agg({"age": "avg"})
for country_ in country_list:
    country_df = df_criminals_combined.where("country = '{}'".format(country_))
    lat = round(country_df.agg({"latitude": "avg"}).collect()[0][0], 4)
    lon = round(country_df.agg({"longitude": "avg"}).collect()[0][0], 4)
    print("Country: {}, (lat, lon): {}, {}".format(country_, lat, lon))
    print(40 * "*")

Country: United Kingdom, (lat, lon): 51.5046, -0.124
****************************************
Country: Germany, (lat, lon): 50.0971, 8.679
****************************************
Country: Netherlands, (lat, lon): 52.3753, 4.901
****************************************
Country: France, (lat, lon): 48.8606, 2.3646
****************************************


In [11]:
country_list

['United Kingdom', 'Germany', 'Netherlands', 'France']

In [12]:
# add the city name to the df

#it can be done using a series of if/else statements, such as 'if country_ == 'France': city = 'Paris', etc. OR
# using a dictionary as below:
country_city_dict = {"United Kingdom": "London", "Germany": "Frankfurt", "Netherlands": "Amsterdam", "France": "Paris"}
country_city_dict


{'United Kingdom': 'London',
 'Germany': 'Frankfurt',
 'Netherlands': 'Amsterdam',
 'France': 'Paris'}

In [15]:
from pyspark.sql.functions import when
df_with_city = df_criminals_combined.withColumn('city', \
                                                 when(F.col('country')=='United Kingdom', 'London').\
                                                 when(F.col('country')=='France', 'Paris').\
                                                 when(F.col('country')=='Germany', 'Frankfurt').\
                                                 when(F.col('country')=='Netherlands', 'Amsterdam').\
                                                 otherwise(None))
df_with_city.show(10, False)

+---+------------------------+-----+--------+---------+--------------+------+
|id |name                    |alias|latitude|longitude|country       |city  |
+---+------------------------+-----+--------+---------+--------------+------+
|0  |Ms. Diane Barnett       |null |51.3327 |-0.0328  |United Kingdom|London|
|1  |Elizabeth McDonald      |null |51.3732 |-0.0396  |United Kingdom|London|
|2  |Jacqueline Martin-Winter|null |51.3536 |-0.223   |United Kingdom|London|
|3  |Roger Farmer            |null |51.2891 |-0.208   |United Kingdom|London|
|4  |Mrs. Georgina Harrison  |null |51.6004 |0.0054   |United Kingdom|London|
|5  |Peter Stevens           |null |51.6441 |0.0188   |United Kingdom|London|
|6  |Georgina Bell           |null |51.5304 |-0.0927  |United Kingdom|London|
|7  |Miss Lesley Sullivan    |null |51.7303 |-0.2607  |United Kingdom|London|
|8  |Keith Kelly             |Happy|51.4393 |-0.1421  |United Kingdom|London|
|9  |Shane Bailey            |null |51.2735 |-0.3407  |United Ki

In [16]:
# Fillna in alias.
df_with_city = df_with_city.fillna({"alias": ""})
print("Df shape: {}".format(df_with_city.count()))
df_with_city.orderBy("name").show(5)

Df shape: 1169
+---+--------------------+-----+--------+---------+--------------+---------+
| id|                name|alias|latitude|longitude|       country|     city|
+---+--------------------+-----+--------+---------+--------------+---------+
|247|          Abbie Bond|     | 51.7279|  -0.2436|United Kingdom|   London|
| 63|          Abel Greij|     |  52.603|     5.06|   Netherlands|Amsterdam|
|150|Adam van de Pol-K...|     | 52.5674|   4.8518|   Netherlands|Amsterdam|
| 28|Adelgunde Hensche...|     | 50.2047|   8.7456|       Germany|Frankfurt|
|261|         Adrian West|     | 51.4574|   0.0411|United Kingdom|   London|
+---+--------------------+-----+--------+---------+--------------+---------+
only showing top 5 rows



# Task 2
Add crime_type and profit info to criminals. 
#(merge/join) criminals table with the crime type and profit information.

- Great, Watson! 
- Now we need to know what everyone of those supspects did wrong, that is the crime type, and desirably, how much they profited from it: Moriarty is not a small fish. He is in the category with th largest total sales.

- You'll need to add the crime type and the profit from the files to the table you already put together. Be mindful of the file types. I also believe that the separator in these file maybe different from the files you used previously.
-Moriarty made one of the top 5 sales last year. He is not stupid for nicknames, I am pretty sure he doesn't have an alias.


# Solution (task 2)

In [17]:
df = spark.read.csv("./data/crime_type_profit_France.txt", header=True, sep=" ")
print("Columns: ", list(df.columns))
df.show(5, False)

Columns:  ['name', 'crime_type', 'profit']
+-------------------------+-------------+------+
|name                     |crime_type   |profit|
+-------------------------+-------------+------+
|Henriette Thomas du Peron|robbery      |558   |
|Marianne Francois        |drug sale    |8100  |
|Chantal Laurent          |theft        |52    |
|Dorothée Coulon          |theft        |37    |
|Astrid Meunier           |pickpocketing|41    |
+-------------------------+-------------+------+
only showing top 5 rows



In [18]:
# union(concatenate) files for the latest crime dates

country_list = ["United Kingdom", "Germany", "Netherlands", "France"]
dfs_dict = {}
for country_ in country_list:
    file_name = "./data/crime_type_profit_{}.txt".format(country_)
    df = spark.read.csv(file_name, header=True, sep=" ")
    print("rows: {}".format(df.count()))
    df = df.withColumn('country', F.lit(country_))
    dfs_dict[country_] = df
print("Len dfs_dict: {}".format(len(dfs_dict)))

#combine all dataframes into one
df_crime_type_profit = unionAll(list(dfs_dict.values()))
print(list(df_crime_type_profit.columns))

df_crime_type_profit.show(10)

rows: 306
rows: 264
rows: 250
rows: 349
Len dfs_dict: 4
['name', 'crime_type', 'profit', 'country']
+--------------------+----------+------+--------------+
|                name|crime_type|profit|       country|
+--------------------+----------+------+--------------+
|   Ms. Diane Barnett|     theft|   284|United Kingdom|
|  Elizabeth McDonald|     theft|    59|United Kingdom|
|Jacqueline Martin...|   forgery|   150|United Kingdom|
|        Roger Farmer|     theft|   378|United Kingdom|
|Mrs. Georgina Har...|     theft|    55|United Kingdom|
|       Peter Stevens|   robbery|   868|United Kingdom|
|       Georgina Bell|     theft|   365|United Kingdom|
|Miss Lesley Sullivan|   forgery|   320|United Kingdom|
|         Keith Kelly|     theft|   399|United Kingdom|
|        Shane Bailey|   forgery|   495|United Kingdom|
+--------------------+----------+------+--------------+
only showing top 10 rows



In [19]:
# drop duplicates 
df_with_city = df_with_city.drop_duplicates(["name"])
df_with_city.count()


1169

In [20]:
# join main criminal info with crime type and profit
df_city_profit = df_with_city.join(df_crime_type_profit, ["name","country"], "left")
print("Df shape: {}".format(df_city_profit.count()))
# print(df_city_profit.columns)
df_city_profit.orderBy('profit', ascending = False).show(10, False)

Df shape: 1169
+-----------------------+--------------+---+-----+--------+---------+---------+----------+------+
|name                   |country       |id |alias|latitude|longitude|city     |crime_type|profit|
+-----------------------+--------------+---+-----+--------+---------+---------+----------+------+
|Ing. Karla Lindner MBA.|Germany       |196|Chuey|49.9286 |8.8667   |Frankfurt|robbery   |992   |
|Rhys Evans             |United Kingdom|248|     |51.5803 |-0.2249  |London   |drug sale |990   |
|Alfred Morvan          |France        |295|     |48.9516 |2.4703   |Paris    |robbery   |988   |
|Amina van Ochten       |Netherlands   |67 |     |52.545  |4.9964   |Amsterdam|robbery   |986   |
|Valerie Smith          |United Kingdom|161|     |51.5111 |0.0496   |London   |robbery   |976   |
|Sara Foster            |United Kingdom|274|     |51.452  |-0.2654  |London   |robbery   |974   |
|Dean Ward              |United Kingdom|243|     |51.3494 |-0.204   |London   |theft     |97    |
|Ann-

In [21]:
# profit column is not sorted properly. possibly the data type is the issue

In [22]:
df_city_profit.printSchema()

root
 |-- name: string (nullable = true)
 |-- country: string (nullable = false)
 |-- id: integer (nullable = true)
 |-- alias: string (nullable = false)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- city: string (nullable = true)
 |-- crime_type: string (nullable = true)
 |-- profit: string (nullable = true)



In [23]:
df_city_profit = df_city_profit.withColumn("profit", F.col("profit").cast("int"))
df_city_profit.printSchema()

root
 |-- name: string (nullable = true)
 |-- country: string (nullable = false)
 |-- id: integer (nullable = true)
 |-- alias: string (nullable = false)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- city: string (nullable = true)
 |-- crime_type: string (nullable = true)
 |-- profit: integer (nullable = true)



In [24]:
# let's order by profit again...
df_city_profit.orderBy('profit', ascending = False).show(5, False)

+------------------------+--------------+---+----------+--------+---------+---------+------------+------+
|name                    |country       |id |alias     |latitude|longitude|city     |crime_type  |profit|
+------------------------+--------------+---+----------+--------+---------+---------+------------+------+
|Odette Renard du Michaud|France        |302|          |48.7832 |2.259    |Paris    |weapons sale|498000|
|Anthony Mitchell        |United Kingdom|58 |          |51.421  |0.1152   |London   |weapons sale|495000|
|Gabriel Le Schneider    |France        |307|          |48.8161 |2.3073   |Paris    |weapons sale|493000|
|Malcolm Cox-Mason       |United Kingdom|62 |Handlebars|51.5569 |0.0905   |London   |weapons sale|491000|
|Lily Walter             |Netherlands   |25 |Montana   |52.3557 |4.7229   |Amsterdam|weapons sale|484000|
+------------------------+--------------+---+----------+--------+---------+---------+------------+------+
only showing top 5 rows



In [25]:
#investigate crime types and get total sales for each
df_by_profit = df_city_profit.groupBy(["crime_type"]).\
                agg(F.sum("profit").alias("total_profit")).\
                orderBy("total_profit", ascending=False)

df_by_profit.show(10, False)

+-------------+------------+
|crime_type   |total_profit|
+-------------+------------+
|weapons sale |14942000    |
|drug sale    |2214270     |
|robbery      |96582       |
|theft        |95702       |
|forgery      |37863       |
|pickpocketing|6359        |
+-------------+------------+



In [26]:
crime_type_big_sales = df_by_profit.select("crime_type").collect()[0][0]
crime_type_big_sales

'weapons sale'

In [27]:
print("crime_type = '{}'".format(crime_type_big_sales))

crime_type = 'weapons sale'


In [28]:
df_city_profit.columns

['name',
 'country',
 'id',
 'alias',
 'latitude',
 'longitude',
 'city',
 'crime_type',
 'profit']

In [29]:
countries_crime_type_profit_df = df_city_profit.where("crime_type == '{}'".format(crime_type_big_sales))\
                    .groupBy(["country"])\
                    .agg(F.sum("profit").alias('total_profit'))\
                    .orderBy("total_profit", ascending=False)
    
countries_crime_type_profit_df.show(10, False)

+--------------+------------+
|country       |total_profit|
+--------------+------------+
|France        |6312000     |
|United Kingdom|3914000     |
|Germany       |2365000     |
|Netherlands   |2351000     |
+--------------+------------+



In [30]:
top_country = countries_crime_type_profit_df.select("country").collect()[0][0]
top_country

'France'

In [31]:
df_city_profit.show(3)

+--------------------+-----------+---+-----+--------+---------+---------+----------+------+
|                name|    country| id|alias|latitude|longitude|     city|crime_type|profit|
+--------------------+-----------+---+-----+--------+---------+---------+----------+------+
|    Amina van Ochten|Netherlands| 67|     |  52.545|   4.9964|Amsterdam|   robbery|   986|
|Dr. Raissa Benthi...|    Germany|218|     | 50.3391|   8.6749|Frankfurt|     theft|   212|
|     Eugène de Costa|     France|189|     | 48.9425|   2.5928|    Paris|     theft|   232|
+--------------------+-----------+---+-----+--------+---------+---------+----------+------+
only showing top 3 rows



In [32]:
# Show top 5 salesmen in the selected country
df_large_sales_alias_null = df_city_profit.where("country = '{}' and alias = '' and crime_type = '{}'".format(top_country, crime_type_big_sales))\
                                            .orderBy("profit", ascending = False)

df_large_sales_alias_null.show(5)

+--------------------+-------+---+-----+--------+---------+-----+------------+------+
|                name|country| id|alias|latitude|longitude| city|  crime_type|profit|
+--------------------+-------+---+-----+--------+---------+-----+------------+------+
|Odette Renard du ...| France|302|     | 48.7832|    2.259|Paris|weapons sale|498000|
|Gabriel Le Schneider| France|307|     | 48.8161|   2.3073|Paris|weapons sale|493000|
|Constance du Laurent| France|171|     | 48.8806|   2.2083|Paris|weapons sale|453000|
|   Valentine Meunier| France|200|     |  48.822|   2.5017|Paris|weapons sale|435000|
|René Tessier du L...| France| 19|     | 48.6504|   2.3543|Paris|weapons sale|423000|
+--------------------+-------+---+-----+--------+---------+-----+------------+------+
only showing top 5 rows



# PART 3

Add date (last deal date) Moriarty does not deal on Sundays

In [33]:
id_dates = spark.read.csv("./data/id_dates.csv", header=True, inferSchema=True)
print("id_dates shape: {}".format(id_dates.count()))
id_dates.show(4)

id_dates shape: 1169
+---+----------+-------+
| id|      date|country|
+---+----------+-------+
|  0|2020-06-15| France|
|  1|2020-01-06| France|
|  2|2020-08-03| France|
|  3|2020-06-19| France|
+---+----------+-------+
only showing top 4 rows



In [34]:
df_selected_with_dates = df_city_profit.join(id_dates, on=["id", "country"], how="left")
print(df_selected_with_dates.count())
df_selected_with_dates.show(3)

1169
+---+-----------+--------------------+-----+--------+---------+---------+----------+------+----------+
| id|    country|                name|alias|latitude|longitude|     city|crime_type|profit|      date|
+---+-----------+--------------------+-----+--------+---------+---------+----------+------+----------+
| 67|Netherlands|    Amina van Ochten|     |  52.545|   4.9964|Amsterdam|   robbery|   986|2020-08-13|
|218|    Germany|Dr. Raissa Benthi...|     | 50.3391|   8.6749|Frankfurt|     theft|   212|2020-10-20|
|189|     France|     Eugène de Costa|     | 48.9425|   2.5928|    Paris|     theft|   232|2020-01-09|
+---+-----------+--------------------+-----+--------+---------+---------+----------+------+----------+
only showing top 3 rows



In [35]:
df_selected_with_dates.printSchema()

root
 |-- id: integer (nullable = true)
 |-- country: string (nullable = false)
 |-- name: string (nullable = true)
 |-- alias: string (nullable = false)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- city: string (nullable = true)
 |-- crime_type: string (nullable = true)
 |-- profit: integer (nullable = true)
 |-- date: string (nullable = true)



In [44]:
from pyspark.sql.functions import udf
from pyspark.sql.types import DateType, StringType


def weekday(date):
    """ Generate day of the week based on date (as string or as datetime object)"""
    
    if isinstance(date, str):
        from datetime import datetime
        
        date = datetime.strptime(date, "%Y-%m-%d")  # change the format if necessary
        
    return date.strftime("%A")


weekday_udf = udf(weekday, StringType())

# conversion to DateType is not necessary as it is handled inside the function
# here it is offered as an example of re-casting
df_selected_with_dates = df_selected_with_dates.withColumn("date", F.col("date").cast(DateType()))

df_selected_with_dates = df_selected_with_dates.withColumn("weekdate", weekday_udf("date").alias("weekday"))
df_selected_with_dates.show(10)

+---+--------------+--------------------+-----+--------+---------+---------+-------------+------+----------+---------+
| id|       country|                name|alias|latitude|longitude|     city|   crime_type|profit|      date| weekdate|
+---+--------------+--------------------+-----+--------+---------+---------+-------------+------+----------+---------+
| 67|   Netherlands|    Amina van Ochten|     |  52.545|   4.9964|Amsterdam|      robbery|   986|2020-08-13| Thursday|
|218|       Germany|Dr. Raissa Benthi...|     | 50.3391|   8.6749|Frankfurt|        theft|   212|2020-10-20|  Tuesday|
|189|        France|     Eugène de Costa|     | 48.9425|   2.5928|    Paris|        theft|   232|2020-01-09| Thursday|
| 34|   Netherlands|Yasmine Zaal-Lang...|     | 52.4738|   4.6771|Amsterdam|        theft|   248|2020-04-20|   Monday|
|161|       Germany|    Alfons Dörr MBA.|     | 50.0601|   8.7972|Frankfurt| weapons sale|165000|2020-05-06|Wednesday|
| 40|United Kingdom|        Carolyn Reid|     | 

In [43]:
crime_type_big_sales

'weapons sale'

In [45]:
# Show top 5 salesmen in the selected country
df_final = df_selected_with_dates.where("""country = '{}' 
                                            and alias = '' 
                                            and crime_type = '{}'
                                            and weekday != 'Sunday'
                                       """.format(top_country, crime_type_big_sales))

df_final.show(5)

AnalysisException: cannot resolve '`weekday`' given input columns: [alias, city, country, crime_type, date, id, latitude, longitude, name, profit, weekdate]; line 4 pos 48;
'Filter (((country#118 = France) AND (alias#722 = )) AND ((crime_type#855 = weapons sale) AND NOT ('weekday = Sunday)))
+- Project [id#88, country#118, name#94, alias#722, latitude#106, longitude#112, city#646, crime_type#855, profit#1205, date#2046, weekday(date#2046) AS weekdate#2060]
   +- Project [id#88, country#118, name#94, alias#722, latitude#106, longitude#112, city#646, crime_type#855, profit#1205, cast(date#1904 as date) AS date#2046, weekdate#1917]
      +- Project [id#88, country#118, name#94, alias#722, latitude#106, longitude#112, city#646, crime_type#855, profit#1205, date#1904, weekday(date#1904) AS weekdate#1917]
         +- Project [id#88, country#118, name#94, alias#722, latitude#106, longitude#112, city#646, crime_type#855, profit#1205, cast(date#1707 as date) AS date#1904]
            +- Project [id#88, country#118, name#94, alias#722, latitude#106, longitude#112, city#646, crime_type#855, profit#1205, date#1707]
               +- Join LeftOuter, ((id#88 = id#1706) AND (country#118 = country#1708))
                  :- Project [name#94, country#118, id#88, alias#722, latitude#106, longitude#112, city#646, crime_type#855, cast(profit#856 as int) AS profit#1205]
                  :  +- Project [name#94, country#118, id#88, alias#722, latitude#106, longitude#112, city#646, crime_type#855, profit#856]
                  :     +- Join LeftOuter, ((name#94 = name#854) AND (country#118 = country#868))
                  :        :- Deduplicate [name#94]
                  :        :  +- Project [id#88, name#94, coalesce(alias#100, cast( as string)) AS alias#722, latitude#106, longitude#112, country#118, city#646]
                  :        :     +- Project [id#88, name#94, alias#100, latitude#106, longitude#112, country#118, CASE WHEN (country#118 = United Kingdom) THEN London WHEN (country#118 = France) THEN Paris WHEN (country#118 = Germany) THEN Frankfurt WHEN (country#118 = Netherlands) THEN Amsterdam ELSE cast(null as string) END AS city#646]
                  :        :        +- Union
                  :        :           :- Project [id#88, name#94, alias#100, latitude#106, longitude#112, United Kingdom AS country#118]
                  :        :           :  +- Project [id#88, name#94, alias#100, latitude#106, longitude#72 AS longitude#112]
                  :        :           :     +- Project [id#88, name#94, alias#100, latitude#71 AS latitude#106, longitude#72]
                  :        :           :        +- Project [id#88, name#94, alias#70 AS alias#100, latitude#71, longitude#72]
                  :        :           :           +- Project [id#88, name#69 AS name#94, alias#70, latitude#71, longitude#72]
                  :        :           :              +- Project [id#68 AS id#88, name#69, alias#70, latitude#71, longitude#72]
                  :        :           :                 +- Relation[id#68,name#69,alias#70,latitude#71,longitude#72] csv
                  :        :           :- Project [id#161, name#167, alias#173, latitude#179, longitude#185, Germany AS country#191]
                  :        :           :  +- Project [id#161, name#167, alias#173, latitude#179, länge#145 AS longitude#185]
                  :        :           :     +- Project [id#161, name#167, alias#173, breitengrad#144 AS latitude#179, länge#145]
                  :        :           :        +- Project [id#161, name#167, aliasnamen#143 AS alias#173, breitengrad#144, länge#145]
                  :        :           :           +- Project [id#161, benennen#142 AS name#167, aliasnamen#143, breitengrad#144, länge#145]
                  :        :           :              +- Project [id#141 AS id#161, benennen#142, aliasnamen#143, breitengrad#144, länge#145]
                  :        :           :                 +- Relation[id#141,benennen#142,aliasnamen#143,breitengrad#144,länge#145] csv
                  :        :           :- Project [id#234, name#240, alias#246, latitude#252, longitude#258, Netherlands AS country#264]
                  :        :           :  +- Project [id#234, name#240, alias#246, latitude#252, länge#218 AS longitude#258]
                  :        :           :     +- Project [id#234, name#240, alias#246, breitengrad#217 AS latitude#252, länge#218]
                  :        :           :        +- Project [id#234, name#240, aliasnamen#216 AS alias#246, breitengrad#217, länge#218]
                  :        :           :           +- Project [id#234, benennen#215 AS name#240, aliasnamen#216, breitengrad#217, länge#218]
                  :        :           :              +- Project [id#214 AS id#234, benennen#215, aliasnamen#216, breitengrad#217, länge#218]
                  :        :           :                 +- Relation[id#214,benennen#215,aliasnamen#216,breitengrad#217,länge#218] csv
                  :        :           +- Project [id#307, name#313, alias#319, latitude#325, longitude#331, France AS country#337]
                  :        :              +- Project [id#307, name#313, alias#319, latitude#325, longitude#291 AS longitude#331]
                  :        :                 +- Project [id#307, name#313, alias#319, latitude#290 AS latitude#325, longitude#291]
                  :        :                    +- Project [id#307, name#313, pseudonyme#289 AS alias#319, latitude#290, longitude#291]
                  :        :                       +- Project [id#307, nom#288 AS name#313, pseudonyme#289, latitude#290, longitude#291]
                  :        :                          +- Project [id#287 AS id#307, nom#288, pseudonyme#289, latitude#290, longitude#291]
                  :        :                             +- Relation[id#287,nom#288,pseudonyme#289,latitude#290,longitude#291] csv
                  :        +- Union
                  :           :- Project [name#854, crime_type#855, profit#856, United Kingdom AS country#868]
                  :           :  +- Relation[name#854,crime_type#855,profit#856] csv
                  :           :- Project [name#889, crime_type#890, profit#891, Germany AS country#903]
                  :           :  +- Relation[name#889,crime_type#890,profit#891] csv
                  :           :- Project [name#924, crime_type#925, profit#926, Netherlands AS country#938]
                  :           :  +- Relation[name#924,crime_type#925,profit#926] csv
                  :           +- Project [name#959, crime_type#960, profit#961, France AS country#973]
                  :              +- Relation[name#959,crime_type#960,profit#961] csv
                  +- Relation[id#1706,date#1707,country#1708] csv


In [None]:
moriarty_name =  df_final.select("name").collect()[0][0]
print("The name Moriarty is hiding behind: {}".format(moriarty_name))