# THE GOAL

The goal is to take Watson's role and using the intel (the data in the supplied files) from the police, Interpol, and undercover agents about Europe's criminals to identify the name behind which Moriarty is hiding. 


# SOLUTION

# PART 1
-Watson, just like our grand-grand-fathers we are again after Moriarty. 

We need to catch him. H-mmm... I need to be careful here - maybe it is not him, maybe it is her. All we know is 
that someone is masterminding unlawful activities and planning something bad. The Interpol agents, with the help of my boys, collected information that should provide us the clues to determine the name Moriarty's is hiding brhind and arrest him.

-I have a number of .csv and .txt files about criminal activity and high-profile suspicious sales that were sent over from our neighbors: France, Germany, Netherlands, and our own MI-6 in the United Kingdom.

So, the first task would be to combine the data into one table. I requested info on the name, alias, and the location of the last known whereabouts, as latitude and longitude, but since the data comes from all around the Europe they might have named the columns differently.

I am thinking that adding the country to the data might be helpful in our future analysis.

Lastly, from my correspondence with our undercover agents, all the activity seems to be happening around major financial centers. If the city names are not in the data, I suppose you can extract it based on the latitude and logitude. Mmmm... And a map of course, unless your knowledge of Europe's geography is excepitonal. 





Text:
Tasks:
1. Read in data from the files into a separate dataframe and add the country name ('country' column).
2. Identify the city around which the criminals operate. Add it to the dataframe ('city' column).
3. Concatenate dfs into a single dataframe with the four original columns renamed to: [name, alias, latitude, longitude]
4. Fill NAs in aliases with an empty string.


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Holmes_Moriarty_SQL") \
    .getOrCreate()

In [None]:
%ls data

In [None]:
#sample one of the csvs
country_ = "France"
file_name = "./data/criminals_{}.csv".format(country_)
df_country = spark.read.csv(file_name, header=True, inferSchema=True)
print(df_country.columns)
df_country.show(5, False)

In [None]:
import pyspark.sql.functions as F

In [None]:
def rename_cols(df, new_col_names):
    """"""
    for col, new_col in zip(df.columns, new_col_names):
        df = df.withColumnRenamed(col, new_col)
        
    return df

#explore the dataframes: column names, shapes and combine into a single dataframe
country_list = ["United Kingdom", "Germany", "Netherlands", "France"]
dfs_dict = {}
for country_ in country_list:
    file_name = "./data/criminals_{}.csv".format(country_)
    df = spark.read.csv(file_name, header=True, inferSchema=True)
    print("Country: {}, rows: {}".format(country_, df.count()))
    new_col_names = ["id", "name", "alias", "latitude", "longitude"]
    df = rename_cols(df, new_col_names)
    df = df.withColumn('country', F.lit(country_))
    dfs_dict[country_] = df  # add data frame to the dict for a future union
print("Len dfs_dict: {}".format(len(dfs_dict)))



In [None]:
len(list(dfs_dict.values()))

In [None]:
# from https://datascience.stackexchange.com/questions/11356/merging-multiple-data-frames-row-wise-in-pyspark
from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(dfs):
    return reduce(DataFrame.unionAll, dfs)

df_criminals_combined = unionAll(list(dfs_dict.values()))
print("Rows in combined df: {}".format(df_criminals_combined.count()))

In [None]:
#alternatively, a chaining union one-by-one
dfs_list = list(dfs_dict.values())
df_combined = dfs_list[0].union(dfs_list[1]).union(dfs_list[2]).union(dfs_list[3])
print("Rows in combined df: {}".format(df_criminals_combined.count()))
df_criminals_combined.show(5, False)

In [None]:
# calculate mean latitude and longitude to identify the major financial centers (cities)
# (copy and paste the lat, lon values into Google Maps)
# dataframe.filter(df['salary'] > 100000).agg({"age": "avg"})
for country_ in country_list:
    country_df = df_criminals_combined.where("country = '{}'".format(country_))
    lat = round(country_df.agg({"latitude": "avg"}).collect()[0][0], 4)
    lon = round(country_df.agg({"longitude": "avg"}).collect()[0][0], 4)
    print("Country: {}, (lat, lon): {}, {}".format(country_, lat, lon))
    print(40 * "*")

In [None]:
country_list

In [None]:
# add the city name to the df

#it can be done using a series of if/else statements, such as 'if country_ == 'France': city = 'Paris', etc. OR
# using a dictionary as below:
country_city_dict = {"United Kingdom": "London", "Germany": "Frankfurt", "Netherlands": "Amsterdam", "France": "Paris"}
country_city_dict


In [None]:
from pyspark.sql.functions import when
df_with_city = df_criminals_combined.withColumn('city', \
                                                 when(F.col('country')=='United Kingdom', 'London').\
                                                 when(F.col('country')=='France', 'Paris').\
                                                 when(F.col('country')=='Germany', 'Frankfurt').\
                                                 when(F.col('country')=='Netherlands', 'Amsterdam').\
                                                 otherwise(None))
df_with_city.show(10, False)

In [None]:
# Fillna in alias.
df_with_city = df_with_city.fillna({"alias": ""})
print("Df shape: {}".format(df_with_city.count()))
df_with_city.orderBy("name").show(5)

# Task 2
Add crime_type and profit info to criminals. 
#(merge/join) criminals table with the crime type and profit information.

- Great, Watson! 
- Now we need to know what everyone of those supspects did wrong, that is the crime type, and desirably, how much they profited from it: Moriarty is not a small fish. He is in the category with th largest total sales.

- You'll need to add the crime type and the profit from the files to the table you already put together. Be mindful of the file types. I also believe that the separator in these file maybe different from the files you used previously.
-Moriarty made one of the top 5 sales last year. He is not stupid for nicknames, I am pretty sure he doesn't have an alias.


# Solution (task 2)

In [None]:
df = spark.read.csv("./data/crime_type_profit_France.txt", header=True, sep=" ")
print("Columns: ", list(df.columns))
df.show(5, False)

In [None]:
# union(concatenate) files for the latest crime dates

country_list = ["United Kingdom", "Germany", "Netherlands", "France"]
dfs_dict = {}
for country_ in country_list:
    file_name = "./data/crime_type_profit_{}.txt".format(country_)
    df = spark.read.csv(file_name, header=True, sep=" ")
    print("rows: {}".format(df.count()))
    df = df.withColumn('country', F.lit(country_))
    dfs_dict[country_] = df
print("Len dfs_dict: {}".format(len(dfs_dict)))

#combine all dataframes into one
df_crime_type_profit = unionAll(list(dfs_dict.values()))
print(list(df_crime_type_profit.columns))

df_crime_type_profit.show(10)

In [None]:
# drop duplicates 
df_with_city = df_with_city.drop_duplicates(["name"])
df_with_city.count()


In [None]:
# join main criminal info with crime type and profit
df_city_profit = df_with_city.join(df_crime_type_profit, ["name","country"], "left")
print("Df shape: {}".format(df_city_profit.count()))
# print(df_city_profit.columns)
df_city_profit.orderBy('profit', ascending = False).show(10, False)

In [None]:
# profit column is not sorted properly. possibly the data type is the issue

In [None]:
df_city_profit.printSchema()

In [None]:
df_city_profit = df_city_profit.withColumn("profit", F.col("profit").cast("int"))
df_city_profit.printSchema()

In [None]:
# let's order by profit again...
df_city_profit.orderBy('profit', ascending = False).show(5, False)

In [None]:
#investigate crime types and get total sales for each
df_by_profit = df_city_profit.groupBy(["crime_type"]).\
                agg(F.sum("profit").alias("total_profit")).\
                orderBy("total_profit", ascending=False)

df_by_profit.show(10, False)

In [None]:
crime_type_big_sales = df_by_profit.select("crime_type").collect()[0][0]
crime_type_big_sales

In [None]:
print("crime_type = '{}'".format(crime_type_big_sales))

In [None]:
df_city_profit.columns

In [None]:
countries_crime_type_profit_df = df_city_profit.where("crime_type == '{}'".format(crime_type_big_sales))\
                    .groupBy(["country"])\
                    .agg(F.sum("profit").alias('total_profit'))\
                    .orderBy("total_profit", ascending=False)
    
countries_crime_type_profit_df.show(10, False)

In [None]:
top_country = countries_crime_type_profit_df.select("country").collect()[0][0]
top_country

In [None]:
df_city_profit.show(3)

In [None]:
# Show top 5 salesmen in the selected country
df_large_sales_alias_null = df_city_profit.where("country = '{}' and alias = '' and crime_type = '{}'".format(top_country, crime_type_big_sales))\
                                            .orderBy("profit", ascending = False)

df_large_sales_alias_null.show(5)

# PART 3

Add date (last deal date) Moriarty does not deal on Sundays

In [None]:
id_dates = spark.read.csv("./data/id_dates.csv", header=True, inferSchema=True)
print("id_dates shape: {}".format(id_dates.count()))
id_dates.show(4)

In [None]:
df_selected_with_dates = df_city_profit.join(id_dates, on=["id", "country"], how="left")
print(df_selected_with_dates.count())
df_selected_with_dates.show(3)

In [None]:
df_selected_with_dates.printSchema()

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import DateType, StringType


def weekday(date):
    """ Generate day of the week based on date (as string or as datetime object)"""
    
    if isinstance(date, str):
        from datetime import datetime
        
        date = datetime.strptime(date, "%Y-%m-%d")  # change the format if necessary
        
    return date.strftime("%A")


weekday_udf = udf(weekday, StringType())

# conversion to DateType is not necessary as it is handled inside the function
# here it is offered as an example of re-casting
df_selected_with_dates = df_selected_with_dates.withColumn("date", F.col("date").cast(DateType()))

df_selected_with_dates = df_selected_with_dates.withColumn("weekday", weekday_udf("date").alias("weekday"))
df_selected_with_dates.show(10)

In [None]:
crime_type_big_sales

In [None]:
# Show top 5 salesmen in the selected country
df_final = df_selected_with_dates.where("""country = '{}' 
                                            and alias = '' 
                                            and crime_type = '{}'
                                            and weekday != 'Sunday'
                                       """.format(top_country, crime_type_big_sales))

df_final.show(5)

In [None]:
moriarty_name =  df_final.orderBy("profit", ascending=False).select("name").collect()[0][0]
print("The name Moriarty is hiding behind: {}".format(moriarty_name))