# THE GOAL

The goal is using the intel (data in the supplied files) from the police, Interpol, and undercover agents about Europe's criminals to identify the name behind which Moriarty is hiding. 


# SOLUTION

# PART 1
-Watson, just like our grand-grand-fathers we are again after Moriarty. 

We need to catch him... maybe it is her. All we know is 
that someone is masterminding unlawful activities and planning something bad. The Interpol agents, with the help of my boys, collected information that should provide us the clues to determine the name Moriarty is hiding behind, and arrest him.

-The data is in the csv and text files and contains info on the criminal activity in the last year as well as high-profile and suspicious sales. They were sent over by  our collegues from the neighboring countries: France, Germany, Netherlands, and our own MI-6 in the United Kingdom.

-The first task would be to combine the data into one table. I requested the information on the name, alias, and the location of the last known whereabouts of the criminals, as latitude and longitude, but since the data comes from all around the Europe the columns names may differ between files.

-I am thinking that adding the country to the data might be helpful in our future analysis.

-Lastly, from my correspondence with our undercover agents, all the activity seems to be happening around major financial centers. If those are not in the data, I suppose you can extract the city names using the latitude and logitude. And a map of course, unless your knowledge of Europe's geography is excepitonal. 


Data tasks outline:
1. Read the data from the files (named 'criminals_' plus country name) into separate dataframes and add the country name as 'country' column.
2. Identify the city around which the criminals operate and add it to the dataframe as 'city' column.
3. Concatenate the dataframes into a single dataframe with the four original columns renamed to: [name, alias, latitude, longitude]
4. Fill NAs in aliases with an empty string.


In [1]:
from datetime import datetime
import random
import pandas as pd
import numpy as np

In [2]:
#sample one of the csvs
country_ = "France"
file_name = "./data/criminals_{}.csv".format(country_)
df_country = pd.read_csv(file_name, index_col=False)
print(df_country.columns)
df_country.head(2)

Index(['id', 'nom', 'pseudonyme', 'latitude', 'longitude'], dtype='object')


Unnamed: 0,id,nom,pseudonyme,latitude,longitude
0,0,Henriette Thomas du Peron,,48.9072,2.2521
1,1,Marianne Francois,,48.7158,2.167


In [3]:
#explore the dataframes: column names, shapes and combine into a single dataframe
country_list = ["United Kingdom", "Germany", "Netherlands", "France"]
dfs_dict = {}
for country_ in country_list:
    file_name = "./data/criminals_{}.csv".format(country_)
    df = pd.read_csv(file_name, index_col=False)
    print(list(df.columns), df.shape)
    df.columns = ["id", "name", "alias", "latitude", "longitude"]
    df["country"] = country_
    dfs_dict[country_] = df  # add data frame to the dict for a future union
print("Len dfs_dict: {}".format(len(dfs_dict)))

# combine(concatenate/union) into a single dataframe
df_criminals_combined = pd.concat(dfs_dict.values())
print("Combined shape: {}".format(df_criminals_combined.shape))
df_criminals_combined.head(3)

['id', 'name', 'alias', 'latitude', 'longitude'] (306, 5)
['id', 'benennen', 'aliasnamen', 'breitengrad', 'länge'] (264, 5)
['id', 'benennen', 'aliasnamen', 'breitengrad', 'länge'] (250, 5)
['id', 'nom', 'pseudonyme', 'latitude', 'longitude'] (349, 5)
Len dfs_dict: 4
Combined shape: (1169, 6)


Unnamed: 0,id,name,alias,latitude,longitude,country
0,0,Ms. Diane Barnett,,51.3327,-0.0328,United Kingdom
1,1,Elizabeth McDonald,,51.3732,-0.0396,United Kingdom
2,2,Jacqueline Martin-Winter,,51.3536,-0.223,United Kingdom


In [4]:
# calculate mean latitude and longitude to identify the major financial centers (cities)
# (copy and paste the lat, lon values into Google Maps)
for country_ in country_list:
    test_df = df_criminals_combined.loc[df_criminals_combined.country == country_]
    print("Country: {}, (lat, lon): {}, {}".format(country_, 
                                                   round(test_df.latitude.mean(), 4), 
                                                   round(test_df.longitude.mean(), 4)))
    print(40 * "*")

Country: United Kingdom, (lat, lon): 51.5046, -0.124
****************************************
Country: Germany, (lat, lon): 50.0971, 8.679
****************************************
Country: Netherlands, (lat, lon): 52.3753, 4.901
****************************************
Country: France, (lat, lon): 48.8606, 2.3646
****************************************


In [5]:
country_list

['United Kingdom', 'Germany', 'Netherlands', 'France']

In [6]:
# add the city name to the df

#it can be done using a series of if/else statements, such as 'if country_ == 'France': city = 'Paris', etc. OR
# using a dictionary as below:
country_city_dict = {"United Kingdom": "London", "Germany": "Frankfurt", "Netherlands": "Amsterdam", "France": "Paris"}
country_city_dict



{'United Kingdom': 'London',
 'Germany': 'Frankfurt',
 'Netherlands': 'Amsterdam',
 'France': 'Paris'}

In [7]:
# assign city to the country
def assign_city(df, country_city_dict):
    """"""
    df["city"] = np.nan
    for country, city in country_city_dict.items():
        df.loc[df.country == country, "city"] = city
        
    return df

df_with_city = assign_city(df_criminals_combined, country_city_dict)
df_with_city.head(10)

Unnamed: 0,id,name,alias,latitude,longitude,country,city
0,0,Ms. Diane Barnett,,51.3327,-0.0328,United Kingdom,London
1,1,Elizabeth McDonald,,51.3732,-0.0396,United Kingdom,London
2,2,Jacqueline Martin-Winter,,51.3536,-0.223,United Kingdom,London
3,3,Roger Farmer,,51.2891,-0.208,United Kingdom,London
4,4,Mrs. Georgina Harrison,,51.6004,0.0054,United Kingdom,London
5,5,Peter Stevens,,51.6441,0.0188,United Kingdom,London
6,6,Georgina Bell,,51.5304,-0.0927,United Kingdom,London
7,7,Miss Lesley Sullivan,,51.7303,-0.2607,United Kingdom,London
8,8,Keith Kelly,Happy,51.4393,-0.1421,United Kingdom,London
9,9,Shane Bailey,,51.2735,-0.3407,United Kingdom,London


In [8]:
# Fillna in alias.
df_with_city = df_with_city.fillna({"alias": ""})
print("Df shape: {}".format(df_with_city.shape[0]))
df_with_city.sort_values("name").head(5)

Df shape: 1169


Unnamed: 0,id,name,alias,latitude,longitude,country,city
247,247,Abbie Bond,,51.7279,-0.2436,United Kingdom,London
63,63,Abel Greij,,52.603,5.06,Netherlands,Amsterdam
150,150,Adam van de Pol-Konings,,52.5674,4.8518,Netherlands,Amsterdam
28,28,Adelgunde Henschel B.Eng.,,50.2047,8.7456,Germany,Frankfurt
261,261,Adrian West,,51.4574,0.0411,United Kingdom,London


# PART 2

- Great, Watson! 
- Now we need to know what everyone of those supspects did wrong, that is the crime type, and desirably, how much they profited from it: Moriarty is not a small fish. 

- You'll need to add the crime type and the profit from the files to the table you already put together. Be mindful of the file types. I also believe that the separator in these file maybe different from the files you used previously.


# Solution for PART 2

In [9]:
df = pd.read_csv("./data/crime_type_profit_France.txt", index_col=False, sep=" ")
print("Columns: ", list(df.columns))

Columns:  ['name', 'crime_type', 'profit']


In [10]:
# union(concatenate) files for the latest crime dates

country_list = ["United Kingdom", "Germany", "Netherlands", "France"]
dfs_dict = {}
for country_ in country_list:
    file_name = "./data/crime_type_profit_{}.txt".format(country_)
    df = pd.read_csv(file_name, index_col=False, sep=" ")
    print(list(df.columns), df.shape)
    df["country"] = country_
    dfs_dict[country_] = df
print("Len dfs_dict: {}".format(len(dfs_dict)))

#combine all dataframes into one
df_crime_type_profit = pd.concat(dfs_dict.values())
print(list(df_crime_type_profit.columns))

df_crime_type_profit.head(10)

['name', 'crime_type', 'profit'] (306, 3)
['name', 'crime_type', 'profit'] (264, 3)
['name', 'crime_type', 'profit'] (250, 3)
['name', 'crime_type', 'profit'] (349, 3)
Len dfs_dict: 4
['name', 'crime_type', 'profit', 'country']


Unnamed: 0,name,crime_type,profit,country
0,Ms. Diane Barnett,theft,284,United Kingdom
1,Elizabeth McDonald,theft,59,United Kingdom
2,Jacqueline Martin-Winter,forgery,150,United Kingdom
3,Roger Farmer,theft,378,United Kingdom
4,Mrs. Georgina Harrison,theft,55,United Kingdom
5,Peter Stevens,robbery,868,United Kingdom
6,Georgina Bell,theft,365,United Kingdom
7,Miss Lesley Sullivan,forgery,320,United Kingdom
8,Keith Kelly,theft,399,United Kingdom
9,Shane Bailey,forgery,495,United Kingdom


In [11]:
# drop duplicates 
df = df_with_city[["name"]].drop_duplicates()
df.shape[0]

1169

In [12]:
# join main criminal info with crime type and profit
df_city_profit = pd.merge(df_with_city, df_crime_type_profit, on=["name","country"], how="left")
print("Df shape: {}".format(df_city_profit.shape[0]))
print(df_city_profit.columns)
df_city_profit.sort_values('profit', ascending = False).head(4)

Df shape: 1169
Index(['id', 'name', 'alias', 'latitude', 'longitude', 'country', 'city',
       'crime_type', 'profit'],
      dtype='object')


Unnamed: 0,id,name,alias,latitude,longitude,country,city,crime_type,profit
1168,302,Odette Renard du Michaud,,48.7832,2.259,France,Paris,weapons sale,498000
58,58,Anthony Mitchell,,51.421,0.1152,United Kingdom,London,weapons sale,495000
1126,307,Gabriel Le Schneider,,48.8161,2.3073,France,Paris,weapons sale,493000
62,62,Malcolm Cox-Mason,Handlebars,51.5569,0.0905,United Kingdom,London,weapons sale,491000


In [13]:
#investigate crime types
df_city_profit["crime_type"].value_counts()

theft            467
pickpocketing    237
robbery          198
forgery          116
drug sale         93
weapons sale      58
Name: crime_type, dtype: int64

Determine the crime type with most sales

In [14]:
df_by_profit = df_city_profit.groupby(["crime_type"])\
                        .agg({"profit": "sum"})\
                        .sort_values("profit", ascending=False)\
                        .reset_index()
df_by_profit

Unnamed: 0,crime_type,profit
0,weapons sale,14942000
1,drug sale,2214270
2,robbery,96582
3,theft,95702
4,forgery,37863
5,pickpocketing,6359


In [15]:
crime_type_big_sales = df_by_profit["crime_type"][0]
crime_type_big_sales

'weapons sale'

Identify the country where the crime type with biggest sales happens

In [16]:
countries_crime_type_profit_df = df_city_profit.loc[df_city_profit["crime_type"] == "{}".format(crime_type_big_sales)]\
                    .groupby(["country"])\
                    .agg({"profit": "sum"})\
                    .sort_values('profit', ascending=False)\
                    .reset_index()
countries_crime_type_profit_df

Unnamed: 0,country,profit
0,France,6312000
1,United Kingdom,3914000
2,Germany,2365000
3,Netherlands,2351000


In [17]:
top_country = countries_crime_type_profit_df.country.tolist()[0]
top_country

'France'

In [18]:
df_crime_type_alias_null = df_city_profit.loc[(df_city_profit["country"] == top_country)  & 
                                           (df_city_profit.alias == "") &
                                             (df_city_profit["crime_type"] == crime_type_big_sales)]
df_crime_type_alias_null.sort_values("profit", ascending=False).head(5)


Unnamed: 0,id,name,alias,latitude,longitude,country,city,crime_type,profit
1168,302,Odette Renard du Michaud,,48.7832,2.259,France,Paris,weapons sale,498000
1126,307,Gabriel Le Schneider,,48.8161,2.3073,France,Paris,weapons sale,493000
991,171,Constance du Laurent,,48.8806,2.2083,France,Paris,weapons sale,453000
1020,200,Valentine Meunier,,48.822,2.5017,France,Paris,weapons sale,435000
839,19,René Tessier du Lagarde,,48.6504,2.3543,France,Paris,weapons sale,423000


# PART 3
-Watson, I think we got the last piece of the puzzle! 

I learned that Moriarty doesn't do his dealings on Sunday. 

That means that the top seller (in the country with the top sale in the last year) who didn't sell on a Sunday and who doesn't have an aliase will be him.

All we have to do now is add the date information I just got and determine the weekday for that date. We already know the rest.

And we'll send Lestrade right after him!

In [19]:
id_dates = pd.read_csv("./data/id_dates.csv", index_col=False)
print("id_dates shape: {}".format(id_dates.shape[0]))
id_dates.head(4)

id_dates shape: 1169


Unnamed: 0,id,date,country
0,0,2020-06-15,France
1,1,2020-01-06,France
2,2,2020-08-03,France
3,3,2020-06-19,France


In [20]:
df_selected_with_dates = pd.merge(df_crime_type_alias_null, id_dates, on=["id", "country"], how="left")
print(df_selected_with_dates.shape[0])

20


In [21]:
df_selected_with_dates["date"] = df_selected_with_dates["date"].astype("datetime64")
df_selected_with_dates.dtypes

id                     int64
name                  object
alias                 object
latitude             float64
longitude            float64
country               object
city                  object
crime_type            object
profit                 int64
date          datetime64[ns]
dtype: object

In [22]:
def weekday(date):
    """ Generate day of the week based on date (as string or as datetime object)"""
    
    if isinstance(date, str):
        from datetime import datetime
        
        date = datetime.strptime(date, "%Y-%m-%d")  # change the format if necessary
        
    return date.strftime("%A")

df_selected_with_dates["weekday"]= df_selected_with_dates["date"].apply(weekday)
df_selected_with_dates.sort_values("profit", ascending = False).head(4)

Unnamed: 0,id,name,alias,latitude,longitude,country,city,crime_type,profit,date,weekday
19,302,Odette Renard du Michaud,,48.7832,2.259,France,Paris,weapons sale,498000,2020-01-29,Wednesday
17,307,Gabriel Le Schneider,,48.8161,2.3073,France,Paris,weapons sale,493000,2020-07-05,Sunday
12,171,Constance du Laurent,,48.8806,2.2083,France,Paris,weapons sale,453000,2020-06-22,Monday
13,200,Valentine Meunier,,48.822,2.5017,France,Paris,weapons sale,435000,2020-05-03,Sunday


In [23]:
print("Shape of df selected: {}".format(df_selected_with_dates.shape[0]))

Shape of df selected: 20


In [24]:
df_selected_not_sunday = df_selected_with_dates.loc[df_selected_with_dates.weekday != "Sunday"]
df_selected_not_sunday = df_selected_not_sunday.sort_values("profit", ascending = False).reset_index()
print(df_selected_not_sunday.shape[0])
df_selected_not_sunday.head(5)

14


Unnamed: 0,index,id,name,alias,latitude,longitude,country,city,crime_type,profit,date,weekday
0,19,302,Odette Renard du Michaud,,48.7832,2.259,France,Paris,weapons sale,498000,2020-01-29,Wednesday
1,12,171,Constance du Laurent,,48.8806,2.2083,France,Paris,weapons sale,453000,2020-06-22,Monday
2,0,19,René Tessier du Lagarde,,48.6504,2.3543,France,Paris,weapons sale,423000,2020-07-04,Saturday
3,18,343,Zoé Guibert de la Levy,,48.6893,2.3624,France,Paris,weapons sale,364000,2020-04-06,Monday
4,9,135,Denis Lesage,,48.6713,2.5139,France,Paris,weapons sale,328000,2020-06-01,Monday


In [25]:
print("The name Moriarty is hiding behind: {}".format(df_selected_not_sunday.name.iloc[0]))

The name Moriarty is hiding behind: Odette Renard du Michaud
