# THE GOAL

The goal is using the intel (data in the supplied files) from the police, Interpol, and undercover agents about Europe's criminals to identify the name behind which Moriarty is hiding. 


# SOLUTION

# PART 1
-Watson, just like our grand-grand-fathers we are again after Moriarty. 

We need to catch him... maybe it is her. All we know is 
that someone is masterminding unlawful activities and planning something bad. The Interpol agents, with the help of my boys, collected information that should provide us the clues to determine the name Moriarty is hiding behind, and arrest him.

-The data is in the csv and text files and contains info on the criminal activity in the last year as well as high-profile and suspicious sales. They were sent over by  our collegues from the neighboring countries: France, Germany, Netherlands, and our own MI-6 in the United Kingdom.

-The first task would be to combine the data into one table. I requested the information on the name, alias, and the location of the last known whereabouts of the criminals, as latitude and longitude, but since the data comes from all around the Europe the columns names may differ between files.

-I am thinking that adding the country to the data might be helpful in our future analysis.

-Lastly, from my correspondence with our undercover agents, all the activity seems to be happening around major financial centers. If those are not in the data, I suppose you can extract the city names using the latitude and logitude. And a map of course, unless your knowledge of Europe's geography is excepitonal. 


Data tasks outline:
1. Read the data from the files (named 'criminals_' plus country name) into separate dataframes and add the country name as 'country' column.
2. Identify the city around which the criminals operate and add it to the dataframe as 'city' column.
3. Concatenate the dataframes into a single dataframe with the four original columns renamed to: [name, alias, latitude, longitude]
4. Fill NAs in aliases with an empty string.


In [None]:
from datetime import datetime
import random
import pandas as pd
import numpy as np

In [None]:
#sample one of the csvs
country_ = "France"
file_name = "./data/criminals_{}.csv".format(country_)
df_country = pd.read_csv(file_name, index_col=False)
print(df_country.columns)
df_country.head(2)

In [None]:
#explore the dataframes: column names, shapes and combine into a single dataframe
country_list = ["United Kingdom", "Germany", "Netherlands", "France"]
dfs_dict = {}
for country_ in country_list:
    file_name = "./data/criminals_{}.csv".format(country_)
    df = pd.read_csv(file_name, index_col=False)
    print(list(df.columns), df.shape)
    df.columns = ["id", "name", "alias", "latitude", "longitude"]
    df["country"] = country_
    dfs_dict[country_] = df  # add data frame to the dict for a future union
print("Len dfs_dict: {}".format(len(dfs_dict)))

# combine(concatenate/union) into a single dataframe
df_criminals_combined = pd.concat(dfs_dict.values())
print("Combined shape: {}".format(df_criminals_combined.shape))
df_criminals_combined.head(3)

In [None]:
# calculate mean latitude and longitude to identify the major financial centers (cities)
# (copy and paste the lat, lon values into Google Maps)
for country_ in country_list:
    test_df = df_criminals_combined.loc[df_criminals_combined.country == country_]
    print("Country: {}, (lat, lon): {}, {}".format(country_, 
                                                   round(test_df.latitude.mean(), 4), 
                                                   round(test_df.longitude.mean(), 4)))
    print(40 * "*")

In [None]:
country_list

In [None]:
# add the city name to the df

#it can be done using a series of if/else statements, such as 'if country_ == 'France': city = 'Paris', etc. OR
# using a dictionary as below:
country_city_dict = {"United Kingdom": "London", "Germany": "Frankfurt", "Netherlands": "Amsterdam", "France": "Paris"}
country_city_dict



In [None]:
# assign city to the country
def assign_city(df, country_city_dict):
    """"""
    df["city"] = np.nan
    for country, city in country_city_dict.items():
        df.loc[df.country == country, "city"] = city
        
    return df

df_with_city = assign_city(df_criminals_combined, country_city_dict)
df_with_city.head(10)

In [None]:
# Fillna in alias.
df_with_city = df_with_city.fillna({"alias": ""})
print("Df shape: {}".format(df_with_city.shape[0]))
df_with_city.sort_values("name").head(5)

# PART 2

- Great, Watson! 
- Now we need to know what everyone of those supspects did wrong, that is the crime type, and desirably, how much they profited from it: Moriarty is not a small fish. 

- You'll need to add the crime type and the profit from the files to the table you already put together. Be mindful of the file types. I also believe that the separator in these file maybe different from the files you used previously.


# Solution for PART 2

In [None]:
df = pd.read_csv("./data/crime_type_profit_France.txt", index_col=False, sep=" ")
print("Columns: ", list(df.columns))

In [None]:
# union(concatenate) files for the latest crime dates

country_list = ["United Kingdom", "Germany", "Netherlands", "France"]
dfs_dict = {}
for country_ in country_list:
    file_name = "./data/crime_type_profit_{}.txt".format(country_)
    df = pd.read_csv(file_name, index_col=False, sep=" ")
    print(list(df.columns), df.shape)
    df["country"] = country_
    dfs_dict[country_] = df
print("Len dfs_dict: {}".format(len(dfs_dict)))

#combine all dataframes into one
df_crime_type_profit = pd.concat(dfs_dict.values())
print(list(df_crime_type_profit.columns))

df_crime_type_profit.head(10)

In [None]:
# drop duplicates 
df = df_with_city[["name"]].drop_duplicates()
df.shape[0]

In [None]:
# join main criminal info with crime type and profit
df_city_profit = pd.merge(df_with_city, df_crime_type_profit, on=["name","country"], how="left")
print("Df shape: {}".format(df_city_profit.shape[0]))
print(df_city_profit.columns)
df_city_profit.sort_values('profit', ascending = False).head(4)

In [None]:
#investigate crime types
df_city_profit["crime_type"].value_counts()

Determine the crime type with most sales

In [None]:
df_by_profit = df_city_profit.groupby(["crime_type"])\
                        .agg({"profit": "sum"})\
                        .sort_values("profit", ascending=False)\
                        .reset_index()
df_by_profit

In [None]:
crime_type_big_sales = df_by_profit["crime_type"][0]
crime_type_big_sales

Identify the country where the crime type with biggest sales happens

In [None]:
countries_crime_type_profit_df = df_city_profit.loc[df_city_profit["crime_type"] == "{}".format(crime_type_big_sales)]\
                    .groupby(["country"])\
                    .agg({"profit": "sum"})\
                    .sort_values('profit', ascending=False)\
                    .reset_index()
countries_crime_type_profit_df

In [None]:
top_country = countries_crime_type_profit_df.country.tolist()[0]
top_country

In [None]:
df_crime_type_alias_null = df_city_profit.loc[(df_city_profit["country"] == top_country)  & 
                                           (df_city_profit.alias == "") &
                                             (df_city_profit["crime_type"] == crime_type_big_sales)]
df_crime_type_alias_null.sort_values("profit", ascending=False).head(5)


# PART 3
-Watson, I think we got the last piece of the puzzle! 

I learned that Moriarty doesn't do his dealings on Sunday. 

That means that the top seller (in the country with the top sale in the last year) who didn't sell on a Sunday and who doesn't have an aliase will be him.

All we have to do now is add the date information I just got and determine the weekday for that date. We already know the rest.

And we'll send Lestrade right after him!

In [None]:
id_dates = pd.read_csv("./data/id_dates.csv", index_col=False)
print("id_dates shape: {}".format(id_dates.shape[0]))
id_dates.head(4)

In [None]:
df_selected_with_dates = pd.merge(df_crime_type_alias_null, id_dates, on=["id", "country"], how="left")
print(df_selected_with_dates.shape[0])

In [None]:
df_selected_with_dates["date"] = df_selected_with_dates["date"].astype("datetime64")
df_selected_with_dates.dtypes

In [None]:
def weekday(date):
    """ Generate day of the week based on date (as string or as datetime object)"""
    
    if isinstance(date, str):
        from datetime import datetime
        
        date = datetime.strptime(date, "%Y-%m-%d")  # change the format if necessary
        
    return date.strftime("%A")

df_selected_with_dates["weekday"]= df_selected_with_dates["date"].apply(weekday)
df_selected_with_dates.sort_values("profit", ascending = False).head(4)

In [None]:
print("Shape of df selected: {}".format(df_selected_with_dates.shape[0]))

In [None]:
df_selected_not_sunday = df_selected_with_dates.loc[df_selected_with_dates.weekday != "Sunday"]
df_selected_not_sunday = df_selected_not_sunday.sort_values("profit", ascending = False).reset_index()
print(df_selected_not_sunday.shape[0])
df_selected_not_sunday.head(5)

In [None]:
print("The name Moriarty is hiding behind: {}".format(df_selected_not_sunday.name.iloc[0]))