## Datlinq data assessment
The first step is to import the the packages I'll be using. I'll start with Pandas for importing the data and initial cleaning of the data. Various parts of Scikit-Learn will become important in the next steps of processing the data.

In [12]:
import numpy as np
import pandas as pd
from src.functions import import_json_to_df

### Data preparation and Exploration
The first step is to import the data into Pandas to prepare it for processing. The function `import_json_to_df` uses `json_normalize` to flatten nested dicts inside the JSON into new columns. An unfortunate side-effect is that the columns are stored as object types when they should be numeric. This import is included in a function in a separate file and imported to keep this file cleaner.

In [None]:
facebook_path = "data-sample/facebook-rotterdam-20170131.json"
factual_path = "data-sample/factual-rotterdam-20170207.csv"
google_path = "data-sample/google-rotterdam-20170207.json"

df_facebook = import_json_to_df(facebook_path)
df_factual = pd.read_csv(factual_path)
df_google = import_json_to_df(google_path)

Changing columns with NaN values to numeric has the unintended consequence that long integer values are changed to float and subject to floating-point errors. Skipping this step for now.

In [16]:
# Some features were tagged as a numeric type and could be converted
#fb_numeric_columns = df_facebook.columns[df_facebook.columns.str.contains('numberLong')]
#df_facebook[fb_numeric_columns] = \
#    df_facebook[fb_numeric_columns].apply(pd.to_numeric)

Not all of the data represents entities located in Rotterdam, NL. For now, this doesn't have much affect on the data (less than 10 out of 14516 entries), but could be important later.

In [14]:
df_facebook[(df_facebook['location_country'] 
             != 'Netherlands') & (df_facebook['location_country'].notnull())][['name', 'location_country']]


Unnamed: 0,name,location_country
4024,"Rotterdam, Limpopo, South Africa",South Africa
7650,Recovery Sports Grill,United States
7677,Bricklayers Pub and Pizzeria,United States
7715,Rotterdam Mall Cinema,United States
8299,"Rotterdam, Eastern Cape, South Africa",South Africa
8309,Rotterdam (New York),United States
9894,Rotterdam Elks Lodge #2157,United States
12541,Entertainment Express,United States
13493,Planet Fitness,United States


Looking at the `.info()` for each set can give some insight into what data is available - how many features have `NaN` values, and how much of the data is populated. Some of the numerical data could be filled in, and a separate feature could be used to track which values were originally `NaN`. Or the data can be filtered to only include available values. Columns with no data can safely be dropped (preferrably in the import stage).

In [15]:
df_facebook.info(max_cols=150)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14516 entries, 0 to 14515
Data columns (total 135 columns):
__timestamp_$numberLong                         14516 non-null object
__timestamp_basic_$numberLong                   14498 non-null object
__timestamp_detail_$numberLong                  14478 non-null object
__timestamp_insights_$numberLong                8059 non-null object
__timestamp_photos_$numberLong                  8059 non-null object
about                                           10329 non-null object
affiliation                                     41 non-null object
artists_we_like                                 2 non-null object
attire                                          159 non-null object
awards                                          608 non-null object
band_interests                                  1 non-null object
band_members                                    1 non-null object
best_page_id                                    1046 non-null object
be

In [17]:
df_factual.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53828 entries, 0 to 53827
Data columns (total 30 columns):
factual_id            53828 non-null object
name                  53828 non-null object
address               51429 non-null object
address_extended      11 non-null object
po_box                0 non-null float64
locality              53828 non-null object
region                53828 non-null object
post_town             0 non-null float64
admin_region          0 non-null float64
postcode              50253 non-null object
country               53828 non-null object
tel                   50529 non-null object
fax                   9401 non-null object
latitude              53828 non-null float64
longitude             53828 non-null float64
neighborhood          24950 non-null object
website               23815 non-null object
email                 1019 non-null object
category_ids          47589 non-null object
category_labels       47589 non-null object
chain_name            0

In [18]:
df_google.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91242 entries, 0 to 91241
Data columns (total 32 columns):
address_components                 91242 non-null object
adr_address                        91242 non-null object
formatted_address                  91242 non-null object
formatted_phone_number             66517 non-null object
geometry_location_lat              91242 non-null float64
geometry_location_lng              91242 non-null float64
geometry_viewport_northeast_lat    76375 non-null float64
geometry_viewport_northeast_lng    76375 non-null float64
geometry_viewport_southwest_lat    76375 non-null float64
geometry_viewport_southwest_lng    76375 non-null float64
icon                               91242 non-null object
id                                 91242 non-null object
international_phone_number         66517 non-null object
loc_coordinates                    91242 non-null object
loc_type                           91242 non-null object
name                          

Only the factual data set claims to have duplicate records. These often appear to be locations in a chain, so different instances of the same business and not necessarily duplicated information. None of the records are complete duplicates. In this sense, all the data sets contain duplicate values (a search using `df_google[df_google['name'].str.contains('AMRO')]` confirms this).

In [19]:
df_factual[df_factual['name'].str.contains('Zeetuin')] 

Unnamed: 0,factual_id,name,address,address_extended,po_box,locality,region,post_town,admin_region,postcode,...,chain_name,chain_id,hours,hours_display,existence,_org_filename,_org_filedate,_imported,__main_category_id,__hash
37721,f033fad0-94f3-4f9e-b8f0-beaa60f93fbf,Zeetuin Kinderdagverblijf De,Statenweg 123/A,,,Rotterdam,Zuid-Holland,,,3039 HK,...,,,,,0.6,nl_places.factual.v3_49.1484024731.tab,10/01/2017 06:05:31,31/01/2017 13:03:44,23,641a864ad64a8ec8e25206725ff3be5c
50373,7f048f80-7986-4ffd-80ba-8a07c2aac163,Zeetuin Kinderdagverblijf De,Schiewg 147d-149a,,,Rotterdam,Zuid-Holland,,,3038 AN,...,,,,,0.6,nl_places.factual.v3_49.1484024731.tab,10/01/2017 06:05:31,31/01/2017 13:03:44,23,a3f98044255d2eba1276a1a1fa34e857


### Geolocation Data
Each of the data sets contains geolocation data (latitude and longitude), this provides a good point for crossreferencing data. We can start by using the location to find what else is located nearby. The first step is plotting locations.