## Datlinq data assessment
The first step is to import the the packages I'll be using. I'll start with Pandas for importing the data and initial cleaning of the data. Various parts of Scikit-Learn will become important in the next steps of processing the data.

In [None]:
import numpy as np
import pandas as pd
from src.functions import import_json_to_df

### Data preparation and Exploration
The first step is to import the data into Pandas to prepare it for processing. The function `import_json_to_df` uses `json_normalize` to flatten nested dicts inside the JSON into new columns. An unfortunate side-effect is that the columns are stored as object types when they should be numeric. This import is included in a function in a separate file and imported to keep this file cleaner.

In [None]:
facebook_path = "data-sample/facebook-rotterdam-20170131.json"
factual_path = "data-sample/factual-rotterdam-20170207.csv"
google_path = "data-sample/google-rotterdam-20170207.json"

df_facebook = import_json_to_df(facebook_path)
df_factual = pd.read_csv(factual_path)
df_google = import_json_to_df(google_path)

Changing columns with NaN values to numeric has the unintended consequence that long integer values are changed to float and subject to floating-point errors. Skipping this step for now.

In [None]:
# Some features were tagged as a numeric type and could be converted
#fb_numeric_columns = df_facebook.columns[df_facebook.columns.str.contains('numberLong')]
#df_facebook[fb_numeric_columns] = \
#    df_facebook[fb_numeric_columns].apply(pd.to_numeric)

Not all of the data represents entities located in Rotterdam, NL. For now, this doesn't have much affect on the data (less than 10 out of 14516 entries), but could be important later.

In [3]:
df_facebook[(df_facebook['location_country'] 
             != 'Netherlands') & (df_facebook['location_country'].notnull())][['name', 'location_country']]

Looking at the `.info()` for each set can give some insight into what data is available - how many features have `NaN` values, and how much of the data is populated. Some of the numerical data could be filled in, and a separate feature could be used to track which values were originally `NaN`. Or the data can be filtered to only include available values. Columns with no data can safely be dropped (preferrably in the import stage).

In [None]:
df_facebook.info(max_cols=150)

In [None]:
df_factual.info()

In [None]:
df_google.info()

Only the factual data set claims to have duplicate records. These often appear to be locations in a chain, so different instances of the same business and not necessarily duplicated information. None of the records are complete duplicates. In this sense, all the data sets contain duplicate values (a search using `df_google[df_google['name'].str.contains('AMRO')]` confirms this).

In [None]:
df_factual[df_factual['name'].str.contains('Zeetuin')] 

### Geolocation Data
Each of the data sets contains geolocation data (latitude and longitude), this provides a good point for crossreferencing data. We can start by using the location to find what else is located nearby. The first step is plotting locations. The packages `matplotlib` and `sklearn` will be useful here.

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

%matplotlib inline

The first step is plotting the location latitude and longitude for each item in the data (choosing the Facebook data because it's smallest). Using an alpha value of .4 lets us see where the points are stacked on top of each other. In this case, the majority appear in a cluster to the right. Points outside this cluster are likely wrong. One example is for a place called 'Soulbrothers' that claims to be in Rotterdam, NL, but shows coordinates for Las Vegas, NV. Some cleaning of the data would be required to resolve this.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(10, 6)
loc_scatter = ax.scatter(df_facebook['location_longitude'], df_facebook['location_latitude'], alpha=.4)

plt.show()

The data from Google has better coordinate data. The scale of the plot has a tighter range, and the clustering around Rotterdam is much more visible. There are fewer points that need to be fixed or explained before use.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(10, 6)
loc_scatter = ax.scatter(df_google['geometry_location_lng'], df_google['geometry_location_lat'], alpha=.4)

plt.show()

Ideally, any points outside a range that defines the boundaries of Rotterdam, NL, would be identified and corrected or simply eliminated from the data. The remaining points could then be mapped onto a map of Rotterdam and used to identify popular areas, or identify similar locations based on similarity in their characteristics (location, opening hours, type of business, etc). It can be seen that the shapes become recognizable once the outliers are removed.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(10, 6)
loc_scatter = ax.scatter(
    df_google['geometry_location_lng'].iloc[:50000], 
    df_google['geometry_location_lat'].iloc[:50000], 
    alpha=.4)

plt.show()

Clustering can make the data representation more efficient and do a better job of showing overlap in denser locations. It can be tempting to use k-means to group locations around k centroids on the map. However, k-means algorithms are not very robust for geo-spatial data. DBSCAN from sklearn is considered a better option.

In [None]:
coords = df_google.as_matrix(columns=['geometry_location_lat', 'geometry_location_lat'])
loc_clusters = DBSCAN(eps=.0001, min_samples=10, n_jobs=-1).fit(coords)
unique_labels = set(loc_clusters.labels_)
len(unique_labels)

The algorithm has created 46 unique groupings based on the spatial data. Borrowing heavily from the [SkLearn demo](http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html) we could use these labels as a mask for plotting the locations. This method has an additional advantage of being able to remove some of the noise mentioned previously. Individual points far outside the main body of the points won't register. It also decides the appropriate number of clusters from the paramaters and can run multiple jobs at once, reducing the time for processing.

### Text Analysis

Spacy allows for sophisticated, robust, and fast text analysis and natural language processing. Spacy offers a number of [tutorials](https://spacy.io/docs/usage/) on basic usage that cover the preliminary steps of things like entity recognition. The standard models are available for English, German, and French. Work is being done on models for the Dutch language, but Spacy also allows for the training of custom models for use in a pipeline.

In [None]:
import spacy
nlp = spacy.load('en') # this assumes `python -m spacy download en` has been run after installing Spacy
doc = nlp(df_facebook['description'].iloc[14474])

Once in a document is loaded in, the model is applied and many of the features of the model are immediately trained and available for use. This requires the suitable model for the text being used, but should work as well on a small text as it does on a larger text. The results are only as good as the quality of the model, but the ability to use custom models allows for specialization to types of text an new languages.

In [None]:
df_facebook['description'].iloc[14474]

In [None]:
for word in doc:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)

The model also provides entity recognition that can be trained to recognize new entities relatively easily. As can be seen here, the model does need some additional training. It reads the name Elise as an organization, and reads 'THE' as the lead into organizations as well.

In [None]:
for word in doc:
    print(word.text, word.ent_iob, word.ent_type_)

Spacy also uses pretrained, 300-dimensional vectors using the GloVe algorithm. Vectors are available for individual tokens and for whole documents. If the standard vectors are not suitable, it is easy to import a new set using custom training on a corpus.

In [None]:
print(doc.vector)
print(doc[3].vector)

Using word embedding vectors allows for similarity comparisons. Here we can see that 'better' (3rd token from the end of the document) is more similar to 'good' (12th from the end) than it is to bad (6th from the end). These similarities can also be found for the whole texts in our data. This would allow for finding similar text descriptions within the cluster based on locality described above to find locations with in an area that have similar descriptions. If a company found their product to be successful at a location that highlights its party atmosphere and connection to the nightlife, this could be a way for them to find similar locations within an area that already has a number of locations near each other.

In [None]:
print(doc[-3].similarity(doc[-12]))
print(doc[-3].similarity(doc[-6]))

Spacy has been designed to work well with other packages (including SKLearn). Other algorithms such as TfIdf could be imported and used to find important terms from descriptions. The ability to train and import new models means that you can start with the pretrained models, create a working pipeline, and then continue to improve it by adding more accurate models. These models can then be combined with the clustering algorithms described previously to find characterisitcs of areas based on proximity or other values. These could be further defined by filtering locations based on on other features such as opening times or specializations such as 'coffee' or 'breakfast'. These levels of specification could allow for a finer grained approach to finding new locations for products. It would also be flexible enough to take into account descriptions provided by the locations themselves, or their customers, and compare them to product descriptions or characterisitcs seen as desireable by the owners of the products (ie, if the company wants to focus on a particcular kind of location).

### Apply NLP to descriptions

#### Use vector embedding to analyze document similarity

Use SpaCy to create an `nlp` object for each text description in the Facebook dataframe. Then use the similarity method on one of the objects to compare it to the others (the default for SpaCy is cosine similarity). First, convert all the text in records that have a text description (about 30 seconds for 6588 texts using `nlp.pipe()`). Then save several of the NLP objects as references for our search. Finally, calculate which other descriptions are most similar to the reference documents.

In [None]:
from src.functions import apply_nlp_to_column, find_other_documents, spacy_docs_to_df_tfidf, top_n_doc_tokens

In [None]:
df_facebook['description_nlp'] = apply_nlp_to_column(df_facebook, 'description')

In [None]:
nlp_elise = nlp(df_facebook['description'].iloc[14474])
nlp_iweek = nlp(df_facebook['description'].iloc[14466])
nlp_mangrove = nlp(df_facebook['description'].iloc[5])

In [None]:
sim_elise = find_other_documents(nlp_elise, df_facebook)
sim_iweek = find_other_documents(nlp_iweek, df_facebook)
sim_mangrove = find_other_documents(nlp_mangrove, df_facebook)

Validating the results can be difficult. We can print out the text for each reference item and see that they are each getting a unique set of related descriptions. If necessary, the cosine similarity could also be printed, to see the expected level of similarity. The real problem with this implementation is that the models are are trained for English language texts. As a result, I have chosen only English descriptions as reference documents. The results show how important this training is - all the results were also written in English.

In [None]:
print(df_facebook['name'].iloc[14474])
print(df_facebook['description'].iloc[14474])
print(sim_elise)

In [None]:
print(df_facebook['name'].iloc[14466])
print(df_facebook['description'].iloc[14466])
print(sim_iweek)

In [None]:
print(df_facebook['name'].iloc[5])
print(df_facebook['description'].iloc[5])
print(sim_mangrove)

##### Investment Week Comparison
Here is the description for *Investment Week* for comparison with the description of *Robeco Asset Management*. We can see similar themes of finance and advice. These businesses are clearly in related industries.

In [None]:
print(df_facebook['name'].iloc[14466])
print(df_facebook['description'].iloc[14466])

In [None]:
print(df_facebook['name'].iloc[13504])
print(df_facebook['description'].iloc[13504])

##### Mangrove Comparison
The text for *Mangrove* is very short, and may even be too short for proper comparison. The description for *Ranj* has been returned as the most similar text. *Ranj* appears to be primarily a gaming company, however both texts emphasize the impact of their companies and the focus they have on developing solutions.

In [None]:
print(df_facebook['name'].iloc[5])
print(df_facebook['description'].iloc[5])

In [None]:
print(df_facebook['name'].iloc[8162])
print(df_facebook['description'].iloc[8162])

#### Use vector matrices to find key words in documents

In [None]:
df_tfidf, tfidf_vocabulary = spacy_docs_to_df_tfidf(df_facebook)
top_n_doc_tokens(doc_idx=5, df_tfidf=df_tfidf, vocab=tfidf_vocabulary, max_tokens=10)

#### Use Support Vector Classifier (SVC) to predict categories from Facebook descriptions