# Exploring the *RMS Titanic* sinking in Neo4j

The Titanic dataset is very well known among the data science and analytics community. This notebook outlines the initial preprocessing steps in the pipeline to prepare the data for import into a property graph database such as Neo4j. Once the data is in Neo4j the relationships between entities in the data, such as passengers, lifeboats, destinations, etc. can be easily analyzed and visualized. Graph data science can offer an understanding of context within data in a way that tabular data does not do easily.

## Setting up resources
1) Configure environment using Anaconda or virtualenv.

2) Deploy a local Neo4j Docker instance:
```bash
cd neo4j-titanic \
&& docker build -t neo4j-titanic:neo4j_db ./neo4j \
&& docker run --name neo4j_db -d -p 7474:7474 -p 7473:7473 -p 7687:7687 \
-v $PWD/data/interim:/var/lib/neo4j/import neo4j-titanic:neo4j_db
```

3) (optional) If geoparsing is desired, the Mordecai Python package requires an ElasticSearch service to be running on the correct port with the correct index. This can be set up from the command line:
```bash
docker pull elasticsearch:5.5.2 \
&& wget https://s3.amazonaws.com/ahalterman-geo/geonames_index.tar.gz --output-file=wget_log.txt \
&& tar -xzf geonames_index.tar.gz \
&& docker run -d -p 127.0.0.1:9200:9200 -v "$(pwd)/es/geonames_index/:/usr/share/elasticsearch/data" elasticsearch:5.5.2
```

## Running the pipeline
The ```src.preprocess``` module contains functions for cleaning, feature engineering and batch geoparsing with NLP analysis. 

## Import Data

In [100]:
import sys
# Set paths for modules
sys.path[0] = '../'

import pandas as pd
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columns

import py2neo

# import pipeline
from src.preprocess import clean_data, remap_abbrev

In [101]:
# Define paths --> move to .env
URL = 'https://query.data.world/s/xjk6hp7t7w3553bfpkfshr2bjd67a4'
RAW_PATH = sys.path[0] + 'data/raw/titanic.csv'
INTERIM_PATH = sys.path[0] + 'data/interim/titanic_clean.csv'
GEOPARSED_PATH = sys.path[0] + 'data/processed/titanic_final.csv'

# import
data = pd.read_csv(URL)

## Preprocessing

For the purpose of creating a property graph it is useful to correct errors, fill NaN values with useful information and update values to improve readability. Creating new feature columns can expedite node creation when the graph is created. Preprocessing makes use of the ```clean_data``` function.

### Cleaning Steps:
* ***embarked*** - Fix NaN & Replace letters with place names for readability
* ***home.dest*** - Fill NaN with 'Unspecified Destination'; Replace abbreviations with names.

### Feature Engineering Steps:
* ***family.size*** - Combines *sibsb* and *parch* for total size of family including passenger
* ***surname*** - Extracts *surname* from *name*. This will make it much easier to define family relationships.
* ***deck*** - Extracts the *deck* from *cabin* in order to make this into a node.

In [102]:
# Clean data and save
data = clean_data(data)
data.to_csv(INTERIM_PATH, index=False)

In [103]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,family.size,surname,deck
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,Southampton,2.0,,"St Louis, MO",1.0,Allen,B
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,Southampton,11.0,,"Montreal, PQ / Chesterville, ON",4.0,Allison,C
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,,"Montreal, PQ / Chesterville, ON",4.0,Allison,C
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,135.0,"Montreal, PQ / Chesterville, ON",4.0,Allison,C
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,,"Montreal, PQ / Chesterville, ON",4.0,Allison,C


## Geoparsing

The goal of data analysis is to extract as much useful information as possible. In this case it would be useful to create nodes from the destination countries of passengers. In order to do this we have to extract that data from the unstructured text data in *home.dest* using Natural Language Processing (NLP).

The `geoparse_data` function uses the Mordecai package to extract geopolitical entities from unstructured text. Applying this to the *home.dest* column returns the country ISO values, and the Pycountry package is used to convert these into country names for our *home.country* nodes. These steps can be viewed in detail in the *0.1-process-data* notebook.

In [104]:
# Replace home.dest abbreviations with full names for NLP step
data['home.dest'] = remap_abbrev(data['home.dest'])

In [105]:
# Run geoparser. This step can take some time so prepare to wait.
#data = geoparse_data(data[:5])

In [106]:
# Save to processed data folder. Careful not to overwrite accidentally.
#data.to_csv(GEOPARSED_PATH,index=False)

In [107]:
# Inspect
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,family.size,surname,deck
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,Southampton,2.0,,"St Louis, Missouri",1.0,Allen,B
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,Southampton,11.0,,"Montreal, Quebec / Chesterville, Ontario",4.0,Allison,C
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,,"Montreal, Quebec / Chesterville, Ontario",4.0,Allison,C
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,135.0,"Montreal, Quebec / Chesterville, Ontario",4.0,Allison,C
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,,"Montreal, Quebec / Chesterville, Ontario",4.0,Allison,C


## Load to Neo4j

There are a number of options for getting data into Neo4j depending on the size of the import. The simplest way to load data into a container with Neo4j is through py2neo or a shell command to run a query that reads a preprocessed CSV, although this may not be the quickest option. The fastest method is use the `neo4j-admin import` tool which works especially well with large datasets. The drawback is that it requires specially formatted CSV files which means additional preprocessing steps.

In [108]:
# Method 1: use shell to load
#!cat neo4j/create_db.cyp | docker exec --interactive neo4j_db bin/cypher-shell -u neo4j -p test

In [109]:
# Method 2: use py2neo
file = sys.path[0] + '/neo4j/create_db.cyp' 

with open(file, 'r') as f:
    query = f.read()

# Connect to local running neo4j instance
graph = py2neo.Graph(host='127.0.0.1', password='test')

# Split queries into individual statements
queries = query.split(';')

# Run each query
for cypher in queries[:-1]:
    graph.run(cypher)