# Exploring the *RMS Titanic* sinking in Neo4j

The Titanic dataset is very well known among the data science and analytics community. This notebook outlines the initial preprocessing steps in the pipeline to prepare the data for import into a property graph database such as Neo4j. Once the data is in Neo4j the relationships between entities in the data, such as passengers, lifeboats, destinations, etc. can be easily analyzed and visualized. Graph data science can offer an understanding of context within data in a way that tabular data does not do easily.

## Setting up resources
The Mordecai Python package requires an ElasticSearch service to be running on the correct port with the correct index. This can be set up from the command line:
```bash
docker pull elasticsearch:5.5.2 \
&& wget https://s3.amazonaws.com/ahalterman-geo/geonames_index.tar.gz --output-file=wget_log.txt \
&& tar -xzf geonames_index.tar.gz \
&& docker run -d -p 127.0.0.1:9200:9200 -v "$(pwd)/es/geonames_index/:/usr/share/elasticsearch/data" elasticsearch:5.5.2
```

## Running the pipeline
The ```src.pipeline``` module contains two functions ```clean_data``` and ```geoparse_data``` for cleaning, feature engineering and batch geoparsing and NLP analysis. 

## Import Data

In [2]:
import sys
# Set paths for modules
sys.path[0] = '../'

import pandas as pd
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columns

# import pipeline
from src.pipeline import clean_data  #, geoparse_data

Using TensorFlow backend.
[nltk_data] Downloading package treebank to
[nltk_data]     /Users/gregory/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /Users/gregory/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/gregory/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/gregory/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/gregory/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/gregory/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [3]:
# Define paths
url = 'https://query.data.world/s/xjk6hp7t7w3553bfpkfshr2bjd67a4'
raw = sys.path[0] + 'data/raw/titanic.csv'
clean = sys.path[0] + 'data/interim/titanic_clean.csv'
processed = sys.path[0] + 'data/processed/titanic_final.csv'

# import
data = pd.read_csv(url)

## Preprocessing

For the purpose of creating a property graph it is useful to correct errors, fill NaN values with useful information and update values to improve readability. Creating new feature columns can expedite node creation when the graph is created. Preprocessing makes use of the ```clean_data``` function.

### Cleaning Steps:
* ***embarked*** - Fix NaN & Replace letters with place names for readability
* ***home.dest*** - Fill NaN with 'Unspecified Destination'

### Feature Engineering Steps:
* ***family.size*** - Combines *sibsb* and *parch* for total size of family including passenger
* ***surname*** - Extracts *surname* from *name*. This will make it much easier to define family relationships.
* ***deck*** - Extracts the *deck* from *cabin* in order to make this into a node.

In [4]:
# Clean data and save
data = clean_data(data)
data.to_csv(clean, index=False)

## Geoparsing

The goal of data analysis is to extract as much useful information as possible. In this case it would be interesting to create nodes from the destination countries of passengers. In order to do this we have to extract that data from *home.dest* using Natural Language Processing (NLP).

The `geoparse_data` function uses the Mordecai package to extract geopolitical entities from unstructured text. Applying this to the *home.dest* column returns the country ISO values, and the Pycountry package is used to convert these into country names for our *home.country* nodes. These steps can be viewed in detail in the *0.1-process-data* notebook.

In [5]:
# Run geoparser. This step can take some time so prepare to wait.
data = geoparse_data(data[:5])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['home.dest'] = remap_abbrev(data['home.dest'])
GET http://localhost:9200/geonames/_count [status:N/A request:3.682s]
Traceback (most recent call last):
  File "/Users/gregory/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/Users/gregory/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/Users/gregory/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (m

GET http://localhost:9200/geonames/_count [status:N/A request:0.005s]
Traceback (most recent call last):
  File "/Users/gregory/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/Users/gregory/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/Users/gregory/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/gregory/anaconda3/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 114, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
  File "/Users/gregory/anaconda3/lib/python3.7/site-packages/urlli

IndexError: tuple index out of range

In [28]:
# Save to processed data folder. Careful not to overwrite accidentally.
#data.to_csv(processed,index=False)

In [32]:
# Inspect
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,family.size,surname,deck
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,Southampton,2.0,,"St Louis, MO",1.0,Allen,B
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,Southampton,11.0,,"Montreal, PQ / Chesterville, ON",4.0,Allison,C
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,,"Montreal, PQ / Chesterville, ON",4.0,Allison,C
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,135.0,"Montreal, PQ / Chesterville, ON",4.0,Allison,C
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,Southampton,,,"Montreal, PQ / Chesterville, ON",4.0,Allison,C


## Conclusion
Once data has been processed it can be imported into Neo4j using Cypher Query Language. 