# Pre-processing of Lagain Mars craters for DaCHS DB ingestion

To ingest data from Lagain & Chiara data release -- https://github.com/alagain/martian_crater_database -- into DaCHS, which is a Postgres db, we must first workout the GeoJSON files provided. We should arrive in a flat table.

The workflow is basically:
* read JSON
 * unravel data in one or more tables
* check data
* write CSV

In [1]:
import json
import pandas
print('Pandas version: ', pandas.__version__)

Pandas version:  0.24.1


<div class="alert alert-info">
Where are we?

<pre>
<code>
$ git clone https://github.com/alagain/martian_crater_database.git
$ cd martian_crater_database/Global
$ unzip lagain_db.json.zip
</code>
</pre>
</div>

In [2]:
%ls

[0m[01;31mlagain_db_filtered.json.zip[0m  Notebook_PreProcLagain.html
[01;31mlagain_db_filtered.zip[0m       Notebook_PreProcLagain.ipynb
lagain_db.json               notebook.tex
[01;31mlagain_db.json.zip[0m           README.md
[01;31mlagain_db.zip[0m


In [3]:
with open('lagain_db.json','r') as fp:
    js = json.load(fp)
    
features = js['features']
print("Number of features: ", len(features))

Number of features:  384582


In [4]:
features[0]

{'type': 'Feature',
 'properties': {'CRATER_ID': '200-007',
  'RADIUS': 500.0,
  'X': 23.671499,
  'Y': -43.584301,
  'TYPE': 1.0,
  'STATUS': 'Valid',
  'LRD_MORPH': None,
  'ORIGIN': None,
  'ADDING': 1.0},
 'geometry': {'type': 'Point', 'coordinates': [23.671499, -43.584301]}}

## Let's check if everything is "Point"
...sanity check actually, because asaik they are all points...

In [5]:
set([f['geometry']['type'] for f in features])

{'Point'}

...all entries are points. OK. Which, together with our "first feature" sample above, I assume that (`X`,`Y`) and `coordinates` have the same values. That being the case, I can simply drop `geometry` in what follows.

## Let's check what is going on in `properties`

In [6]:
df = pandas.read_json(json.dumps([f['properties'] for f in features]))

In [7]:
df.describe(include='all')

Unnamed: 0,ADDING,CRATER_ID,LRD_MORPH,ORIGIN,RADIUS,STATUS,TYPE,X,Y
count,384582.0,384582,8480,39204,384582.0,384582,384582.0,384582.0,384582.0
unique,,384561,4,106,,5,,,
top,,21-003453,SLE,05-000000,,Valid,,,
freq,,2,6680,5857,,288117,,,
mean,0.000793,,,,1778.392579,,1.66577,10.135017,-7.18077
std,0.02815,,,,4295.514407,,1.211242,96.634768,33.612876
min,0.0,,,,500.0,,1.0,-179.996994,-86.699997
25%,0.0,,,,590.0,,1.0,-58.806,-30.92875
50%,0.0,,,,765.0,,1.0,12.7595,-10.056
75%,0.0,,,,1280.0,,2.0,89.264749,17.259001


In [8]:
pandas.set_option('display.max_rows',100)
df.sample(100)

Unnamed: 0,ADDING,CRATER_ID,LRD_MORPH,ORIGIN,RADIUS,STATUS,TYPE,X,Y
280879,0,24-001127,,,6895,Valid,1,-174.186005,-58.084999
45563,0,20-015722,,,550,Valid,1,5.905,-26.218
213574,0,07-001750,,,1115,Valid,1,152.475006,55.555
24490,0,24-014612,,,525,Valid,1,-151.048996,-38.698002
131435,0,12-012470,,,695,Valid,1,14.999,24.162001
3314,0,18-014319,,,505,Valid,1,-76.845001,-21.645
359626,0,26-011110,,26-000008,590,Secondary,4,-15.291,-48.409
268618,0,02-000382,,,2900,Valid,1,-155.679001,39.369999
228086,0,02-001071,,,1290,Valid,1,-151.828995,38.723999
294720,0,21-001286,SLE,,5475,Layered,2,63.188999,-17.815001
