## Creating a data model

In this notebook we will create and apply a new **data model/schema** to a raw `.imma` file, using the [mdf_reader](https://git.noc.ac.uk/iregon/mdf_reader) tool. We will add supplemental metadata to the basic `imma1` data model and display supplemental data as a pandas dataframe.

Lets first import all the tools that we will need.

In [None]:
import os
import sys

sys.path.append("/home/bea/")

import mdf_reader

The `mdf_reader` tool comes with data model templates of `.json` files, that we can use to build our models. For more information see the following [manual](https://git.noc.ac.uk/iregon/mdf_reader/-/blob/master/docs/User_manual.docx).

In [None]:
path_to_data_models = "/home/bea/c3s_work/mdf_reader/data_models/lib/"

template_names = mdf_reader.schemas.templates()
template_names

According to the manual, ICOADS data stored with the [IMMA format](https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf) represents a complex data model, since the data includes blocks of sections which are exclusive to certain DCK's (e.g. data coming from the NOAA National Climatic Data Center (NCDC) TD-11 formats). Most of the ICOADS data however will need a **schema** based on the `imma1.json` format, which is based on the template: `*_complex_opt.json`.

Lets try to build our own **schema** based on this template for a new dck. In this notebook we will organise the data and metadata from the **US Maury collection** that corresponds to `source/dck 069-701`.

1. First lets read a raw `.imma` file from dck 701 as an example, for a subset of the data collected in April/1845.

In [None]:
schema = "imma1"

data_file_path = "/home/bea/c3s_work/mdf_reader/tests/data/069-701_1845-04_subset.imma"

data_raw = mdf_reader.read(data_file_path, data_model=schema)

In [None]:
data_raw.data["c99"]

The `c99` column is a bit messy. Here, we will need to separate the Suplememal Metadata ingestied in ICOADS as an entire string and sort each row out according to the source&dck documentation.

2. We then need to make a new data model or **schema** to be stored in the library folder of the `mdf_reader`. For this we create a folder with the name `imma1_d701` in the lib directory.
1. Under this folder (`/data_models/lib/imma1_d701`) we will need to add a `.json` file with the same name. This `imma1_d701.json` file will contain all the data model information with instructions on how to subdivide the metadata added to `c99`. The name of the file is `imma1_d701.json` because the data model for this deck is based on the `imma1` template shown above, but the `c99` will be further subdivided into other columns/sections.

In [None]:
path_to_folder = "/home/bea/c3s_work/mdf_reader/data_models/lib/"
model_name = "imma1_d701"
model_path = os.path.join(path_to_folder, model_name)
model_path

> Uncomment the following lines to create new data models. This folder is already withing the repository so you dont need to run the lines below. They only serve as a guide for further schemas

In [None]:
# if not os.path.exists(model_path):
#     os.makedirs(model_path)

In that path we will copy the template that we will based our **schema** from. In this case the `imma1` schema.

In [None]:
# import shutil
# shutil.copyfile(os.path.join(path_to_folder, 'imma1/imma1.json'),  os.path.join(model_path, model_name+'.json'))

Now we need to make a directory called `code_tables` and copy all `code_tables` from the `imma1` folder template

In [None]:
# import shutil
# shutil.copytree(os.path.join(path_to_folder, 'imma1/code_tables'), os.path.join(model_path,'code_tables'))

We end up with something like this:

In [None]:
from IPython.display import Image

Image(filename="/home/bea/c3s_work/figures/deckschema.png")

In [None]:
from IPython.display import Image

Image(filename="/home/bea/c3s_work/figures/code_tables_schema_one.png")

Now the key will be to modify the `c99` section of the `imma1_d701.json`. See the highlighted text in the figure below.

In [None]:
from IPython.display import Image

Image(filename="/home/bea/c3s_work/figures/c99differences.png")

The `c99_sentinal` section identifies where in the data, we will have a new section. In this case we will have a new section corresponding to Supplemental Metadata.

In our example this supplemental metadata will come from the documentation of the US Maury collection stored in the [ICOADS website](https://icoads.noaa.gov/e-doc/other/transpec/maury/maury_transpec).

4. We will need to add the metadata information from the website inside that `c99_sentinal` section and create as many sections as the data requires.

> sentinal: section identifier
> applies to: format.fixed_width
> is mandatory: it is not mandatory if the section is unique, unique in a parsing_order block, or
> part of a sequential parsing_order block.
> type: string
> comments: the element bearing the sentinal needs to be, additionally, declared in the
> elements block

5. We will have to build additional `.json` files to be saved under the `code_tables` folder of our schema. Each `.json` file inside the `code_tables` are dictionaries that will help decode metadata observations (e.g. wind force scales or weather codes).  For each encoded variable that we add, we will need to add a new `ICOADS.C99_Variable.json` to the **schema**. Files need to be named after the section that they represent, in this case `ICOADS.C99_Variable.json`. See images below:

In [None]:
Image(filename="/home/bea/c3s_work/figures/code_tables_schema_two.png")

From the US Maury collection [ICOADS documentation](https://icoads.noaa.gov/e-doc/other/transpec/maury/maury_transpec), we find out that the `c99` for this deck is compose of the following sections:

- Data
- Header information
- Quality control information (qc)

```
Data stored in the supplemental attachment consisted of the entire data record
(173 characters); followed by a selection of fields from, or derived from, the
associated header record (through character 241); and selected fields from the
qc file (total 250 characters):
  # Pos.     Total #  Field  Record
    range    of pos.   name    type  Description of field (of derived field)
--- -------  -------  -----  ------  ----------------------------------------
  1 1-7         7     cvoyd    data  voyage number
... ...               ...       ...  ...
 47 172-173     2     cmvq     data  magnetic variation QC indicator
 NA 174-175     2     cts2   header  (fr ship type, ctship, according to [5])
  4 176-177     2     cft    header  form type
  5 178-193    16     comm   header  commander (first 16 positions only) [6]
  6 194-217    24     cfr    header  from city
  7 218-241    24     cto    header  to city
  2 242-246     5     qc2    qc      reel sequence number
  5 247-248     2     qc5    qc      day  (local time) (99 indicates missing)
  6 249-250     2     qc6*   qc      hour (local time) (99 indicates missing)
--- -------  -------  -----  ------  ----------------------------------------
* Whenever qc6 was 24, zero was inadvertently written out to the supplemental
attachment.  This resulted from an error in the conversion program, but can
be fixed by interpretation of hour zero as hour 24 of qc5 + 1 (as noted in [2],
qc6 originally ranged 1-24, with 24 signifying hour 0 of the next day.  As
intended, qc5 was included in the supplementary attachment in original form.
```

In the raw data file the information looks like this:

In [None]:
c99 = data_raw.data["c99"]
line = c99.iloc[63]

In [None]:
line.values[0]

We then need to divide all this string accoding to the documentation above and the format of the data specified in the [US Maury data docs](https://icoads.noaa.gov/e-doc/other/transpec/maury/maury_format)

In [None]:
# sentinal = 5
part_1 = line.values[0][0:5]
part_1

In [None]:
# cvoyd voyage number = 7
part_2 = line.values[0][5 : 5 + 7]
part_2

In [None]:
# date = 10
part_3 = line.values[0][12 : 12 + 10]
part_3

6. We build our `.json` file reflecting each data field from the ICOADS documentation as a new section. And add each parameter from the data as a new element. Having a `sentinal` section at the beginning of the `c99` is important since in the `.imma` format, regardless of the source/dck, will have 5 characters that will always be the same.

In [None]:
Image(filename="/home/bea/c3s_work/figures/new_schema.png")

To each section we add the corresponding elements/parameters.

In [None]:
Image(filename="/home/bea/c3s_work/figures/elements.png")

Now we feed this new data model to the `mdf_reader.read` function. It is important that we save this data model under the right directory

In [None]:
model_path

In [None]:
data_file_path = "/home/bea/c3s_work/mdf_reader/tests/data/069-701_1845-04_subset.imma"

data = mdf_reader.read(data_file_path, data_model_path=model_path)

And magically all the messy string is separated!

In [None]:
import pandas as pd

pd.options.display.max_columns = None
data.data[["c99_sentinal"]].head()

> The section above is the sentinal section that is the same in all ICOADS dck's/c99 column

In [None]:
data.data[["c99_data"]].head(n=5)

In [None]:
data.data[["c99_header"]].head(n=5)

In [None]:
data.data[["c99_qc"]].head(n=5)