This notebook shows you how to how to scan over your raw data in a pre-processing step to understand what's inside.

This is useful to avoid crashing with multiple column names for the same item (e.g. lon and LON), and to map raw data values to standard values (e.g. string descriptions to integers).

In [13]:
import trackio as tio
import glob
import numpy as np

First, define the raw data files.

In [2]:
#define raw data files
data_path = './files'
files = glob.glob(f'{data_path}/AIS_*.csv')

files

['./files\\AIS_2021_01_01.csv']

Use the below function to making a dictionary mapper for all of the column names encountered in the raw data files.

In [3]:
#make a column mapper for raw data
col_mapper = tio.make_col_mapper(files, 
                                 ncores=1)
                                 
col_mapper

Making column mapper: 100%|██████████| 1/1 [00:00<00:00, 499.98it/s]




{'BaseDateTime': 'BaseDateTime',
 'COG': 'COG',
 'CallSign': 'CallSign',
 'Cargo': 'Cargo',
 'Draft': 'Draft',
 'Heading': 'Heading',
 'IMO': 'IMO',
 'LAT': 'LAT',
 'LON': 'LON',
 'Length': 'Length',
 'MMSI': 'MMSI',
 'SOG': 'SOG',
 'Status': 'Status',
 'TranscieverClass': 'TranscieverClass',
 'VesselName': 'VesselName',
 'VesselType': 'VesselType',
 'Width': 'Width'}

Every raw data file must have X,Y,Time columns, or have column names mapped to these 3 columns. In this case, you would simply edit the column mapper as below.

In [4]:
col_mapper['LON'] = 'X'
col_mapper['LAT'] = 'Y'
col_mapper['BaseDateTime'] = 'Time'

col_mapper

{'BaseDateTime': 'Time',
 'COG': 'COG',
 'CallSign': 'CallSign',
 'Cargo': 'Cargo',
 'Draft': 'Draft',
 'Heading': 'Heading',
 'IMO': 'IMO',
 'LAT': 'Y',
 'LON': 'X',
 'Length': 'Length',
 'MMSI': 'MMSI',
 'SOG': 'SOG',
 'Status': 'Status',
 'TranscieverClass': 'TranscieverClass',
 'VesselName': 'VesselName',
 'VesselType': 'VesselType',
 'Width': 'Width'}

Alternatively, you can use the built-in column mapper. This has been built over time by encountering new column names and data fields in raw data files.

In [5]:
#make a column mapper for raw data, use the built in mapper
col_mapper = tio.make_col_mapper(files, 
                                 ncores=4,
                                 fill_mapper=tio.mappers.columns)

col_mapper

Making column mapper: 100%|██████████| 1/1 [00:00<00:00, 587.60it/s]




{'BaseDateTime': 'Time',
 'COG': 'Coursing',
 'CallSign': 'CallSign',
 'Cargo': 'Cargo',
 'Draft': 'Draft',
 'Heading': 'Heading',
 'IMO': 'IMO',
 'LAT': 'Y',
 'LON': 'X',
 'Length': 'Length',
 'MMSI': 'MMSI',
 'SOG': 'Speed',
 'Status': 'Status',
 'TranscieverClass': 'TranscieverClass',
 'VesselName': 'Name',
 'VesselType': 'AISCode',
 'Width': 'Width'}

Notice that X,Y,Time fields have been automtically detected, as well as some other fields.

You can also update the built in column mapper so you can save newly encountered column names and data fields, and be able to automatically detect them next time.

In [6]:
builtin_col_mapper = tio.mappers.columns
builtin_col_mapper

{'BaseDateTime': 'Time',
 'basedatetime': 'Time',
 'BASEDATETIME': 'Time',
 'Basedatetime': 'Time',
 'dt_pos_utc': 'Time',
 'DT_POS_UTC': 'Time',
 'Dt_Pos_Utc': 'Time',
 'dt pos utc': 'Time',
 'dt-pos-utc': 'Time',
 'DATE TIME (UTC)': 'Time',
 'date time (utc)': 'Time',
 'Date Time (Utc)': 'Time',
 'Date time stamp': 'Time',
 'date time stamp': 'Time',
 'DATE TIME STAMP': 'Time',
 'Date Time Stamp': 'Time',
 'POSITION_UTC_DATE': 'Time',
 'position_utc_date': 'Time',
 'Position_Utc_Date': 'Time',
 'POSITION UTC DATE': 'Time',
 'POSITION-UTC-DATE': 'Time',
 'Time': 'Time',
 'time': 'Time',
 'TIME': 'Time',
 'MovementDateTime': 'Time',
 'movementdatetime': 'Time',
 'MOVEMENTDATETIME': 'Time',
 'Movementdatetime': 'Time',
 'X': 'X',
 'x': 'X',
 'Longitude': 'X',
 'longitude': 'X',
 'LONGITUDE': 'X',
 'Lon': 'X',
 'lon': 'X',
 'LON': 'X',
 'Longitude (DDD.ddd)': 'X',
 'longitude (ddd.ddd)': 'X',
 'LONGITUDE (DDD.DDD)': 'X',
 'Longitude (Ddd.Ddd)': 'X',
 'longitude [deg]': 'X',
 'LONGITUDE [

In [7]:
#check they aren't there
print('longitude [deg]' in builtin_col_mapper.keys())
print('latitude [deg]' in builtin_col_mapper.keys())

#make new mappings
add = {'longitude [deg]': 'X',
       'latitude [deg]': 'Y'}

#append to built in
tio.mappers.update(tio.mappers.columns, add)

True
True
Updated mapper in c:\Users\dere\Miniconda3\envs\trackio\lib\site-packages\trackio/supporting/column_mapper.csv


Now you can auto-detect these fields next time by passing `fill_mapper=tio.mappers.columns` to `tio.make_col_mapper`.

Next, you can perform the same operation on any of the raw data fields. This allows you to do a cursory scan over data fields you know might be problematic (e.g. mix of integers, floats, strings, etc.) and map them to consistent values when processing the raw data.

In [8]:
#make a data mapper for raw data
data_mapper = tio.make_raw_data_mapper(files,
                                       col_mapper=col_mapper,
                                       data_col='Status',
                                       fill_mapper={},
                                       ncores=4)

data_mapper


QCing data columns:   0%|          | 0/1 [00:00<?, ?it/s]

QCing data columns: 100%|██████████| 1/1 [00:00<00:00, 622.30it/s]


{0.0: None,
 1.0: None,
 3.0: None,
 5.0: None,
 8.0: None,
 9.0: None,
 11.0: None,
 12.0: None,
 15.0: None,
 nan: None}

You can do the same thing for multiple data fields at once.

In [9]:
#make a data mapper for raw data
data_mapper = tio.make_raw_data_mapper(files,
                                       col_mapper=col_mapper,
                                       data_col=['Status', 'AISCode'],
                                       fill_mapper={},
                                       ncores=4)

data_mapper

QCing data columns: 100%|██████████| 1/1 [00:00<00:00, 500.81it/s]


{'Status': {0.0: None,
  1.0: None,
  3.0: None,
  5.0: None,
  8.0: None,
  9.0: None,
  11.0: None,
  12.0: None,
  15.0: None,
  nan: None},
 'AISCode': {31.0: None,
  36.0: None,
  37.0: None,
  59.0: None,
  60.0: None,
  70.0: None,
  80.0: None,
  90.0: None,
  nan: None}}

You can now edit these dictionaries manually and use them when processing the data. There are also two built in mappers specifically meant for AIS data as shown below.

In [14]:
#make a data mapper for raw data
data_mappers = tio.make_raw_data_mapper(files,
                                        col_mapper=col_mapper,
                                        data_col=['Status','AISCode','Draft'],
                                        fill_mapper={'Status': tio.mappers.ais['Status'],
                                                     'AISCode': tio.mappers.ais['AISCode']},
                                        ncores=4)

#make a descriptor for draft
for key in data_mappers['Draft'].keys():
    if key <= 3:
        data_mappers['Draft'][key] = 'small'
    elif np.isnan(key):
        data_mappers['Draft'][key] = 'unknown'
    else:
        data_mappers['Draft'][key] = 'large'
        
data_mappers

QCing data columns: 100%|██████████| 1/1 [00:00<00:00, 400.14it/s]


{'Status': {0.0: 0,
  1.0: 1,
  3.0: 3,
  5.0: 5,
  8.0: 8,
  9.0: 9,
  11.0: 11,
  12.0: 12,
  15.0: 15,
  nan: None},
 'AISCode': {31.0: 31,
  36.0: 36,
  37.0: 37,
  59.0: 59,
  60.0: 60,
  70.0: 70,
  80.0: 80,
  90.0: 90,
  nan: None},
 'Draft': {2.3: 'small',
  2.5: 'small',
  2.7: 'small',
  2.9: 'small',
  3.0: 'small',
  3.3: 'large',
  3.4: 'large',
  3.5: 'large',
  3.6: 'large',
  3.7: 'large',
  3.8: 'large',
  3.9: 'large',
  4.0: 'large',
  4.1: 'large',
  4.2: 'large',
  4.3: 'large',
  4.4: 'large',
  4.5: 'large',
  4.6: 'large',
  4.9: 'large',
  5.0: 'large',
  5.2: 'large',
  5.5: 'large',
  5.8: 'large',
  6.1: 'large',
  9.4: 'large',
  9.8: 'large',
  9.9: 'large',
  10.1: 'large',
  12.0: 'large',
  14.0: 'large',
  14.5: 'large',
  14.9: 'large',
  nan: 'unknown'}}

Please see notebook `04 - Grouping Points and Splitting Tracks.ipynb` to see how these mappers get used for the processing of raw data.