# Fix bad ISO8859 characters on ERDDAP

ERDDAP encodes ISO8859 extended latin characters in a strange way (à  á  â  ã  ä  å  æ  ç  è  é  ê  ë  ì  etc.) I wrote a Python function to fix this on the user side.

It also renders them incorrectly on the ERDDAP page itself though, so this is more of a workaround than a real fix.

### 1. Generate a list of ISO8859 characters

We skip the rows 0, 1, 8 and 9 as they are empty in the specification.

We also skip the ; character so we can use it as a seperator in our ERDDAP csv file without causing confusion.

This generates a string of key:value pairs where the key is the code point in hexadecimal notation and the value is the character that that codepoint encodes.

In [1]:
def make_iso_8859():
    iso_str = ""
    for a in range(16):
        if a in [0, 1, 8, 9]:
            continue
        for b in range(16):
            i = a * 16 + b
            hex_code = hex(i)
            char = f'\\{hex_code[1:]}'.encode('utf-8').decode('unicode-escape')
            if char == ';':
                continue
            add_str = f"{hex_code}: " + char + ", "
            iso_str+= add_str
    return iso_str[:-2]
string_in = make_iso_8859()
print(string_in)

0x20:  , 0x21: !, 0x22: ", 0x23: #, 0x24: $, 0x25: %, 0x26: &, 0x27: ', 0x28: (, 0x29: ), 0x2a: *, 0x2b: +, 0x2c: ,, 0x2d: -, 0x2e: ., 0x2f: /, 0x30: 0, 0x31: 1, 0x32: 2, 0x33: 3, 0x34: 4, 0x35: 5, 0x36: 6, 0x37: 7, 0x38: 8, 0x39: 9, 0x3a: :, 0x3c: <, 0x3d: =, 0x3e: >, 0x3f: ?, 0x40: @, 0x41: A, 0x42: B, 0x43: C, 0x44: D, 0x45: E, 0x46: F, 0x47: G, 0x48: H, 0x49: I, 0x4a: J, 0x4b: K, 0x4c: L, 0x4d: M, 0x4e: N, 0x4f: O, 0x50: P, 0x51: Q, 0x52: R, 0x53: S, 0x54: T, 0x55: U, 0x56: V, 0x57: W, 0x58: X, 0x59: Y, 0x5a: Z, 0x5b: [, 0x5c: \, 0x5d: ], 0x5e: ^, 0x5f: _, 0x60: `, 0x61: a, 0x62: b, 0x63: c, 0x64: d, 0x65: e, 0x66: f, 0x67: g, 0x68: h, 0x69: i, 0x6a: j, 0x6b: k, 0x6c: l, 0x6d: m, 0x6e: n, 0x6f: o, 0x70: p, 0x71: q, 0x72: r, 0x73: s, 0x74: t, 0x75: u, 0x76: v, 0x77: w, 0x78: x, 0x79: y, 0x7a: z, 0x7b: {, 0x7c: |, 0x7d: }, 0x7e: ~, 0x7f: , 0xa0:  , 0xa1: ¡, 0xa2: ¢, 0xa3: £, 0xa4: ¤, 0xa5: ¥, 0xa6: ¦, 0xa7: §, 0xa8: ¨, 0xa9: ©, 0xaa: ª, 0xab: «, 0xac: ¬, 0xad: ­, 0xae: ®, 0xaf: ¯, 0

### 2. Add this data to an ERDDAP server

I have hosted this on the VOTO ERDDAP server for now, if it's no longer available you can host it on your own server

https://erddap.observations.voiceoftheocean.org/erddap/tabledap/experiment_characters.html

### 3. Download the dataset from ERDDAP

We can see how ERDDAP reproduces the first 8 lines of characters faithfully, then does some real weird stuff

In [2]:
import pandas as pd
df = pd.read_csv("https://erddap.observations.voiceoftheocean.org/erddap/tabledap/experiment_characters.csv?characters", sep=';')
erddap_string = df['characters'].values[0]
erddap_string

'0x20:  , 0x21: !, 0x22: ", 0x23: #, 0x24: $, 0x25: %, 0x26: &, 0x27: \', 0x28: (, 0x29: ), 0x2a: *, 0x2b: +, 0x2c: ,, 0x2d: -, 0x2e: ., 0x2f: /, 0x30: 0, 0x31: 1, 0x32: 2, 0x33: 3, 0x34: 4, 0x35: 5, 0x36: 6, 0x37: 7, 0x38: 8, 0x39: 9, 0x3a: :, 0x3c: <, 0x3d: =, 0x3e: >, 0x3f: ?, 0x40: @, 0x41: A, 0x42: B, 0x43: C, 0x44: D, 0x45: E, 0x46: F, 0x47: G, 0x48: H, 0x49: I, 0x4a: J, 0x4b: K, 0x4c: L, 0x4d: M, 0x4e: N, 0x4f: O, 0x50: P, 0x51: Q, 0x52: R, 0x53: S, 0x54: T, 0x55: U, 0x56: V, 0x57: W, 0x58: X, 0x59: Y, 0x5a: Z, 0x5b: [, 0x5c: \\\\, 0x5d: ], 0x5e: ^, 0x5f: _, 0x60: `, 0x61: a, 0x62: b, 0x63: c, 0x64: d, 0x65: e, 0x66: f, 0x67: g, 0x68: h, 0x69: i, 0x6a: j, 0x6b: k, 0x6c: l, 0x6d: m, 0x6e: n, 0x6f: o, 0x70: p, 0x71: q, 0x72: r, 0x73: s, 0x74: t, 0x75: u, 0x76: v, 0x77: w, 0x78: x, 0x79: y, 0x7a: z, 0x7b: {, 0x7c: |, 0x7d: }, 0x7e: ~, 0x7f: \\u007f, 0xa0: \\u00a0, 0xa1: \\u00c2\\u00a1, 0xa2: \\u00c2\\u00a2, 0xa3: \\u00c2\\u00a3, 0xa4: \\u00c2\\u00a4, 0xa5: \\u00c2\\u00a5, 0xa6: \\u

### 4. Fix the special characters 

I have written an ugly function that fixes this. I don't really understand it tbh but it seems to work

In [3]:
from erddap_admin_utils import fix_erddap_unicode
fixed_string = fix_erddap_unicode(erddap_string)
fixed_string

'0x20:  , 0x21: !, 0x22: ", 0x23: #, 0x24: $, 0x25: %, 0x26: &, 0x27: \', 0x28: (, 0x29: ), 0x2a: *, 0x2b: +, 0x2c: ,, 0x2d: -, 0x2e: ., 0x2f: /, 0x30: 0, 0x31: 1, 0x32: 2, 0x33: 3, 0x34: 4, 0x35: 5, 0x36: 6, 0x37: 7, 0x38: 8, 0x39: 9, 0x3a: :, 0x3c: <, 0x3d: =, 0x3e: >, 0x3f: ?, 0x40: @, 0x41: A, 0x42: B, 0x43: C, 0x44: D, 0x45: E, 0x46: F, 0x47: G, 0x48: H, 0x49: I, 0x4a: J, 0x4b: K, 0x4c: L, 0x4d: M, 0x4e: N, 0x4f: O, 0x50: P, 0x51: Q, 0x52: R, 0x53: S, 0x54: T, 0x55: U, 0x56: V, 0x57: W, 0x58: X, 0x59: Y, 0x5a: Z, 0x5b: [, 0x5c: \\, 0x5d: ], 0x5e: ^, 0x5f: _, 0x60: `, 0x61: a, 0x62: b, 0x63: c, 0x64: d, 0x65: e, 0x66: f, 0x67: g, 0x68: h, 0x69: i, 0x6a: j, 0x6b: k, 0x6c: l, 0x6d: m, 0x6e: n, 0x6f: o, 0x70: p, 0x71: q, 0x72: r, 0x73: s, 0x74: t, 0x75: u, 0x76: v, 0x77: w, 0x78: x, 0x79: y, 0x7a: z, 0x7b: {, 0x7c: |, 0x7d: }, 0x7e: ~, 0x7f: \x7f, 0xa0: \xa0, 0xa1: ¡, 0xa2: ¢, 0xa3: £, 0xa4: ¤, 0xa5: ¥, 0xa6: ¦, 0xa7: §, 0xa8: ¨, 0xa9: ©, 0xaa: ª, 0xab: «, 0xac: ¬, 0xad: \xad, 0xae: ®

### 5. Check that you fixed them all

We make a dataframe by splitting the input string, ERDDAP string and fixed string on ',' and comparing them row by row

In [4]:
df_comp =  pd.DataFrame({"original": string_in.split(','), "erddap_raw": erddap_string.split(','), "erddap_fixed": fixed_string.split(',')})
df_comp[::10]

Unnamed: 0,original,erddap_raw,erddap_fixed
0,0x20:,0x20:,0x20:
10,0x2a: *,0x2a: *,0x2a: *
20,0x33: 3,0x33: 3,0x33: 3
30,0x3e: >,0x3e: >,0x3e: >
40,0x48: H,0x48: H,0x48: H
50,0x52: R,0x52: R,0x52: R
60,0x5c: \,0x5c: \\,0x5c: \
70,0x66: f,0x66: f,0x66: f
80,0x70: p,0x70: p,0x70: p
90,0x7a: z,0x7a: z,0x7a: z


In [5]:
(df_comp["original"] == df_comp["erddap_fixed"]).all()

True