# Approaching "Unsupported cast" errors

**Background**

Occasionally, an import will throw errors like the following:

```
Key:       4_2666
Function:  reduce_pixel_shards
args:      ()
kwargs:    {...}
Exception: "ArrowNotImplementedError('Unsupported cast from string to null using function cast_null')"
```

We've observed that this is due to the way that PyArrow encodes types in parquet files.

At the reduce stage, we're combining several intermediate parquet files for a single spatial tile into the final parquet file. It's possible at this stage that some files will contain only empty (null) values in a column that we expect to be a string field.

e.g. 

File1

| int_field | string_field | float_field |
| --------- | ------------ | ----------  |
|         5 |      <empty> |         3.4 |
|         8 |      <empty> |         3.8 |

which will have a schema like:
       
    optional int64 field_id=-1 int_field;
    optional int32 field_id=-1 string_field **(Null)**;
    optional double field_id=-1 float_field;
    
File2
    
| int_field | string_field | float_field |
| --------- |------------- | ----------- |
|         6 |      hello   |         4.1 |
|         7 |      <empty> |         3.9 |

will have a schema like:

    optional int64 field_id=-1 int_field;
    optional binary field_id=-1 string_field (String);
    optional double field_id=-1 float_field;

When we try to merge these files together, the parquet engine does not want to perform a cast between these types, and throws an error.

**Objective**

In this notebook, we'll look at an approach to resolving this issue: generating a single standard parquet schema file to use when writing the dataset.

**Read a single input file**

Start by reading a single input file as a pandas dataframe. You can reuse the InputReader class that you've instantiated for the import pipeline.

In [1]:
import pandas as pd
from hipscat_import.catalog.file_readers import CsvReader

input_file="/data3/epyc/data3/hipscat/raw/allwise_raw/wise-allwise-cat-part05"

## This input CSV file requires header and type data from another source.
type_frame = pd.read_csv("/astro/users/mmd11/git/hipscripts/epyc/allwise/allwise_types.csv")
type_names = type_frame["name"].values.tolist()
type_map = dict(zip(type_frame["name"], type_frame["type"]))

file_reader = CsvReader(
                    header=None,
                    separator="|",
                    column_names=type_frame["name"].values.tolist(),
                    type_map=type_map,
                    chunksize=5
                )

data_frame = next(file_reader.read(input_file))

Now that we have the typed data from the input file, write out only the column-level schema to a new file:

In [2]:
import pyarrow.parquet as pq

schema_only_file = "/data3/epyc/data3/hipscat/tmp/allwise_schema.parquet"
pq.write_table(pa.Table.from_pandas(data_frame).schema.empty_table(), where=schema_only_file)

NameError: name 'pq' is not defined