<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Handle Corrupt/bad Records


## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io
import pandas
import s3fs

baseUri = "/home/jovyan/materials/local-data/"

## Handle Malformed Lines

Let's try to read a pipe-separated file and let the system infer data-type

In [None]:
goodDF = pandas.read_csv(baseUri+"good.csv", sep="|")
goodDF

No problem here, but what if the file contains lines with fewer fields

In [None]:
badDF = pandas.read_csv(baseUri+"bad-fewer.csv", sep="|")
badDF

No problem from programmatically point of view, lines with too few fields will have `NA` values filled in the trailing fields

Unfortunaltely, the resulting DataFrame lose its meaning....

...but what about lines with more fileds?

In [None]:
badDF = pandas.read_csv(baseUri+"bad-more.csv", sep="|")
badDF

The system raise an error by default

Let's try to play with parameters.

`error_bad_lines`
>boolean, default True
>
>Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no >DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. See bad lines below.

`warn_bad_lines`
>boolean, default True
>
>If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output.

 

In [None]:
badDF2 = pandas.read_csv(baseUri+"bad-more.csv", sep="|", error_bad_lines=False, warn_bad_lines=True)
badDF2

By setting `error_bad_lines` to `False`, the system raise a warning and avoid to read malformed lines

## Handle Data Type Problem

Let's go back to the file we used in [Lab 1](./notebooks/python/data-ingestion/structured-semi_structured/lab/bulk-lab1.ipynb).

Instead of using `Int64` pandas array data type, the problem with the `NaN` value and the numpy integer data type can be solved using pandas `converter` (see `read_csv` [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html))

Let's reproduce the problem...

In [None]:
import numpy as np
import csv
import math

dtp = { 'prev_id': np.int64
        , 'curr_id': np.int64
        , 'n': np.int64
        , 'prev_title': np.string_
        , 'curr_title': np.string_
        , 'type': np.string_
    }

#pyCsvPath = baseUri + "2015_02_clickstream.csv"
filePath = baseUri + "good.csv"

# During the pandas reading we manage malformed lines
df = pandas.read_csv(filePath, sep="|", dtype=dtp)


df.info()


... and solve it with converters.

Our converter will use a `lambda function` ([info](https://en.wikipedia.org/wiki/Anonymous_function#Python) and [more info](05-Lambda-Expressions-Map-and-Filter.ipynb)) to cast to `numpy.int64` the value in the column `prev_id` if it is not `NaN`, 0 otherwise.

In [None]:
dtp = {'curr_id': np.int64
       , 'n': np.int64
       , 'prev_title': np.string_
       , 'curr_title': np.string_
       , 'type': np.string_
       }

#pyCsvPath = baseUri + "2015_02_clickstream.csv"
filePath = baseUri + "good.csv"

# During the pandas reading we manage malformed lines
df = pandas.read_csv(
    filePath
    , sep="|"
    , dtype=dtp
    , converters={'prev_id': lambda x: np.int64(0) if not x else np.int64(x)}
    )

df.info()


In [None]:
df

**Note**
> If converters are specified, they will be applied INSTEAD of dtype conversion.

**Discussion**: is this a good solution?

`prev_id` and `curr_id` are identifier. Shall we invent identifiers?

Moreover, we are identifying with `0` several different values in the columns `prev_title` and `curr_title`. While technically we solved the problem, from a business perspective we did not.

To address this common problem, also `int` shall we nullable, [Pandas fixed this problem in version 1.0.0](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#nullable-integer-data-type)

Therefore, the correct solution is 

In [None]:
dtp = {
    'prev_id': "Int64"
    , 'curr_id': "Int64"
    , 'n': np.int64
    , 'prev_title': np.string_
    , 'curr_title': np.string_
    , 'type': np.string_
}

#pyCsvPath = baseUri + "2015_02_clickstream.csv"
filePath = baseUri + "good.csv"

# During the pandas reading we manage malformed lines
df = pandas.read_csv(
    filePath
    , sep="|"
    , dtype=dtp
    )

df.info()


In [None]:
df

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) Quantia Consulting, srl. All rights reserved.