<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Reading Data Lab
* The goal of this lab is to put into practice some of what you have learned about reading data with pandas.

## Instructions
0. Start with the file **2015_02_clickstream.csv**, some random file you haven't seen yet.
0. Look into the file (...what is the separator?)
0. Read in the data and assign it to a `DataFrame` named **pyTestDF**.
0. Run the last cell to verify that the data was loaded correctly and to print its schema.

For the test to pass, the following columns should have the specified data types:
* **prev_id**: Int64
* **curr_id**: Int64
* **n**: int64
* **prev_title**: string
* **curr_title**: string
* **type**: string
  
**Note:** 
* In the columns prev_id and curr_id you can find `NaN` values (we will see later how to cope with it in a more useful way). `NaN` is a float and `numpy.int64` dtypes has no value for NaN and it ([src](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#nullable-integer-data-type)). Pandas provides `Int64` (note the capital "I", to differentiate from NumPy’s `int64`) nullable integer array dtype, which can be used to cope with `NaN` values in an an integer column ([src](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#nullable-integer-data-type)) 
* String types in python, is mapped as object in pandas (http://pbpython.com/pandas_dtypes.html)

## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io
import pandas
import s3fs

baseUri = "s3a://quantia-master/training/"

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Show Your Ingestion Work in python

In [None]:
import numpy as np
import csv

pyCsvPath = baseUri + "2015_02_clickstream.csv"

# During the pandas reading we manage malformed lines
pyTestDF = ( pandas
            .read_csv(
              pyCsvPath
              , sep="|"
              , dtype={
                'prev_id': "Int64"
                , 'curr_id': "Int64"
                , 'n': np.int64
                , 'prev_title': np.string_
                , 'curr_title': np.string_
                , 'type': np.string_
              }
            )
           )

In [None]:
pyTestDF.info()

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Verify Your python Result

In [None]:
pyTestDF.info()

columns = pyTestDF.columns
types = pyTestDF.dtypes

assert len(columns) == 6, "Expected 6 columns but found " + str(len(columns))

assert columns[0] == "prev_id",    "Expected column 0 to be \"prev_id\" but found \"" + columns[0][0] + "\"."
assert types[0] == "Int64",        "Expected column 0 to be of type \"int\" but found \"" + columns[0][1] + "\"."

assert columns[1] == "curr_id",    "Expected column 1 to be \"curr_id\" but found \"" + columns[1][0] + "\"."
assert types[1] == "Int64",        "Expected column 1 to be of type \"int\" but found \"" + columns[1][1] + "\"."

assert columns[2] == "n",          "Expected column 2 to be \"n\" but found \"" + columns[2][0] + "\"."
assert types[2] == "int64",        "Expected column 2 to be of type \"int\" but found \"" + columns[2][1] + "\"."

assert columns[3] == "prev_title", "Expected column 3 to be \"prev_title\" but found \"" + columns[3][0] + "\"."
assert types[3] == "object",     "Expected column 3 to be of type \"string\" but found \"" + columns[3][1] + "\"."

assert columns[4] == "curr_title", "Expected column 4 to be \"curr_title\" but found \"" + columns[4][0] + "\"."
assert types[4] == "object",     "Expected column 4 to be of type \"string\" but found \"" + columns[4][1] + "\"."

assert columns[5] == "type",       "Expected column 5 to be \"type\" but found \"" + columns[5][0] + "\"."
assert types[5] == "object",     "Expected column 5 to be of type \"string\" but found \"" + columns[5][1] + "\"."

print("Congratulations, all tests passed... that is if no jobs were triggered :-)\n")

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) Quantia Consulting, srl. All rights reserved.