<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Reading Data with Pandas - CSV Files

**Technical Accomplishments:**
- Start working with the API documentation
- Read data from:
  * CSV without a Schema.
  * CSV with a Schema.

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import pandas
import s3fs
import boto3
import io
import qcutils

baseUri = "s3://quantia-master/training/"

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Pandas API

The pandas I/O API is a set of top level `reader` functions that generally return a pandas object (a `DataFrame`). The corresponding `writer` functions are object methods. 

|**Format**|**Data Description**|**Reader**    |**Writer**  |
|----------|--------------------|--------------|------------|
|text      |CSV                 |read_csv      |to_csv      |
|text      |JSON                |read_json     |to_json     |
|text      |HTML                |read_html     |to_html     |
|text      |Local clipboard     |read_clipboard|to_clipboard|
|binary    |MS Excel            |read_excel    |to_excel    |
|binary    |HDF5 Format         |read_hdf      |to_hdf      |
|binary    |Feather Format      |read_feather  |to_feather  |
|binary    |Parquet Format      |read_parquet  |to_parquet  |
|binary    |Msgpack             |read_msgpack  |to_msgpack  |
|binary    |Stata               |read_stata    |to_stata    |
|binary    |SAS                 |read_sas      |            |
|binary    |Python Pickle Format|read_pickle   |to_pickle   |
|SQL       |SQL                 |read_sql      |to_sql      |
|SQL       |Google Big Query    |read_gbq      |to_gbq      |

**Note** For more information http://pandas.pydata.org/pandas-docs/stable/io.html

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Reading from CSV w/InferSchema

We are going to start by reading in a very simple text file.

### The Data Source
* For this exercise, we will be using a tab-separated file called **wikipedia_pageviews_by_second.tsv** (<a href="https://datahub.io/en/dataset/english-wikipedia-pageviews-by-second" target="_blank">255 MB</a> file from Wikipedia)

In [None]:
qcutils.print_s3_bucket_object(key='training/wikipedia_pageviews_by_second.tsv')

### Read The CSV File
Let's start with the bare minimum by specifying the tab character as the delimiter and the location of the file:

In [None]:
# A reference to our tab-seperated-file
csvFile = baseUri + "wikipedia_pageviews_by_second.tsv"

tempDF = pandas.read_csv(csvFile)

Let's take a look at the result!

In [None]:
print(tempDF.head())

It looks strange, is the default separator correct?

The file is a tsv!

In [None]:
tempDF = pandas.read_csv(csvFile, sep="\t")
display(tempDF)

We can see the structure of the DataFrame by executing the command info()

In [None]:
tempDF.info()

As you can see, pandas automatically detect the header of the file (it uses the first row) and infer the column data type (e.g. timestamp is an object)

## Additional Info

The `read_csv` offers many parameters to customize the reading action, for the complete list go to: http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table

### Review: Reading CSV w/InferSchema

* The schema was automatically inferred
  * We have three columns of different type
* The header was automatically detected

**Question:** Is it possible to manually set the index column?

**Question:** What if one column is forced to be an index? Is it still a data column?

**Hints**

Go to the documentation and look for the parameters:
* index_col

In [None]:
tempDF = pandas.read_csv(csvFile, sep="\t", index_col='timestamp')
tempDF.info()

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Reading from CSV w/User-Defined Schema

This time we are going to read the same file.

The difference here is that we are going to define the schema beforehand.

First let us refresh the information about the automatically detected schema

In [None]:
tempDF = pandas.read_csv(csvFile, sep="\t")
tempDF.info()

This time we use a user defined datatype for the columns.

**Note:** We import numpy to exploit standard data types.

In [None]:
import numpy as np

tempDF = pandas.read_csv(csvFile, sep="\t", dtype={'site': np.str_, 'requests': np.int64})
tempDF.info()

We only pass schema of two columns, we skipped the date column.

**Why?**

Because datetime64 is not supported for parsing, you need to pass this column using parse_dates parameters in the `read_csv()`.

In [None]:
tempDF = pandas.read_csv(csvFile, sep="\t", dtype={'site': np.str_, 'requests': np.int64}, parse_dates=['timestamp'])
tempDF.info()

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.