# Welcome to the Dark Art of Coding:
## Introduction to Python
pandas: reading in files

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Importing pandas, etc
---

When using pandas, it is common to import pandas as `pd` and to simply import the factory functions: `Series` and `DataFrames`

In [None]:
import pandas as pd
from pandas import Series, DataFrame

Pandas is adept at reading in **many, many** data formats 

To see which ones, you can type pd.read and use **tab completion**

```python
pd.read<Press the tab key>
```

# Reading from the clipboard
---

Let's start by reading from the clipboard after we copy data from a table on a webpage.

There is a sample file in the folder called:

```bash
sample_table.html
```

WARNING: if running this on a JupyterHub, this won't work as expected. IF you run this on a system where you have access to the system clipboard (i.e. run the notebook natively on your local laptop) then when open this file with your browser, you will see a table. When running locally, if you copy the table to your clipboard, pandas will be able to extract the table from the clipboard.


This produces a DataFrame in memory. As with all DataFrames, you would be able to access:

* the columns
* the rows
* the `.attributes`
* the `.methods()` 

# Reading from CSV
---

We will focus on `csv` and `sql` for remainder of this discussion.

Let's dive into `csv` first.

To read from `csv` files, we use the `.read_csv()` method.

In [None]:
# In the following results, notice the column headers.
# By default, pandas will use the first
#     row as a header row...  
#     thus 'barry allen' shows up a the header for column 1.
# Maybe NOT what you want...

data = pd.read_csv('../universal_datasets/log_file.csv')
data

In [None]:
# If we include a list of names in the function
#     call, pandas will use those as the headers,
#     instead of the first row.

named_cols = pd.read_csv('../universal_datasets/log_file.csv',
                         names=['name', 
                                'email', 
                                'fmip', 
                                'toip',
                                'datetime', 
                                'lat', 
                                'long', 
                                'payload'])

In [None]:
# The .info() method shows us some details 
#     about our new DataFrame
#     * The number and names of the columns
#     * The datatypes for each column
#     * etc

named_cols.info()

In [None]:
# If want to see a single column, we can use 
#     bracket notation 

named_cols['fmip']

In [None]:
# You will see it, but I don't recommend using the dot notation:
#     df.column_name

named_cols.fmip

It is not necessary to ingest all the lines from a file. 

Presuming that certain lines lack useful information... 

* metadata
* header lines
* document data, etc.

In this case, let's skip rows 0, 1, and 2 from the csv.

In [None]:
skipped_rows = pd.read_csv('../universal_datasets/log_file_junk.csv', 
                           names=['name', 'email', 'fmip', 'toip',
                                  'datetime', 'lat', 'long', 'payload'],
                           skiprows=[0, 1, 2])

skipped_rows

You may receive files with alternate separators/delimiters. Pandas gives you tools to  
deal with this situation. 

In [None]:
# This file uses a 'pipe' character as the separator.

piped_data = pd.read_csv('../universal_datasets/log_file_pipes.csv',
                         names=['name', 'email', 'fmip',
                                'toip', 'datetime', 'lat',
                                'long', 'payload'],
                         sep='|')

piped_data

In [None]:
# This was a short file, but when you have thousands
#     of rows, sometimes you simply want a quick look at 
#     samples of the data.

# .tail() and .head() are good examples of this.

piped_data.tail(3)

In [None]:
piped_data.head(4)

## Indexing

When reading in data, `pandas` assigns a default index of `0..n`. Sometimes we want to use something different than the default indexing.

We **can choose** a particular column to be used as an index.

Here we chose to use the `datetime` column

In [None]:
import pandas as pd
date_index = pd.read_csv('../universal_datasets/log_file.csv', 
                         names=['name', 'email', 'fmip', 'toip',
                                'datetime', 'lat', 'long', 'payload'],
                         index_col='datetime')

date_index

In [None]:
# If we have an index, we can select data from the DataFrame
# based on the index. In this case, since we just made the
# date/time our index, we can
# easily select rows based on the date/time stamps

date_index.loc['2016-02-06T21:44:56':'2016-02-06T21:49:36']


# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_read_01.py```

Execute your script in the **IPython interpreter** using the command:

```bash
run my_read_01.py```

1. With a text editor look inside the file `../universal_datasets/log_file_sign.csv` and identify the delimiter.
1. Open the file using the pandas `.read_csv()` method and the following:
   * Assign the following names to the columns, in this order:<br><br>
```
name
email
fmip
toip
datetime
lat
long
payload_size
```<br><br>
   * Assign the correct delimiter based on what you found
   * Skip the even numbered rows using `range()` to count up to 1000 stepping by twos
   * Index the DataFrame using the `datetime` column
1. Display the data between the index `2016-01-29T22:27:34` and `2016-01-28T22:34:28`

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
import pandas as pd
df = pd.read_csv('../universal_datasets/log_file_sign.csv',
                 names=['name',
                        'email',
                        'fmip',
                        'toip',
                        'datetime',
                        'lat',
                        'long',
                        'payload_size'],
                sep='!',
                index_col='datetime', 
                skiprows = range(0,1000,2))

df.loc['2016-01-29T22:27:34':'2016-01-28T22:34:28']

# Missing data
---

Some files have missing data or markers indicating that data is not available.

In [None]:
data_na = pd.read_csv('../universal_datasets/log_file_na.csv', 
                      names=['name', 'email', 'fmip',
                             'toip', 'datetime', 'lat',
                             'long', 'payload'])

data_na


In [None]:
# You can drop any rows that contain NaN data:

data_na.dropna()

Checking for NaN status and converting the particular values to an pandas NaN flag is a time consuming process that might not be optimal when loading data.

You **can** turn this process off

In [None]:
data_na = pd.read_csv('../universal_datasets/log_file_na.csv', 
                      names=['name', 'email', 'fmip',
                             'toip', 'datetime', 'lat', 
                             'long', 'payload'],
                      na_filter=False)

data_na.head(20)

You can provide a list of particular values to use as na values. 

Some files or software will use sentinels or flag values to represent a null value.

NOTE: in this case, pandas will combine the na_values you give with the built-in na values.

In [None]:
data_na = pd.read_csv('../universal_datasets/log_file_na.csv', 
                      names=['name', 'email', 'fmip',
                             'toip', 'datetime', 'lat',
                             'long', 'payload'],
                      na_values=['', '9999'])
data_na.head(20)

In [None]:
data_na = pd.read_csv('../universal_datasets/log_file_na.csv', 
                      names=['name', 'email', 'fmip',
                             'toip', 'datetime', 'lat',
                             'long', 'payload'],
                         na_values=['', '9999'],
                         keep_default_na=False)

data_na.head(20)

In [None]:
# It is possible to tell pandas how many rows to read using:
# nrows


data_na = pd.read_csv('../universal_datasets/log_file_na.csv', 
                      names=['name', 'email', 'fmip',
                             'toip', 'datetime', 'lat',
                             'long', 'payload'],
                      na_values=['', '9999'], 
                      keep_default_na=False,
                      nrows=7)
data_na

In [None]:
# Sometimes the amount of data you need to process is 
#     too large to read into memory, so you need to process
#     it portion by portion.
# The chunksize argument allows you to identify how many
#     rows to read in at a time

data = pd.read_csv('../universal_datasets/log_file.csv', 
                   names=['name', 'email', 'fmip', 
                          'toip', 'datetime', 'lat',
                          'long', 'payload'],
                   chunksize=3)

for chunk in data:
    print('\n### pre-processing')
    print('### more pre-processing')
    print('### even more pre-processing')
    print(chunk[['name', 'lat', 'long', 'payload']])
    print('subtotal sum:', chunk['payload'].sum())
    print('### post processing\n')

In [None]:
# If you want to convert data in one or more columns of your
#     DataFrame, you can use functions to transform the data.

# To convert multiple columns with different functions
#     you can use dictionary to create a mapping that
#     defines which conversion function(s)
#     to use against which columns(s)

In [None]:
def dsplitter(address):
    userid, domain = address.split('@')
    return domain

def date_only(datetime):
    return datetime.split('T')[0]

In [None]:
data = pd.read_csv('../universal_datasets/log_file.csv', 
                   names=['name', 'email', 'fmip',
                          'toip', 'datetime', 'lat',
                          'long', 'payload'],
                   converters={'email':dsplitter,
                               'datetime':date_only})
data

In [None]:
# If you only want to retain certain columns, you can
#     identify which columns to keep, using:
# usecols


data = pd.read_csv('../universal_datasets/log_file.csv', 
                   names=['name', 'email', 'fmip',
                          'toip', 'datetime', 'lat',
                          'long', 'payload'],
                   usecols=['email', 'fmip', 'toip'])

data

# SQL
---

In [None]:
# pandas can read from sql databases easily...

import sqlite3
conn = sqlite3.connect('../universal_datasets/log_file.sql')
cur = conn.cursor()

df = pd.read_sql("SELECT * FROM customers", conn)
df.head()

In [None]:
df1 = pd.read_sql('''SELECT date, email, lat, long FROM customers
                          WHERE name LIKE "%wayne%"''', conn)
df1.head()

In [None]:
df2 = pd.read_sql('''SELECT date, email, lat, long
                     FROM customers
                     WHERE name LIKE "%wayne%"''',
                  conn,
                  index_col='date')
df2.head()

# Writing to disk
---

In [None]:
df2.to_csv('class_out.csv',
           columns=['email', 'lat', 'long', 'name'],
           header=True, sep='|')


# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_read_02.py```

Execute your script in the **IPython interpreter** using the command:

```bash
run my_read_02.py```

1. Connect to the `customers` table in the sql database: `../universal_datasets/log_file.sql`
1. Read from the connection using pandas `.read_sql()` method to create a DataFrame with these characteristics:
    * Read in only the following columns: `email`, `lat`, `long`, and `date`
    * Choose only the rows where the name contains the string `barry`
1. Use DataFrame's `.head()` method to display just the first ten rows.

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>