# Read a zip archive

In this exercise we'll explore the mechanism for working with zip archives

The idea is to process the data without unpacking the archive, which python allows us to do

In [121]:
import numpy as np
import scipy as sc
import pandas as pd
import zipfile as zf
import matplotlib as mp
from io import StringIO 
%matplotlib inline

## Open a zip archive

Our first step will be to open the zip archive.

We could use the generic open() and read() functions in binary mode, but then we would have to decipher the internal structure of the zip file ourselves.

This is a bad idea for a couple of reasons.  First, it would be a lot of work to reverse-engineer the zip data structure, and second, that structure is most likely subject to change over time, which would break our code in ways we can't easily anticipate.

A much better approach is to use a python package written for that purpose.  Since there is a package for just about anything you want to do, we just have to find one.

One possibility is the zipfile package, which we have imported.

The function we use to open the archive for reading and assign the file object the name 'z' is:

z = zf.ZipFile('filename','r')

The file we will be using resides in the parent directory of this notebook, and its name is:

FARS2015NationalCSV.zip

This file contains data collected by the National Highway Traffic Safety Administration on fatal traffic accidents.  

It was downloaded from the NHTSA FARS website:

https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars

In [122]:
#

## Determining what kind of object we have

The python built-in function type() returns the name of the class an object belongs to.

The syntax for an arbitrary python object called 'obj' is:

type(obj)

Use this function to display the type of the object we just created.

For a list of python's built-in functions, see:

https://docs.python.org/3/library/functions.html

In [123]:
#

## Print the embedded documentation 

One of the strengths of python is "introspection", which means that python objects have the ability to reveal their properties.

While it is optional, one of the best is the 'doc' property.  

Developers can provide as much built-in documentation as they want (or none at all).

For an arbitrary python object 'obj', the 'doc' information is accessed by:

obj.\_\_doc\_\_

(Note that there are two underscores preceding and following 'doc')

Use this feature to print the built-in documentation for our ZipFile object.

In [124]:
#

## Using dir()

Another piece of the introspection feature is the dir() function, which lists the attributes of a python object.

The syntax for using it for an arbitrary python object named 'obj' is:

obj.dir()

Use dir() to list the attributes of our ZipFile object.

In [125]:
#

## Examining the methods available

The attributes returned by dir() include the names of available methods for this class, i.e., the names of the functions we can run on an instance of the class.

The standard way to refer to an arbitrary attribute 'att' of a python object 'obj' is:

obj.att

Use this to reference the 'printdir' attribute of our zipfile object.

In [126]:
#

## Using __doc__

Since methods are python objects, we should be able to use __doc__ to access the documentation on printdir.

See if you can print the __doc__ information for the printdir method.

In [127]:
#

## Calling a method

To call a method, we need to have an instance of the class that method belongs to.  

For an arbitrary method 'meth', if an arbitrary object 'obj' is an instance of a class that contains 'meth', we can invoke the method on 'obj' with the statement:

obj.meth()

Use this to invoke the printdir method on our zipfile object.

In [128]:
#

## Getting the names as a list

One of the advantages of reading a zip archive without expanding it is that we can automate the processing of the files it contains, rather than having to extract them and run a program against each one of them.

One of the steps in the automation will be to obtain a machine-readable list of the names of the files in the archive.

Use the __doc__ feature to print the documentation on the namelist method.

In [129]:
#

## Running namelist

Now invoke the namelist() method on our zipfile object.

Call the result 'nl'

In [130]:
#

## Determine the type returned

Use the type() function to determine the type of the object returned by namelist()

In [131]:
#

## Print the list of names

Print the filenames in the archive

In [132]:
#

## Determine how to read a single file from the archive

Print the __doc__ information for the zipfile method 'read'

In [133]:
#

## Read 'accident.csv' from the archive

Use the read() method to read the file 'accident.csv' from the archive into a buffer

Call the result 'acc'

In [134]:
#

## Determine the length and type of the result

Use type() and len() for this

In [135]:
#

## See if the result can be used as input to pandas read_csv

Trial and error.....

In [136]:
#

## Converting the result to a suitable form

Currently, the expanded contents of the file 'accidents.csv' is contained in an object of class 'bytes' called 'acc'

If we had a string object, we could turn it into a file-like object using the StringIO package.

Our first task is to see if we can turn 'acc' into a string.

Use the dir() function to list the attributes of the 'bytes' class.  For an arbitrary python object called 'obj', the syntax is:

dir('obj')

In [137]:
#

## The decode() method

Use __doc__ to print the built-in documentation for the decode() method of the 'bytes' class

In [138]:
#

##  Converting 'bytes' to 'string'

Use the decode() method to convert the 'bytes' object to a string.  For an arbitrary python object of class 'bytes' called 'obj', the syntax (using the default codec) is:

obj.decode()

Call the result 'accstr'

In [139]:
#

## Determine the type of the result

Use the type() function to determine what type of object we have now

In [140]:
#

## Create a file-like object

Now that the contents of 'accident.csv' are in the form of a string, we can use the StringIO function to create a file-like object.  The syntax for an arbitrary string object called 'obj' is:

StringIO(obj)

Call the resulting object 'accfile'

In [141]:
#

## Determine the type of the file-like object

In [142]:
#

## Create a pandas dataframe

Now pass the file-like object in place of a filename to the pandas read_csv function.

Call the resulting dataframe 'adf'

Print the first five rows of the resulting dataframe.

In [143]:
#

## Print the first 5 rows of the dataframe

In [144]:
#