# Lab 4 - How to work with open data
*© 2020 Colin Conrad*

Welcome to Week 4 of INFO 6270! Last week we covered basic data cleaning and analysis using lists and dictionaries. This week we are making our way to our final lesson on basic Python: libraries and external files. This week we will do a few things that will more relatable (and useful!) to most of you. We will start by learning how to navigate files in our Python environment before making our way to work with CSV and PDF files. 

This week's work covers a **lot** of ground from [Sweigart (2020)](https://automatetheboringstuff.com/). Instead of going through these chapters in depth, we will borrow some of the material from Chapters 15 and 16 throughout and apply it in a way that is more relevant to our context. If you are interested (and have the time) it is also helpful to have read Chapters 6 and 7 on string manipulation and regular expressions. The later is quite complex however and we will not cover it in this course; if you want to be a data science expert though, you should definitely learn regular expressions!

**This week, we will achieve the following objectives:**
- Locate files using Python
- Retrieve CSV data
- Analyze and write CSV data
- Retrieve and analyze PDF data
- Write PDF files

Weekly reading: Sweigart (2014) Ch. 15 and 16. 

# Case: The 2016 Canadian Census
The Canadian Census Program is a data collection program conducted by Statistics Canada every five years and has occurred regularly since 1851, before confederation. Census records are maintained by two federal government agencies based on their date of record. Census records prior to 1926 are curated by [Library and Archives Canada](https://www.bac-lac.gc.ca/eng/census/Pages/census.aspx), while records after 1926 are maintained by [Statistics Canada](https://www12.statcan.gc.ca/census-recensement/index-eng.cfm). These records represent the most comprehensive data on the Canadian population and are (for the most part) publicly available.

In addition to census profiles on various regions throughout the country, the Census Program provides [detailed data tables](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/index-eng.cfm) on a variety of topics. Housing is one such topic and the data table titled ["Tenure including presence of mortgage payments and subsidized housing"](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110574&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=121&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0) is particularly relevant to understanding housing affordability among Canadians because it provides the number of Canadian households which reported unaffordable housing or whether their households were in need of repairs. In this last lab in a series related to housing security, we will retrieve and analyze the tables provided by the census to generate insights about housing needs in Canada and Halifax specifically, if so desired.

# Objective 1: Locate files using Python
Before we can get started with data science in earnest, we need to know more about one last core Python feature: *libraries*. As mentioned in class, one of the main advantages of Python versus other programming languages is that it is *high level and highly portable*. This is to say that you can do a lot with Python in a few lines of code. One of the main things that makes this possible are Python's libraries.

In programming, a library is a collection of pre-defined routines that you can import into your code without writing them. These greatly accelerate the time that it takes to finish a programming task, and in some cases, save you years of work. Just like libraries designed for humans, Python programming libraries are generally curated by groups of people who ensure that the library is usable. As a free and open source programming language, experienced developers will often create and curate libraries for free, sometimes at great expense of their time!

Let's start by using the `pathlib` library. This library is provided in all Python distributions and is maintained by the Python Software Foundation. Similarly to Sweigart in Chapter 9, we can import the `Path` method from the `pathlib` library, which can help us locate our files.

In [None]:
from pathlib import Path # imports the Path function into our environment

The code above will import a function for us called `Path` which will help Python navigate throughout our computer's operating system. If you are taking this course, chances are that you would find it difficult to write a function that can interface between our Python environment and the operating system. Fortunately, more experienced developers have created this function for us to use. Let's try executing the `Path` method.

In [None]:
Path() # imports this Python file's path in the Windows or Mac operating system

This is probably not that exciting to you yet. However, what `Path()` reveals is that Python is actively listening to your local directory. We can get more context by asking the Path what its current director is. We can do that using the `cwd()` ("**c**urrent **w**orking **d**irectory") method.

In [None]:
Path.cwd() # gets the full current working directory; yours is probably different from Colin's

How this works is complicated and well-beyond the scope of this course. Fortunately however, we did not have to understand how it works in order to use the `Path` method. This will be an ongoing theme from this point onwards. Libraries enhance our ability to do things--we don't need to know *how* they work for now, just that they do!

## Navigating your local folder
Though the `Path` function is handy, it is not really evident until we pair it with another library. The `os` library is Python's library for navigating **o**perating **s**ystems, such as Windows or Mac OS. While `Path()` gives us paths, `os` allows Python to send commands to your computer. This is very handy if you want to change your directory or make new ones. Let's again start by importing os.

In [None]:
import os # import the os library

Let's see this library in action. One `os` method is called `listdir()` which will give us the directories in our current path.

In [None]:
os.listdir() # lists the files in the current directory

Chances are high that you downloaded this file as well as the `img` folder and placed them together. If so, you should see a series of files and folders, including Lab 4 and img. This is very handy for figuring out the names of files that we download! We can also take this one step further by navigating to the `img` subfolder. Let's create a new path containing our current path as well as the `img` subfolder. The following line will combine this file's current path and the `img` folder.

In [None]:
img_path = Path("img") # the path to the img subfolder

The os library also contains a `chdir()` method which allows us to **c**hange **d**irectories. We can now combine the `img_path` with the os library to navigate to the subfolder.

In [None]:
os.chdir(img_path) # change to the image path

If we now run `listdir()` again, we should see the contents of the `img` subfolder instead.

In [None]:
os.listdir()

Great work! Let's navigate back to the original path. We can do this by changing to the folder above using the ".." string. This is a feature of operating systems for moving up a level. This should bring us back to where we started.

In [None]:
first_path = Path("..") # denotes the directory above the current one
os.chdir(first_path) # change to the above directory

### *Challenge Question 1 (2 points)*
The `os` library will be very helpful for many future tasks. It is also very important to refer to documentation on how the various libraries that we will use work. Using the [documentation for os](https://docs.python.org/3/library/os.html) look up the `makedir()` command. Using the `mkdir` method of the os library, create a subdirectory (a.k.a. a subfolder) in your current folder called `data`. We will use this folder later place our csv files.

In [None]:
# insert code here

# Objective 2: Retrieve CSV data
It's now time to apply this skill to something practical. One of the main reasons why we would want to navigate the operating system in our Python environment is so that we can read and write files. When working with open data in Python you will have to retrieve files from the internet and process them.

The first step with any data project is to source and attain the data that you plan to analyze. Statistics Canada's census portal provides a (somewhat complicated) interface for attaining the data that we are interested in. By default, the portal gives a user the ability to create a custom table containing only the data that a user is interested in. Visit the provided at [this link](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110574&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=121&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0) and click on the Download tab. Select `CSV (comma-separated values) file` from the Download interface. This will download a CSV file into your downloads folder.

![alt text](img/4-1.png "Download the Table")

Rename this file to `w4_canada_housing.csv` and move it into the `/data` subdirectory that you created earlier. You can do this by right clicking on the file that you downloaded and selecting rename and then by dragging it to the relevant folder. Once your data is in the relevant folder, you are ready to interpret it.

## The CSV library
You will probably be unsurprised to learn that we will again leverage a library to read this file; in this case, the `csv` library. This library is the bread and butter of basic data science and we will come back to it almost every class from here on out. 

Python's [csv library](https://docs.python.org/3/library/csv.html) is a fantastic resource for reading and writing csv files. This one takes a little getting used to, so it is better to simply give an example of its basic use and then explain it. The following cell gives code for reading the file that you downloaded from Statistics Canada.

In [None]:
import csv
with open('data/w4_canada_housing.csv', newline='') as csvfile: # tells Python which file to read
    housing_reader = csv.reader(csvfile, delimiter=',') # draws on the reader object to read the file
    for row in housing_reader:
        print(row)

As you can see, the file is quite messy. The code really only has two unfamiliar parts to it. The first is the `with open('data/w4_canada_housing.csv', newline='') as csvfile:`. The `with open()` statement is the way that you command Python to open an external file. In this line of code, you are telling Python to open this file and call its contents `csvfile` in our environment.

The second unfamiliar piece of code is `housing_reader = csv.reader(csvfile, delimiter=',')`. The `csv.reader` is an object contained in the `csv` library which is designed to read csv files. In the `(csvfile, delimiter=',')` bit, we commanded the csv reader to read the opened `csvfile` and that each data in that file was separated (delimited) by the character `,`. CSV (comma separated values) files are simply a series of data separated by commas. In this line of code, we thus created a reader called `housing_reader` which reads the data inside of the csv file.

The reader object consists of a series of rows for each row in the csv file. We can loop through the rows using a for loop, just like with other data! Unfortunately, Statistics Canada's CSV files contain a lot of data which are not useful for this task. What we need is a way to clean the data efficiently. Fortunately, we learned this skill in Week 2; let's apply it to CSV files!

### *Challenge Question 2 (2 points)*
Currently the Statistics Canada table is structured poorly for Python analysis. Fortunately most of the unusable data are systematically structured similarly, each being placed on a single column row. Modify the code below to do the following:
- check to see if the row contains too few columns
- append the rows that have the useful data to a list

Doing this will give us a "list of lists" (a.k.a. two dimensional list) which we can use for analysis later.

In [None]:
import csv

housing = []

with open('data/w4_canada_housing.csv', newline='') as csvfile:
    housing_reader = csv.reader(csvfile, delimiter=',')
    for row in housing_reader:
        # add some logic to filter out rows that have too few items
        # add some logic to append the row to the housing list

#### Sample Test 
Should return:

`[['Housing indicators (5)', 'Total - Tenure including presence of mortgage payments and subsidized housing [4]', '  Owner', '    With mortgage', '    Without mortgage', '  Renter', '    Subsidized housing', '    Not subsidized housing', ' '], ['Total - Housing indicators [5]', '13798300', '9357290', '5680655', '3676630', '4441020', '575830', '3865190 '], ['   Adequacy: major repairs needed', '867565', '516640', '337990', '178645', '350925', '54300', '296625 '], ['   Suitability: not suitable', '670735', '253560', '199985', '53575', '417175', '48835', '368335 '], ['   Affordability: 30% or more of household income is spent on shelter costs', '3325950', '1550380', '1308780', '241600', '1775570', '238825', '1536740 '], ['   Adequacy, suitability or affordability: major repairs needed, or not suitable, or 30% or more of household income is spent on shelter costs [6]', '4373550', '2140660', '1694325', '446335', '2232895', '304675', '1928215 ']]`

In [None]:
print(housing)

# Objective 3: Analyze and write CSV data

In addition to reading CSV files, the CSV library helps us write files. One of the most tangible, practical uses for Python in an office setting is that you can clean such files and return them accordingly. Let' start by retrieving the current `housing` list.

In [None]:
for h in housing: # print each line separately for readability
    print(h)

It would be desirable to reduce the length of the long titles, such as `Adequacy: major repairs needed` and replace them with something more digestible. This would be a pain to do in a defined business analytics program.

## Cleaning your CSV data

An effective way to clean csv data is to create a function that iterates through a list file. For instance, we already decided to save the contents of our csv file in a list called `housing`. We could create the `cleanCSV` function which removes the colons as follows.

In [None]:
def cleanCSV(csvfile):
    new_file = []
    i = 0 
    while i < len(csvfile): # the length of the number of rows
        new_list = [] # a placeholder for cleaned row strings
        j = 0
        while j < len(csvfile[i]): # the number of values in this row
            if ":" in csvfile[i][j]: 
                colon_index = csvfile[i][j].index(":") #retrieve the index of the colon
                new_list.append(csvfile[i][j][:colon_index]) # [:colon_index] retrieves the string characters to the left of the index.
            else:
                new_list.append(csvfile[i][j]) # if there is no colon in the value, append it to the placeholder
            j += 1
        new_file.append(new_list) # append the cleaned list to the first level list
        i += 1
    return(new_file) # returns the cleaned file

We can then create a `new_housing` list which contains the cleaned version of the `housing` data.

In [None]:
new_housing = cleanCSV(housing) # apply the function

for h in new_housing: # print each line separately for readability
    print(h)

This is a bit better. Some of the unwieldly titles have changed. We can then use this cleaned data to write a CSV file.

## Writing a CSV file

Similarly to the `reader`, the Python csv library has a `writer`. The writer similarly uses the `with open(` structure, though contains the 'w' option. The writer similarly can create new rows and is designed to be iterated. Try executing the following code-- you will be left with cleaned data file called `w4_canada_housing_cleaned.csv` which you can open in Excel.

In [None]:
with open('data/w4_canada_housing_cleaned.csv', 'w', newline='') as csvfile:
    housing_writer = csv.writer(csvfile, delimiter=',')
    for row in new_housing:
        housing_writer.writerow(row)

![alt text](img/4-2.png "Cleaned Data")

### *Challenge Question 3 (2 points)*
Though the `cleanCSV()` function currently cleans out values to the left of the colon, there is still data which can be further cleaned. Modify the function to clean the data further. There are many ways to answer this question; you will be evaluated based on whether you:
- modify the cleanCSV function
- apply it to create an even cleaner csv file

#### Modify the cleanCSV function here!

In [None]:
def cleanCSV(csvfile):
    new_file = []
    i = 0 
    while i < len(csvfile): # the length of the number of rows
        new_list = [] # a placeholder for cleaned row strings
        j = 0
        while j < len(csvfile[i]): # the number of values in this row
            if ":" in csvfile[i][j]: 
                colon_index = csvfile[i][j].index(":") #retrieve the index of the colon
                new_list.append(csvfile[i][j][:colon_index]) # [:colon_index] retrieves the string characters to the left of the index.
            else:
                new_list.append(csvfile[i][j]) # if there is no colon in the value, append it to the placeholder
            j += 1
        new_file.append(new_list) # append the cleaned list to the first level list
        i += 1
    return(new_file) # returns the cleaned file

#### Apply the function

In [None]:
new_housing = cleanCSV(housing) # apply the function

#### Write the file

In [None]:
import csv

with open('data/w4_canada_housing_cleaned.csv', 'w', newline='') as csvfile:
    housing_writer = csv.writer(csvfile, delimiter=',')
    for row in new_housing:
        housing_writer.writerow(row)

# Objective 4: Retrieve and analyze PDF data

In addition to csv files, Python can often be used to process files generated by everyday business applications such as Adobe Acrobat and Microsoft Word. Unlike the csv library however, Python does not have built-in libraries for processing these types of files by default and we must install new libraries to add this functionality to our environment. 

## Installing libraries
The easiest way to install Python libraries is by using the `pip` tool. Pip is a recursive acronym which stands for "**p**ip **i**nstalls **p**ackages. It is package management tool which indexes Python libraries and makes it easy to install them in your Python environment.

In order to complete this step you must go outside of your Jupyter environment. Look for a tool called the **Anaconda Prompt** which should have been installed on your computer when you installed Anaconda. This is a shell (a.k.a. command line) interface that uses your Anaconda Python installation rather than your system's Python. This comes with the advantage that we do not need to be administrators in order to install new stuff.

In the anaconda terminal write the following command: `pip install PyPDF2`.

This will install the PyPDF2 library in your Anaconda environment. You can similarly install other Python libraries which you find interesting.
## PDF documents
Portable Document Format (PDF) documents are employed virtually everywhere in the working world. Frustratingly, organizations will occasionally post data tables in this format, which makes it difficult to employ analytical tools such as Excel or Tableau. Fortunately, these documents are files like any other, and programming languages such as Python can be used to interpret their data, though with some added difficulty.

In this week's data folder you will find a list of Halifax city councillors which was provided [online in pdf format](https://www.halifax.ca/sites/default/files/documents/city-hall/districts-councillors/CouncillorsExternalContactList.pdf). I speculate that this was an effort to prevent web crawlers from creating spam. We will now learn why this is futile.

Let's start by importing the PyPDF library and reading the document. The following code should bring the file's data into your Python environment.

In [None]:
import PyPDF2 # imports the PyPDF2 lubrary
pdfFileObj = open('data/halifax_councillors_list.pdf', 'rb') # creates a PDF file object which can be read by ython
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # creates a Reader object whihc can read the PDF file

We can now extract text from the document. As Sweigart points out in Chapter 15, the PyPDF library does not extract text perfectly, though it generally does a good job. The following code will set the target page and print the text contents.

In [None]:
pageObj = pdfReader.getPage(0) # set the page that we want to extract text from
print(pageObj.extractText()) # extract the text

With a method for extracting PDF in hand, we can also opt to write PDF text in a more Python-friendly format, such as plain text (a.k.a. .txt). Python is equipped to write .txt files by default and we can write one using the same `open()` method that we used with csv files. The following code writes a text file with the city councillor's information and places it in the data folder. 

In [None]:
councillors_text = open('data/councillors.txt','w') # opens a new write file called councillors.txt in the data folder
councillors_text.write(pageObj.extractText()) # writes the PDF contents in the txt file
councillors_text.close() # closes the txt file

### *Challenge Question 4 (2 points)*
As Sweigart points out in Chapter 15, PDF documents are difficult to work with. You have been provided with a second file called `halifax_sorting_guide.pdf`. This document has a few tables and text scattered across four pages. Using Python, extract the data from the third page which concerns the communities in Area I. Your script should:
- Open the `halifax_sorting_guide.pdf` as a pdfFileObj
- Retrieve the third page
- Retrieve the part of the string that corresponds to the list of communities in Area I
- Print your substring
- **Hint:**  `pageObj.extractText()` will retrieve the data as a string. You may wish to use the `.index()` function to retrieve the location of the "AREA I" and "Area II" substrings. You don't need to do this a fancy way -- you are welcome to locate the range of the substring with trial and error if it makes more sense to you!

In [None]:
import PyPDF2
# insert code here!

 # Objective 5: Write PDF files
Finally, we can also use Python write, and even combine PDF files! Similarly to the PDF reader, PyPDF2 also provides a PDF writer. As Sweigart points out however, this library is limited in the sense that it cannot modify PDF pages, just write pre-existing pages. This has a few obvious uses however, such as
- Reducing redundant PDF pages
- Combining select pages from various PDF documents
- Merging PDF documents

Consider reading through Sweigart Chapter 15 to learn more about how the PDF writer works!

### *Challenge Question 5 (2 points)*
Using the PyPDF2 library we can retrieve the city councillors page and merge it with the garbage collection guide. Write a script that achieves the following:
- Open the Halifax Sorting Guide
- Open the Halifax Councillors List
- Create a writer instance
- Loop through Sorting Guide pages and add them to the writer
- Loop through the Councillors List pages and add them to the writer
- Write a new file

**Hint:** Sweigart gives a useful example in Chapter 15. You are welcome to borrow from his example as long as you cite the reference at the end of your document.

In [None]:
# insert code here!

# References
Statistics Canada (2016). Statistics Canada Catalogue no. 98-400-X2016230. *2016 Census of Population*.