# Introduction

Source

Practical Python Data Wrangling and Data Quality
by Susan E. McGregor
2022 published by O'Reilly Media

File-based formats

They contain historical data in static files which can be downloaded from a database, emailed, or accessed via file-sharing sites.

Feed-based data

Real-time data sources.  These sources have their own unique formats and structures.  Access often via specialized application programming interfaces or APIs.  Accessing the "endpoint" via an API shows the most recent data.

[Citi Bike's real-time json feed](https://gbfs.citibikenyc.com/gbfs/en/station_status.json)

*note: these are not perfect distinctions, they can also be seen as complementary as one can be used to augment the other*

# Structured Versus Unstructured Data

The goal of most data wrangling projects is to generate insight and, often, to use data to make better decisions. But decisions are time sensitive, so our work with data also requires balancing trade-offs...As long as we can gain these efficiencies without sacrificing too much in terms of data quality, improving the timeliness of our data work can also increase its impact.

One of the simplest ways to make our data wrangling more efficient is to seek out data formats that are easy for Python and other computational tools to access and understand...structured, machine-readable data...the United States legal definition of "machine readable" data from the Foundations for Evidence-Based Policymaking Act of 2019: data in a format that can be easily processed by a computer without human intervention while ensuring no semantic meaning is lost.  

Structured data

Organized and classified in some way, into some version of records and fields, such as rows and columns, lists of objects, or dictionaries.
Examples: xls, xlsx, ods, tsv, csv, dbf, spss, txt

Unstructured data

May consist of different data types, combining text, numbers, photographs, images, waveforms of sound...Has some sort of record-and-field structure.
Examples: xml, json, rss, atom, doc(x), pdf, mp3, jpg

Unstructed to Structured

Collecting information about the world and applying structure to it...organizing information.

*Note: structure influences how it can be analyzed...data is the product of inherently subjective human choices...which reflect interests and priorities...trade-offs...inheriting bias...engaging a robust data quality process*



# Smart Searching for Specific Data Types

* Utilize file extension as keyword in search terms (.csv)
* Use desired source, such as URL as a keyword (.com)
* Locate only secure websites (https)
* Using hyphen - to focus search, excluding results (-apple)


# Working with Structured Data

The TABLE (or collection of)

### File-Based, Table-Type Data—Take It to Delimit

See extension: 

.csv
Comma-separated value files

.tsv
Tab-separated value files

.txt
Structured data files with this extension are often .tsv files in disguise; older data systems often labeled tab-separated data with the .txt extension - open and review any data file you want to wrangle with a basic text program (or a code editor like Atom)

.xls(x)
spreadsheets produced with Microsoft Excel. Because
these files can contain multiple “sheets” in addition to formulas, formatting, and other features that simple delimited files cannot replicate, they need more memory to store the same amount of data

.ods
Open-document spreadsheet files are the default extension for spreadsheets produced by a number of open source software suites like LibreOffice and OpenOffice and have limitations and features similar to those of .xls(x) files

# Reading data from CSVs

In [None]:
# a simple example of reading data from a .csv file with Python
# using the "csv" library.
# the source data was sampled from the Citi Bike system data:
# https://drive.google.com/file/d/17b461NhSjf_akFWvjgNXQfqgh9iFxCu_/
# which can be found here:
# https://s3.amazonaws.com/tripdata/index.html
# import the `csv` library 
import csv
# open the `202009CitibikeTripdataExample.csv` file in read ("r") mode
# this file should be in the same folder as our Python script or notebook
#source_file = open("202009CitibikeTripdataExample.csv","r")
path = "/content/202009CitibikeTripdataExample.csv"
source_file = open(path,"r")
# pass our `source_file` as an ingredient to the `csv` library's
# DictReader "recipe".
# store the result in a variable called `citibike_reader`
citibike_reader = csv.DictReader(source_file)
# the DictReader method has added some useful information to our data,
# like a `fieldnames` property that lets us access all the values
# in the first or "header" row
print(citibike_reader.fieldnames)
# let's just print out the first 5 rows - i values of 0, 1, 2, 3, and 4.
for i in range(0,5):
 print (next(citibike_reader))

['tripduration', 'starttime', 'stoptime', 'start station id', 'start station name', 'start station latitude', 'start station longitude', 'end station id', 'end station name', 'end station latitude', 'end station longitude', 'bikeid', 'usertype', 'birth year', 'gender']
{'tripduration': '4225', 'starttime': '2020-09-01 00:00:01.0430', 'stoptime': '2020-09-01 01:10:26.6350', 'start station id': '3508', 'start station name': 'St Nicholas Ave & Manhattan Ave', 'start station latitude': '40.809725', 'start station longitude': '-73.953149', 'end station id': '116', 'end station name': 'W 17 St & 8 Ave', 'end station latitude': '40.74177603', 'end station longitude': '-74.00149746', 'bikeid': '44317', 'usertype': 'Customer', 'birth year': '1979', 'gender': '1'}
{'tripduration': '1868', 'starttime': '2020-09-01 00:00:04.8320', 'stoptime': '2020-09-01 00:31:13.7650', 'start station id': '3621', 'start station name': '27 Ave & 9 St', 'start station latitude': '40.7739825', 'start station longitu

* csv: workhorse library when it comes to dealing with table-type data
* open() is a built-in function that takes a filename and a “mode” as parameters...“mode” can be r
for “read” or w for “write.”
* citibike_reader.fieldnames values, we can see that the
exact label for columns
* range() function gives us a way to execute some piece of code a specific
number of times, starting with the value of the first argument and ending just *before* the value of the second

### Adding Iterators: The range Function

Python’s for loop is designed to run
through all values in a list or a dataset by default.

Iterator variable

Like any variable,
you can name an iterator anything you like, though i (for iterator!) is traditional....one place where Python iterators typically
appear is within the range function—another example of a control flow function.

the range function includes an iterator variable that lets us write a
slightly different kind of for loop—one that goes through a certain number of rows,
rather than all
```python
for item in complete_list_of_items:
```
certain number of items
```python
for item_position in range (starting_position, >number_of_places_to_move):
  # action here
for i in range(0,5):
 print (next(citibike_reader))
```
when the range iterates over the values specified in the
parentheses, it includes the first number but excludes the second


# Reading data from TSV and TXT files

 DictReader function’s delimiter option...DictReader assumes that the comma character (,) is the
separator it should look for...you can
simply specify a different character when you call the function...specify the tab character (\t), but we could easily substitute any delimiter we prefer
(or that appears in a particular [source file](https://docs.python.org/3/library/csv.html))

In [None]:
# a simple example of reading data from a .tsv file with Python, using
# the `csv` library. The source data was downloaded as a .tsv file
# from Jed Shugerman's Google Sheet on prosecutor politicians: 
# https://docs.google.com/spreadsheets/d/1E6Z-jZWbrKmit_4lG36oyQ658Ta6Mh25HCOBaz7YVrA
# import the `csv` library
import csv
# open the `ShugermanProsecutorPoliticians-SupremeCourtJustices.tsv` file
# in read ("r") mode.
# this file should be in the same folder as our Python script or notebook
#tsv_source_file = open("ShugermanProsecutorPoliticians-SupremeCourtJustices.tsv","r")
path = "/content/Shugerman Research on Rise of Prosecutor Politicians - Supreme Court Justices.tsv"
tsv_source_file = open(path,"r")
# pass our `tsv_source_file` as an ingredient to the csv library's
# DictReader "recipe."
# store the result in a variable called `politicians_reader`
politicians_reader = csv.DictReader(tsv_source_file, delimiter='\t')
# the DictReader method has added some useful information to our data,
# like a `fieldnames` property that lets us access all the values
# in the first or "header" row
print(politicians_reader.fieldnames)
# we'll use the `next()` function to print just the first row of data
print (next(politicians_reader))

['', 'Justice', 'Term Start/End', 'Party', 'State', 'Pres Appt', 'Other Offices Held', 'Relevant Prosecutorial Background']
{'': '40', 'Justice': 'William Strong', 'Term Start/End': '1870-1880', 'Party': 'D/R', 'State': 'PA', 'Pres Appt': 'Grant', 'Other Offices Held': 'US House, Supr Court of PA, elect comm for elec of 1876', 'Relevant Prosecutorial Background': 'lawyer'}


This dataset was listed in Jeremy Singer-Vine’s (@jsvine) “Data Is Plural” newslet‐
ter (https://data-is-plural.com).

*Note:  Changing the extension of a file (for
example, from .tsv to .txt or vice versa) does absolutely nothing to change its contents.
All it does is change what your computer assumes should be done with it...Just specify the correct delimiter*

In [None]:
# a simple example of reading data from a .tsv file with Python, using
# the `csv` library. The source data was downloaded as a .tsv file
# from Jed Shugerman's Google Sheet on prosecutor politicians:
# https://docs.google.com/spreadsheets/d/1E6Z-jZWbrKmit_4lG36oyQ658Ta6Mh25HCOBaz7YVrA
# the original .tsv file was renamed with a file extension of .txt
# import the `csv` library
import csv
# open the `ShugermanProsecutorPoliticians-SupremeCourtJustices.txt` file
# in read ("r") mode.
# this file should be in the same folder as our Python script or notebook
#txt_source_file = open("ShugermanProsecutorPoliticians-SupremeCourtJustices.txt","r")
path = "/content/Shugerman Research on Rise of Prosecutor Politicians - Supreme Court Justices.txt"
txt_source_file = open(path,"r")
# pass our txt_source_file as an ingredient to the csv library's DictReader
# "recipe" and store the result in a variable called `politicians_reader`
# add the "delimiter" parameter and specify the tab character, "\t"
politicians_reader = csv.DictReader(txt_source_file, delimiter='\t')
# the DictReader function has added useful information to our data,
# like a label that shows us all the values in the first or "header" row
print(politicians_reader.fieldnames)
# we'll use the `next()` function to print just the first row of data
print (next(politicians_reader))


['', 'Justice', 'Term Start/End', 'Party', 'State', 'Pres Appt', 'Other Offices Held', 'Relevant Prosecutorial Background']
{'': '40', 'Justice': 'William Strong', 'Term Start/End': '1870-1880', 'Party': 'D/R', 'State': 'PA', 'Pres Appt': 'Grant', 'Other Offices Held': 'US House, Supr Court of PA, elect comm for elec of 1876', 'Relevant Prosecutorial Background': 'lawyer'}


### Escaped

whitespace characters have to be escaped when we’re
using them in code...

* we’re using the escaped character for a tab, which
is \t. 
* another common whitespace character code is \n for newline 
- (or \r for
return...)

# Understanding Unemployment

Unemployment number are released monthly by the Bureau of Labor Statistics (BLS).
Six unemployment rates are calculated each month.  The one most reported on is the "U3" unemployment rate, per BLS, "Total unemployed, as a percent of the civilian labor force (official unemployment rate)."  BLS defines the "U6" unemployment rate as, "Total unemployed, plus all persons marginally attached to the labor force, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all persons marginally attached to the labor force."  Persons marginally attached to the labor force are those who currently are neither working nor looking for work but indicate that they want and are available for a job and have looked for work sometime in the past 12 months.  Persons employed part time for economic reasons are those who want and are available to work full time but have had to settle for part time.  So, the U6 rate is higher than the U3 rate.

The St. Louis Federal Reserve is a source of economic datasets in many formats, including .xlsx and feed-type formats. [FRED U-6](https://fred.stlouisfed.org/series/U6RATE) which gives data back to 1990 when the measure was first created.  To obtain U3, use ADD LINE, UNRATE, Add series.  If downloaded the tables include metadata.  Let us download several formats (although .csv if preferable): .xlsx, .ods .xls

In [1]:
!pip install openpyxl
!pip install pyexcel-ods
!pip install xlrd==2.0.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyexcel-ods
  Downloading pyexcel_ods-0.6.0-py2.py3-none-any.whl (10 kB)
Collecting odfpy>=1.3.5
  Downloading odfpy-1.4.1.tar.gz (717 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m717.0/717.0 kB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyexcel-io>=0.6.2
  Downloading pyexcel_io-0.6.6-py2.py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.2/44.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting lml>=0.0.4
  Downloading lml-0.1.0-py2.py3-none-any.whl (10 kB)
Building wheels for collected packages: odfpy
  Building wheel for odfpy (setup.py) ... [?25l[?25hdone
  Created wheel for odfpy: filename=odfpy-1.4.1-py2.py3-none-any.whl size=160691 sh

openpyxl library to access (or parse) .xlsx files [documentation](https://openpyxl.readthedocs.io/en/stable/tutorial.html)

pyexcel-ods library for .ods files [documentation](https://docs.pyexcel.org/en/latest/)

xlrd library for reading .xls files [documentation](https://xlrd.readthedocs.io/en/latest/)

In [2]:
# an example of reading data from an .xlsx file with Python, using the "openpyxl"
# library. First, you'll need to pip install the openpyxl library:
# https://pypi.org/project/openpyxl/
# the source data can be composed and downloaded from:
# https://fred.stlouisfed.org/series/U6RATE
# specify the "chapter" you want to import from the "openpyxl" library
# in this case, "load_workbook"
from openpyxl import load_workbook
# import the `csv` library, to create our output file
import csv
# pass our filename as an ingredient to the `openpyxl` library's
path = '/content/fredgraph.xlsx'
# `load_workbook()` "recipe"
# store the result in a variable called `source_workbook`
source_workbook = load_workbook(filename = path)

In [3]:
# an .xlsx workbook can have multiple sheets
# print their names here for reference
print(source_workbook.sheetnames)

['FRED Graph']


In [17]:
# loop through the worksheets in `source_workbook`
for sheet_num, sheet_name in enumerate(source_workbook.sheetnames):
    # create a variable that points to the current worksheet by
    # passing the current value of `sheet_name` to `source_workbook`
     current_sheet = source_workbook[sheet_name]
     # print `sheet_name`, just to see what it is
     print(sheet_name)
     # create an output file called "xlsx_"+sheet_name
     output_file = open("xlsx_"+sheet_name+".csv","w")
    # use this csv library's "writer" recipe to easily write rows of data
    # to `output_file`, instead of reading data *from* it
     output_writer = csv.writer(output_file)
    # loop through every row in our sheet
     for row in current_sheet.iter_rows():
      # we'll create an empty list where we'll put the actual
      # values of the cells in each row
      row_cells = []
      # for every cell (or column) in each row....
      for cell in row:
        # let's print what's in here, just to see how the code sees it
        print(cell, cell.value)
        # add the values to the end of our list with the `append()` method
        row_cells.append(cell.value)
        # write our newly (re)constructed data row to the output file
        output_writer.writerow(row_cells)
    #  officially close the `.csv` file we just wrote all that data to
     output_file.close()

FRED Graph
<Cell 'FRED Graph'.A1> FRED Graph Observations
<Cell 'FRED Graph'.B1> None
<Cell 'FRED Graph'.C1> None
<Cell 'FRED Graph'.A2> Federal Reserve Economic Data
<Cell 'FRED Graph'.B2> None
<Cell 'FRED Graph'.C2> None
<Cell 'FRED Graph'.A3> Link: https://fred.stlouisfed.org
<Cell 'FRED Graph'.B3> None
<Cell 'FRED Graph'.C3> None
<Cell 'FRED Graph'.A4> Help: https://fredhelp.stlouisfed.org
<Cell 'FRED Graph'.B4> None
<Cell 'FRED Graph'.C4> None
<Cell 'FRED Graph'.A5> Economic Research Division
<Cell 'FRED Graph'.B5> None
<Cell 'FRED Graph'.C5> None
<Cell 'FRED Graph'.A6> Federal Reserve Bank of St. Louis
<Cell 'FRED Graph'.B6> None
<Cell 'FRED Graph'.C6> None
<Cell 'FRED Graph'.A7> None
<Cell 'FRED Graph'.B7> None
<Cell 'FRED Graph'.C7> None
<Cell 'FRED Graph'.A8> U6RATE
<Cell 'FRED Graph'.B8> Total Unemployed, Plus All Persons Marginally Attached to the Labor Force, Plus Total Employed Part Time for Economic Reasons, as a Percent of the Civilian Labor Force Plus All Persons Margin