# Introduction

File-based formats

They contain historical data in static files which can be downloaded from a database, emailed, or accessed via file-sharing sites.

Feed-based data

Real-time data sources.  These sources have their own unique formats and structures.  Access often via specialized application programming interfaces or APIs.  Accessing the "endpoint" via an API shows the most recent data.

[Citi Bike's real-time json feed](https://gbfs.citibikenyc.com/gbfs/en/station_status.json)

*note: these are not perfect distinctions, they can also be seen as complementary as one can be used to augment the other*

# Structured Versus Unstructured Data

The goal of most data wrangling projects is to generate insight and, often, to use data to make better decisions. But decisions are time sensitive, so our work with data also requires balancing trade-offs...As long as we can gain these efficiencies without sacrificing too much in terms of data quality, improving the timeliness of our data work can also increase its impact.

One of the simplest ways to make our data wrangling more efficient is to seek out data formats that are easy for Python and other computational tools to access and understand...structured, machine-readable data...the United States legal definition of "machine readable" data from the Foundations for Evidence-Based Policymaking Act of 2019: data in a format that can be easily processed by a computer without human intervention while ensuring no semantic meaning is lost.  

Structured data

Organized and classified in some way, into some version of records and fields, such as rows and columns, lists of objects, or dictionaries.
Examples: xls, xlsx, ods, tsv, csv, dbf, spss, txt

Unstructured data

May consist of different data types, combining text, numbers, photographs, images, waveforms of sound...Has some sort of record-and-field structure.
Examples: xml, json, rss, atom, doc(x), pdf, mp3, jpg

Unstructed to Structured

Collecting information about the world and applying structure to it...organizing information.

*Note: structure influences how it can be analyzed...data is the product of inherently subjective human choices...which reflect interests and priorities...trade-offs...inheriting bias...engaging a robust data quality process*



# Smart Searching for Specific Data Types

* Utilize file extension as keyword in search terms (.csv)
* Use desired source, such as URL as a keyword (.com)
* Locate only secure websites (https)
* Using hyphen - to focus search, excluding results (-apple)


# Working with Structured Data

The TABLE (or collection of)

### File-Based, Table-Type Data—Take It to Delimit

See extension: 

.csv
Comma-separated value files

.tsv
Tab-separated value files

.txt
Structured data files with this extension are often .tsv files in disguise; older data systems often labeled tab-separated data with the .txt extension - open and review any data file you want to wrangle with a basic text program (or a code editor like Atom)

.xls(x)
spreadsheets produced with Microsoft Excel. Because
these files can contain multiple “sheets” in addition to formulas, formatting, and other features that simple delimited files cannot replicate, they need more memory to store the same amount of data

.ods
Open-document spreadsheet files are the default extension for spreadsheets produced by a number of open source software suites like LibreOffice and OpenOffice and have limitations and features similar to those of .xls(x) files

# Reading data from CSVs

In [3]:
# a simple example of reading data from a .csv file with Python
# using the "csv" library.
# the source data was sampled from the Citi Bike system data:
# https://drive.google.com/file/d/17b461NhSjf_akFWvjgNXQfqgh9iFxCu_/
# which can be found here:
# https://s3.amazonaws.com/tripdata/index.html
# import the `csv` library 
import csv
# open the `202009CitibikeTripdataExample.csv` file in read ("r") mode
# this file should be in the same folder as our Python script or notebook
#source_file = open("202009CitibikeTripdataExample.csv","r")
path = "/content/202009CitibikeTripdataExample.csv"
source_file = open(path,"r")
# pass our `source_file` as an ingredient to the `csv` library's
# DictReader "recipe".
# store the result in a variable called `citibike_reader`
citibike_reader = csv.DictReader(source_file)
# the DictReader method has added some useful information to our data,
# like a `fieldnames` property that lets us access all the values
# in the first or "header" row
print(citibike_reader.fieldnames)
# let's just print out the first 5 rows - i values of 0, 1, 2, 3, and 4.
for i in range(0,5):
 print (next(citibike_reader))

['tripduration', 'starttime', 'stoptime', 'start station id', 'start station name', 'start station latitude', 'start station longitude', 'end station id', 'end station name', 'end station latitude', 'end station longitude', 'bikeid', 'usertype', 'birth year', 'gender']
{'tripduration': '4225', 'starttime': '2020-09-01 00:00:01.0430', 'stoptime': '2020-09-01 01:10:26.6350', 'start station id': '3508', 'start station name': 'St Nicholas Ave & Manhattan Ave', 'start station latitude': '40.809725', 'start station longitude': '-73.953149', 'end station id': '116', 'end station name': 'W 17 St & 8 Ave', 'end station latitude': '40.74177603', 'end station longitude': '-74.00149746', 'bikeid': '44317', 'usertype': 'Customer', 'birth year': '1979', 'gender': '1'}
{'tripduration': '1868', 'starttime': '2020-09-01 00:00:04.8320', 'stoptime': '2020-09-01 00:31:13.7650', 'start station id': '3621', 'start station name': '27 Ave & 9 St', 'start station latitude': '40.7739825', 'start station longitu

* csv: workhorse library when it comes to dealing with table-type data
* open() is a built-in function that takes a filename and a “mode” as parameters...“mode” can be r
for “read” or w for “write.”
* citibike_reader.fieldnames values, we can see that the
exact label for columns
* range() function gives us a way to execute some piece of code a specific
number of times, starting with the value of the first argument and ending just *before* the value of the second

### Adding Iterators: The range Function

Python’s for loop is designed to run
through all values in a list or a dataset by default.

Iterator variable

Like any variable,
you can name an iterator anything you like, though i (for iterator!) is traditional....one place where Python iterators typically
appear is within the range function—another example of a control flow function.

the range function includes an iterator variable that lets us write a
slightly different kind of for loop—one that goes through a certain number of rows,
rather than all
```python
for item in complete_list_of_items:
```
certain number of items
```python
for item_position in range (starting_position, >number_of_places_to_move):
  # action here
for i in range(0,5):
 print (next(citibike_reader))
```
when the range iterates over the values specified in the
parentheses, it includes the first number but excludes the second


# Reading data from TSV and TXT files

 DictReader function’s delimiter option...DictReader assumes that the comma character (,) is the
separator it should look for...you can
simply specify a different character when you call the function...specify the tab character (\t), but we could easily substitute any delimiter we prefer
(or that appears in a particular [source file](https://docs.python.org/3/library/csv.html))

In [5]:
# a simple example of reading data from a .tsv file with Python, using
# the `csv` library. The source data was downloaded as a .tsv file
# from Jed Shugerman's Google Sheet on prosecutor politicians: 
# https://docs.google.com/spreadsheets/d/1E6Z-jZWbrKmit_4lG36oyQ658Ta6Mh25HCOBaz7YVrA
# import the `csv` library
import csv
# open the `ShugermanProsecutorPoliticians-SupremeCourtJustices.tsv` file
# in read ("r") mode.
# this file should be in the same folder as our Python script or notebook
#tsv_source_file = open("ShugermanProsecutorPoliticians-SupremeCourtJustices.tsv","r")
path = "/content/Shugerman Research on Rise of Prosecutor Politicians - Supreme Court Justices.tsv"
tsv_source_file = open(path,"r")
# pass our `tsv_source_file` as an ingredient to the csv library's
# DictReader "recipe."
# store the result in a variable called `politicians_reader`
politicians_reader = csv.DictReader(tsv_source_file, delimiter='\t')
# the DictReader method has added some useful information to our data,
# like a `fieldnames` property that lets us access all the values
# in the first or "header" row
print(politicians_reader.fieldnames)
# we'll use the `next()` function to print just the first row of data
print (next(politicians_reader))

['', 'Justice', 'Term Start/End', 'Party', 'State', 'Pres Appt', 'Other Offices Held', 'Relevant Prosecutorial Background']
{'': '40', 'Justice': 'William Strong', 'Term Start/End': '1870-1880', 'Party': 'D/R', 'State': 'PA', 'Pres Appt': 'Grant', 'Other Offices Held': 'US House, Supr Court of PA, elect comm for elec of 1876', 'Relevant Prosecutorial Background': 'lawyer'}


This dataset was listed in Jeremy Singer-Vine’s (@jsvine) “Data Is Plural” newslet‐
ter (https://data-is-plural.com).

*Note:  Changing the extension of a file (for
example, from .tsv to .txt or vice versa) does absolutely nothing to change its contents.
All it does is change what your computer assumes should be done with it...Just specify the correct delimiter*

In [6]:
# a simple example of reading data from a .tsv file with Python, using
# the `csv` library. The source data was downloaded as a .tsv file
# from Jed Shugerman's Google Sheet on prosecutor politicians:
# https://docs.google.com/spreadsheets/d/1E6Z-jZWbrKmit_4lG36oyQ658Ta6Mh25HCOBaz7YVrA
# the original .tsv file was renamed with a file extension of .txt
# import the `csv` library
import csv
# open the `ShugermanProsecutorPoliticians-SupremeCourtJustices.txt` file
# in read ("r") mode.
# this file should be in the same folder as our Python script or notebook
#txt_source_file = open("ShugermanProsecutorPoliticians-SupremeCourtJustices.txt","r")
path = "/content/Shugerman Research on Rise of Prosecutor Politicians - Supreme Court Justices.txt"
txt_source_file = open(path,"r")
# pass our txt_source_file as an ingredient to the csv library's DictReader
# "recipe" and store the result in a variable called `politicians_reader`
# add the "delimiter" parameter and specify the tab character, "\t"
politicians_reader = csv.DictReader(txt_source_file, delimiter='\t')
# the DictReader function has added useful information to our data,
# like a label that shows us all the values in the first or "header" row
print(politicians_reader.fieldnames)
# we'll use the `next()` function to print just the first row of data
print (next(politicians_reader))


['', 'Justice', 'Term Start/End', 'Party', 'State', 'Pres Appt', 'Other Offices Held', 'Relevant Prosecutorial Background']
{'': '40', 'Justice': 'William Strong', 'Term Start/End': '1870-1880', 'Party': 'D/R', 'State': 'PA', 'Pres Appt': 'Grant', 'Other Offices Held': 'US House, Supr Court of PA, elect comm for elec of 1876', 'Relevant Prosecutorial Background': 'lawyer'}


### Escaped

whitespace characters have to be escaped when we’re
using them in code...

* we’re using the escaped character for a tab, which
is \t. 
* another common whitespace character code is \n for newline 
- (or \r for
return...)

...to be continued...