## Working with CSVs

### CSV Overview

CSVs are the simplest and the most commonly used data format. CSV stands for _Comma Separated Values_. CSVs contain rows of data where each line is _terminated_ by a newline character (`\n`) and columns within a row are _delimited_ (or _separated_) by commas (`,`). You might be familiar with this format from working with Excel files. CSVs are commonly used by Excel to import/export data.

You can see that commas and newline characters are special characters; therefore if a field contains these characters, it needs to be wrapped by another special character called the _Enclosing Character_. This is typically set by double quotes (`"`). For example to following line represents 3 fields: _name, home\_town, genre_ where the address field is wrapped in quotes to include a comma:

```csv
name,home_town,genre
Biggie Smalls,"Brooklyn, NY",hip hop
"Queen Latifah","Newark, NJ","hip hop"
"Salt-N-Pepa","Queens, NY","hip hop"
Tupac Shakur,"Harlem, NY"."hip hop"
```

Optionally, to be safe, all string or text fields could be wrapped in quotes.

There's another special character called the _Escape Character_; commonly set as backslash (`\`). This character is used to escape the _Enclose_ character; allowing a text field to include a quote. Yes, we almost get into an endless cycle! If we need a single backslash in the text, we should enter double backslashes: '\\'. Take a look at the example below. This example contains a lyric field that includes both a quote and backslash:

```csv
name,lyric
Biggie Smalls, "The song Juicy includes: \"Salt-n-Pepa \\ Heavy D up in the limousine\" -the end"
```

Luckily, you don't have to worry about these as much. Python automatically takes care of escaping or enclosing fields. Just know that there are these special characters which you can change:

| Name | Default Character |
| --- | --- |
| Delimiter | `,` (comma) |
| Line Terminator | `\n` (enter or newline) |
| Enclosing or Quoting | `"` (double quote) |
| Escape | `\` (backslash) |

It's important to mention that there's another similar file type called TSV or _Tab Separated Values_ where fields are terminated by tabs (`\t`).


### Basic CSV I/O

In this lesson we'll follow some simple examples to read/write CSV rows from a file. 

Python provides a built-in `csv` module. This module comes equipped with a `DictWriter` and a `DictReader` class to write/read CSV files. These classes automatically take care of CSV formatting including adding delimiters, parsing fields, and escaping special characters. Life is easy with Python 😉

Let's see these classes in action. The following code writes a series of rows from a list into CSV format:

In [None]:
from csv import DictWriter


# csv header fields
header = ["name", "home_town"]
# rows to write, each row includes a dict of values
rows = [
    {"name": "Biggie Smalls", "home_town": "Brooklyn, NY"},
    {"name": "Queen Latifah", "home_town": "Newark, NJ"},
    {"name": "Salt-N-Pepa", "home_town": "Queens, NY"},
    {"name": "Tupac Shakur", "home_town": "Harlem, NY"},
]

# open a file for writing
with open("./data/hiphop_legends.csv", "w") as csv_file:
    # create a csv writer with header fields
    csv_writer = DictWriter(csv_file, fieldnames=header, dialect="unix")
    # write our header row
    csv_writer.writeheader()
    # iterate through rows and write to file
    for row in rows:
        csv_writer.writerow(row)

print("done!")

Let's take a closer look here:
- We open a file for writing as before
- We use a spacial `csv` module class called `DictWriter`. This class writes dicts into CSV format. The dict fields map one-to-one to the CSV fields written.
- `DictWriter` take a list of `fieldnames`. This is the list of dict keys to write into the file; and is also used for the CSV header column names.
- The `dialect=unix` indicates Unix formatting which includes a single newline (`\n`) character as the line terminator. You also use the `excel` dialect to include Windows newline terminators. Yes, Unix and Windows have different line terminators. This is typically very annoying!
- The `writerow()` method write a dict to our file in CSV format

<br/>

Now, let's read our rows back using a similar `csv` module class called `DictReader`:

In [None]:
from csv import DictReader


# open our file for reading
with open("./data/hiphop_legends.csv", "r") as csv_file:
    # create a csv reader and pass any special args
    csv_reader = DictReader(csv_file, dialect="unix")
    for row in csv_reader:
        # get the current line number
        line_num = csv_reader.line_num
        # access fields as a dict
        name = row["name"]
        home_town = row["home_town"]
        print(f"lineno: {csv_reader.line_num} - name: {name} home_town: {home_town}")

It's important to note that:
- `DictReader` automatically detects if our file includes a header row. This row is used as field names and sets the keys for the parsed dict rows. Pay attention that this row is skipped as a data row. Therefore, printing the line numbers starts from the second line of the CSV file.
- You can use the `DictReader` object itself as an iterator. Iterating over this object returns a dict object corresponding to each row of the CSV file.
- The `DictReader` takes a series of optional constructor parameters such as `dialect`. Read the docs! There are other parameters to change the delimiter, quoting character, or other CSV special characters.
- The `DictReader` returns an `OrderedDict` object. This is a special dict object which retains the keys in the same order in which they were entered.


### Reading/Writing Lists

The built-in `csv` module has two other classes to work with CSV files which that do NOT contain a header row. These classes read/write the fields into a `list` object instead of a `dict`. These classes are simply called `csv.writer` and `csv.reader`. You can read the docs to familiarize yourself with them. 

The example below shows how to work with there classes:

In [None]:
import csv

# our rows are defined as list of values (instead of dicts)
rows = [
    ['Biggie Smalls', 'Brooklyn, NY'], 
    ['Queen Latifah', 'Newark, NJ'], 
    ['Salt-N-Pepa', 'Queens, NY'], 
    ['Tupac Shakur', 'Harlem, NY']
]

# write data to csv
with open("./data/hiphop_legends.tsv", "w") as csv_file:
    # we change our delimiter to tabs and only use enclosing characters when necessary
    csv_writer = csv.writer(csv_file, dialect="unix", delimiter="\t", quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerows(rows)

print("done writing!")

# open and the file back for reading
with open("./data/hiphop_legends.tsv", "r") as csv_file:
    csv_reader = csv.reader(csv_file, dialect="unix", delimiter="\t", quoting=csv.QUOTE_MINIMAL)
    for row in csv_reader:
        # we read back lists (instead of dicts)
        print(row)


A few things to notice here:
- Our rows are now a list of lists
- We also pass a few more parameters to change the delimiter to tabs (making this a TSV file) and set the quoting to only when necessary (minimal)

#### Exercise

- Read the file `./data/deb-airports.csv` using the `csv.DictReader`
- Print each the fields in each row
- Store each row into a list
- Open another file for writing using the `csv.DictWriter`
- Store the rows back into CSV format
- Bonus:
  - Create another list with only Oregon (OR), Washington (WA), and California (CA) airports
  - Write these into a separate CSV file

In [None]:
# import both DictReader and DictWriter

# declare a list to store your rows

# open the file for reading using DictReader
# loop throw all the rows and append them to your list


# open the a file for writing using DictWriter
# loops through your list and write them into csv


### Generating Random Person Profile

Before we get into reading/writing CSV files, let's use the Faker module to generate a series of random person profiles. We use various methods for the `Faker` class to generate random people. We save each person in a corresponding dict.

Let's see this in action:

In [None]:
from faker import Faker

fake = Faker()

# a list to contain our profiles
profiles = []

for i in range(30):
    person = {
        "first_name": fake.first_name(),
        "last_name": fake.last_name(),
        "street_address": fake.street_address(),
        "city": fake.city(),
        "state": fake.state_abbr(),
        "zip": fake.zipcode_in_state(),
        "email": fake.free_email(),
        "birth_date": fake.date_of_birth(),
    }
    print(person)
    profiles.append(person)

#### Exercise (Writer)

1. Write the profiles into a CSV file using the `DictWriter` class
2. Be sure to include a header row

In [None]:

# get our field names from the data
field_names = list(profiles[0].keys())

# write profiles into a CSV file


#### Exercise (Reader)

1. Read the profiles back from using the `DictReader`
2. Print a few of the fields
3. **Bonus**: When reading each row, add some post processing to parse the _birth\_date_ column as datetime objects

#### Exercise - Bonus Points

1. Write and read the profiles again but this time with the `csv.reader` and `csv.writer` classes instead.
2. Remember, these classes work with rows as a `list` of fields instead of a `dict`. You must use the profile `dict.values`.
3. Set the delimiter character to tabs (`'\t'`); making this a TSV file. Read the docs to find the correct parameter to set.

In [None]:
# write the profiles into a TSV file

# read teh profiles from a TSV file