# Welcome to the Dark Art of Coding:
## Introduction to Python
Comma separated value (CSV) files

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, we will look at handling files based on Comma Separated Values (CSV).

Students should expect to:

* Open CSVs manually, using straight Python
* Open CSVs using the `csv` module

# Manually opening CSVs with straight Python
---

In their simplest form, CSVs can be opened/processed just like regular text files. This is generally only successful if the CSV is well-formed and uniformly consistent throughout. 

In [1]:
file = open('folder/yahoo_prices_short.csv', 'r')

for line in file:
    fields = line.rstrip().split(',')
    print(fields[0], fields[5])           
    
# Having a header row is not quite what we want...    
    

Date Volume
2/29/2008 23860500
2/28/2008 30113200
2/27/2008 27664100
2/26/2008 26013000
2/25/2008 32470600
2/22/2008 26157800
2/21/2008 34494000
2/20/2008 29274700


In [2]:
file = open('folder/yahoo_prices_short.csv')
file.readline()        # Date,Open,High,Low,Close,Volume,Adj Close

for line in file:
    fields = line.strip().split(',')
    print(fields[0], fields[5])           

# this is great, but is of limited utility in terms of data analysis...

2/29/2008 23860500
2/28/2008 30113200
2/27/2008 27664100
2/26/2008 26013000
2/25/2008 32470600
2/22/2008 26157800
2/21/2008 34494000
2/20/2008 29274700


## Detour: several builtin functions: `max()` and `min()`

In [16]:
# A pair of built-in functions may help: max() & min()
# given a random list of integers

random_list = [1, 2, 3, 4, 42]
print('Max:')
max(random_list)

Max:


42

In [13]:
print('Min:')
min(random_list)

Min:


1

In [9]:
file = open('folder/yahoo_prices_short.csv')
file.readline()

volumes = list()              # volumes = [] also works.

for line in file:
    fields = line.strip().split(',')
    volumes.append(int(fields[5]))
    
print(max(volumes), min(volumes))    

34494000 23860500


In [17]:
# Sometimes, even though it introduces more lines of 
#     code, we may choose to introduce intermediary
#     variable names, simply to improve the readability
#     of the code.

file = open('folder/yahoo_prices_short.csv')
file.readline()

volumes = list()              # volumes = [] also works.

for line in file:
    fields = line.strip().split(',')
    
    volume = int(fields[5])        # we add this line for readability
    volumes.append(volume)

print(max(volumes), min(volumes))    


34494000 23860500


# Experience Points!

In your **text editor** create a simple script called:

```bash
my_csv_01.py```

Execute your script in **Jupyter** using the command:

```bash
run my_csv_01.py```

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

Task | Sample Object(s)
:---|---
1. open the file `yahoo_prices_short.csv` and label it with a suitable filehandle| `fin` OR `infile` OR `stocks`
1. create a list to hold adjusted closing values for the stocks|`adjusted_close`
1. parse each row for the value in the `Adj Close` column|
1. convert each value to a float| `float()`
3. `append()` each value to the `adjusted_close` list|
4. `print()` the maximum and minimum values|`max()` & `min()`

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Opening CSVs with the csv module
---

## Classic example of what can go wrong!

As noted earlier, manual, pure-Python parsing of CSVs only works if the CSV is well-formed and consistent throughout. Let's look at what happens when we attempt to parse a CSV, that has some oddities OR unexpected conditions.

In [18]:
with open('folder/stocks.csv') as filein:
    filein.readline()           # "Stock Symbol","Stock Name",Open,High,Low,Close,Volume,Adj Close
    
    for line in filein:
        line = line.strip()
        print(line)
        fields = line.split(',')
        print('Num of fields:', len(fields))
        print('The fields:', fields)
        print('-' * 60)

APPL,Apple,27.49,28.41,27.5,27.78,23860500,27.78
Num of fields: 8
The fields: ['APPL', 'Apple', '27.49', '28.41', '27.5', '27.78', '23860500', '27.78']
------------------------------------------------------------
C,Citigroup,27.98,28.82,27.96,28.15,30113200,28.15
Num of fields: 8
The fields: ['C', 'Citigroup', '27.98', '28.82', '27.96', '28.15', '30113200', '28.15']
------------------------------------------------------------
GOOG,Google,30.45,29.81,28.36,30.18,27664100,28.46
Num of fields: 8
The fields: ['GOOG', 'Google', '30.45', '29.81', '28.36', '30.18', '27664100', '28.46']
------------------------------------------------------------
HOG,"Harley-Davidson, Inc.",28.48,30.9,27.98,28.48,52354100,28.31
Num of fields: 9
The fields: ['HOG', '"Harley-Davidson', ' Inc."', '28.48', '30.9', '27.98', '28.48', '52354100', '28.31']
------------------------------------------------------------
MMM,3m,30.64,27.70,29.11,28.64,82354200,29.74
Num of fields: 8
The fields: ['MMM', '3m', '30.64', '27.7

If we go looking for the **High** stock value...shown in BLUE.

We end up getting a single **Open** value associated with Harley Davidson


<pre>
Stock Symbol,Stock Name,<strong>Open,High</strong>,Low,Close,Volume,Adj Close

APPL,Apple,27.49,<strong><span style="color:blue">28.41</span></strong>,27.5,27.78,23860500,27.78
C,Citigroup,27.98,<strong><span style="color:blue">28.82</span></strong>,27.96,28.15,30113200,28.15
GOOG,Google,30.45,<strong><span style="color:blue">29.81</span></strong>,28.36,30.18,27664100,28.46
HOG,"Harley-Davidson, Inc.",<strong><span style="color:red">28.48</span></strong>,30.9,27.98,28.48,52354100,28.31
MMM,3m,30.64,<strong><span style="color:blue">27.70</span></strong>,29.11,28.64,82354200,29.74
M,"Macy's",29.7,<strong><span style="color:blue">26.32</span></strong>,28.16,30.3,72371982,28.12
MSFT,Microsoft,26.13,<strong><span style="color:blue">26.75</span></strong>,26.101,25.51,12365478,29.50
WAG,Walgreens,26.63,<strong><span style="color:blue">26.51</span></strong>,28.47,28.33,81271452,26.29</pre>

In [None]:
with open('folder/stocks.csv') as filein:
    filein.readline()
    open_prices = []            # using list() OR [] is acceptable
    
    for line in filein:
        line = line.strip().split(',')
        
        symbol = line[0]
        name = line[1]
        high = line[3]
        
        open_prices.append(float(high))
        
        print(symbol, name, high)

print()        
print('Max:', max(open_prices))
    
# Max 'High' value SHOULD have been 30.9 for Harley-Davidson
# Oooops.

## Let's do it right...

In [None]:
import csv                          # The csv module provides more flexibility and tools than 
                                    # opening the file with straight Python

file = open('folder/stocks.csv')
file.readline()    

csv_stocks = csv.reader(file)       # produces a "reader" object that parses rows
                                    # the way you expect, out of the box
                                    #     it defaults to comma separator
                                    #     it understands quoted fields

for line in csv_stocks:
    print(line)

In [None]:
import csv                          
file = open('folder/stocks.csv')
file.readline()    

csv_stocks = csv.reader(file)       

open_prices = list()

for line in csv_stocks:
    
    # Let's extract key fields from each of the lists
    symbol = line[0]
    name = line[1]
    high = line[3]
    
    open_prices.append(high)
    print(symbol, name, high)
    
print('\nMax:', max(open_prices))   # NOTICE the '\n' character
file.close()

# Experience Points!

In your **text editor** create a simple script called:

```bash
my_csv_02.py```

Execute your script in **Jupyter** using the command:

```bash
run my_csv_02.py```

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

Task | Sample Object(s)
:---|---
1. open the file `stocks.csv` and label it with a suitable filehandle| `fin` OR `infile` OR `stocks`
1. parse each row for the values in the `Stock Symbol`, `High` and `Low` column|`stock`, `low`, `high`
1. convert each value to a `float()`|
3. calculate the difference between the `high` and `low`| `diff`
4. `print()` the values of `stock` and `diff`| `stock`, `diff`

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
# Let's take a quick look at the raw data in this CSV...

%cat folder/stocks.csv

In [None]:
# Next, let's take a quick look at the raw data in this file with 
#     tab separated values (TSV)...

%cat folder/stocks.tsv

In [None]:
import csv                          

file = open('folder/stocks.tsv')
file.readline()                     # "Stock Symbol" "Stock Name" Open High Low Close Volume "Adj Close"

tsv_stocks = csv.reader(file, 
                        delimiter='\t', 
                        quotechar="'",
                        escapechar="\\")                        
                                    # csv.reader takes several arguments here
                                    #     the filehandle
                                    #     a delimiter character 
                                    #     a quote character to encapsulate any 
                                    #     delimiters
                                    #     an escape character

open_prices = list()

for line in tsv_stocks:
    symbol, name, _open, high, low, close, volume, adjclose = line
    open_prices.append(_open)
    print(symbol, name, _open)

print('\nMax:', max(open_prices))
file.close()

# Experience Points!

In your **text editor** create a simple script called:

```bash
my_csv_03.py```

Execute your script in **Jupyter** using the command:

```bash
run my_csv_03.py```

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

Task | Sample Object(s)
:---|---
1. open the file `stocks.csv` and label it with a suitable filehandle| `fin` OR `infile` OR `stocks`
1. parse each row for the values in the `Stock Symbol`, `High`, and `Low` columns|`stock`, `low`, `high`
1. convert each value to a `float()`|
3. calculate the difference between the `high` and `low`| `diff`
4. `print()` the values of `stock` and `diff`| `stock`, `diff`

When you finish this exercise, please post your Green Sticky.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
# If you're going to read a bunch of CSVs with the same style of formatting then you can
# make a "dialect" which saves some of your arguments

file = open('folder/stocks.tsv')
csv.register_dialect('tsvDialect', delimiter='\t', quotechar="'", escapechar="\\")

tsvinput = csv.reader(file, 'tsvDialect')

for line in tsvinput:
    print(line)

In [None]:
import csv                          
csv.register_dialect('tsvDialect', delimiter='\t', quotechar="'", escapechar="\\")

filein = open('folder/stocks.tsv')
fileout = open('folder/stocks_slim.csv', 'w')
filein.readline()                     # "Stock Symbol","Stock Name",Open,High,Low,Close,Volume,Adj Close

tsv_stocks = csv.reader(filein, 'tsvDialect')
stocks_slim = csv.writer(fileout, quotechar='"')

for line in tsv_stocks:
    print(line)
    symbol, name, _open, high, low, close, volume, adjclose = line
    output = [name, symbol, close]
    stocks_slim.writerow(output)

filein.close()
fileout.close()
    

# Experience points
---

The following problems combine a number fo features from our previous discussions, such as:
* creating functions
* reading CSVs
* processing strings
* indexing lists
* converting values from one datatype to another

## Problem 0

In your **text editor** create a simple script called:

```bash
my_csv_prob_0.py```

Execute your script in **Jupyter** using the command:

```bash
run my_csv_prob_0.py```
---

This problem uses the file: `folder/log_file_1000.csv`, which has the following columns:

`name,email,fm_ip,to_ip,date,lat,long,payload`

You will create a script to extract user IDs (UID) from email addresses and pair them with latitude and longitude values from each row in the file. 

1. Start by creating a function that when given an email address can extract and return the UID (i.e. if given<br>
**ballen@ jleague.org**, will return:  **ballen**)
1. Open the file `folder/log_file_1000.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module!)
1. Parse each row of the CSV.
1. For each row, call your function AND provide as an argument the email address from that row.
1. `print()` the UID returned by your function and print the latitude and longitude from that row


<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

## Problem 1

In your **text editor** create a simple script called:

```bash
my_csv_prob_1.py```

Execute your script in **Jupyter** using the command:

```bash
run my_csv_prob_1.py```
---

This problem uses the file: `folder/log_file_1000.csv`, which has the following columns:

`name,email,fm_ip,to_ip,date,lat,long,payload`

You will create a script to extract the:
* `to IP addresses`
* `from IP addresses`
* `name`
* `date` 


1. Create a function that when given an IP address will return `True` if the IP address is part of the `75.0.0.0/8` network (For simplicity, check if the first three characters are `"75."`) >>> **see NOTE: below**.
1. Open the file `folder/log_file_1000.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module)
1. For each row, call your function AND give it the `from IP address` from that row. Similarly call your function and give it the `to IP address` from that row.
1. `print()` the `name`, `from IP`, `to IP` and `date` for only those rows where an IP address falls into the `75.0.0.0/8` network.


* NOTE: there are way better ways to process IP addresses, this is greatly simplified. see the `ipaddress module`


When you finish this exercise, please post your Green Sticky.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

## Problem 2

In your **text editor** create a simple script called:

```bash
my_csv_prob_2.py```

Execute your script in **Jupyter** using the command:

```bash
run my_csv_prob_2.py```
---

This problem uses the file: `folder/log_file_1000.csv`, which has the following columns:

`name,email,fm_ip,to_ip,date,lat,long,payload`

You will create a script to extract the:

* `latitude`
* `longitude`
* `date`


1. Create a function that when given two text values (`latitude` and `longitude`) returns two modified numeric values: 
   1. Convert each text value to a `float()`
   1. Use the `round()` function to round the float to two decimal places
1. Open the file `folder/log_file_1000.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module)
1. For each row, call your function AND give it the lat and long values
1. `print()` the `date` and the `latitude` and `longitude`.


When you finish this exercise, please post your Green Sticky.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

## Problem 3

In your **text editor** create a simple script called:

```bash
my_csv_prob_3.py```

Execute your script in **Jupyter** using the command:

```bash
run my_csv_prob_3.py```
---

This problem uses the file: `folder/yahoo_price_short.csv`, which has the following columns:

`date, open, high, low, close, volume, adj_close`

You will create a script to extract the:
* `date`
* `volume`


1. Create a function that when given one text value (`volume`) returns one value: 
   1. Convert the text value to an integer
   1. Divide the value by `1,000,000`
   1. Use the round() function to round to one decimal place
   1. Convert the float to a string and concatenate with an `M` to convert the value to a human-readable form (i.e. 28345623 becomes 28.3M)
   1. Return the human-readable `volume`
1. Open the file `folder/yahoo_price_short.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module)
1. For each row, call your function AND give it the `volume` as an argument
1. Print the `date` and `volume` in human-readable form.

When you finish this exercise, please post your Green Sticky.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

## Problem 4

In your **text editor** create a simple script called:

```bash
my_csv_prob_4.py```

Execute your script in **Jupyter** using the command:

```bash
run my_csv_prob_4.py```
---

This problem uses the file: `folder/yahoo_price_short.csv`, which has the following columns:

`date, open, high, low, close, volume, adj_close`

You will create a script to extract the:
* `date`
* `volume`

1. Create a function that when given one text value (`date`) returns one value: 
   1. Split the text and extract the day
   1. Return the `day` of the month (i.e. mm/dd/yyyy >>> 2/29/2008 >>> 29)
1. Open the file `folder/yahoo_price_short.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module)
1. For each row, extract the `volume`
1. For the row with the highest `volume`, calculate the `day` of the month associated with that row, using your function
1. `print()` the `day` of the month with the highest `volume` and the value associated with the `volume`

When you finish this exercise, please post your Green Sticky.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Solutions:
---

## Problem 0

This problem uses the file: `folder/log_file_1000.csv`, which has the following columns:

`name,email,fm_ip,to_ip,date,lat,long,payload`

You will create a script to extract user IDs (UID) from email addresses and pair them with latitude and longitude values from each row in the file. 

1. Start by creating a function that when given an email address can extract and return the UID (i.e. if given<br>
**ballen@ jleague.org**, will return:  **ballen**)
1. Open the file `folder/log_file_1000.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module!)
1. Parse each row of the CSV.
1. For each row, call your function AND provide as an argument the email address from that row.
1. `print()` the UID returned by your function and print the latitude and longitude from that row

In [None]:
import csv

def get_uid(email):
    uid = email.split('@')[0]
    return uid

fin = open('folder/log_file_1000.csv')
logs = csv.reader(fin)

for fields in logs:
    
    name, email, fmip, toip, date, lat, long, payload = fields
    uid = get_uid(email)
    print(uid, lat, long)

## Problem 1

This problem uses the file: `folder/log_file_1000.csv`, which has the following columns:

`name,email,fm_ip,to_ip,date,lat,long,payload`

You will create a script to extract the:
* `to IP addresses`
* `from IP addresses`
* `name`
* `date` 


1. Create a function that when given an IP address will return `True` if the IP address is part of the `75.0.0.0/8` network (For simplicity, check if the first three characters are `"75."`) >>> **see NOTE: below**.
1. Open the file `folder/log_file_1000.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module)
1. For each row, call your function AND give it the `from IP address` from that row. Similarly call your function and give it the `to IP address` from that row.
1. `print()` the `name`, `from IP`, `to IP` and `date` for only those rows where an IP address falls into the `75.0.0.0/8` network.


* NOTE: there are way better ways to process IP addresses, this is greatly simplified. see the `ipaddress module`


In [None]:
import csv

def ip75(ipaddress):
    return ipaddress.startswith('75.')

fin = open('folder/log_file_1000.csv')

logs = csv.reader(fin)

for fields in logs:
    
    # barry allen,ballen@jleague.org,155.130.121.215,75.122.133.241,2016-02-08T21:44:41,49.8316,8.01485,764272    
    name, email, fmip, toip, date, lat, long, payload = fields
    if ip75(fmip) or ip75(toip):
        print(name, fmip, toip)

## Problem 2

This problem uses the file: `folder/log_file_1000.csv`, which has the following columns:

`name,email,fm_ip,to_ip,date,lat,long,payload`

You will create a script to extract the:

* `latitude`
* `longitude`
* `date`


1. Create a function that when given two text values (`latitude` and `longitude`) returns two modified numeric values: 
   1. Convert each text value to a `float()`
   1. Use the `round()` function to round the float to two decimal places
1. Open the file `folder/log_file_1000.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module)
1. For each row, call your function AND give it the lat and long values
1. `print()` the `date` and the `latitude` and `longitude`.


In [None]:
import csv

def geoConverter(lat, long):
    lat = round(float(lat), 2)
    long = round(float(long), 2)
    return lat, long

with open("folder/log_file_1000.csv") as fin:
    logs = csv.reader(fin)
    
    for line in logs:
        name, email, fmip, toip, date, lat, long, payload = line
        lat, long = geoConverter(lat, long)
        print(date, lat, long)  

## Problem 3

This problem uses the file: `folder/yahoo_price_short.csv`, which has the following columns:

`date, open, high, low, close, volume, adj_close`

You will create a script to extract the:
* `date`
* `volume`


1. Create a function that when given one text value (`volume`) returns one value: 
   1. Convert the text value to an integer
   1. Divide the value by `1,000,000`
   1. Use the round() function to round to one decimal place
   1. Convert the float to a string and concatenate with an `M` to convert the value to a human-readable form (i.e. 28345623 becomes 28.3M)
   1. Return the human-readable `volume`
1. Open the file `folder/yahoo_price_short.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module)
1. For each row, call your function AND give it the `volume` as an argument
1. Print the `date` and `volume` in human-readable form.

In [None]:
import csv

def human(txt):
    num = int(txt)/1000000
    num = round(num, 1)
    return str(num) + 'M'
    
with open("folder/yahoo_prices_short.csv") as fin:
    fin.readline()
    logs = csv.reader(fin)
    
    for line in logs:
        date, _open, high, low, close, volume, adj_close = line
        print(date, human(volume))

## Problem 4

This problem uses the file: `folder/yahoo_price_short.csv`, which has the following columns:

`date, open, high, low, close, volume, adj_close`

You will create a script to extract the:
* `date`
* `volume`

1. Create a function that when given one text value (`date`) returns one value: 
   1. Split the text and extract the day
   1. Return the `day` of the month (i.e. mm/dd/yyyy >>> 2/29/2008 >>> 29)
1. Open the file `folder/yahoo_price_short.csv` AND label it with a suitable filehandle
1. Use the `csv` module to read the data from the file (don't forget to import the module)
1. For each row, extract the `volume`
1. For the row with the highest `volume`, calculate the `day` of the month associated with that row, using your function
1. `print()` the `day` of the month with the highest `volume` and the value associated with the `volume`

In [None]:
import csv

def dayFromDate(date):
    return date.split('/')[1]
    
with open("folder/yahoo_prices_short.csv") as fin:
    fin.readline()
    logs = csv.reader(fin)
    
    highest_vol = 0
    
    for line in logs:
        date, _open, high, low, close, volume, adj_close = line
        volume = int(volume)
        if volume > highest_vol:
            highest_vol = volume
            day = dayFromDate(date)
        
        
    print(day, highest_vol)
