<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Base Python: Comma Separated Value (CSV) Files
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 1: Topic 3</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

#### Comma Separated Value Format

- Tabular data
- Column entries separated by commas
- Typical file extensions: .csv, .dat, .txt
- Very common data format

Example: Track times (s) for 100m dash for four athletes across 3 meets.

Clear tabular format:

| Meet 1 | Meet 2 | Meet 3 |
| ------ | ------ | ------ |
| 13.10  | 13.59  | 13.44  |
| 13.93  | 13.85  | 13.47  |
| 14.12  | 14.41  | 13.89  |
| 14.42  | 13.55  | 13.43  |


Data as nested list of numbers:

In [3]:
track_times = [
    [13.10, 13.59, 13.44],
    [13.93, 13.85, 13.47],
    [14.12, 14.41, 13.89],
    [14.42, 13.55, 13.43]
]

track_times

[[13.1, 13.59, 13.44],
 [13.93, 13.85, 13.47],
 [14.12, 14.41, 13.89],
 [14.42, 13.55, 13.43]]

Data in file represented as simple text with comma delimiter:
- Coming from file will be formatted as string.
- Will look like below:

In [4]:
# Initialize an empty string
track_times_csv = ""

# Loop over all lists in the overall list
for index, athlete_times in enumerate(track_times):
    # Join together the values in the nested list using
    # a comma as a separator
    athlete_times_string = ",".join([str(time) for time in athlete_times])
    # Append the values to the overall string
    track_times_csv += athlete_times_string
    # Append a newline, unless this is the last row
    if index < (len(track_times) - 1):
        track_times_csv += "\n"
    
print(track_times_csv)

13.1,13.59,13.44
13.93,13.85,13.47
14.12,14.41,13.89
14.42,13.55,13.43


What we want (inverse process):
- Interpret string as tabular data.
- Convert to a Python list or dict.
- Load into memory.

Write comma separated string to file:

In [5]:
with open("Data/track_times.csv", "w") as f:
    f.write(track_times_csv)

In [6]:
import os
os.path.exists('Data/track_times.csv')

True

#### Open and read a csv file (raw string):

In [7]:
with open("Data/track_times.csv", "r") as f:
    csv_string = f.read()

In [101]:
type(csv_string)
csv_string

'13.1,13.59,13.44\n13.93,13.85,13.47\n14.12,14.41,13.89\n14.42,13.55,13.43'

#### Open and read a csv file (convert to list format) line by line:
- Good way to do things.
- Less memory
- Can process and clean line by line.


In [8]:

with open("Data/track_times.csv", "r") as f:
    data = []
    
    while True:
        csv_string = f.readline()
        if csv_string == '':
            break
        row = csv_string.strip().split(",")
        data.append(row)


In [9]:
data

[['13.1', '13.59', '13.44'],
 ['13.93', '13.85', '13.47'],
 ['14.12', '14.41', '13.89'],
 ['14.42', '13.55', '13.43']]

#### Another (better) way to do this:

- ###### file object can be treated as iterable
- for loop over file object:
    - calls .readline() implicitly.
    - terminates when no more lines: automatic

In [112]:
with open("Data/track_times.csv", "r") as f:
    data = []
    delimiter = ","
    # line is the result of f.readline() implicit:
    for line in f:
        # strip "\n" and split on delimiter.
        data.append(line.strip().split(delimiter))


In [113]:
data

[['13.1', '13.59', '13.44'],
 ['13.93', '13.85', '13.47'],
 ['14.12', '14.41', '13.89'],
 ['14.42', '13.55', '13.43']]

Thats better, but it's still not exactly what we want.

What's the issue?

In [10]:
print(type(data[0][0]))
data[0][0]

<class 'str'>


'13.1'

DATA TYPES!!! DATA TYPES!!! DATA TYPES!!!!

Have to take care of this manually.

In [11]:
with open("Data/track_times.csv", "r") as f:
    data = []
    for line in f:
        # line is a string. strip "\n" and split on delimiter.
        delimiter = ","
        stripped_line_list = line.strip().split(delimiter)
        # convert each element from string to float
        float_line = [float(x) for x in stripped_line_list]
        
        data.append(float_line)

In [12]:
data

[[13.1, 13.59, 13.44],
 [13.93, 13.85, 13.47],
 [14.12, 14.41, 13.89],
 [14.42, 13.55, 13.43]]

In [13]:
type(data[0][0])

float

#### Dealing with headers and column names.

- Often .csv file explicitly contains column names on first row.
- Have to deal with this.

In [14]:
with open("Data/track_times_header.csv", "r") as f:
    data = f.read()
data

'Meet 1,Meet 2,Meet 3\n13.1,13.59,13.44\n13.93,13.85,13.47\n14.12,14.41,13.89\n14.42,13.55,13.43\n'

In [214]:

data = []

with open("Data/track_times_header.csv", "r") as f:
    col_line = 0  # lets say where column name line is.
    delimiter = ','
    # enumerate returns (integer index, element) tuple:
    # helps us keep track of line number in file.
    
    for i, line in enumerate(f): 
        # line is a string. strip "\n" and split on delimiter.
        stripped_line_list = line.strip().split(delimiter) 
        
        if i == 0: # ignores header info before column line
            data.append(stripped_line_list)
            
        else:
        # convert each element from string to float: only for lines 
            float_line = [float(x) for x in stripped_line_list]
            data.append(float_line)
  

In [215]:
data

[['Meet 1', 'Meet 2', 'Meet 3'],
 [13.1, 13.59, 13.44],
 [13.93, 13.85, 13.47],
 [14.12, 14.41, 13.89],
 [14.42, 13.55, 13.43]]

This is a common operation. 
- csv library can help with some of this:
    - Takes care of stripping and splitting.
    - Can take care of interpreting numeric types.
    

In [15]:
# import csv library (included in base python)
import csv

In [221]:
data = []
with open("Data/track_times.csv", "r") as f:
    # converts to numeric automatically, strips and takes care of splitting on delimiters.   
    csv_obj = csv.reader(f, delimiter = "," , quoting=csv.QUOTE_NONNUMERIC) 
    for row in csv_obj:
        data.append(row)
data  

[[13.1, 13.59, 13.44],
 [13.93, 13.85, 13.47],
 [14.12, 14.41, 13.89],
 [14.42, 13.55, 13.43]]

Compared to:

In [218]:
with open("Data/track_times.csv", "r") as f:
    data = []
    for line in f:
        # line is a string. strip "\n" and split on delimiter.
        delimiter = ","
        stripped_line_list = line.strip().split(delimiter)
        # convert each element from string to float
        float_line = [float(x) for x in stripped_line_list]
        
        data.append(float_line)
        data

Take in csv with column name row:

In [222]:
data = []
with open("Data/track_times_header.csv", "r") as f:
    
    # input the rest of the file into the csv reader
    csv_obj = csv.reader(f, delimiter = ',')
    for row in csv_obj:
        data.append(row)
        
data

[['Meet 1', 'Meet 2', 'Meet 3'],
 ['13.1', '13.59', '13.44'],
 ['13.93', '13.85', '13.47'],
 ['14.12', '14.41', '13.89'],
 ['14.42', '13.55', '13.43']]

Data in string format. If you want data in numeric form:

In [220]:
data = []
with open("Data/track_times_header.csv", "r") as f:
    # manually get first row as column names
    cols = next(f).strip().split(',')
    data.append(cols)

    
    # input the rest of the file into the csv reader
    # convert to numeric
    csv_obj = csv.reader(f, delimiter = ',', quoting = csv.QUOTE_NONNUMERIC)
    
    for row in csv_obj:
        data.append(row)
data

[['Meet 1', 'Meet 2', 'Meet 3'],
 [13.1, 13.59, 13.44],
 [13.93, 13.85, 13.47],
 [14.12, 14.41, 13.89],
 [14.42, 13.55, 13.43]]

#### csv DictReader
- gets each row as dictionary with key as column (attribute) names.
- logical way to address data.


In [187]:
with open("Data/track_times_header.csv", "r") as f:
    reader = csv.DictReader(f, delimiter = ",")
    data = list(reader)

data

[{'Meet 1': '13.1', 'Meet 2': '13.59', 'Meet 3': '13.44'},
 {'Meet 1': '13.93', 'Meet 2': '13.85', 'Meet 3': '13.47'},
 {'Meet 1': '14.12', 'Meet 2': '14.41', 'Meet 3': '13.89'},
 {'Meet 1': '14.42', 'Meet 2': '13.55', 'Meet 3': '13.43'}]

- Want to load numeric data as floats. Keeps column name keys as strings.
    - csv Dictreader takes in fieldnames argument for column keys.
    - Can tell Dictreader to interpret data as numeric.

In [17]:
with open("Data/track_times_header.csv", "r") as f:
    colnames = next(f).strip().split(',')
    reader = csv.DictReader(f, delimiter = ",", quoting=csv.QUOTE_NONNUMERIC, fieldnames = colnames)
    data = list(reader)

data

[{'Meet 1': 13.1, 'Meet 2': 13.59, 'Meet 3': 13.44},
 {'Meet 1': 13.93, 'Meet 2': 13.85, 'Meet 3': 13.47},
 {'Meet 1': 14.12, 'Meet 2': 14.41, 'Meet 3': 13.89},
 {'Meet 1': 14.42, 'Meet 2': 13.55, 'Meet 3': 13.43}]

In [18]:
print(type(data[0]['Meet 2']))
data[0]['Meet 2']

<class 'float'>


13.59

What we've learned:

- How to import CSVs:
    - base python
    - csv library.
- Dealing with:
    - headers: code
    - column names: more code
    - different data types: tricky, more code


Important to know base python data importing for csv...

but:

Pandas will save us from code apocalypse!!!
<br>
<br>
<div align = "right">
    <center><img src="Images/pandas.jpg" align = "center" width="500"/></center>
</div>
    


