# Setup

## Import packages for later use

`csv` is the built-in Python library for interacting with .csv files.

`os` and `shutil` are for interacting with the operating system and files

In [None]:
import csv
import os
import shutil

# Prepare paths to datasets
pokemon_csv_path = 'pokemon_data.csv'
mtg_csv_path = 'mtg_data.csv'

# Examples

## Reading .csv files

### Counting rows

First let's do something simple: Count the lines in a .csv file

At the very least, we'll have to open the file

In [None]:
pokemon_file = open(pokemon_csv_path, encoding='utf-8')

The built-in way to read .csv files is using the `csv.reader()` function

In [None]:
csv_reader = csv.reader(pokemon_file)

Now we can use `csv_reader` as an iterable, and use it to count the lines in the file.

Each iteration will return one line of the .csv, which will then have an array of the comma-separated values

In [None]:
# Comment out the line below to continue iterating through the csv
# You'll keep seeing this throughout the example, it resets the file position
pokemon_file.seek(0)

# next(iterable) will return the next iteration of the iterable, or raising StopIteration
# Internally, it calls the __next__() method
first_line = next(csv_reader)
print(first_line)

In [None]:
pokemon_file.seek(0)
line_count = 0
for line in csv_reader:
    line_count += 1
    
line_count

We can also use a trick with `sum()` to count up the number of lines

In [None]:
pokemon_file.seek(0)
line_count_with_sum = sum(1 for i in csv_reader)
line_count_with_sum

Finally, we should close the file handler to release the system resources.  
(Note: You will have to rerun the code block with `open(pokemon_csv_path, ...)` to use the above code examples again.

In [None]:
pokemon_file.close()

To generalize this and be able to count lines for any .csv file, we can throw this logic into a function.

We also don't want to forget using `.close()`, so we could use a context manager to open the file instead.

In [None]:
def count_csv_rows(path_to_csv):
    with open(path_to_csv, encoding='utf8') as f:
        r = csv.reader(f)
        return sum(1 for i in r)

In [None]:
print(f'Number of rows: {count_csv_rows(pokemon_csv_path)}')

### Finding data in a .csv

Say we want to find all the Pokemon that weigh more than 900kg.

We really want:
* A list
* of Pokemon names (a single Pokemon is 1 row, name is a single column in that row)
* that have >900 weight (weight is just another column in the row)

In [None]:
with open(pokemon_csv_path, encoding='utf-8') as f:
    r = csv.reader(f)
    
    # Find the index of the column's we're interested in
    header_line = next(r)
    name_index = header_line.index('name')
    weight_index = header_line.index('weight')
    
    names = []
    for row in r:
        # Need to convert value to float to compare to 900
        if float(row[weight_index]) > 900:
            names.append(row[name_index])
            
    # List comprehension equivalent to the above loop, but it's not very clear what's going on
    # names = [row[name_index] for row in r if float(row[weight_index]) > 900]

print(names)

However, there is a potentially better way within the `csv` library.

By using a `csv.DictReader` instead, the first row will become our field names which we can then access directly for each row. Since boilerplate code has been reduced, the list comprehension that was a bit unwieldy above will fit nicely and be very readable.

In [None]:
def more_than_900_weight(filename):
    with open(filename, encoding='utf-8') as f:
        r = csv.DictReader(f)
        return [row['name'] for row in r if float(row['weight']) > 900]

print(more_than_900_weight(pokemon_csv_path))

Try modifying the code above to see other information with different conditions!

# TODO

Now we'll set up a function to search for grass-type Pokemon

In [None]:
def search_csv(path_to_csv, search_for):
    with open(path_to_csv, encoding='utf-8') as f:
        r = csv.reader(f)
        return [row for row in r if search_for in row]

In [None]:
search_csv(pokemon_csv_path, 'grass')

Why doesn't the code block above output anything? Is that what we want?

If not, try to change `search_csv` to suit your desired behavior.

Hint: If you're having trouble try running the code block below. Think more why "Moltres" would return a result but "Moltre" does not...

In [None]:
search_terms = ['Squirtle', 'Squ', 'Moltres', 'Moltre']

for term in search_terms:
    print(f'Search for "{term}" has {len(search_csv(pokemon_csv_path, term))} results')

In [None]:
artifact_count = len(search_csv(mtg_csv_path, 'Artifact'))
print(f'Rows that have a field that contain only "Artifact": {artifact_count}')

In [None]:
def select_column(path_to_csv, column_name):
    with open(path_to_csv, encoding='utf-8') as f:
        r = csv.reader(f)
        column_index = next(r).index(column_name)
        return [row[column_index] for row in r if row[column_index]]

In [None]:
mega_names = select_column('pokemon_data.csv', 'megas')
print(mega_names)

In [None]:
def get_average(path_to_csv, column_name):
    with open(path_to_csv, encoding='utf-8') as f:
        r = csv.DictReader(f)
        
        values = [float(row[column_name]) for row in r]
        return sum(values) / len(values)

In [None]:
attributes_to_average = ['speed', 'health', 'attack', 'defense', 'height']

postfixes = {'height': 'm',
             'weight': 'kg'}

for attribute in attributes_to_average:
    average = get_average(pokemon_csv_path, attribute)
    
    # Get a little more fancy by adding units to weight and height
    post_string = ''
    if attribute in postfixes:
        post_string = postfixes[attribute]
        
    print(f'Average {attribute}: {average:.2f}' + post_string)

### Find most common occurence in a column

In the below example, we can take advantage of Python's great built-in libraries with the `collections.Counter` class. Documentation for the `collections` library can be found [here](https://docs.python.org/3/library/collections.html)

***WARNING***  
*In some college classes you may be restricted from using some built-in libraries, so you should still know the basics of how these work behind the scenes.*

In [None]:
from collections import Counter

def most_common(path_to_csv, column_name, predicate=None):
    with open(path_to_csv, encoding='utf-8') as f:
        r = csv.DictReader(f)
        
        if pred:
            counter = Counter([row[column_name] for row in r if pred(row[column_name])])
        else:
            counter = Counter([row[column_name] for row in r])
        
        # Since .most_common() returns an array, we're returning the first element of that
        return counter.most_common(1)[0]

Below are some examples using our new function.

In [None]:
speed, count = most_common(pokemon_csv_path, 'speed')
print(f'"{speed}" was the most common speed with {count} occurances')

In [None]:
print(f'Our Magic The Gathering dataset contains information on '
      + f'{count_csv_rows(mtg_csv_path) - 1} cards')

common_type, count = most_common(mtg_csv_path, 'type')

print(f'"{common_type}" was the most common type with {count} occurances')

In [None]:
common_subtype, count = most_common(mtg_csv_path, 'subtypes')

print(f'"{common_subtype}" was the most common subtype with {count} occurances')

In [None]:
common_subtype, count = most_common(mtg_csv_path, 'subtypes',
                                    lambda s: len(s))

print(f'"{common_subtype}" was the most common subtype with {count} occurances')

## Writing to .csv files

### Most basic example

In [None]:
students = ['Bob Gel Sr.', 'Bob Gel Jr.', 'Jane Doe']
students_with_id = [(name, i) for i, name in enumerate(students)]

with open('new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(students_with_id)  # parameter is some iterable

In [None]:
# Confirm that the .csv we wrote it as expected and then delete it
with open('new.csv') as f:
    r = csv.reader(f)
    for row in r:
        print(row)
        
os.remove('new.csv')

### Modifying existing values

Going back to a previous example, let's modify our Pokemon dataset to cap the weight at 800kg.

In order to not break the earlier code blocks, we're going to first copy the dataset to a new file and operate on that.

In [None]:
capped_pokemon_csv_path = 'sub800kg_pokemon_data.csv'
shutil.copy(pokemon_csv_path, capped_pokemon_csv_path)

In [None]:
print(f'# of Pokemon with weight >900: '
      + f'{len(more_than_900_weight(capped_pokemon_csv_path))}')

Reading from and writing to the same file at once is a recipe for disaster, so in this example we will load the entire dataset into memory first, and then write only once you have stopped reading. However, this method will have a larger memory footprint and may not work for larger datasets.

Another solution would be to first write to some temporary file, and then move that file to overwrite the original.  
*PS. There is a built-in library* `tempfile` *for this too!*

In [None]:
fields = ''
data = []
with open(capped_pokemon_csv_path, encoding='utf-8') as f:
    r = csv.DictReader(f)
    fields = r.fieldnames
    
    for row in r:
        if float(row['weight']) > 800:
            row['weight'] = 800
        data.append(row)

with open(capped_pokemon_csv_path, 'w', encoding='utf-8') as f:
    w = csv.DictWriter(f, fields)
    w.writeheader()
    
    for row in data:
        w.writerow(row)

# Profiling Python Performance

The code blocks below can be ran to show a "pager" at the bottom of the window that displays how long that block took to run.

In [None]:
%%prun -l 0
count_csv_rows(pokemon_csv_path) # 930 rows

In [None]:
%%prun -l 0
count_csv_rows(mtg_csv_path) # 35758 rows

In [None]:
%%prun -l 0
most_common(mtg_csv_path, 'type')

In [None]:
%%prun -l 0
most_common(mtg_csv_path, 'subtypes', lambda s: len(s))