<a href="https://colab.research.google.com/github/eur-nl/bootcamps/blob/master/pandas/01_pandas_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Pandas: Introduction

#### Under the hood

Suppose we have the following csv file, cash.csv:

    date,person,dollars
    2000-01-03,Michael,200
    2000-01-03,George,500
    2000-01-03,Lisa,450
    2000-01-04,Michael,180.50
    2000-01-04,George,450
    2000-01-04,Lisa,448
    2000-01-05,Michael,177
    2000-01-05,George,420
    2000-01-05,Lisa,447
    2000-01-06,Michael,150
    2000-01-06,George,300
    2000-01-06,Lisa,344.60

Suppose we want to use this file to do some calculations (total_earnings, total_cash_withdrawals, what are we looking at here?). We would have to process the contents of the file.

In Python this could look like:

    #!/usr/bin/env python
    import sys
    input_file = sys.argv[1]
    # ... more stuff
    with open(input_file, 'r', newline='') as filereader:
        header = filereader.readline()
        header = header.strip()
        header_list = header.split(',')
        # ... do something useful with the header
        for row in filereader:
            row = row.strip()
            row_list = row.split(',')
            # ... do something useful with the contents
     # Report about the useful stuff

In [0]:
#!/usr/bin/env python
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(input_file, 'r', newline='') as filereader:
    with open(output_file, 'w', newline='') as filewriter:
        header = filereader.readline()
        header = header.strip()
        header_list = header.split(',')
        print(header_list)
        filewriter.write(','.join(map(str,header_list))+'\n')
        for row in filereader:
            row = row.strip()
            row_list = row.split(',')
            print(row_list)
            filewriter.write(','.join(map(str,row_list))+'\n')

FileNotFoundError: [Errno 2] No such file or directory: '-f'

So what does the code do?

The first line is a comment line that makes the script usuable on various platforms.
On the second line we import the sys (short for system) library, so that we in the next two lines can assign the first two arguments after the program name, when we run the file, to the variables "input_file" and "output_file".

Then, with line 7, we open the file for reading ('r') from the commandline referred to by the variable "input_file" as a file object referred to as "filereader". In the next line we open a file for writing ('w') as a file object filewriter. Remember that list indexing in Python starts with 0. So argv[0] always refers to the script that one runs, argv[1] is the next argument on the commandline (the input_file) and argv[2] refers to the output_file.

Line 9 uses the file_object's readline method to read in the first line of the input file as a string and assigns it to a variable named "header". So "header" now contains the first line of the input file "food,amount,calories". With line 10 we strip all whitespace from the string and line 11 splits the content of header on the  "," into a list of strings: ['food', 'amount', 'calories'] referred to by the variable "header_list".

Then we do two things: Line 12 prints the content of "header_list" to standard output (the terminal when running the program) and line 13 writes the contents of "header list" to the filewriter object but it does the opposite of the split method: first of all it maps the "str" function over all elements of the "header_list", then it joins these elements with a ',' and adds a newline at the end.

So what it does is: ['food', 'amount', 'calories'] => 'food,amount,calories' and hands it over to filewriter. The rest of the input file is going through the same motion, but we use a for loop here to process all remaining lines of the input file.

So we did write a little "roundtripper" that touched all elements of the csv file (input), writing intermediate results to screen, and writing the end result to file.

Now this might seem like a lot of work for doing "nothing", but it served a special purpose here: Showing how to process a csv file *not* using any libraries. But what is more, we could have done all sorts of useful stuff before writing the contents of the input file back to the output file.

In the following snippet, csv_02.py, we use the Python library "csv" to do most of the work for us. We use the reader method of the library to read the contents of the csv file into a reader object.

If we use the file csv_02.py to print the contents of the file "calories.csv" to screen, we see that the library did the same as our reader function in csv_01.py: splitting the items of a line into a list of strings.

In [0]:
#!/usr/bin/env python
import csv

def dataset(path):
    with open(path, 'rU') as data:
        reader = csv.reader(data)
        for row in reader:
            row[2] = int(row[2])
            yield row

Suppose we did make a directory ("~/Projects/myproject") and stored the files we mentioned there, then we just could have opened up a Python REPL with the command: python or ipython

In [0]:
# After the prompt (">>>") enter the following line:
from foo import dataset # and hit [return]
# And then the following two lines
for row in dataset("calories.csv"):
    print(row)

![Python REPL commands](https://github.com/eur-nl/bootcamps/blob/master/pandas/graphics/pandas_intro_01.png?raw=1)

We can add the command we just entered, and which you can see in the image above, to the file foo.py

In [0]:
#!/usr/bin/env python
import csv

def dataset(path):
    with open(path, 'rU') as data:
        reader = csv.reader(data)
        for row in reader:
            row[2] = int(row[2])
            yield row
            
if __name__ == '__main__':
    for row in dataset("calories.csv"):
        print(row[0])

The trick here is the line beginning with "if __name__ ==" which makes that we can run the file as input to Python, from the commandline.

![Run foo.py from the CLI](https://github.com/eur-nl/bootcamps/blob/master/pandas/graphics/pandas_intro_02.png?raw=1)

So, now we have a project directory with a data file and a small program. There exist some guidelines how to properly structure a larger Python project.

You will find more information on the website of Zed Shaw [Learn Python the hard way](https://learnpythonthehardway.org/book/ex46.html)

One uses a standard directory structure, together with virtual environments and everything under version control.

We have seen how libraries can simplify your work by shielding you off a lot low-level details. On the other hand, when things go wrong, and things will go wrong:-), you are looking at a black box, unless you understand what a library provide under the hood. That is why it always is a good idea to study the documentation and examples a library provides carefully. Start testing the problems you want to solve right away, plug them in, instead of just following the tutorial.

In the first example we read in a csv file line by line and in our second try we saw that using the csv library, we could use its reader method to return an iterator over all rows of the csv file.

The Pandas library reads in a csv file and returns a dataframe.

In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/eur-nl/bootcamps/master/pandas/data/calories.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,food,amount,0
0,butter,tbsp,102
1,cheddar cheese,slice,113
2,whole milk,cup,148
3,hamburger,item,254


Before we dive deeper into Pandas, we present the code for our "roundtrip" example.

In [0]:
#!/usr/bin/env python
import sys
import pandas as pd
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_csv(input_file) print(data_frame) data_frame.to_csv(output_file, index=False)

We can run this file in the project directory as:

  python csv_03.py calories.csv out3.csv

### Pandas intro

Pandas uses two data structures: series and dataframes.

A series is a one-dimensional array-like object containg a sequence of values and an associated array of data labels, called its *index*.

In [0]:
import pandas as pd
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Pandas generated the index for us, sequentially numbering the elements of the list we defined as our series. Both parts of the Series, the index and the array, are available

In [0]:
obj.values

In [0]:
obj.index

It is also possible to label the data points while constructing the series:

In [0]:
obj2 = pd.Series([4, 7, -5, 3], index = ['a', 'b', 'c', 'd'])
obj2

In [0]:
obj2['a']

Because under the hood the Pandas series is a Numpy vector, we can do all sorts of Numpy-like operations on it.

In [0]:
# Boolean filtering
obj2[obj2 > 0]

In [0]:
# scalar multiplication
obj2 * 2

In [0]:
# applying math functions
import numpy as np
np.exp2(obj2) # calculate 2**x for all elements of the array

This looks a lot like a Python dictionary or dict: A fixed-length, ordered dict that maps index values to data values. In fact when we have a Python dict we can just create a Series from it.

In [0]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [0]:
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

Then there is the Pandas workhorse: The dataframe.

A dataframe represents a rectangular table of data and contains an ordered collection of columns, each of  these columns can be of a different value type.

Think of it as a dict of series that all share the same index.

In [0]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
       }

frame = pd.DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


There is a lot to say about the internal representation of dataframes in Pandas. If you are going to use the Pandas library for your work, be sure to read chapter 5 of Wes Mckinney's book (Python for Data Analysis 2E) carefully, because it discusses the ways indices work in detail.

We are going to look at Pandas in the context of creating "Tidy Data".