### Pandas: Introduction

#### Under the hood: Using libraries

Suppose we have the following csv file, cash.csv:

    date,person,dollars
    2000-01-03,Michael,200
    2000-01-03,George,500
    2000-01-03,Lisa,450
    2000-01-04,Michael,180.50
    2000-01-04,George,450
    2000-01-04,Lisa,448
    2000-01-05,Michael,177
    2000-01-05,George,420
    2000-01-05,Lisa,447
    2000-01-06,Michael,150
    2000-01-06,George,300
    2000-01-06,Lisa,344.60

Suppose we want to use this file to do some calculations (total_earnings, total_cash_withdrawals, what are we looking at here?). We would have to process the contents of the file.

In Python this could look like:

    #!/usr/bin/env python
    import sys
    input_file = sys.argv[1]
    # ... more stuff
    with open(input_file, 'r', newline='') as filereader:
        header = filereader.readline()
        header = header.strip()
        header_list = header.split(',')
        # ... do something useful with the header
        for row in filereader:
            row = row.strip()
            row_list = row.split(',')
            # ... do something useful with the contents
     # Report about the useful stuff

If we were to deal with csv files a lot, we would probably want to capture the above in some way in order to process the contents of csv regardless of their actual contents. And this is precisely what a *library* does: It automates some common tasks in a certain context.

Mind you, there is always a trade-off involved in using libraries made by others: It makes a lot of things easier (faster) to do, so that you can concentrate on the task at hand, but it does this in a certain way and this might not be the (best) way to solve your particular problem. That said, well designed libraries can make life so much more pleasant.

In data science csv files are often used as the basic storage file format for datasets.

So, it will come as no surprise that Python has a csv library; Python code that makes it easier to read in csv, process the contents and report results back.

Reading in a csv file, using the Python csv library, looks like:

    #!/usr/bin/env python
    import csv
    
    def dataset(path):
        with open(path, 'rU' as data:
           reader = csv.reader(data)
           # ... do useful stuff with the reader object

Libraries can simplify your work by shielding you off a lot of low-level details. On the other hand, when things go wrong, and things will go wrong:-), you are looking at a black box, unless you understand what a library provide under the hood. That is why it always is a good idea to study the documentation and examples a library provides carefully. Start testing the problems you want to solve right away, plug them in, instead of just following the tutorial.

So, why Pandas when we already have the Python csv library? Pandas is much more than a library for easy processing of csv files.

The Pandas library for Python was build around the dataframe idea taken from R, the statistical programming language. Wes McKinney is the driving force behind the library (O'Reilly book: Python for Data Analysis). Pandas provides high-level data structures and tools to make data analysis easy and fast.

Both languages have "notebooks",  and it is possible in the Jupyter ([JU]lia[PYT]hone[R]) noteboooks to incorporate both Python and R snippets. CSV files are the "lingua franca" between the languages.

We added references to both programming environments to the bibliography.

In [2]:
import pandas as pd
df = pd.read_csv("data/cash.csv")
df.head()

Unnamed: 0,date,person,dollars
0,2000-01-03,Michael,200.0
1,2000-01-03,George,500.0
2,2000-01-03,Lisa,450.0
3,2000-01-04,Michael,180.5
4,2000-01-04,George,450.0


In [None]:
total_earned = sum(df['dollars'])
print(total_earned)

In [None]:
df['person'].value_counts()

#### Pandas import conventions

In [None]:
from pandas import Series, DataFrame
import pandas as pd

#### Pandas: Series and dataframe

Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its *index*.

In [None]:
obj = Series([4, 7, -5, 3])
obj

In [None]:
obj.values

In [None]:
obj.index

In [None]:
obj[2]

Series can be used to pass in Python dictionaries.

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [None]:
sdata_series = Series(sdata)

In [None]:
sdata_series

Series make it easy to mix and merge. Take for example a second series in which we pass a new index of states and use the Python dict called sdata:

In [None]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
new_series = Series(sdata, index = states)

In [None]:
new_series

In [None]:
new_series.isnull()

#### Dataframes

Dataframes are spreadsheet-like data structures in which the columns may contain different value types and rows contain observations. Both columns and rows have indexes.

To give just two simple examples of the use of dataframes:

In [3]:
df2 = pd.read_csv("data/treatment1.csv", sep = ",")
df2

Unnamed: 0,Name,Treatment A,Treatment B
0,John Smith,,2
1,Jane Doe,16.0,11
2,Mary Johnson,3.0,1


A clear case of so-called "messy data". The column names Treatment A and Treatment B contain values: A and B. And we have 5 or 6 values depending on how we count the "-".

There is much to say about "tidy data" and if you are interested, please read the article by Hadley Wickham, of R and RStudio fame on the topic.

Here we just use Pandas to change this dataset into a tidy form: A column contains variables and a row contains one observation.

In [4]:
melted_df2 = pd.melt(df2,
                     ["Name"],
                     var_name = "Treatment",
                     value_name = "Result")
melted_df2

Unnamed: 0,Name,Treatment,Result
0,John Smith,Treatment A,
1,Jane Doe,Treatment A,16.0
2,Mary Johnson,Treatment A,3.0
3,John Smith,Treatment B,2.0
4,Jane Doe,Treatment B,11.0
5,Mary Johnson,Treatment B,1.0


The second dataframe shows how simple it is to take a nested Python dictionary and pass that to DataFrame.

In [None]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
      'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [None]:
df3 = DataFrame(pop)
df3

In [None]:
df3.T

With the above we have just scratched the surface of what is possible with Python and data. But we hope that you have got an idea what is possible using these tools: Python, its scientific libraries, and the Jupyter Notebook. The notebook 99_titanic_bibliography.ipynb provides useful links to sources for selfstudy.

Time to dive in and take on a somewhat larger dataset.