# Intro to Pandas
Duncan Callaway

My objective in this notebook is to review enough of the basics to give people a clear sense of 
* what a pandas **data frame** is 
* what it can contain
* how to access the information in it

## Pandas references

Introductory:

* [Getting started with Python for research](https://github.com/TiesdeKok/LearnPythonforResearch), a gentle introduction to Python in data-intensive research.

* [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/index.html), by Jake VanderPlas, another quick Python intro (with notebooks).

Core Pandas/Data Science books:

* [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), by Jake VanderPlas.

* [Python for Data Analysis, 2nd Edition](http://proquest.safaribooksonline.com/book/programming/python/9781491957653), by  Wes McKinney, creator of Pandas. [Companion Notebooks](https://github.com/wesm/pydata-book)

* [Effective Pandas](https://github.com/TomAugspurger/effective-pandas), a book by Tom Augspurger, core Pandas developer.


Complementary resources:

* [An introduction to "Data Science"](https://github.com/stefanv/ds_intro), a collection of Notebooks by BIDS' [Stéfan Van der Walt](https://bids.berkeley.edu/people/st%C3%A9fan-van-der-walt).

* [Effective Computation in Physics](http://proquest.safaribooksonline.com/book/physics/9781491901564), by Kathryn D. Huff; Anthony Scopatz. [Notebooks to accompany the book](https://github.com/physics-codes/seminar). Don't be fooled by the title, it's a great book on modern computational practices with very little that's physics-specific.

## Introduction 
In Data8 you used the `tables` library to organize and manipulate data.  I've not used tables but it has been described to me as a 'light' version of pandas.  So this will be somewhat familiar to you.

Pandas has several features that make it immediately better than numpy for organizing data.
1. You can have different data types (string, int, float) in each column
2. You can label columns (headers)
3. You can label rows (index)

We call the data structure that holds all these things together a **data frame**

**Ambiguity alert**: up to now we've talked about indexing to access individual elements of a data structure.  The index in pandas has a slightly more specific meaning, in that it references the rows of the data frame.  Pandas documentation talks about "location" or "position" in place using "index" as we did with the other data structures.  I'll try my best to disambiguate, but also rely on context to clarify the meaning of the terms.

At its core, the data frame structure is what draws us to use pandas.  But it also has a bunch of great built-in functions you can use to manipulate data once it is loaded in.

In [1]:
import pandas as pd

I'm going to define a simple data frame in a way that you can see the connections to existing Python data types and structures.  

First let's define a dict of fruit information:

In [2]:
fruit_info_dict = {'fruit':['apple','banana','orange','raspberry'],
                  'color':['red','yellow','orange','pink'],
                  'weight':[120,150,250,15]}
fruit_info_dict

{'fruit': ['apple', 'banana', 'orange', 'raspberry'],
 'color': ['red', 'yellow', 'orange', 'pink'],
 'weight': [120, 150, 250, 15]}

### Q: What do you get if you call the dict of lists with a key?

In [3]:
fruit_info_dict['color']

['red', 'yellow', 'orange', 'pink']

Ans: A the list associated with the key

### Q: Figure out how to get the weight of a raspberry out of the dict.

In [4]:
fruit_info_dict['weight'][3]

15

## Now let's make a dataframe
It's just a fancy version of a dict of lists:

In [5]:
fruit_info_df = pd.DataFrame(
    data={'fruit':['apple','banana','orange','raspberry'],
                  'color':['red','yellow','orange','pink'],
                  'weight':[120,150,250,15]
         })
fruit_info_df

Unnamed: 0,fruit,color,weight
0,apple,red,120
1,banana,yellow,150
2,orange,orange,250
3,raspberry,pink,15


Some notes:
1. You can see that we put the data inside curly brackets much like we do in a dict.  
    1. We defined it as if fruit, color and weight are keys.
    1. In fact, we'll be calling these column names.
2. The pandas data structure is called the data frame.  

In [6]:
type(fruit_info_df)

pandas.core.frame.DataFrame

We can call the values associated with each column name (like the dict key) in much the same way that we did with the dict:    

In [7]:
fruit_info_dict['color']

['red', 'yellow', 'orange', 'pink']

In [8]:
fruit_info_df['color']

0       red
1    yellow
2    orange
3      pink
Name: color, dtype: object

But notice above in the dataframe we don't just have a list of colors.  Instead we have something called a **series**.   This is a pandas object that is analogous to a numpy series.

In [9]:
type(fruit_info_df['color'])

pandas.core.series.Series

The series differs from the list in at least one important way: It has numbers directly associated with it that we call the index.  (The left column of the Series.)

Note that in the above, the index is just numeric.  But as we'll see, we can make it whatever we want.  This makes the data frame much more flexible than a list -- we can call elements from a sensical index, rather than just a number.

We can also call columns from the data frame as follows:

In [10]:
fruit_info_df.color

0       red
1    yellow
2    orange
3      pink
Name: color, dtype: object

But as we'll see soon, there are alternative ways to get access to the elements of the data frame (`.loc` and `.iloc`) that enable us to work with the frame more as we would a numpy array.

First, let me show you how to get into the dataframe to index individual elements if we *don't* use `.loc` or `.iloc`.

In this case you index into the df in a way that looks a bit like indexing into a *list*.  That is, you use two sequences of square brackets, the first carrying information about the column and the second information about the row.

In [11]:
fruit_info_df['fruit'][1]

'banana'

In [12]:
fruit_info_df.fruit[1]

'banana'

Slicing is limited in this case: we can only slice down rows, not columns of the df:

In [15]:
fruit_info_df['fruit'][0:2]

0     apple
1    banana
Name: fruit, dtype: object

### Anatomy of the data frame.

Let's talk a little about the anatomy of the data frame.

<img src="dataframe_anatomy.png" width="800px" align="left" float="left"/>

We have the following important attributes:
1. Rows
2. Columns
2. Index
3. Column names

The "index" can be numeric, but as we'll see we can also make the indices strings.  

## Recap
* Pandas dataframes are sophisticated dicts of lists.
    * They have attributes like columns and index that have special meaning in the pandas context.
    * You can store any combination of data types in the dataframe
* You can access information in them as though they are dicts of lists