# Day 2: Intro to Pandas

## Pandas
* high-level library to support data manipulation and analysis
* DataFrame is the primary object we’ll be dealing with
* similar to R’s dataframe
* maps onto tabular structure
* good for time series and econometric data

## Shift from "pythonic" to "pandorable"

* less looping over elements
* lots of built-in functionality
* a "paradigm shift"

# Data structures

We're all familiar with lists:

In [10]:
names = ["Charlotte", "Ingrid", "Ian", "Eric"]
scores = [80, 95, 85, 70]

Now let's say that we wanted to divide each of those scores by two and assign the results to another variable. Go ahead and write some code that does that... There are lots of ways to do this, so go ahead and write one way to do it (without importing any additional python packages) and assign the results to a 
variable called ```half```:

In [13]:
# insert your code here

If you followed the above instructions, the following cell block should print
a list of floats that looks like ```
[40.0, 47.5, 42.5, 35.0]```


In [14]:
half

[40.0, 47.5, 42.5, 35.0]

We can put data into an array structure that allows us to apply more powerful
functions.  The data structure that we're interested in is called an ```ndarray``` and is from the ```numpy``` package:

In [16]:
import numpy as np
ascores = np.array(scores)

In [17]:
ascores 

array([80, 95, 85, 70])

In [18]:
ahalf = ascores / 2

Numpy arrays are powerful, but they have some limitations:  they can only 
consist of one type of data (e.g. int), etc.  pandas provides two additional
data structures that are built on numpy ndarrays.

The first are Series.  Let's create a simple pandas Series and examine it:

In [25]:
import pandas as pd

In [26]:
from pandas import Series

In [28]:
sscores = Series(scores,name='scores')

In [29]:
sscores

0    80
1    95
2    85
3    70
Name: scores, dtype: int64

So you see a couple of useful things: an index (0 to 3) and a data type (dtype), which in this case is an int64.

**A Series is a one-dimensional ndarray with axis labels**

In [30]:
data = dict(zip(names,scores))

In [31]:
data

{'Charlotte': 80, 'Eric': 70, 'Ian': 85, 'Ingrid': 95}

In [42]:
sData = Series(data=data,name='score')

In [43]:
sData

Charlotte    80
Eric         70
Ian          85
Ingrid       95
Name: score, dtype: int64

So Series are a bit friendlier than numpy arrays, but they're still only one-dimensional.  Keep in mind that our basic data abstraction is a table, which can
be thought of as a two-dimensional array.  Let's go ahead and create a simple DataFrame with just one column:

In [52]:
from pandas import DataFrame


In [53]:
sData.to_frame()

Unnamed: 0,score
Charlotte,80
Eric,70
Ian,85
Ingrid,95


Let's return to the code that you "just ran" last time:

In [71]:
years = range(1880, 2015)
pieces = []
for year in years:
    path = 'names/yob%d.csv'%year
    frame = pd.read_csv(path)
    frame['year'] = year
    pieces.append(frame)
df_names = pd.concat(pieces, ignore_index=True)
df_names

Unnamed: 0,name,gender,birth_count,year
0,Simeon,M,23,1880
1,Raoul,M,7,1880
2,Lou,M,14,1880
3,Myra,F,83,1880
4,Alois,M,10,1880
5,Hosea,M,10,1880
6,Arthur,M,1599,1880
7,Vena,F,11,1880
8,Electa,F,6,1880
9,Tessie,F,17,1880


Unnamed: 0,name,gender,birth_count,year
0,Simeon,M,23,1880
1,Raoul,M,7,1880
2,Lou,M,14,1880
3,Myra,F,83,1880
4,Alois,M,10,1880
5,Hosea,M,10,1880
6,Arthur,M,1599,1880
7,Vena,F,11,1880
8,Electa,F,6,1880
9,Tessie,F,17,1880
