## Python Dictionary for storing mixed-type data

What we call "tabular data" are also just commonly called spreadsheets, or in database terms they would be called tables. Ideally, each column contains a single type of data.

Using the built-in Python data structures:
- A **List** holds a sequence of objects. Members are accessed by an integer index (starting with 0)
- A **Dictionary** holds "key:value pairs". Members are accessed by the "key" name, which is handy so that we don't make as many mistakes accessing the desired object.

Dictionaries of lists might be a natural way of storing tablular data.

In [1]:
data_dict = {'letters':['A','B','c','D','eee'], 
             'hundreds':[100,200,300,400,500], 
             'tens':[10.0,20.0,30.0,40.0,50.0],
             'boolean':[True,False,True,True,False]}

#### Accessing the equivalent of a column

In [2]:
data_dict['hundreds']

[100, 200, 300, 400, 500]

#### We can even do math on lists of numbers

In [3]:
sum(data_dict['hundreds'])

1500

#### But math between multiple columns doesn't work like we'd expect

Adding two lists in Python just concatenates them

In [4]:
data_dict['hundreds'] + data_dict['tens']

[100, 200, 300, 400, 500, 10.0, 20.0, 30.0, 40.0, 50.0]

## DataFrame is convenient storage for tablular data

Using the Pandas module, we can easily make a DataFrame out of our data dictionary of lists

In [5]:
import pandas as pd

In [6]:
df = pd.DataFrame(data_dict)
df

Unnamed: 0,letters,hundreds,tens,boolean
0,A,100,10.0,True
1,B,200,20.0,False
2,c,300,30.0,True
3,D,400,40.0,True
4,eee,500,50.0,False


### Each column has a data "type"

- **object** is how Pandas refers to strings of text
- **int64** is a 64-bit integer (whole number). The number of bits is just the amount of internal storage used for that number. *For integers it limits how big the number can be.*
- **float64** is a "floating point" number (number with decimal places). *For floats the number of bits limits the precision of the number.*
- **bool** is a booleal value, which is just True/False

In [7]:
df.dtypes

letters      object
hundreds      int64
tens        float64
boolean        bool
dtype: object

### DataFrame index

Notice the column of sequential integers off to the left-hand side of the DataFrame output. That is the DataFrame's **index**. 

- The index contains the names of the rows
- Because we didn't explicitly specify an index column, Pandas created one for us

In [8]:
df.index

RangeIndex(start=0, stop=5, step=1)

### DataFrame columns

There is a separate index of column names

In [9]:
df.columns

Index(['letters', 'hundreds', 'tens', 'boolean'], dtype='object')

## Accessing (selecting/indexing) a column with `df[]` notation

The most common, and concise, way of selecting a column out of a DataFrame is just using square brackets with the column name inside -- similar to how you access a dictionary value using it's key.

In [10]:
df['hundreds']

0    100
1    200
2    300
3    400
4    500
Name: hundreds, dtype: int64

### Each column is a Series

Columns are not just a list of values, they are a **Series**.

- Series have an index alongside the values which makes sure they stay in proper alignment with other series in a DataFrame
- Series is the base one-dimensional (1D) data structure in Pandas, while DataFrames tend to be 2D

In [11]:
type(df['hundreds'])

pandas.core.series.Series

#### List of column names

You can select multiple columns by putting a *list of column names* inside the square brackets

In [12]:
df[['tens','hundreds']]

Unnamed: 0,tens,hundreds
0,10.0,100
1,20.0,200
2,30.0,300
3,40.0,400
4,50.0,500


## Math is easier with a DataFrame

### Math with Series

Both Series and DataFrames have mathematical methods associated with them, like sum(), mean(), median(), max(), min()...

In [33]:
df['hundreds'].sum()

1500

There are many simple math operations that can be performed element-wise between Series and DataFrames, some of which can be specified with the common math operators (+, -, /, \*).

**Operations between Series align values based on their associated index values.**

In [14]:
df['hundreds'] + df['tens']

0    110.0
1    220.0
2    330.0
3    440.0
4    550.0
dtype: float64

### Math on a DataFrame

Simple math methods will operate down columns, giving you a Series, indexed by the column names. *Notice what it did with the column of strings!*

In [15]:
df.sum()

letters     ABcDeee
hundreds       1500
tens            150
boolean           3
dtype: object

## Boolean series as a selector

Logical tests on Series return a series of boolean (True/False) values

In [16]:
df['tens'] < 35

0     True
1     True
2     True
3    False
4    False
Name: tens, dtype: bool

**You can use a series of True/False values to return only the rows of a DataFrame where the Series equals True**

In [17]:
df[df['tens'] < 35]

Unnamed: 0,letters,hundreds,tens,boolean
0,A,100,10.0,True
1,B,200,20.0,False
2,c,300,30.0,True


In [18]:
df[df['boolean']]

Unnamed: 0,letters,hundreds,tens,boolean
0,A,100,10.0,True
2,c,300,30.0,True
3,D,400,40.0,True


## Series are automatically aligned by their index

Here we create a Series from scratch, but its index is in descending numerical order

In [38]:
series_spelled = pd.Series(['Five','Four','Three','Two','One'],
                   index=[4,3,2,1,0])
series_spelled

4     Five
3     Four
2    Three
1      Two
0      One
dtype: object

### New DataFrame column gets aligned

Now when we create a new column in our DataFrame, using that series, Pandas forces alignment based on the two indexes!

*Any row in our series without a corresponding index in the original DataFrame will be dropped, and any rows of the DataFrame without a matching index value in the Series will end up with a NaN / Null value.*

In [39]:
df['spelled_out'] = series_spelled
df

Unnamed: 0,letters,hundreds,tens,boolean,spelled_out
0,A,100,10.0,True,One
1,B,200,20.0,False,Two
2,c,300,30.0,True,Three
3,D,400,40.0,True,Four
4,eee,500,50.0,False,Five


## Index doesn't have to be integers

Since the Index isn't row numbers, but instead, the names of the rows, you can use things like strings or dates for the Index. Here we set an existing column for use as the Index.

In [21]:
df2 = df.set_index('spelled_out')
df2

Unnamed: 0_level_0,letters,hundreds,tens,boolean
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
One,A,100,10.0,True
Two,B,200,20.0,False
Three,c,300,30.0,True
Four,D,400,40.0,True
Five,eee,500,50.0,False


And then every column's Series has this same Index

In [22]:
df2['letters']

spelled_out
One        A
Two        B
Three      c
Four       D
Five     eee
Name: letters, dtype: object

## `.loc[]` label-based two-axis indexing/selecting

A more general, and in some ways readable, way of selecting DataFrame elements (rows, columns, values) is by using `df.loc[]` to specify \[row, column\] labels

In [23]:
df2.loc['One','letters']

'A'

### Colon `:` for wholes or slices

A notation that comes from accessing lists in Python is the "slice" operator, which is specified with a colon between two values. **The colon by itself denotes the whole row or column.** So, here we grab a single column.

In [24]:
df2.loc[:,'hundreds']

spelled_out
One      100
Two      200
Three    300
Four     400
Five     500
Name: hundreds, dtype: int64

### Lists for combinations

Lists of values work the same as with the `df[]` notation

In [25]:
df2.loc[:,['tens','letters']]

Unnamed: 0_level_0,tens,letters
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1
One,10.0,A
Two,20.0,B
Three,30.0,c
Four,40.0,D
Five,50.0,eee


### Single rows are a Series, too

Remember, any 1D result, row or column, will be a Series in Pandas

In [26]:
df2.loc['Three',:]

letters        c
hundreds     300
tens          30
boolean     True
Name: Three, dtype: object

### Two or more rows are a DataFrame

In [27]:
df2.loc[['Three','Five'],:]

Unnamed: 0_level_0,letters,hundreds,tens,boolean
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Three,c,300,30.0,True
Five,eee,500,50.0,False


### Boolean series can be used for grabbing True rows or columns

In [28]:
df2.loc[df2['tens']<35,:]

Unnamed: 0_level_0,letters,hundreds,tens,boolean
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
One,A,100,10.0,True
Two,B,200,20.0,False
Three,c,300,30.0,True


## SettingWithCopyWarning

In [29]:
df_nums = df2[['hundreds','tens']]
df_nums

Unnamed: 0_level_0,hundreds,tens
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1
One,100,10.0
Two,200,20.0
Three,300,30.0
Four,400,40.0
Five,500,50.0


In [30]:
df_nums['sums'] = df_nums['hundreds'] + df_nums['tens']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [31]:
df_nums = df2[['hundreds','tens']].copy()
df_nums['sums'] = df_nums['hundreds'] + df_nums['tens']
df_nums

Unnamed: 0_level_0,hundreds,tens,sums
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
One,100,10.0,110.0
Two,200,20.0,220.0
Three,300,30.0,330.0
Four,400,40.0,440.0
Five,500,50.0,550.0
