### Unit 2.1.2 Pandas Data Frames

Pandas is probably the most heavily utilized package in Python for data scientists.  Built on top of MunPy, it is essential to data manipulation, organization and modeling.  The primary data structure of pandas is the <b>data frame</b>. 

The data frame is like a NumPy array, with additional features like column names and row indexing (strating with 0 by default), like Excel! 


In [1]:
import pandas as pd
import numpy as np

In [2]:
my_array = np.array([[0,1,2,3],[4,5,6,7]])
df = pd.DataFrame(my_array)
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7


In [3]:
#You can name columns and rows
df.columns = ['first','this','that','last']
df.index = ['row_1','row_2']
df

Unnamed: 0,first,this,that,last
row_1,0,1,2,3
row_2,4,5,6,7


In [4]:
#You can also set column and index names through column= and index= keyword arguments of the pd.DataFrame() function
#Note: It is custom to omit spaces around '=' with keyword arguments to distunguish them from variable assignments.

df2 = pd.DataFrame(
    my_array,
    columns=['first','this','that','last'],
    index=['row_1','row_2'])
df2

Unnamed: 0,first,this,that,last
row_1,0,1,2,3
row_2,4,5,6,7


In [6]:
#Adding more data.  Let's make a bigger data frame with named columns using lists.  

# This list will become our row names.
names = ['George',
         'John',
         'Thomas',
         'James',
         'Andrew',
         'Martin',
         'William',
         'Zachary',
         'Millard',
         'Franklin']

# Create an empty data frame with named rows.
purchases = pd.DataFrame(index=names)

# Add our columns to the data frame one at a time.
purchases['country'] = ['US', 'CAN', 'CAN', 'US', 'CAN', 'US', 'US', 'US', 'CAN', 'US']
purchases['ad_views'] = [16, 42, 32, 13, 63, 19, 65, 23, 16, 77]
purchases['items_purchased'] = [2, 1, 0, 8, 0, 5, 7, 3, 0, 5]
purchases 

Unnamed: 0,country,ad_views,items_purchased
George,US,16,2
John,CAN,42,1
Thomas,CAN,32,0
James,US,13,8
Andrew,CAN,63,0
Martin,US,19,5
William,US,65,7
Zachary,US,23,3
Millard,CAN,16,0
Franklin,US,77,5


In [12]:
#Note: you can call out a column as a series using either dot notation (purchases.country) 
#    or bracket notation (purchases['country']).  
#Bracket notations is generally preferred.

print(purchases.country)

print('\n')

print(purchases['country'])

George       US
John        CAN
Thomas      CAN
James        US
Andrew      CAN
Martin       US
William      US
Zachary      US
Millard     CAN
Franklin     US
Name: country, dtype: object


George       US
John        CAN
Thomas      CAN
James        US
Andrew      CAN
Martin       US
William      US
Zachary      US
Millard     CAN
Franklin     US
Name: country, dtype: object


In [13]:
#You can create a new column out of the data frame.  
# Let's say we want a column of average items purchased per page viewed

purchases['item_purch_per_ad'] = purchases['items_purchased'] / purchases['ad_views']

purchases

Unnamed: 0,country,ad_views,items_purchased,item_purch_per_ad
George,US,16,2,0.125
John,CAN,42,1,0.02381
Thomas,CAN,32,0,0.0
James,US,13,8,0.615385
Andrew,CAN,63,0,0.0
Martin,US,19,5,0.263158
William,US,65,7,0.107692
Zachary,US,23,3,0.130435
Millard,CAN,16,0,0.0
Franklin,US,77,5,0.064935


In [15]:
#If we just want to see the values and not store them inthe data drame, you can run it withouy assigning it to
#   purchases['item_purch_per_ad'] and it will return labled values giving the name and purchases per ad for each user

purchases['items_purchased'] / purchases['ad_views']

George      0.125000
John        0.023810
Thomas      0.000000
James       0.615385
Andrew      0.000000
Martin      0.263158
William     0.107692
Zachary     0.130435
Millard     0.000000
Franklin    0.064935
dtype: float64