## Introduction to Pandas ##
We will learn:

a) What is a DataFrame?

b) How to read data from clipboard to a DataFrame ?

c) How to read data from file to a DataFrame ?

d) Let's code together - to understand object types.

In [9]:
#Remember these will become the standard imports from now on!
import numpy as np

from pandas import Series,DataFrame
import pandas as pd


In [8]:
#Now we'll learn DataFrames

#Let's get some data to play with. How about the NFL?
import webbrowser
website = 'http://en.wikipedia.org/wiki/NFL_win-loss_records'
webbrowser.open(website)

True

In [15]:
#Copy and read to get data
#if xclip is not installed install it using `sudo apt-get install xclip`

nfl_frame = pd.read_clipboard()
type(nfl_frame)

pandas.core.frame.DataFrame

In [16]:
#Show
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
0,1,Dallas Cowboys,493,367,6,0.573,1960,866,NFC East
1,2,Green Bay Packers,730,553,37,0.567,1921,1320,NFC North
2,3,Chicago Bears,744,568,42,0.565,1920,1354,NFC North
3,4,Miami Dolphins,439,341,4,0.563,1966,784,AFC East
4,5,New England Patriots,476,383,9,0.554,1960,868,AFC East
5,6,New York Giants,684,572,33,0.543,1925,1289,NFC East
6,7,Denver Broncos,465,393,10,0.541,1960,868,AFC West
7,8,Minnesota Vikings,457,387,10,0.541,1961,854,NFC North
8,9,Baltimore Ravens,181,154,1,0.54,1996,336,AFC North
9,10,San Francisco 49ers,522,450,14,0.537,1950,986,NFC West


In [17]:
# We can grab the column names with .columns
nfl_frame.columns

Index([u'Rank ', u'Team ', u'Won ', u'Lost ', u'Tied ', u'Pct. ',
       u'First NFL Season ', u'Total Games ', u'Division'],
      dtype='object')

In [18]:
nfl_frame.head(3)

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
0,1,Dallas Cowboys,493,367,6,0.573,1960,866,NFC East
1,2,Green Bay Packers,730,553,37,0.567,1921,1320,NFC North
2,3,Chicago Bears,744,568,42,0.565,1920,1354,NFC North


In [19]:
nfl_frame.tail(1)

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
17,18,San Diego Chargers,426,431,11,0.497,1960,868,AFC West


### Read data from a CSV file ###

In [20]:
trainingData = pd.read_csv("train.csv")   #this returns a set of series
trainingData.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [25]:
trainingData.tail(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [26]:
# LET'S DO IT TOGETHER:  
# a) Try printing out the last 15 values in the trainingData.

In [27]:
# b) Print the 'PassengerId' column where the PassengerId is less than 10
# Hint: DataFrame can accept conditions such as: PassengerId < 10.
series = trainingData["PassengerId"] 
trainingData[series<10]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S


In [28]:
# c) Try printing out a non-existent column.
trainingData["test"] 


KeyError: 'test'

In [30]:
# d) Try adding a Series say Alive/Dead to the DataFrame 
trainingData["Alive/Dead"] = 'Dead'
#create a new column and initialize all to dead 

trainingData.head()
# to print the top `five` rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Alive/Dead
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Dead
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Dead
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Dead
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Dead
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Dead


In [31]:
aliveDead = Series(["Alive","Alive"],index=[4,0])
trainingData["Alive/Dead"] = aliveDead

trainingData.head()
#print the top five

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Alive/Dead
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Alive
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Alive


In [32]:
#      Example: 
#           stadiums = Series(["Levi's Stadium","AT&T Stadium"],index=[4,0])
#           nfl_frame['Stadium']=stadiums

In [33]:
# e) Check what these return:
#               1)type(trainingData) - note the datatype
#               2)trainingData.info  - note the starting index
type(trainingData)

pandas.core.frame.DataFrame

In [None]:
trainingData.info
# f) Try out:
#         1) M = trainingData.as_matrix()
#         2) Check the datatype of M
#         3) Print trainingData[0,0] and trainingData[0]
#         4) Check the datatype of trainingData[0]
# g) Try out:
#         1) Print trainingData.iloc[0] 
#         2) Print trainingData.ix[0]
#         3) Check the datatype of trainingData.ix[0] 
M = trainingData.as_matrix()
type(M)

## Check point ##

Pandas is used to load the data from clipboard and also from file.

Pandas - Series - can be thought of as column/row - 1D, DataFrame as a table - 2D.

Pandas returns numpy arrays for us to manipulate.

So, use pandas to load the data, convert it to Numpy Arrays for the manipulation.

REMEMBER: DataFrame[0] - returns all values of  column '0'
NumpyArray[0] - returns all values in the row '0'.