## Introduction to Pandas ##
We will learn:

a) What is Pandas?

b) What is a Series ?

c) What is a DataFrame?

d) How to read data from a file to DataFrame ?

e) Exploring the Dataframe.

In [2]:
import numpy as np

from pandas import Series,DataFrame
import pandas as pd

In [3]:
# Series
# A Series is a labelled collection of values similar to the NumPy vector. 

series = Series([3,6,9,12])

print series

0     3
1     6
2     9
3    12
dtype: int64


In [4]:
# Lets print values
print series.values

# Lets print index 
print series.index.values

[ 3  6  9 12]
[0 1 2 3]


In [5]:
# The main advantage of Series objects is the ability to utilize non-integer labels/index.

ww2_casuality = Series([8700000,4300000,3000000,2100000,400000],index=['USSR','Germany','China','Japan','USA'])

print ww2_casuality

USSR       8700000
Germany    4300000
China      3000000
Japan      2100000
USA         400000
dtype: int64


In [6]:
# Now we can use index values to select values in Series
print ww2_casuality['USA']
print '\n'

# Check if USSR is in Series index
print ('USSR' in ww2_casuality)
print '\n'

# Check who had casualties greater than 4 million
casualities_greater_than_4_million = ww2_casuality > 4000000
print casualities_greater_than_4_million
print '\n'

print ww2_casuality[casualities_greater_than_4_million]

400000


True


USSR        True
Germany     True
China      False
Japan      False
USA        False
dtype: bool


USSR       8700000
Germany    4300000
dtype: int64


In [7]:
# Naming the series and index
ww2_casuality.name = "Casualities in World War 2"
ww2_casuality.index.name = "Countries"
ww2_casuality

Countries
USSR       8700000
Germany    4300000
China      3000000
Japan      2100000
USA         400000
Name: Casualities in World War 2, dtype: int64

### Now we'll learn DataFrame ###

A dataframe is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data.

In [8]:
# Read data from csv file
df = pd.read_csv("ship_data.csv")

In [9]:
# Now that we've read the dataset into a dataframe, we can start using the dataframe methods to explore the data
df.head(10)

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived
0,1,3,Alexander Harris,male,22.0,1,0,7250.0,New York,0
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0
5,6,3,Megan Clarkson,male,,0,0,8458.3,Chicago,0
6,7,1,Molly Bower,male,54.0,0,0,51862.5,New York,0
7,8,3,Steven Jones,male,2.0,3,1,21075.0,New York,0
8,9,3,Bernadette Vance,female,27.0,0,2,11133.3,New York,1
9,10,2,Irene Chapman,female,-20.0,1,0,30070.8,Los Angeles,1


In [10]:
# Let's try it together :
# a) Try printing out the last 15 values in the df.
df.tail(15)

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived
876,877,3,Isaac Roberts,male,20.0,0,0,9845.8,New York,0
877,878,3,Audrey Fisher,male,19.0,0,0,7895.8,New York,0
878,879,3,David Allan,male,,0,0,7895.8,New York,0
879,880,1,Victor Springer,female,56.0,0,1,83158.3,Los Angeles,1
880,881,2,Alison Oliver,female,25.0,0,1,26000.0,New York,1
881,882,3,William Walker,male,33.0,0,0,7895.8,New York,0
882,883,3,Dorothy Rampling,female,22.0,0,0,10516.7,New York,0
883,884,2,Max Poole,male,28.0,0,0,10500.0,New York,0
884,885,3,Simon Rutherford,male,25.0,0,0,7050.0,New York,0
885,886,3,Diane Simpson,female,39.0,0,5,29125.0,Chicago,0


In [11]:
# Lenght of dataframe
len(df)

891

In [12]:
# To access the full list of column names, use the columns attribute
column_names = df.columns.values

print column_names

['Passenger ID' 'Class' 'Name' 'Gender' 'Age' 'Siblings Count'
 'Parents Count' 'Fare' 'Embarked' 'Survived']


In [13]:
# Selecting Individual Columns
series = df["Passenger ID"] 
print series.head(10)

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
Name: Passenger ID, dtype: int64


In [14]:
# Selecting Multiple Columns
df_subset = df[["Passenger ID", "Name"]]
print df_subset.head(10)

   Passenger ID               Name
0             1   Alexander Harris
1             2      Frank Parsons
2             3  Anthony Churchill
3             4   Alexandra Hughes
4             5        Joan Fraser
5             6     Megan Clarkson
6             7        Molly Bower
7             8       Steven Jones
8             9   Bernadette Vance
9            10      Irene Chapman


In [16]:
#Selecting Rows
df.loc[[1, 2, 6]]

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1
6,7,1,Molly Bower,male,54.0,0,0,51862.5,New York,0


In [17]:
df.loc[[1,5,10]]

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1
5,6,3,Megan Clarkson,male,,0,0,8458.3,Chicago,0
10,11,3,Gavin Payne,female,4.0,1,1,16700.0,New York,1


In [18]:
# Try adding a column say Alive/Dead to the DataFrame 
df["Alive/Dead"] = 'Dead'
df.head()

Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived,Alive/Dead
0,1,3,Alexander Harris,male,22.0,1,0,7250.0,New York,0,Dead
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1,Dead
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1,Dead
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1,Dead
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0,Dead


In [19]:
alive_or_dead = Series(["Alive","Alive"],index=[4,0])

print alive_or_dead
df["Alive/Dead"] = alive_or_dead
df.head()

4    Alive
0    Alive
dtype: object


Unnamed: 0,Passenger ID,Class,Name,Gender,Age,Siblings Count,Parents Count,Fare,Embarked,Survived,Alive/Dead
0,1,3,Alexander Harris,male,22.0,1,0,7250.0,New York,0,Alive
1,2,1,Frank Parsons,female,38.0,1,0,71283.3,Los Angeles,1,
2,3,3,Anthony Churchill,female,26.0,0,0,7925.0,New York,1,
3,4,1,Alexandra Hughes,female,35.0,1,0,53100.0,New York,1,
4,5,3,Joan Fraser,male,35.0,0,0,8050.0,New York,0,Alive


In [20]:
# Check what these return:
#               1)type(df) - Check the datatype
#               2)df.info  - Check the starting index

In [30]:
# Try out:
#         1) matrix = df.as_matrix()
#         2) Check the datatype of matrix
#         3) Print df[0,0] and df[0]
#         4) Check the datatype of df[0]
# Try out:
#         1) Print df.iloc[0] 
#         2) Print df.ix[0]
#         3) Check the datatype of df.ix[0] 
matrix = df.as_matrix()
type(matrix)
# print df.iloc[0]

numpy.ndarray

In [33]:
# Print the 'Passenger ID' column where the Passenger ID is less than 10
# Hint: DataFrame can accept conditions such as: Passenger ID < 10.
pessenger_id_less_than_10 = (df["Passenger ID"] < 10)

print pessenger_id_less_than_10.head()

df[pessenger_id_less_than_10]

0    True
1    True
2    True
3    True
4    True
Name: Passenger ID, dtype: bool


In [37]:
pessenger_in_first_class = (df["Class"] == 1)
print df[pessenger_in_first_class]

     Passenger ID  Class                Name  Gender   Age  Siblings Count  \
1               2      1       Frank Parsons  female  38.0               1   
3               4      1    Alexandra Hughes  female  35.0               1   
6               7      1         Molly Bower    male  54.0               0   
11             12      1       Melanie Scott  female  58.0               0   
23             24      1      Maria Mitchell    male  -5.0               0   
27             28      1       Steven Butler    male  19.0               3   
30             31      1       Piers Coleman    male  40.0               0   
31             32      1        Angela Paige  female   NaN               1   
34             35      1    Penelope Manning    male  28.0               1   
35             36      1    Bernadette Nolan    male  42.0               1   
52             53      1           Una Vance  female  49.0               1   
54             55      1     Steven McDonald    male  65.0      

891

In [38]:
# Try it, you can also add multiple conditions using logical operators (& and |)
pessengers_in_first_class_with_id_less_than_10 = (df["Class"] == 1) & (df["Passenger ID"] < 10)
df[pessengers_in_first_class_with_id_less_than_10]
len(df)

891

## Check point ##

Pandas is used to load the data from file.

In Pandas, Series can be thought of as column/row (1D), DataFrame as a table (2D).

Pandas series/dataframe can be easily converted to numpy arrays.

REMEMBER: DataFrame[0] - returns all values of  column '0'
NumpyArray[0] - returns all values in the row '0'