Numpy's ndarrays well-suited for performing math operations on one and two-dimensional arrays of numeric values, but they fall short when it comes to dealing with heterogeneous data sets. To store data from an external source like an excel workbook or database, we need a data structure that can hold different data types. It is also desirable to be able to refer to rows and columns in the data by custom labels rather than numbered indexes.

Pandas introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy

<B>Pandas Series</B>

Series are very similar to ndarrays: the main difference between them is that with series, you can provide custom index labels and then operations you perform on series automatically align the data based on the labels.

In [2]:
import numpy as np
import pandas as pd 

my_series = pd.Series( data = [2,3,5,4],             # Data
                       index= ['a', 'b', 'c', 'd'])  # Indexes

my_series

a    2
b    3
c    5
d    4
dtype: int64

In [3]:
#Creating a series from dictionary

my_dict = {"x": 2, "a": 5, "b": 4, "c": 8}

my_series2 = pd.Series(my_dict)

my_series2 

a    5
b    4
c    8
x    2
dtype: int64

In [4]:
my_series["a"]

2

In [5]:
my_series[1:3]

b    3
c    5
dtype: int64

In [6]:
#operations performed on two series align by label:

my_series + my_series

a     4
b     6
c    10
d     8
dtype: int64

If you perform an operation with two series that have different labels, the unmatched labels will return a value of NaN (not a number.).

In [7]:
my_series + my_series2

a     7.0
b     7.0
c    13.0
d     NaN
x     NaN
dtype: float64

<B>Data Frames</B>

A DataFrame is a 2D table with labeled columns that can each hold different types of data. DataFrames are essentially a Python implementation of the types of tables you'd see in an Excel workbook or SQL database. DataFrames are the defacto standard data structure for working with tabular data in Python

In [8]:
# Create a dictionary with some different data types as values

my_dict = {"name" : ["Joe","Bob","Frans"],
           "age" : np.array([10,15,20]),
           "weight" : [75,123,239],
           "height" : pd.Series([4.5, 5, 6.1], 
                                index=["Joe","Bob","Frans"]),
           "siblings" : 1,
           "gender" : "M"}

df = pd.DataFrame(my_dict)   # Convert the dict to DataFrame

df                           # Show the DataFrame

Unnamed: 0,age,gender,height,name,siblings,weight
Joe,10,M,4.5,Joe,1,75
Bob,15,M,5.0,Bob,1,123
Frans,20,M,6.1,Frans,1,239


In [9]:
my_dict2 = {"name" : ["Joe","Bob","Frans"],
           "age" : np.array([10,15,20]),
           "weight" : [75,123,239],
           "height" :[4.5, 5, 6.1],
           "siblings" : 1,
           "gender" : "M"}

df2 = pd.DataFrame(my_dict2)   # Convert the dict to DataFrame

df2                            # Show the DataFrame

Unnamed: 0,age,gender,height,name,siblings,weight
0,10,M,4.5,Joe,1,75
1,15,M,5.0,Bob,1,123
2,20,M,6.1,Frans,1,239


In [10]:
df2 = pd.DataFrame(my_dict2,
                   index = my_dict["name"] )

df2

Unnamed: 0,age,gender,height,name,siblings,weight
Joe,10,M,4.5,Joe,1,75
Bob,15,M,5.0,Bob,1,123
Frans,20,M,6.1,Frans,1,239


A DataFrame behaves like a dictionary of Series objects that each have the same length and indexes. This means we can get, add and delete columns in a DataFrame the same way we would when dealing with a dictionary:

In [11]:
# Get a column by name

df2["weight"]

Joe       75
Bob      123
Frans    239
Name: weight, dtype: int64

In [12]:
# You can also use dot notation

df2.weight

Joe       75
Bob      123
Frans    239
Name: weight, dtype: int64

In [13]:
# Delete a column

del df2['name']

In [14]:
# Add a new column

df2["IQ"] = [130, 105, 115]

df2

Unnamed: 0,age,gender,height,siblings,weight,IQ
Joe,10,M,4.5,1,75,130
Bob,15,M,5.0,1,123,105
Frans,20,M,6.1,1,239,115


In [15]:
# Inserting a single value into a dataframe updates it for all rows

df2["Married"] = False

df2

Unnamed: 0,age,gender,height,siblings,weight,IQ,Married
Joe,10,M,4.5,1,75,130,False
Bob,15,M,5.0,1,123,105,False
Frans,20,M,6.1,1,239,115,False


When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be filled with NaN:

In [16]:
df2["College"] = pd.Series(["Harvard"],
                           index=[2])

df2

Unnamed: 0,age,gender,height,siblings,weight,IQ,Married,College
Joe,10,M,4.5,1,75,130,False,
Bob,15,M,5.0,1,123,105,False,
Frans,20,M,6.1,1,239,115,False,


In [17]:
# using loc for selecting rows or columns

df2.loc["Bob"]          # Select row "Bob"

age            15
gender          M
height          5
siblings        1
weight        123
IQ            105
Married     False
College       NaN
Name: Bob, dtype: object

In [18]:
df2.loc["Bob"]["height"]          # Select row Bob and Coumn height

5.0

In [19]:
df2["height"] # Select column "height"

Joe      4.5
Bob      5.0
Frans    6.1
Name: height, dtype: float64

In [20]:
df2.iloc[0]          # Get row 0

age            10
gender          M
height        4.5
siblings        1
weight         75
IQ            130
Married     False
College       NaN
Name: Joe, dtype: object

In [21]:
df2.iloc[0, 5]       # Get row 0, column 5

130

In [22]:
# Importing an in-built dataframe for mtcars data from ggplot:

from ggplot import mtcars

type(mtcars)

pandas.core.frame.DataFrame

In [23]:
mtcars.shape      # Check dimensions

(32, 12)

In [24]:
mtcars.head(10)    # Check the first 10 rows

Unnamed: 0,name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [25]:
mtcars.columns

Index(['name', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
       'gear', 'carb'],
      dtype='object')

In [26]:
mtcars["name"]

0               Mazda RX4
1           Mazda RX4 Wag
2              Datsun 710
3          Hornet 4 Drive
4       Hornet Sportabout
5                 Valiant
6              Duster 360
7               Merc 240D
8                Merc 230
9                Merc 280
10              Merc 280C
11             Merc 450SE
12             Merc 450SL
13            Merc 450SLC
14     Cadillac Fleetwood
15    Lincoln Continental
16      Chrysler Imperial
17               Fiat 128
18            Honda Civic
19         Toyota Corolla
20          Toyota Corona
21       Dodge Challenger
22            AMC Javelin
23             Camaro Z28
24       Pontiac Firebird
25              Fiat X1-9
26          Porsche 914-2
27           Lotus Europa
28         Ford Pantera L
29           Ferrari Dino
30          Maserati Bora
31             Volvo 142E
Name: name, dtype: object