In [None]:
## Introduction to Pandas

Pandas is an open source package providing high-performance data structures and analysis tools for Python. It provides a Python equivalent of the data analysis and manipulation tools available in the R programming language.

Pandas offers two key data structures that are optimised for data analysis and manipulation: Series and Data Frame. The key distinction of these data structures over basic Python data structures is that they make it easy to associate an *index* with data - i.e. row and column names.

To start off, we import the Pandas package. We can import it as *pd* for shorthand.

In [1]:
import pandas as pd

### Pandas Series

A *Series* is a one-dimensional array capable of holding any data type (integers, strings, floating point numbers, etc.).  

To create a new series, we can use the *Series()* method. The simplest (but least useful) approach is to pass in a Python list. This means that the Series will have a numeric index.

In [2]:
s1 = pd.Series([2,101,45,232,45,67])
s1

0      2
1    101
2     45
3    232
4     45
5     67
dtype: int64

We can also explicitly pass a list of index values to the Series() method to use a more interesting index (in this case strings containing country names):

In [3]:
populations = pd.Series([1357000000, 1252000000, 321068000, 249900000, 200400000, 191854000], 
                        ["China", "India", "United States", "Indonesia", "Brazil", "Pakistan"])
print(populations)

China            1357000000
India            1252000000
United States     321068000
Indonesia         249900000
Brazil            200400000
Pakistan          191854000
dtype: int64


This is very similar to a Python dictionary. In fact we can create a Series directly from a Python dictionary. Here the keys in the dictionary act as the index for the Series:

In [4]:
populations = pd.Series({"China":1357000000, "India":1252000000, "United States":321068000, "Indonesia":249900000, 
                         "Brazil":200400000, "Pakistan":191854000})
print(populations)

Brazil            200400000
China            1357000000
India            1252000000
Indonesia         249900000
Pakistan          191854000
United States     321068000
dtype: int64


####  Basic statistics ####
A Series has associated functions for a range of simple analyses:

In [5]:
populations.min()

191854000

In [6]:
populations.max()

1357000000

In [7]:
populations.median()

285484000.0

In [8]:
populations.mean()

595370333.3333334

In [10]:
populations.std()   # standard deviation

552206625.2879139

The *describe()* function gives a useful statistical summary of a Series using a single line:

In [9]:
populations.describe()

count    6.000000e+00
mean     5.953703e+08
std      5.522066e+08
min      1.918540e+08
25%      2.127750e+08
50%      2.854840e+08
75%      1.019267e+09
max      1.357000e+09
dtype: float64

#### Accessing Series Values by Position
A Series offers a number of choices for accessing element values. We can use simple position numbers like lists, counting from 0:

In [11]:
populations[0]  # value at 1st position

200400000

In [12]:
populations[2]  # value at 3rd position

1252000000

We can use slicing via the *:* operator (remember this includes the elements from index start up to but not including index end): 

In [13]:
populations[0:2]  # start at 1st position, end before 3rd position

Brazil     200400000
China     1357000000
dtype: int64

In [14]:
populations[1:]  # everything from the 2nd position onwards

China            1357000000
India            1252000000
Indonesia         249900000
Pakistan          191854000
United States     321068000
dtype: int64

In [15]:
populations[:3]  # everything before the 4th position

Brazil     200400000
China     1357000000
India     1252000000
dtype: int64

We can also provide Boolean expressions for conditional data access:

In [16]:
# Check which of the values in the Series are > 1 billion
populations > 1000000000

Brazil           False
China             True
India             True
Indonesia        False
Pakistan         False
United States    False
dtype: bool

In [17]:
# Select only the positions for which the corresponding values in the Series are > 1 billion
populations[populations > 1000000000]

China    1357000000
India    1252000000
dtype: int64

We can use numeric positions to modify values in a series:

In [18]:
populations[0] = 1364730000
populations

Brazil           1364730000
China            1357000000
India            1252000000
Indonesia         249900000
Pakistan          191854000
United States     321068000
dtype: int64

#### Accessing Series Values by Index
We can also access elements using the index defined at creation, similar to a dictionary:

In [19]:
populations["India"]

1252000000

In [20]:
populations["China"]

1357000000

We can access multiple elements by passing a list of index values:

In [21]:
populations[["Brazil","China"]]

Brazil    1364730000
China     1357000000
dtype: int64

We can also use an index to modify values in a series:

In [22]:
populations["China"] = 1374730000
populations

Brazil           1364730000
China            1374730000
India            1252000000
Indonesia         249900000
Pakistan          191854000
United States     321068000
dtype: int64

### Pandas Data Frames

A Pandas *Data Frame* is a 2-dimensional labelled data structure with columns of data that can be of different types. Like a series, it supports both position-based and index-based data access.

The easiest way to create a DataFrame is to pass the *DataFrame()* method a dictionary of lists, where each list will be a column. Notice that IPython Notebooks will render frames in a tabular format.

In [23]:
countries = pd.DataFrame({"Country":["China", "India", "United States", "Indonesia", "Brazil", "Pakistan"],
                          "Population":[1357000000, 1252000000, 321068000, 249900000, 200400000, 191854000],
                          "GDP":[11384760, 2182580, 17968200, 888648, 1799610, 246849],
                          "Life Expectancy":[75.41, 68.13, 79.68, 72.45, 73.53, 67.39]})
countries

Unnamed: 0,Country,GDP,Life Expectancy,Population
0,China,11384760,75.41,1357000000
1,India,2182580,68.13,1252000000
2,United States,17968200,79.68,321068000
3,Indonesia,888648,72.45,249900000
4,Brazil,1799610,73.53,200400000
5,Pakistan,246849,67.39,191854000


We can get the dimensions of the Data Frame using the *shape* variable, which is a tuple of the form (rows,columns):

In [24]:
countries.shape

(6, 4)

#### Loading Data Frames

We read a Data Frame from a CSV file via the *read_csv()* function. The first line contains the column  names, each subsequent line will be a row in the frame. By default, the function assumes the values are comma-separated.

In [26]:
df = pd.read_csv("../data/countries.csv")

In [27]:
df.shape   # check the size of the dataset which we have loaded

(16, 7)

We can also tell the *read_csv()* function to use one of the columns in the CSV file as the index for the rows in our data. In the example below we will use the "Country ID" columns:

In [29]:
df = pd.read_csv("../data/countries.csv",index_col="Country ID")
df

Unnamed: 0_level_0,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,CPI
Country ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,59.61,23.21,74.3,4.44,0.4,1.5171
Haiti,45.0,47.67,73.1,0.09,3.4,1.7999
Nigeria,51.3,38.23,82.6,1.07,4.1,2.4493
Egypt,70.48,26.58,19.6,1.86,5.3,2.8622
Argentina,75.77,32.3,13.3,0.76,10.1,2.9961
China,74.87,29.98,13.7,1.95,6.4,3.6356
Brazil,73.12,42.93,14.5,1.43,7.2,3.7741
Israel,81.3,28.8,3.6,6.77,12.5,5.8069
U.S.A,78.51,29.85,6.3,4.72,13.7,7.1357
Ireland,80.15,27.23,3.5,0.6,11.5,7.536


#### Data Frame Statistics ####
Again we can use the describe() function to get a basic summary of the values in a Data Frame, which is presented as a table with statistics for each columns:

In [30]:
df.describe()

Unnamed: 0,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,CPI
count,16.0,16.0,16.0,16.0,16.0,16.0
mean,73.47625,29.845,20.55,2.079375,9.4,5.72575
std,11.481893,7.295689,28.351296,1.76695,4.28859,2.917551
min,45.0,22.07,2.4,0.09,0.4,1.5171
25%,72.46,25.2475,4.05,1.115,6.125,2.962625
50%,79.3,28.15,5.6,1.425,11.5,6.4713
75%,80.75,30.56,15.775,2.11,12.575,8.2027
max,82.09,47.67,82.6,6.77,14.2,9.4627


We can also get individual statistics for each column:

In [31]:
df.mean()

Life Exp.        73.476250
Top-10 Income    29.845000
Infant Mort.     20.550000
Mil. Spend        2.079375
School Years      9.400000
CPI               5.725750
dtype: float64

In [32]:
df.median()

Life Exp.        79.3000
Top-10 Income    28.1500
Infant Mort.      5.6000
Mil. Spend        1.4250
School Years     11.5000
CPI               6.4713
dtype: float64

In [33]:
df.sum()

Life Exp.        1175.620
Top-10 Income     477.520
Infant Mort.      328.800
Mil. Spend         33.270
School Years      150.400
CPI                91.612
dtype: float64

In [34]:
df.std()

Life Exp.        11.481893
Top-10 Income     7.295689
Infant Mort.     28.351296
Mil. Spend        1.766950
School Years      4.288590
CPI               2.917551
dtype: float64

#### Accessing Columns in Data Frames
Columns in a Data Frame can be accessed using the name of the column to give a single column Series. 

In [35]:
df["School Years"]

Country ID
Afghanistan     0.4
Haiti           3.4
Nigeria         4.1
Egypt           5.3
Argentina      10.1
China           6.4
Brazil          7.2
Israel         12.5
U.S.A          13.7
Ireland        11.5
U.K.           13.0
Germany        12.0
Canada         14.2
Australia      11.5
Sweden         12.8
New Zealand    12.3
Name: School Years, dtype: float64

Multiple columns can be selected by passing a list of column names. The result is a new Data Frame, where the row index values are also copied.

In [36]:
df[["CPI","School Years"]]

Unnamed: 0_level_0,CPI,School Years
Country ID,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,1.5171,0.4
Haiti,1.7999,3.4
Nigeria,2.4493,4.1
Egypt,2.8622,5.3
Argentina,2.9961,10.1
China,3.6356,6.4
Brazil,3.7741,7.2
Israel,5.8069,12.5
U.S.A,7.1357,13.7
Ireland,7.536,11.5


We can also use numeric positions to access individual columns, using slicing:

In [37]:
# Return all rows, and first two columns
df.iloc[:,0:2]

Unnamed: 0_level_0,Life Exp.,Top-10 Income
Country ID,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,59.61,23.21
Haiti,45.0,47.67
Nigeria,51.3,38.23
Egypt,70.48,26.58
Argentina,75.77,32.3
China,74.87,29.98
Brazil,73.12,42.93
Israel,81.3,28.8
U.S.A,78.51,29.85
Ireland,80.15,27.23


#### Accessing Rows in Data Frames
We can also access rows of the Data Frame in different ways. We can use numeric slicing to access ranges of individual rows:

In [38]:
df[0:2] # access first two rows

Unnamed: 0_level_0,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,CPI
Country ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,59.61,23.21,74.3,4.44,0.4,1.5171
Haiti,45.0,47.67,73.1,0.09,3.4,1.7999


We can access a single row of a Data Frame as a Series, by using *iloc* and the row position:

In [39]:
df.iloc[0]   # get first row

Life Exp.        59.6100
Top-10 Income    23.2100
Infant Mort.     74.3000
Mil. Spend        4.4400
School Years      0.4000
CPI               1.5171
Name: Afghanistan, dtype: float64

We can access a single row of a Data Frame, by using *loc* and the index of the row. Again this returns a Series:

In [41]:
df.loc["Sweden"]

Life Exp.        81.4300
Top-10 Income    22.1800
Infant Mort.      2.4000
Mil. Spend        1.2700
School Years     12.8000
CPI               9.2985
Name: Sweden, dtype: float64

In [42]:
df.loc["Argentina"]

Life Exp.        75.7700
Top-10 Income    32.3000
Infant Mort.     13.3000
Mil. Spend        0.7600
School Years     10.1000
CPI               2.9961
Name: Argentina, dtype: float64

#### Accessing Individual Values in Data Frames
There are a variety of different ways to access and change the individual values in a Data Frame, using combinations of an index and/or position.

In [43]:
# Access by column index, then row index
df["Mil. Spend"]["China"]

1.95

In [44]:
# Access by row index, then column index
df.loc["Ireland"]["School Years"]

11.5

In [45]:
# Access by row position, then column index
df.iloc[3]["Life Exp."]

70.480000000000004

In [46]:
# Access by row position, then column position
df.iloc[3][4]

5.2999999999999998

We can also use these functions to modify values in a Data Frame. For example:

In [47]:
df.loc["Sweden"]

Life Exp.        81.4300
Top-10 Income    22.1800
Infant Mort.      2.4000
Mil. Spend        1.2700
School Years     12.8000
CPI               9.2985
Name: Sweden, dtype: float64

In [50]:
# Change an individual value
df.loc["Sweden"]["Life Exp."] = 85.1

In [49]:
df.loc["Sweden"]

Life Exp.        85.1000
Top-10 Income    22.1800
Infant Mort.      2.4000
Mil. Spend        1.2700
School Years     12.8000
CPI               9.2985
Name: Sweden, dtype: float64