## Introduction to Pandas

Pandas is an open source package providing high-performance data structures and analysis tools for Python. It provides a Python equivalent of the data analysis and manipulation tools available in the R programming language.

Pandas offers two key data structures that are optimised for data analysis and manipulation: Series and Data Frame. The key distinction of these data structures over basic Python data structures is that they make it easy to associate an *index* with data - i.e. row and column names.

To start off, we import the Pandas package. We can import it as *pd* for shorthand.

In [None]:
import pandas as pd

### Pandas Series

A *Series* is a one-dimensional array capable of holding any data type (integers, strings, floating point numbers, etc.).  

To create a new series, we can use the *Series()* method. The simplest (but least useful) approach is to pass in a Python list. This means that the Series will have a numeric index.

In [None]:
s1 = pd.Series([2,101,45,232,45,67])
s1

We can also explicitly pass a list of index values to the Series() method to use a more interesting index (in this case strings containing country names):

In [None]:
populations = pd.Series([1357000000, 1252000000, 321068000, 249900000, 200400000, 191854000], 
                        ["China", "India", "United States", "Indonesia", "Brazil", "Pakistan"])
print(populations)

This is very similar to a Python dictionary. In fact we can create a Series directly from a Python dictionary. Here the keys in the dictionary act as the index for the Series:

In [None]:
populations = pd.Series({"China":1357000000, "India":1252000000, "United States":321068000, "Indonesia":249900000, 
                         "Brazil":200400000, "Pakistan":191854000})
print(populations)

####  Basic statistics ####
A Series has associated functions for a range of simple analyses:

In [None]:
populations.min()

In [None]:
populations.max()

In [None]:
populations.median()

In [None]:
populations.mean()

In [None]:
populations.std()   # standard deviation

The *describe()* function gives a useful statistical summary of a Series using a single line:

In [None]:
populations.describe()

#### Accessing Series Values by Position
A Series offers a number of choices for accessing element values. We can use simple position numbers like lists, counting from 0:

In [None]:
populations[0]  # value at 1st position

In [None]:
populations[2]  # value at 3rd position

We can use slicing via the *:* operator (remember this includes the elements from index start up to but not including index end): 

In [None]:
populations[0:2]  # start at 1st position, end before 3rd position

In [None]:
populations[1:]  # everything from the 2nd position onwards

In [None]:
populations[:3]  # everything before the 4th position

We can also provide Boolean expressions for conditional data access:

In [None]:
# Check which of the values in the Series are > 1 billion
populations > 1000000000

In [None]:
# Select only the positions for which the corresponding values in the Series are > 1 billion
populations[populations > 1000000000]

We can use numeric positions to modify values in a series:

In [None]:
populations[0] = 1364730000
populations

#### Accessing Series Values by Index
We can also access elements using the index defined at creation, similar to a dictionary:

In [None]:
populations["India"]

In [None]:
populations["China"]

We can access multiple elements by passing a list of index values:

In [None]:
populations[["Brazil","China"]]

We can also use an index to modify values in a series:

In [None]:
populations["China"] = 1374730000
populations

### Pandas Data Frames

A Pandas *Data Frame* is a 2-dimensional labelled data structure with columns of data that can be of different types. Like a series, it supports both position-based and index-based data access.

The easiest way to create a DataFrame is to pass the *DataFrame()* method a dictionary of lists, where each list will be a column. Notice that IPython Notebooks will render frames in a tabular format.

In [None]:
countries = pd.DataFrame({"Country":["China", "India", "United States", "Indonesia", "Brazil", "Pakistan"],
                          "Population":[1357000000, 1252000000, 321068000, 249900000, 200400000, 191854000],
                          "GDP":[11384760, 2182580, 17968200, 888648, 1799610, 246849],
                          "Life Expectancy":[75.41, 68.13, 79.68, 72.45, 73.53, 67.39]})
countries

We can get the dimensions of the Data Frame using the *shape* variable, which is a tuple of the form (rows,columns):

In [None]:
countries.shape

#### Loading Data Frames

We read a Data Frame from a CSV file via the *read_csv()* function. The first line contains the column â€¨names, each subsequent line will be a row in the frame. By default, the function assumes the values are comma-separated.

In [None]:
df = pd.read_csv("countries.csv")

In [None]:
df.shape   # check the size of the dataset which we have loaded

We can also tell the *read_csv()* function to use one of the columns in the CSV file as the index for the rows in our data. In the example below we will use the "Country ID" columns:

In [None]:
df = pd.read_csv("countries.csv",index_col="Country ID")
df

#### Data Frame Statistics ####
Again we can use the describe() function to get a basic summary of the values in a Data Frame, which is presented as a table with statistics for each columns:

In [None]:
df.describe()

We can also get individual statistics for each column:

In [None]:
df.mean()

In [None]:
df.median()

In [None]:
df.sum()

In [None]:
df.std()

#### Accessing Columns in Data Frames
Columns in a Data Frame can be accessed using the name of the column to give a single column Series. 

In [None]:
df["School Years"]

Multiple columns can be selected by passing a list of column names. The result is a new Data Frame, where the row index values are also copied.

In [None]:
df[["CPI","School Years"]]

We can also use numeric positions to access individual columns, using slicing:

In [None]:
# Return all rows, and first two columns
df.iloc[:,0:2]

#### Accessing Rows in Data Frames
We can also access rows of the Data Frame in different ways. We can use numeric slicing to access ranges of individual rows:

In [None]:
df[0:2] # access first two rows

We can access a single row of a Data Frame as a Series, by using *iloc* and the row position:

In [None]:
df.iloc[0]   # get first row

We can access a single row of a Data Frame, by using *loc* and the index of the row. Again this returns a Series:

In [None]:
df.loc["Sweden"]

In [None]:
df.loc["Argentina"]

#### Accessing Individual Values in Data Frames
There are a variety of different ways to access and change the individual values in a Data Frame, using combinations of an index and/or position.

In [None]:
# Access by column index, then row index
df["Mil. Spend"]["China"]

In [None]:
# Access by row index, then column index
df.loc["Ireland"]["School Years"]

In [None]:
# Access by row position, then column index
df.iloc[3]["Life Exp."]

In [None]:
# Access by row position, then column position
df.iloc[3][4]

We can also use these functions to modify values in a Data Frame. For example:

In [None]:
df.loc["Sweden"]

In [None]:
# Change an individual value
df.loc["Sweden"]["Life Exp."] = 85.1

In [None]:
df.loc["Sweden"]