# Pandas library

This library is again based on the [numpy]() library that we discuss in a previous lesson. 
It provides python with a new object which allows us to work with "relational" or "labeled" data in an easy way.
It provides a similar access to data coming from Database or Spreadsheet as the one that you can find in languages like R.

Pandas is designed more for datascience than pure numerical analysis but the tools can be combined together.

The library is providing input/output tools with which it is possible to open/save data from MS Excel, CSV or HDF5 files.

As usual we are going to import the library...

In [None]:
import pandas

We will later plot some data so we are importing the library matplotlib

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

And we will move the current directory to where our data files are located:

In [None]:
cd data

In [None]:
%more gapminder_gdp_europe.csv

In [None]:
pandas.read_csv?

Inside this directory you will find a file called *gapminder_gdp_europe.csv* that is from the software carpentry project. This file has one column which contains the name of the countries and the other one contains the value of GDP for a certain year. 

Using the 'index_col' parameter, we will tell pandas to use the columns 'country' as our row labels.

In [None]:
data = pandas.read_csv('gapminder_gdp_europe.csv', 
                       index_col='country')

We can check the type of this object:

In [None]:
type(data)

We now have a new object type called DataFrame which contains our data

In [None]:
print(data)

In [None]:
data

In [None]:
data.info()

In [None]:
data.shape

If it seems to be an array-like object but be aware that you cannot access the elements as with numpy array:

In [None]:
data[0,0]

### Accessing elements, rows and columns using iloc and loc

To get access to the first row, first columns element, we can use the numpy indexes using the method **iloc** followed by the position (starting by 0 as usual in python):

In [None]:
print(data.iloc[0,0])

or we can use the keys with the method **loc**

In [None]:
print(data.loc["Albania", "gdpPercap_1952"])

It is possible to obtain all the information related to a country (i.e. a row):

In [None]:
print(data.loc["Albania", :])

or per GDP per year (i.e. per columns:)

In [None]:
print(data.loc[:, "gdpPercap_1952"])

It is possible to slice it:

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])

And we can do some numerical operations like finding the maximum of GDP for a slice:

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())

or the minimum value

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

We can create variable which will contains a subset of the data:

In [None]:
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

We can select items using certain criteria, in this case: if the GDP is greater than 10,000. 

One way to do this is to create a mask (similar to the numpy masked array) and we use it to only print the items which validate the condition

In [None]:
mask = subset > 10000
print(subset[mask])

We can ask pandas to provide a statistical description of the data using the method **describe**

In [None]:
print(subset[subset > 10000].describe())

Using the pandas plot function, it is possible to plot the change of GDP for a specific country directly from the selected column:

In [None]:
data.loc['Sweden'].plot()
plt.xticks(rotation=90)

If you want to analyse the data with a histogram where every columns will represent a country. You can transpose the data. Here we are transposing for the first three countries:

In [None]:
data[0:3]

In [None]:
data[0:3].T

And we then plot the histogram. 

For people use to R, it is possible to adjust the properties of the plot to be similar to the **ggpplot** package use in R:

In [None]:
plt.style.use('ggplot')
data[0:3].T.plot(kind='bar')
plt.xticks(rotation=90)
plt.ylabel('GDP per capita')

<div style='background:#B1E0A8; padding:10px 10px 10px 10px;'>
<H2> Challenges </H2>
<li>
Create two variables 'gdp_sweden' and 'gdp_iceland' which contain the gdp for the Sweden and Iceland respectively.
</li>
<li>
Plot the change in GDP per year.
</li>
 <ol>
 </div>

<div style='background:#B1E0A8; padding:10px 10px 10px 10px;'>
<H2> Challenges </H2>

 <ol>
 <li>Read the data of the file 'agelist.txt' using *python* only (not pandas) and create a dictionary which contains this data (the key should be the name)
 </li>
 <li>
 Read the same file, this time using *pandas* and create a dataframe df with the data.
 Hint: you cannot use read_csv function but pandas does provide a fixed formed reader (read_fwf)
 </li>
 
 </div>

In [None]:
df

In [None]:
df.keys()

In [None]:
df['Name']

In [None]:
df[df['Name'] == 'Bob']['Age']

In [None]:
df['Name']

In [None]:
df2 = pandas.read_fwf('agelist.txt', index_col='Name')

In [None]:
df2