# pandas

This workshop's goal&mdash;which is facilitated by this Jupyter notebook&mdash;is to give attendees the confidence to use `pandas` in their research projects. Basic familiarity with Python _is_ assumed.

`pandas` is designed to make it easier to work with structured data. Most of the analyses you might perform will likely involve using tabular data, e.g., from .csv files or relational databases (e.g., SQL). The `DataFrame` object in `pandas` is "a two-dimensional tabular, column-oriented data structure with both row and column labels."

If you're curious:

>The `pandas` name itself is derived from _panel data_, an econometrics term for multidimensional structured data sets, and _Python data analysis_ itself.

To motivate this workshop, we'll work with example data and go through the various steps you might need to prepare data for analysis. You'll (hopefully) realize that doing this type of work is much more difficult using Python's built-in data structures.

The data used in these examples is available in the following [GitHub repository](#null). If you've [cloned that repo](https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-clone), which is the recommended approach, you'll have everything you need to run this notebook. Otherwise, you can download the data file(s) from the above link. (Note: this notebook assumes that the data files are in a directory named `data/` found within your current working directory.)

For this example, we're working with European unemployment data from Eurostat, which is hosted by [Google](https://code.google.com/p/dspl/downloads/list). There are several .csv files that we'll work with in this workshop.

Let's begin by loading the `pandas`, using the conventional abbreviation. (We'll also import `numpy`, which we'll use for some operations.)

In [1]:
import numpy as np
import pandas as pd

The `read_csv()` function in `pandas` allows us to easily import our data. It assumes the data is comma-delimited. There are several parameters that you can specify. See the documentation [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). `read_csv()` returns a `DataFrame`.

Notice that we call `read_csv()` using the `pd` abbreviation from the import statement above.

In [2]:
unemployment = pd.read_csv('data/european_unemployment/country_total.csv')

Great! You've created a `pandas` `DataFrame`. We can look at our data by using the `.head()` method. By default, this shows the header (column names) and the first five rows. Passing an integer, $n$, to `.head()` returns that number of rows. To see the last $n$ rows, use `.tail()`.

In [3]:
unemployment.head()

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6
2,at,nsa,1993.03,166000,4.4
3,at,nsa,1993.04,157000,4.1
4,at,nsa,1993.05,147000,3.9


To find the number of rows, you can use the `len()` function. Alternatively, you can use the `.shape` attribute.

In [4]:
unemployment.shape

(20796, 5)

There are 20,796 rows and 5 columns.

You may have noticed that the `month` column also includes the year. It'd be nice to have two separate columns for this information. To do this, we'll need to know how to select a single column. We can either use bracket (`[]`) or dot notation.

In [5]:
unemployment['month'].head()

0    1993.01
1    1993.02
2    1993.03
3    1993.04
4    1993.05
Name: month, dtype: float64

In [6]:
unemployment.month.head()

0    1993.01
1    1993.02
2    1993.03
3    1993.04
4    1993.05
Name: month, dtype: float64

It is preferrable to use the bracket notation as a column name might inadvertently have the same name as a `DataFrame` method.

When selecting a single column in this way, what we have is a `pandas` `Series` object, which is "a one-dimensionalarray-like object containing an array of data (of any `NumPy` data type) and an associated array of data labels, called its _index_." A `DataFrame` also has an index. In our example, the indices are an array of sequential integers. You can find them in the left-most position, without a column label.

As we saw above, `month` is type `float64`. To parse this data, we can either round the values down or convert them to strings and slice. For the year, let's round the values down. We'll use `numpy`'s `floor()` function. To create a new column, we'll also use bracket notation.

In [7]:
unemployment['year'] = np.floor(unemployment['month'])

In [8]:
unemployment.head()

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate,year
0,at,nsa,1993.01,171000,4.5,1993
1,at,nsa,1993.02,175000,4.6,1993
2,at,nsa,1993.03,166000,4.4,1993
3,at,nsa,1993.04,157000,4.1,1993
4,at,nsa,1993.05,147000,3.9,1993


Now, to extract the month, we'll use the second option&mdash;slicing a string.

In [12]:
unemployment['month'] = unemployment['month'].apply(lambda x: ''.join(str(x).split('.')[-1:]))

The `.apply()` method performs some operations on the `month` column, element-wise. In this case, we're converting the value to a string and selecting the portion that comes _after_ (to the right of) the decimal. Then, we use the `.join()` function to remove it from the list.

What would have happened had there been _more than one_ decimal? We could have introduced an error. In this case, since we know it was type `float64`, it's not possible for there to be more than one decimal point. However, we'll see we might have another issue to deal with.

In [14]:
unemployment.head(10)

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate,year
0,at,nsa,1,171000,4.5,1993
1,at,nsa,2,175000,4.6,1993
2,at,nsa,3,166000,4.4,1993
3,at,nsa,4,157000,4.1,1993
4,at,nsa,5,147000,3.9,1993
5,at,nsa,6,134000,3.5,1993
6,at,nsa,7,128000,3.4,1993
7,at,nsa,8,130000,3.4,1993
8,at,nsa,9,132000,3.5,1993
9,at,nsa,1,141000,3.7,1993


Do you notice anything odd? The last row is `1`. It should be 10.

The first thing we might want to do is get country names. These can be found in another file, `countries.csv`.

In [4]:
countries = pd.read_csv('data/european_unemployment/countries.csv')

In [5]:
countries.tail(3)

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de,latitude,longitude
27,se,SE,eu,Sweden,Suède,Schweden,62.198468,14.896307
28,tr,TR,non-eu,Turkey,Turquie,Türkei,38.952942,35.439795
29,uk,GB,eu,United Kingdom,Royaume-Uni,Vereinigtes Königreich,54.315447,-2.232612


This file has lots of useful information. It even has the country names is three different languages.

Because the data we need is stored in two separate files, we'll want to merge the data somehow. Let's determine which column we can use to join this data. `country` looks like a good option. However, we don't need all of the columns in this `DataFrame`. Let's see how to select certain columns.

In [6]:
country_names = countries[['country', 'country_group', 'name_en']]

In [7]:
country_names.head(2)

Unnamed: 0,country,country_group,name_en
0,at,eu,Austria
1,be,eu,Belgium


We select individual columns using the bracket notation (`[]`), simply passing in a list of column names.

`pandas` includes an easy-to-use merge function. Let's see it in action.

In [8]:
unemployment = pd.merge(unemployment, country_names, on='country')

In [9]:
unemployment.head()

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate,country_group,name_en
0,at,nsa,1993.01,171000,4.5,eu,Austria
1,at,nsa,1993.02,175000,4.6,eu,Austria
2,at,nsa,1993.03,166000,4.4,eu,Austria
3,at,nsa,1993.04,157000,4.1,eu,Austria
4,at,nsa,1993.05,147000,3.9,eu,Austria


That makes things a little more clear. We now know that the abbreviation "at" corresponds to Austria. We might be curious to check what countries we have data for. Before we do this, we need to know how to select a single column. We can either use bracket (`[]`) or dot notation.

In [10]:
unemployment['name_en'].head()

0    Austria
1    Austria
2    Austria
3    Austria
4    Austria
Name: name_en, dtype: object

In [11]:
unemployment.name_en.head()

0    Austria
1    Austria
2    Austria
3    Austria
4    Austria
Name: name_en, dtype: object

It is preferrable to use the bracket notation as a column name might inadvertently have the same name as a `DataFrame` method.

When selecting a single column in this way, what we have is a `pandas` `Series` object, which is "a one-dimensionalarray-like object containing an array of data (of any `NumPy` data type) and an associated array of data labels, called its _index_."

Now that we've selected the column, let's look at the unique values.

In [12]:
unemployment['name_en'].unique()

array(['Austria', 'Belgium', 'Bulgaria', 'Cyprus', 'Czech Republic',
       'Germany (including  former GDR from 1991)', 'Denmark', 'Estonia',
       'Spain', 'Finland', 'France', 'Greece', 'Croatia', 'Hungary',
       'Ireland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Malta',
       'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Sweden',
       'Slovenia', 'Slovakia', 'Turkey', 'United Kingdom'], dtype=object)

It might be more interesting to know how many observations we actually have. For this, we'll introduce `pandas`'s `.groupby()` method.

In [13]:
unemployment.groupby('name_en')['name_en'].count()

name_en
Austria                                       648
Belgium                                      1008
Bulgaria                                      576
Croatia                                       324
Cyprus                                        396
Czech Republic                                468
Denmark                                      1008
Estonia                                       387
Finland                                       828
France                                       1008
Germany (including  former GDR from 1991)     504
Greece                                        450
Hungary                                       576
Ireland                                      1008
Italy                                         924
Latvia                                        459
Lithuania                                     459
Luxembourg                                   1008
Malta                                         576
Netherlands                               

Let's explain what just happened. We start with our `DataFrame`. We tell `pandas` that we want to group the data by country name&mdash;that's what goes in the parentheses. Next, we need to tell it what column we'd like to perform the `.count()` operation on. In this case, it's country name again.

This will be useful for our analysis. The maximum number of observations for a given country for this time period is 1,008 observations. We'll note that certain countries, such as Turkey, have far less data.

Before we start analyzing our data, let's make some additional changes. You may have noticed that the `month` column actually includes the year. It'd be nice to have two separate coluns for this information. We'll do this in two ways. First, we'll create a new variable. To do this, we use the bracket notation once again.

In [20]:
unemployment['year'] = unemployment['month'] // 1

In [25]:
unemployment['month'].apply(lambda x: str(x).split('.')[-1:])

0        [01]
1        [02]
2        [03]
3        [04]
4        [05]
5        [06]
6        [07]
7        [08]
8        [09]
9         [1]
10       [11]
11       [12]
12       [01]
13       [02]
14       [03]
15       [04]
16       [05]
17       [06]
18       [07]
19       [08]
20       [09]
21        [1]
22       [11]
23       [12]
24       [01]
25       [02]
26       [03]
27       [04]
28       [05]
29       [06]
         ... 
20766    [05]
20767    [06]
20768    [07]
20769    [08]
20770    [09]
20771     [1]
20772    [11]
20773    [12]
20774    [01]
20775    [02]
20776    [03]
20777    [04]
20778    [05]
20779    [06]
20780    [07]
20781    [08]
20782    [09]
20783     [1]
20784    [11]
20785    [12]
20786    [01]
20787    [02]
20788    [03]
20789    [04]
20790    [05]
20791    [06]
20792    [07]
20793    [08]
20794    [09]
20795     [1]
Name: month, dtype: object