# Introduction to Pandas

The `pandas` library allows us to import and manipulate data with much greater ease, but it can take a little getting used to if you've never used it before. This notebook should hopefully give you a brief introduction into the tools and commands you'll have at your disposal, so you can jump right into analysing your data.

## Useful Jupyter Notebook Hotkeys

Before we start, there are several hotkeys in Jupyter Notebooks that may come in useful:
* Press `Shift`+`Enter` to execute the currently selected cell.
* Press `Shift`+`CTRL`+`-` to split the current cell in two at the current position of the cursor.

You can enter `Command Mode` by pressing `Esc`. You should see the indicator box around your currently selected cell change from Green to Blue.
Once in `Command Mode`:
* Press `a` to insert a new cell above the currently selected cell.
* Press `b` to insert a new cell below the currently selected cell.
* Press `d` twice in a row to delete the currently selected cell.
* Press `m` to swap the current cell to Markdown format (used for displaying text, like this cell).
* Press `y` to swap the current cell back to Code format.


---

## Pandas

To start with, let's import the library that we're going to be using.

In [3]:
import pandas as pd # It is fairly standard to use 'pd' to represent 'pandas'

### DataFrames and Series

A table of data in `pandas` is known as a `DataFrame`, and the data is split up into rows and columns. Each column individually is called a `Series`, so when you see this word it's referring to a single column of the data. This is important, as some of the functions that we will use can only be used on individual `Series`, and not on the `DataFrame` as a whole.

A DataFrame can be converted from a python `dict` object, or even a list, and will look something like the following.

In [16]:
example_dict = {
    "apples": [1, 2, 3, 4, 5],
    "oranges": [2, 4, 6, 8, 10],
    "bananas": [1, 0, 3, 4, 2],
    "id": [0, 2, 4, 6, 8]
};

df_from_dict = pd.DataFrame(example_dict);

df_from_dict

Unnamed: 0,apples,oranges,bananas,id
0,1,2,1,0
1,2,4,0,2
2,3,6,3,4
3,4,8,4,6
4,5,10,2,8


Here we can see that, as mentioned above, the data is arranged in a table, with a value for each row in each column (if a row does not have a value for a given column, it will likely be filled in with a `NAN` value, which is the NumPy 'Not-A-Number' representation).
You might also notice that there is an additional column on the left that does not have a column title. This is the `index` of the DataFrame, and it holds a unique value for every row of the data, and is used to refer to individual rows. If you have a column of data that has a unique value for every row, you can use this as the index instead, if you prefer.

In [5]:
df_from_dict.set_index("id")

Unnamed: 0_level_0,apples,oranges,bananas
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,2,1
2,2,4,0
4,3,6,3
6,4,8,4
8,5,10,2


Here the `id` column is used as the index, which is indicated by the way that the column label is lower than the others.

#### IMPORTANT NOTE:

Most `pandas` functions do not modify the dataframe at all. Instead, they return a modified copy of the dataframe. Because of this, if you do not assign the output of these functions to a new variable (or the same variable), the changes that they created will be lost. For example, even though we've used the `.set_index()` function above, if we look at the `df_from_dict` variable again, we can see that it hasn't changed.

In [9]:
df_from_dict

Unnamed: 0_level_0,apples,oranges,bananas
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,2,1
2,2,4,0
4,3,6,3
6,4,8,4
8,5,10,2


This habit of not modifying DataFrames in place is excellent for things like IPython or Jupyter Notebooks, where the order of commands can become confused, to the point where your variables are not in the state you think they should be in at each point in the code. However, if you really want to change the DataFrame without assigning a new variable, many functions will allow for the additional argument of `inplace=True`, like below. However, keep in mind that running the same cell twice when using this method can have unforeseen consequences.

In [17]:
df_from_dict.set_index("id", inplace=True)

In [18]:
df_from_dict

Unnamed: 0_level_0,apples,oranges,bananas
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,2,1
2,2,4,0
4,3,6,3
6,4,8,4
8,5,10,2


In [20]:
df_from_dict.set_index("id", inplace=True)
# Here we run the previous command again, but because 'id' is already the index, it fails.

KeyError: "None of ['id'] are in the columns"

## Reading a datafile into a pandas DataFrame

Most of the datafiles that you will be using today will be in the form of `.csv` files, which are fairly simple to read in as Pandas has a built-in function to deal with them. For any other file types, there should be additional help in the introductory booklet for that dataset, or you can see if `pandas` has a built-in function to import them on this page: https://pandas.pydata.org/pandas-docs/stable/reference/io.html

In [13]:
df = pd.read_csv("data/HoC-GE2019-results-by-candidate.csv") # 'df' is often used to represent 'dataframe'
df

Unnamed: 0,ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change
0,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Labour,Lab,Stephen,Kinnock,Male,Yes,Yes,17008,0.538262,-0.142933
1,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Conservative,Con,Charlotte,Lang,Female,No,No,6518,0.206279,0.028901
2,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Brexit Party,Brexit,Glenda,Davies,Female,No,No,3108,0.098361,
3,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Plaid Cymru,PC,Nigel,Hunt,Male,No,No,2711,0.085797,0.002804
4,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Liberal Democrat,LD,Sheila,Kingston-Jones,Female,No,No,1072,0.033926,0.015921
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3315,E14001061,E12000003,York Central,North Yorkshire,Yorkshire and The Humber,England,Borough,Social Democratic Party,SDP,Andrew,Dunn,Male,No,No,134,0.002707,
3316,E14001062,E12000003,York Outer,North Yorkshire,Yorkshire and The Humber,England,County,Conservative,Con,Julian,Sturdy,Male,Yes,Yes,27324,0.493685,-0.017503
3317,E14001062,E12000003,York Outer,North Yorkshire,Yorkshire and The Humber,England,County,Labour,Lab,Anna,Perrett,Female,No,No,17339,0.313278,-0.053570
3318,E14001062,E12000003,York Outer,North Yorkshire,Yorkshire and The Humber,England,County,Liberal Democrat,LD,Keith,Aspden,Male,No,No,9992,0.180534,0.077620


## Exploring the data

As you can see, Jupyter cannot show all of the data in the display box, and even if they did, it would be fairly cumbersome to try and look at all the data at once. The rows are not all displayed, and often in datasets with many columns, the columns themselves will not all be displayed in this format either.

It's often best to get to know what your data is like before you try and analyse it, so we're going to have a bit of a closer look at smaller elements of the DataFrame. Going forward, I will be representing commands as `DF.function()` or `DF.attribute` for functions and attributes that relate to DataFrames, and `S.function()` for functions that relate only to Series.

#### DF.head(), DF.tail()

To get a quick look at some of the data in a controlled manner, you can use the `DF.head()` or `DF.tail()` functions to have a look at the top or bottom rows of the data. By default, each will only show 5 rows, but you can customise that number by providing an integer argument like so:

In [17]:
df.head(10)

Unnamed: 0,ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change
0,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Labour,Lab,Stephen,Kinnock,Male,Yes,Yes,17008,0.538262,-0.142933
1,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Conservative,Con,Charlotte,Lang,Female,No,No,6518,0.206279,0.028901
2,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Brexit Party,Brexit,Glenda,Davies,Female,No,No,3108,0.098361,
3,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Plaid Cymru,PC,Nigel,Hunt,Male,No,No,2711,0.085797,0.002804
4,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Liberal Democrat,LD,Sheila,Kingston-Jones,Female,No,No,1072,0.033926,0.015921
5,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Independent,Ind,Captain,Beany,Male,No,No,731,0.023134,
6,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Green,Green,Giorgia,Finney,Female,No,No,450,0.014241,
7,W07000058,W92000004,Aberconwy,Clwyd,Wales,Wales,County,Conservative,Con,Robin,Millar,Male,No,No,14687,0.460913,0.014972
8,W07000058,W92000004,Aberconwy,Clwyd,Wales,Wales,County,Labour,Lab,Emily,Owen,Female,No,No,12653,0.397081,-0.029108
9,W07000058,W92000004,Aberconwy,Clwyd,Wales,Wales,County,Plaid Cymru,PC,Lisa,Goodier,Female,No,No,2704,0.084858,-0.013742


#### DF.shape

Although it is also stated when attempting to view the entire dataset in Jupyter Notebook, you can access the numbers of Rows and Columns in a DataFrame with the `DF.shape` attribute.

In [19]:
num_rows = df.shape[0]
num_cols = df.shape[1]
print("The DataFrame has {} rows and {} columns.".format(num_rows, num_cols))

The DataFrame has 3320 rows and 17 columns.


### Looking at the Columns

#### DF.columns

If your dataset has too many columns to all be viewed at once in the Jupyter display, you can view all of the column names by checking the `DF.columns` attribute.

In [20]:
df.columns

Index(['ons_id', 'ons_region_id', 'constituency_name', 'county_name',
       'region_name', 'country_name', 'constituency_type', 'party_name',
       'party_abbreviation', 'firstname', 'surname', 'gender', 'sitting_mp',
       'former_mp', 'votes', 'share', 'change'],
      dtype='object')

#### DF.info()

We can get a closer look at the columns by calling the `DF.info()` function. This will tell give us additional information like what type of data each column is filled with, and how many valid values are in each column. It can be seen that while there are 3320 rows in the entire dataset, the `change` column only has 2541 valid (non-null) entries.

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3320 entries, 0 to 3319
Data columns (total 17 columns):
ons_id                3320 non-null object
ons_region_id         3320 non-null object
constituency_name     3320 non-null object
county_name           3320 non-null object
region_name           3320 non-null object
country_name          3320 non-null object
constituency_type     3320 non-null object
party_name            3320 non-null object
party_abbreviation    3320 non-null object
firstname             3320 non-null object
surname               3320 non-null object
gender                3320 non-null object
sitting_mp            3320 non-null object
former_mp             3320 non-null object
votes                 3320 non-null int64
share                 3320 non-null float64
change                2541 non-null float64
dtypes: float64(2), int64(1), object(14)
memory usage: 441.1+ KB


#### DF.isnull(), DF.sum()

Instead of inferring how many null values there are from the table above, we can use a couple of functions to calculate them. 

The `DF.isnull()` function checks if each value in the DataFrame is a null value, and returns a DataFrame of the same dimensions with a corresponding `True` or `False` in each cell. `True` if the value is null, and `False` otherwise.

This can be combined with the `DF.sum()` function, which will add up the values in each column (if possible) and produce a single Series with a value for each column in the DataFrame. It works best on numerical values, which Boolean values also count as in this case (1 or 0). It also works on text values, although it is likely to give you garbage unless you've explicitly planned for that.

In [29]:
df.isnull()

Unnamed: 0,ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3315,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
3316,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3317,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3318,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [30]:
df.isnull().sum()

ons_id                  0
ons_region_id           0
constituency_name       0
county_name             0
region_name             0
country_name            0
constituency_type       0
party_name              0
party_abbreviation      0
firstname               0
surname                 0
gender                  0
sitting_mp              0
former_mp               0
votes                   0
share                   0
change                779
dtype: int64

#### DF.describe(), S.describe()

Another useful tool when looking at numerical columns is the `DF.describe()` function. This will immediately give you a series of simple statistical values for each of the numerical columns in the DataFrame. Below you can see that it has excluded all columns that are not of a numerical type.

In [31]:
df.describe()

Unnamed: 0,votes,share,change
count,3320.0,3320.0,2541.0
mean,9642.804217,0.195783,-0.002188
std,10478.619793,0.206009,0.063352
min,5.0,0.000104,-0.249037
25%,1408.25,0.029588,-0.044088
50%,4381.5,0.090327,0.010216
75%,16125.0,0.353926,0.032245
max,47028.0,0.84681,0.358822


### Renaming columns

When importing a datafile into a DataFrame, you may want to change the labels of the columns from the ones provided. This can be especially true when importing multiple datafiles with very similar data, but for different time periods. Renaming columns can easily be done with the `DF.rename` function (which also has other utilities we won't cover here). The column names can either be provided in a list, if renaming all columns, or as a dictionary if you only want to rename certain ones.

In [32]:
df.rename(columns={"votes": "vote_count", "share": "vote_percentage_share", "change": "change_in_vote_percentage_share_from_2017"})

Unnamed: 0,ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,vote_count,vote_percentage_share,change_in_vote_percentage_share_from_2017
0,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Labour,Lab,Stephen,Kinnock,Male,Yes,Yes,17008,0.538262,-0.142933
1,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Conservative,Con,Charlotte,Lang,Female,No,No,6518,0.206279,0.028901
2,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Brexit Party,Brexit,Glenda,Davies,Female,No,No,3108,0.098361,
3,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Plaid Cymru,PC,Nigel,Hunt,Male,No,No,2711,0.085797,0.002804
4,W07000049,W92000004,Aberavon,West Glamorgan,Wales,Wales,County,Liberal Democrat,LD,Sheila,Kingston-Jones,Female,No,No,1072,0.033926,0.015921
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3315,E14001061,E12000003,York Central,North Yorkshire,Yorkshire and The Humber,England,Borough,Social Democratic Party,SDP,Andrew,Dunn,Male,No,No,134,0.002707,
3316,E14001062,E12000003,York Outer,North Yorkshire,Yorkshire and The Humber,England,County,Conservative,Con,Julian,Sturdy,Male,Yes,Yes,27324,0.493685,-0.017503
3317,E14001062,E12000003,York Outer,North Yorkshire,Yorkshire and The Humber,England,County,Labour,Lab,Anna,Perrett,Female,No,No,17339,0.313278,-0.053570
3318,E14001062,E12000003,York Outer,North Yorkshire,Yorkshire and The Humber,England,County,Liberal Democrat,LD,Keith,Aspden,Male,No,No,9992,0.180534,0.077620


## Filter The DataFrame

Often times, it can be beneficial to filter the DataFrame, either by selecting only certain rows or only certain columns.

### Selecting a Column

To select a certain column, it can be accessed through square brackets, much like you would access a certain value from a dictionary.

In [33]:
df['gender']

0         Male
1       Female
2       Female
3         Male
4       Female
         ...  
3315      Male
3316      Male
3317    Female
3318      Male
3319      Male
Name: gender, Length: 3320, dtype: object

#### Selecting a subset of multiple columns

As well as selecting a single column, multiple columns can also be selected by passing a list of column names as the argument. Note that because this selects more than one column, the resulting output is still a DataFrame, not a Series.

In [34]:
df[ ['firstname', 'surname'] ]

Unnamed: 0,firstname,surname
0,Stephen,Kinnock
1,Charlotte,Lang
2,Glenda,Davies
3,Nigel,Hunt
4,Sheila,Kingston-Jones
...,...,...
3315,Andrew,Dunn
3316,Julian,Sturdy
3317,Anna,Perrett
3318,Keith,Aspden


#### Selecting a row

Rows must be selected using a function of the dataframe, either `DF.loc[]` of `DF.iloc[]`. 

`DF.loc[]` allows you to select rows based on a condition, while `DF.iloc[]` allows you to select rows based upon their position within the DataFrame (so regardless of the index, `df.iloc[0]` will always select the first row).

In [46]:
df.loc[0]

ons_id                     W07000049
ons_region_id              W92000004
constituency_name           Aberavon
county_name           West Glamorgan
region_name                    Wales
country_name                   Wales
constituency_type             County
party_name                    Labour
party_abbreviation               Lab
firstname                    Stephen
surname                      Kinnock
gender                          Male
sitting_mp                       Yes
former_mp                        Yes
votes                          17008
share                       0.538262
change                     -0.142933
Name: 0, dtype: object

### Unique Values

Often times in your data exploration, if can be useful to see what unique values exist in a column. For example, if we wanted to see whether the data in the `Gender` column was strictly binary, or whether any other options were included. This can be done with the `S.unique()` function.