# Pandas basics

[Pandas](https://pandas.pydata.org) is widely used for data analysis in Python and is built on top of NumPy (which we saw in [notebook 1](01-numpy-basics.ipynb)). Pandas and provides data structures which aid analytics as well as tools to manipulate and restructure data, and perform aggregations and queries on that data.


This notebook will give you a quick introduction to pandas. Like the earlier notebooks in this series, some of the cells are in question-and-answer format, whilst some contain working code - make sure you think about what these cells are doing whilst going through the notebook.

Let's go! 

## Importing Pandas

It is common practise to import `pandas` with the alias of `pd`, thereby saving yourself 4 keystrokes every time you want to type the module name.

In [None]:
import pandas as pd

## Creating data structures

There are two data structures in pandas: `Series` and `DataFrames`. 

### Series

A pandas [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) holds one dimensional data, and indices corresponding to the data. 

There are a few ways we can make a `Series` but perhaps the simplest is from an array of data:

In [None]:
dat = pd.Series(['alpha', 'gamma', 'delta', 'omega'])
dat

You can see that, as well as the data, the `Series` has indices, in this case the integers 0 to 3.

We can use these indices to select specific entries from the `Series`:

In [None]:
dat[3]

How would you select the first 3 entries in `dat`?

In [None]:
# FIXME: only keep this in solution 

dat[0:3]

If we decide we want to change the indices of our `Series` we can do so as follows: 

In [None]:
dat.index = ['a', 'g', 'd', 'o']
dat

## Initialising a series with the desired indexes and names

Instead of changing indexes and names once the `Series` has been initialised, you can do it all in one fell swoop.

Using `help(pd.Series)` for guidance, make a new `Series` with the data below and the correct indices and names in one command:

In [None]:
## Ammend this code so the series
## has the correct indices and names

dat2 = pd.Series(['epsilon', 'lambda', 'mu'])

## Solution: 
dat2 = pd.Series(['epsilon', 'lambda', 'mu'], index = ['e','l','m'],
                 name = 'greek letters')
dat2

We can also give the index set a name to aid readability of the `Series`: 

In [None]:
dat.index.name = 'latin'
dat

## Making a Series from a dictionary

Another way to make a `Series` is from a Python dict object:

In [None]:
views = {'Cats': 798, 'Star Wars': 1938, 'Jumanji': 1802}
dat3 = pd.Series(views)
dat3

### NumPy operations on Series

In [notebook 1](01-numpy-basics.ipynb) and [2](02-array-programming.ipynb) introduced us to NumPy and we saw how to use some NumPy operations on Python arrays.

You can also implement NumPy operations on Pandas objects:

In [None]:
dat3*2 

When applying arithmetic to `Series`, they automatically apply the functions to data by aligning indexes. 

In [None]:
dat4 = pd.Series({'Little Women':908, 'Star Wars': 1145, 'Cats':102})

dat3 + dat4

Note that if an index does not appear in all of the `Series` involved in the arithmetic, the resulting entry for that index will be `NaN`.

## DataFrames

As well as the one dimensional, indexed `Series` object, pandas contains a `DataFrame` object, which is more similar to a classic two dimensional spreadsheet, with both rows and columns. 

One easy way to make a `DataFrame` is from a Python Dictionary of Numpy arrays: 

In [None]:
import numpy as np

data = {'film':['Little Women', 'Cats', 'Jumanji', 'Star Wars'], 
       'mean views':[908, 102, 3604, 1145],
       'date': [np.datetime64('2019-12-25'), np.datetime64('2019-12-20'), np.datetime64('2019-12-04'), np.datetime64('2019-12-18')], 
       'run time':[135, 110, 123, 142]}

In [None]:
movie_df = pd.DataFrame(data) 
movie_df

We used the NumPy date and time functionality to format dates in our `DataFrame`. Remember that you can find out more about a function or object type using the ```help($function)``` command. 

As we saw with `Series`, rows of the `DataFrame` are indexed with integers. Using the `dir()` and `help()` functions, see if you can change the indexes of the `DataFrame` to the letters `a` to `d`.

In [None]:
#### Solution: 

movie_df.index=['a', 'b', 'c', 'd']
movie_df

We can retrieve rows and columns of the `DataFrame` using their index names, like so:

In [None]:
movie_df['mean views']

In [None]:
movie_df.loc['b']

We can also add columns to the DataFrame: 

In [None]:
movie_df['theaters']=np.array([1093,987,4038,4874])
movie_df

The `.append` function allows us to add rows to the `DataFrame`. Columns will align based on names.

In [None]:
yday = pd.Series(['yesterday', '795', np.datetime64('2019-06-28'), '116', 'Lily James'])
yday.index = ['film', 'mean views', 'date', 'run time', 'star']

In [None]:
movie_df.append(yday, ignore_index=True)

What do you notice has changed about the `DataFrame`?

The `.drop` function allows us to delete columns:

In [None]:
movie_df.drop(columns=['date'])

Can you use the `help` function to find out how to delete row `b` from the DataFrame, using the `.drop` function?

In [None]:
## Solution 

#help(pd.DataFrame.drop)
#movie_df.drop('b')

As well as adding and deleting data from the `DataFrame` it is important to be able to transform and apply functions to the data in the `DataFrame`. We can sort the `DataFrame` by one (or more) of its columns using the `.sort_values` function:

In [None]:
movie_df.sort_values(by = 'mean views')

To illustrate other functions we can apply to the `DataFrame` let's first make a `DataFrame` that just contains the columns `mean views`, `run time` and `theaters`:

In [None]:
movie_df2 = movie_df[['mean views', 'run time', 'theaters']]
movie_df2

We can compute the mean value down columns using the `.mean` function:

In [None]:
movie_df2.mean()

Using the `axis` parameter we are able to apply the same function across the rows: 

In [None]:
movie_df2.mean(axis=1)

Take a look at the `help` function to see what other column summaries compute. 

We can also apply functions we define to a `DataFrame`: 

In [None]:
movie_df2.apply(lambda x: x^2)