# A quick note:

These notes are being written as a Jupyter Notebook, exploiting the fact that you can use different languages in different cells. What you see as notes are written in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) language. In fact, if you double click on any of these Markdown cells you will see the underlying code (and re-running the cell, e.g., using the Run button, will return to the more pleasent compiled version).<br/>

Some other cells are directly written instead in Python (you can see which language each cell is written in in the toolbar of the Jupyter Notebook above!). In this case, you should Run the cell to see its output. None of this should be new to you but it's better to repeat since this is the first part of Year 2.  

Also as a reminder, while going through this document, you should be running each cell with Python code because some of their definition are used in subsequent cells and if you do not do that, Python will start raising error messages! 

In any case, it is good practice sometimes to experiment to see if you have understood correctly. To do that, clear the output cell (selecting Cell-> Current Outputs -> Clear from the toolbar), change something in the base code and re-Run the cell to see how the output changes! 

================================================================================================================


# `pandas`

> pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

We will start this course by introducing a new Python package (collection of pre-defined function and objects) called `pandas`. The description above is from its official website which can be found [here](https://pandas.pydata.org/). <br/>
`pandas` contains a wide variety of highly-optimised tools to facilitate the manipulation of big databases, as well as providing a myriad of different input/output management tools.

To start, we must install packages and then import them. To install `pandas` we run one of the following commands in the Jupyter Notebook (the specific command depends on our package manager. You should have conda installed but use any of the other two if you get an error):

```bash
conda install pandas
pip install pandas
python -m pip install pandas
```

Following a successful installation, we can import `pandas` using:

```python
import pandas
```

====================================================================================================================================

**Note: Installing packages in Python can always be done with the general syntax**

```python
conda install name_of_package
```

====================================================================================================================================

In most of our code we will be importing `numpy`, `pandas` and `matplotlib.pyplot` by default so don't be surprised by this. Something else that we will do is use their commonly accepted abbreivated forms using the `as` statement. Therefore the first few lines of our code will typically look like this:

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

In general, it is better **NOT TO use abbreviated names** for packages **EXCEPT** for commonly accepted abbreviated names of packages. This is because unexpected names can make it difficult for other users to read your code. Please bear this in mind throughout the course, and especially when you create your own packages or modules.

In [3]:
# import package
import pandas as pd

## `pd.DataFrame`

To begin, we will present the main class that is used in `pandas` and that is the `DataFrame`. Since `DataFrame` is a class, we must call its constructor - `__init__(self, ...)` - by using parentheses - `()` - after we type its name. Let's construct a `DataFrame` now:

In [4]:
pd.DataFrame()

As you can see, not much has happened. This is because we have just created an empty `DataFrame`. There are many ways of initiliasing `DataFrame` objects, and even more ways to use them. In this part of the course we will go through some examples and also provide information on more external examples that show how `DataFrame` objects can be used.

One of the main ways of initialising `DataFrame` objects is by passing either a `dict` or `list` to them as the main argument. Here are a few examples:

> In these examples you may recognise a new way of printing strings using f-strings which take this format `f"Text {variable}"` which can be used to insert a variable into a string. Find more information about f-strings [here](https://www.python.org/dev/peps/pep-0498/)

> Note that for improving the readability of the printed strings, we use `'\n'` characters which are represented as new lines

In [5]:
data_dict = {
    'time' : [0.0, 1.0, 2.0, 3.0, 4.0],
    'temperature' : [37.0, 35.9, 36.0, 37.3, 35.6]
}

print(f"Dictionary:\n{data_dict}\n")
print(f"Data from Dictionary:\n{pd.DataFrame(data_dict)}\n")

data_list = [
    [0.0, 37.0],
    [1.0, 35.9],
    [2.0, 36.0],
    [3.0, 37.3],
    [4.0, 35.6],
]

print(f"List:\n{data_list}\n")
print(f"Data from List:\n{pd.DataFrame(data_list)}\n")

Dictionary:
{'time': [0.0, 1.0, 2.0, 3.0, 4.0], 'temperature': [37.0, 35.9, 36.0, 37.3, 35.6]}

Data from Dictionary:
   time  temperature
0   0.0         37.0
1   1.0         35.9
2   2.0         36.0
3   3.0         37.3
4   4.0         35.6

List:
[[0.0, 37.0], [1.0, 35.9], [2.0, 36.0], [3.0, 37.3], [4.0, 35.6]]

Data from List:
     0     1
0  0.0  37.0
1  1.0  35.9
2  2.0  36.0
3  3.0  37.3
4  4.0  35.6



Already we can see that `DataFrame` objects present data in a more readable way, and when using dictionaries, the columns gain the titles of the keys from the dictionary. Now that we have constructed our dataframes, we need to know some of their basic functions for data manipulation, so that we can harness them as tools later on in the course.

> In the next parts of the course, we will use some of the concepts covered in the first year course MATE40001 with little or no explanation. This is to refresh your memory. One concept that may be new is the use of **annotations** which are exclusive to Python 3 (`def function(args: "argument annotation") -> "return annotation":`). We will use these without explanation but a description of their use, **which you will be required to know**, can be found [here](https://www.python.org/dev/peps/pep-3107/). 

In [6]:
import math
import random

# create a function to generate random data
def generate_dataframe(n_rows: int) -> pd.DataFrame:
    """Generate a DataFrame with time and temperature data"""
    data = {
        'day' : range(1, n_rows+1),
        'temperature' : [32 + 5*math.sin(i) + random.random() for i in range(n_rows)]
    }
    result = pd.DataFrame(data)
    return result

generate_dataframe(10)

Unnamed: 0,day,temperature
0,1,32.795705
1,2,36.374262
2,3,36.749039
3,4,32.800396
4,5,29.012655
5,6,27.309292
6,7,30.726343
7,8,36.114951
8,9,37.318667
9,10,34.07338


## Accessing Data

We are now going to demonstrate a few different ways of understanding the basic data stored in a `DataFrame`. Firstly to access the names of the columns we access the attribute `data.columns`; and to access the names of the rows, we use the attribute `data.index`.

In [7]:
temp = generate_dataframe(10)
print(f'Index:\n{temp.index}')
print(f'Columns:\n{temp.columns}')

Index:
RangeIndex(start=0, stop=10, step=1)
Columns:
Index(['day', 'temperature'], dtype='object')


### loc, iloc and indexing
One of the most advanced things that `pandas` is capable of is indexing by numerical position or by using strings that refer to the columns of index. Now we shall demonstrate how some of these work. This section is quite complicated and we will revisit indexing `DataFrame` objects throughout the course and show many different examples of how this is done.

> The term 'indexing' refers to the act of accessing data within a table

We can use `DataFrame.loc` to use `numpy`-like indexing using the <b>actual values</b> in the form of \[row, column\]. Here are some examples:

In [8]:
temp.loc[1, :]

day             2.000000
temperature    36.980507
Name: 1, dtype: float64

In [9]:
temp.loc[:, 'temperature']

0    32.875117
1    36.980507
2    36.909371
3    33.088624
4    28.796627
5    28.169348
6    31.339406
7    35.818383
8    37.517344
9    34.201856
Name: temperature, dtype: float64

We can also use `DataFrame.iloc` to index the values by their actual position. It is also possible (this is the same between `loc` and `iloc`) to use just one value (rather than two) to access a row by its index. 

In [8]:
temp.iloc[1]

day             2.000000
temperature    37.056996
Name: 1, dtype: float64

In [9]:
temp.iloc[:, 1]

0    32.393463
1    37.056996
2    36.566897
3    33.186805
4    29.098503
5    27.654398
6    30.700131
7    35.866356
8    37.522673
9    34.236665
Name: temperature, dtype: float64

When we have strings as the column names, we can select columns by the name of their column. From here we can reindex the returned column (known as a `pd.Series`) to select individual values. Here are some examples of this:

In [10]:
temp['temperature']

0    32.393463
1    37.056996
2    36.566897
3    33.186805
4    29.098503
5    27.654398
6    30.700131
7    35.866356
8    37.522673
9    34.236665
Name: temperature, dtype: float64

In [11]:
temp['temperature'][1]

37.0569964812263

In [12]:
temp['temperature'][1:6]

1    37.056996
2    36.566897
3    33.186805
4    29.098503
5    27.654398
Name: temperature, dtype: float64

## DataFrame methods

For the next examples, we're going to re-write our `generate_dataframe()` function to add some keyword arguments and we're going to generate some data for each day of the 12 months of the year and store each `DataFrame` as an entry in a dictionary.

> Remember that a <b>method</b> is just a <b>function</b> that belongs to a <b>class</b> 

In [13]:
def generate_temperatures(
    mean_temperature: float, 
    days: int, 
    variation: float = 5.0
) -> pd.DataFrame:
    """Generate a DataFrame with time and temperature data"""
    data = {
        'day' : range(1, days+1),
        'temperature' : [mean_temperature + variation * (random.random() - 0.5) for i in range(days)]
    }
    result = pd.DataFrame(data)
    return result

months = {
    'january' : generate_temperatures(10.0, 31, variation=1.0),
    'february' : generate_temperatures(12.0, 28, variation=3.0),
    'march' : generate_temperatures(17.0, 31, variation=7.0),
    'april' : generate_temperatures(16.0, 30, variation=8.0),
    'may' : generate_temperatures(23.0, 31, variation=8.0),
    'june' : generate_temperatures(25.0, 30, variation=4.0),
    'july' : generate_temperatures(30.0, 31, variation=0.5),
    'august' : generate_temperatures(30.0, 31, variation=0.5),
    'september' : generate_temperatures(20.0, 30, variation=5.0),
    'october' : generate_temperatures(12.0, 31, variation=6.0),
    'november' : generate_temperatures(10.0, 30, variation=4.0),
    'december' : generate_temperatures(5.0, 31, variation=5.0),
}

print(f'Month Names:\n{months.keys()}')
print(f"\nJanuary Weather:\n{months['january']}")

Month Names:
dict_keys(['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december'])

January Weather:
    day  temperature
0     1    10.210911
1     2     9.913583
2     3    10.216349
3     4     9.738865
4     5     9.698485
5     6    10.065464
6     7     9.723614
7     8     9.602762
8     9    10.179839
9    10     9.532841
10   11    10.292285
11   12    10.313594
12   13     9.784719
13   14    10.471105
14   15    10.049943
15   16    10.403821
16   17    10.079736
17   18     9.996480
18   19    10.168183
19   20     9.604798
20   21     9.730309
21   22    10.218975
22   23     9.603863
23   24    10.227438
24   25    10.178251
25   26     9.736987
26   27    10.280161
27   28     9.595116
28   29    10.276092
29   30     9.729638
30   31     9.622401


### DataFrame Statistics
Now let's use the following built-in methods of the `DataFrame` to do some basic statistics on our data. For reference, here are the methods we are going to use:
- `pd.DataFrame.head()` - first 5 rows (or n)
- `pd.DataFrame.tail()` - last 5 rows (or n)
- `pd.DataFrame.sum()` - sum of each column (use axis=1 for rows)
- `pd.DataFrame.mean()` - arithmetic mean
- `pd.DataFrame.std()` - standard deviation

> For a full list of different `DataFrame` methods, you can look [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [14]:
january = months['january']
print(f'Head:\n{january.head()}')
print(f'\nTail:\n{january.tail()}')
print(f'\nSum:\n{january.sum()}')
print(f'\nMean:\n{january.mean()}')
print(f'\nStandard Deviation:\n{january.std()}')

Head:
   day  temperature
0    1    10.210911
1    2     9.913583
2    3    10.216349
3    4     9.738865
4    5     9.698485

Tail:
    day  temperature
26   27    10.280161
27   28     9.595116
28   29    10.276092
29   30     9.729638
30   31     9.622401

Sum:
day            496.000000
temperature    309.246608
dtype: float64

Mean:
day            16.000000
temperature     9.975697
dtype: float64

Standard Deviation:
day            9.092121
temperature    0.288772
dtype: float64


### Some other functions

We will now use some more functions to demonstrate how to use `DataFrame` objects. Firstly we will use the `pd.concat` function to create a new large `DataFrame` that contains the temperature for each day of the year, by concatenating (similar to adding) a `list` of `DataFrame` objects together. If we look at the manual for `pd.concat` by typing `help(pd.concat)` we can see which arguments we need:

> For most of the large Python projects e.g. numpy, scipy, pandas - the `(help)` function will usually be the same as looking up a function in the official documentation - it is essential to be able to look up the definitions of functions and classes when there is no-one to ask!

In [15]:
help(pd.concat)

Help on function concat in module pandas.core.reshape.concat:

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
    Concatenate pandas objects along a particular axis with optional set logic
    along the other axes.
    
    Can also add a layer of hierarchical indexing on the concatenation axis,
    which may be useful if the labels are the same (or overlapping) on
    the passed axis number.
    
    Parameters
    ----------
    objs : a sequence or mapping of Series, DataFrame, or Panel objects
        If a dict is passed, the sorted keys will be used as the `keys`
        argument, unless it is passed, in which case the values will be
        selected (see below). Any None objects will be dropped silently unless
        they are all None in which case a ValueError will be raised
    axis : {0/'index', 1/'columns'}, default 0
        The axis to concatenate along
    join : {'in

The first argument is called `objs` which is described as a sequence or mapping of different objects (one of which is `DataFrame`. So we will use the `dict.values()` method as the argument since this returns a sequence of our values, all of which are `DataFrame` objects. We will also use the keyword argument `ignore_index=True` since for us, it is important for the `index` to be in the correct order.

In [16]:
year = pd.concat(months.values(), ignore_index=True)
print(f'Year:\n\n{year.head()}\n...\n{year.tail()}')

Year:

   day  temperature
0    1    10.210911
1    2     9.913583
2    3    10.216349
3    4     9.738865
4    5     9.698485
...
     day  temperature
360   27     4.847370
361   28     3.049918
362   29     6.433475
363   30     5.130456
364   31     2.977165


Each column of a `DataFrame` is called a `Series` and we will now set the `day` column to the correct values by setting it to the `index` of the `DataFrame`. By calling `year.tail()` we can now see that the days have been updated to be the correct values.

In [17]:
year['day'] = year.index + 1
year.tail()

Unnamed: 0,day,temperature
360,361,4.84737
361,362,3.049918
362,363,6.433475
363,364,5.130456
364,365,2.977165


## Summary

We have now covered some of the basics of using `pd.DataFrame` objects. Through completing some of the tasks and by using `DataFrame` objects to manage our data throughout the course, we will learn some of their many uses, and provide experience in looking through documentation and the internet to figure out how we can use our `DataFrame` objects for the purpose that is required.

Aside from learning about `pandas` and `DataFrame` objects, we have also covered (and in some cases revisited) the following topics:

- f-strings
- Function arguments and keyword-arguments
- Function annotations
- Indexing
- Variables, attributes, functions and methods
- `list` and `dict`
- `import` ... `as` ...