# Pandas

   - A powerful packages to manipulate data
   - Unlike numpy, pandas has named columns
   - The [Pandas](http://pandas.pydata.org/) library is built on top of the NumPy library, and it provides a special kind of two dimensional data structure called `DataFrame`. The `DataFrame` allows to give names to the columns, so that one can access a column using its name in place of the index of the column.



In [None]:
import pandas as pd    # This is the standard way of importing the Pandas library
import numpy as np

### Loading CSV Files

- can handle local or url based CSV files
- there are multiple options on how data can be loaded see the documentation

In [None]:
df = pd.read_csv("https://www.cs.helsinki.fi/u/jttoivon/dap/data/fmi/kumpula-weather-2017.csv")
df.head(10)   

In [None]:
df.tail()

### Accessing Data

Refer to a column by its name:

In [None]:
df["Snow depth (cm)"].head()     # Using the tab key can help enter long column names

There are several summary statistic methods that operate on a column or on all the columns. The next example computes the mean of the temperatures over all rows of the DataFrame:

In [None]:
df["Air temperature (degC)"].mean()    # Mean temperature

We can drop some columns from the DataFrame with the `drop` method:

In [None]:
df.drop("Time zone", axis=1).head()    # Return a copy with one column removed, the original DataFrame stays intact

In [None]:
df.head()     # Original DataFrame is unchanged

In case you want to modify the original DataFrame, you can either assign the result to the original DataFrame, or use the `inplace` parameter of the `drop` method. Many of the modifying methods of the DataFrame have the `inplace` parameter.

Addition of a new column works like adding a new key-value pair to a dictionary:

In [None]:
df["Rainy"] = df["Precipitation amount (mm)"] > 5

In [None]:
df.Rainy.sum()

In the next sections we will systematically go through the DataFrame and its one-dimensional version: *Series*.

## Creation and indexing of series

One can turn any one-dimensional iterable into a Series, which is a one-dimensional data structure:

In [None]:
s = pd.Series(range(10, 40, 4))
s

The data type of the elements in this Series is `int64`, integers representable in 64 bits. 

We can also attach a name to this series:

In [None]:
s.name = "Grades"
s

The common attributes of the series are the `name`, `dtype`, and `size`:

In [None]:
print(f"Name: {s.name}, dtype: {s.dtype}, size: {s.size}")

In addition to the values of the series, also the row indices were printed. All the accessing methods from NumPy arrays also work for the Series: indexing, slicing, masking and fancy indexing. 

In [None]:
s[1]

In [None]:
s2 = s[[0,5]]                    # Fancy indexing
print(s2)

In [None]:
t=s[-2:]                    # Slicing
t

Note that the indices stick to the corresponding values, they are not renumbered!

In [None]:
t[4]                        # t[4] would give an error

The values as a NumPy array are accessible via the `values` attribute:

In [None]:
s2.values

And the indices are available through the `index` attribute:

In [None]:
s2.index

The index is not simply a NumPy array, but a data structure that allows fast access to the elements. The indices need not be integers, as the next example shows:

In [None]:
s3=pd.Series([1, 4, 5, 2, 5, 2], index=list("abcdef"))
s3

In [None]:
s3.index

In [None]:
s3["b"]

<div class="alert alert-warning">
Note a special case here: if the indices are not integers, then the last index of the slice is included in the result. This is contrary to slicing with integers!
</div>

In [None]:
s3["b":"e"]

It is still possible to access the series using NumPy style *implicit integer indices*:

In [None]:
s3[1]

This can be confusing though. Consider the following series:

In [None]:
s4 = pd.Series(["Jack", "Jones", "James"], index=[1,2,3])
s4

What do you think `s4[1]` will print? For this ambiguity Pandas offers attributes `loc` and `iloc`. The attributes `loc` always uses the explicit index, while the attribute `iloc` always uses the implicit integer index:

In [None]:
print(s4.loc[1])
print(s4.iloc[1])

In [None]:
d = { 2001 : "Bush", 2005: "Bush", 2009: "Obama", 2013: "Obama", 2017 : "Trump"}
s4 = pd.Series(d, name="Presidents")
s4