# Pandas Intro

## DataFrames

Pandas is a module for working with tables. The most important data structure in Pandas is the DataFrame, which you can think of as being similar to an Excel table, or an SQL table. Let's import the Pandas module and create a dataframe:

In [None]:
import pandas as pd

df = pd.DataFrame({'col_1': [6,7,11], 'col_2': ['a', 'b', 'c']})
print(df)

We've created a dataframe called ``df``. It contains two variables (columns) and three observations (rows). In this case, we created the DataFrame using a dictionary, where the keys in the dictionary become the column names in the DataFrame, and the values in the dictionary become the data within each column.

Note that DataFrames always have an index column, shown in the above output as the leftmost column (in this case the index contains the values 0, 1, and 2). If you do not specify an index when you create a DataFrame, Pandas will create an auto-incrementing index for you.

Now let's create a DataFrame that looks a little more like real data:

In [None]:
import numpy as np

data = {
    'name': ['Jeff', 'Julia', 'Ronda', 'Dinesh', 'Susan'],
    'height': [183, 176, 187, 157],
    'weight': [69, 73, np.nan, 77, 82],
    'gender': ['m', 'f', 'f', 'm', 'f']
}

df = pd.DataFrame(data)
print(df)

Here we have five observations across four variables. Note that one of the weight observations is a missing value, which in Pandas is conventionally represented by the NaN value ("Not a Number").

The above DataFrame ``df`` was created by converting a Python Dictionary object into a Pandas DataFrame object. Note that you can also specify the columns and observations (rows) separately when creating a DataFrame:

In [None]:
a = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
])
print(a)
print(type(a))

# Create DataFrame using a Numpy array and specifying column names
df_fromarray = pd.DataFrame(a, columns=['x', 'y', 'z'])
print(df_fromarray)
print(type(df_fromarray))

Now create another dataframe that contains identical columns and observations to ``df_fromarray``, but use a dictionary rather than an array to create it:

In [None]:
d = # Write your dictionary code here...



# Create DataFrame using a dictionary
df_fromdict = # Write your code here...


assert df_fromarray.equals(df_fromdict), 'The DataFrames are not identical'

Note that the ``.equals()`` method in the code above is used to check if two DataFrames are identical.

## The Series class, and selecting DataFrame columns

Another important data structure in Pandas is the Series, which is like a single column of an Excel table. You can also think of them as being very similar to Array data structure in Numpy, although with a bit more flexible.

Let's create a Series:

In [None]:
s = pd.Series([39, 43, 25])
print(s)

Note that if we select a particular column of a DataFrame, it returns a Series object, not a DataFrame object:

In [None]:
df['name']

In [None]:
type(df['Name'])

However, if we select multiple columns, a DataFrame is returned, not a Series:

In [None]:
df[['name', 'height']]

In [None]:
type(df[['name', 'height']])

Note that we can also select a column of a DataFrame using slightly different syntax: ``df.col_name``, rather than ``df['col_name']``:

In [None]:
df.name

You can use either syntax, but note that the ``.`` syntax is slightly more limited, because it requires that the column name contain only letters, numbers, and underscores; i.e. no whitespace or other special characters.

## Some useful DataFrame attributes and methods

DataFrames have a number of useful attributes and methods. Some that you might find useful are:

In [None]:
# provides information on dataframe variables (columns), numbers of observations (rows), and data types
df.info()

In [None]:
# Gives the shape of a DataFrame
df.shape

In [None]:
# Gives the column names of the DataFrame
df.columns()

In [None]:
# Gives the index of the DataFrame
df.index

The ``.unique()`` Series method gives the unique values in a DataFrame column (Series):

In [None]:
df['gender'].unique()

In [None]:
df['height'].unique()

The ``.value_counts()`` Series method can be used to count instances of each unique value in a DataFrame column (Series). For example, we can use it to count instances of each unique value in the ``'gender'`` column of ``df``:

In [None]:
# Count instances of each unique value in a column
df['gender'].value_counts()

In [None]:
df['height'].value_counts()

In [None]:
# Simple descriptive statistics for each column (of numeric or object data type)
df.describe()

The descriptive statistics included in the ``.describe()`` DataFrame method are:

- **count:** count of the number of non-NA/null observations,
- **mean:** mean of the values,
- **std:** standard deviation of the observations,
- **min:** minimum of the values,
- **max:** maximum of the values,
- the 25th, 50th (median), and 75th percentiles.

## Selecting rows

We can select particular rows in a DataFrame which satisfy a given condition, using either of the following syntax options:

- ``df[condition]``
- ``df.loc[condition]``

For example:

In [None]:
df.loc[df['name'] == 'Julia']

In [None]:
df[df['name'] = 'Julia']

This returns a new DataFrame of any rows that match the condition within ``[...]``. So depending on the search condition, we can also return multiple rows:

In [None]:
df[df['gender'] == 'f']

Multiple conditions must be contained by brackets ``(...)`` and separated by ``&`` for the AND operator, ``|`` for the OR operator, etc...

Write a line in the cell below that selects rows for women that are > 180cm.

In [None]:
df[... your code here ...]

Now change the above to an OR operator (``|``) rather than an AND operator (``&``), and re-run the above cell. It should now select all rows where gender is female or height is > 180cm.

Now try selecting all data other than Ronda's data using the NOT EQUALS operator (``!=``):

In [None]:
df[... your code here ...]

Note that we can also assign any returned DataFrame to a new variable:

In [None]:
j_rows = df[df['name'].str.startswith('J')]
print(j_rows)

It is useful to understand how Pandas identifies and returns the above rows. The query ``df[df['gender'] == 'f']`` actually contains a couple of different steps. First, the condition ``df['gender'] == 'f'`` grabs the "gender" column and returns the *indices* of all rows that have the value "f" in that column:

In [None]:
df['gender'] == 'f'

This returns a Series of Boolean values (True of False) that identify whether each row satisfies the condition. This Series is then used to return all rows where the condition is True:

In [None]:
df[df['gender'] == f]

This means that if we provide a list of Bool values, we would get the same result:

In [None]:
df[[False, True, True, False, True]]

## Saving and loading DataFrames

Pandas has inbuilt functions that make it easy to save dataframes and load datasets of different formats into dataframes. One of the most common data formats is the comma-separated variable (CSV) format, where each row is an observation, and each variable in that row is separated by a comma. We can save our ``df`` DataFrame to disk as a CSV file:

In [None]:
file_name = 'height_weight.csv'
df.to_csv(file_name)

We can then load the same DataFrame, assigning it to a different variable name:

In [None]:
data = pd.read_csv(file_name)
print(data)

Note that this has an extra column ``'Unnamed: 0'``. This was loaded because the ``.to_csv()`` method will automatically save the DataFrame's index to the CSV file as a new column, in addition to all other data.

To prevent this, we need to specify ``index=False`` as an option when calling the ``.to_csv()`` method. Add the ``index=False`` option to the ``.to_csv()`` call above and re-run that cell. Then re-run the subsequent cell containing the ``.read_csv()`` call. It should now load a dataframe without the ``'Unnamed: 0'`` column.

## Exercise 1

Create a DataFrame called ``volumes`` with:
- a single column called ``'radius (m)'``,
- 20 rows, each with a radius value randomly chosen from between the bounding values 3.0 and 10.0. You can use Numpy's``np.random.uniform(low=a, high=b, size=(n,))`` function, which returns an array of ``n`` randomly generated numbers drawn from a uniform distribution between the numbers ``a`` and ``b``.

Create a new column called ``'sphere_vol (m^3)'`` containing the corresponding volume of a sphere for each of the radius values. The formula for calculating the volume of a sphere is:

$V = {4 \over 3} \pi r^3$

Note that the value for $\pi$ is available in Numpy using the ``np.pi`` variable.

Now print only the rows where the volume is > $30 m^3$.

Save the ``volumes`` DataFrame to the location ``'../data/volumes.csv'``. Then load the data into a new DataFrame called ``vol`` and check that it is equivalent to the ``volumes`` DataFrame. If the DataFrames are not equivalent, one possibility is that floating point rounding errors have created slightly different floats in the new dataframe. One way to check this is by rounding all float values in each dataframe to, say, 5 decimal places using the ``df.round()`` method, and then performing the comparison between the rounded dataframes.