# pandas Basics

[*pandas*](http://pandas.pydata.org/) is a popular data science tool for managing, manipulating and analyzing column-oriented data.

In [None]:
import pandas as pd

In this notebook, we will cover:

- pandas data structures
- Description of the data
- Loading data from CSV
- Inspecting the data
- Data selection and filtering
- Data transformation
- Sorting values

## 1. pandas data structures

We'll work with two main [data structures](https://en.wikipedia.org/wiki/Data_structure) offered by pandas:

- `Series`, an array-like (1-dimensional) data structure
- `DataFrame`, a table-like (2-dimensional) data structure

You can create `Series` and `DataFrame` objects by passing the desired values explicitly, e.g. passing a list for a `Series` and a dictionary in the format `{'column': [list of values]}` for a `DataFrame` .

pandas also offers some helper functions to load data from specific formats, e.g. `read_csv()`.


## 2. Description of the data

In the next couple notebooks, we'll perform some analysis on a (semi-)randomly generated data set of clients and transactions (people buying stuff at a fake store).

There are two tables that we will load into pandas DataFrames: `clients` and `transactions`.

Attributes of `clients`:

- `client_id`: unique number identifying the client
- `name`: string representing the full (fake) name of the client
- `date_of_birth`: datetime object representing the date of birth
- `city`: string representing the location of the client 

Attributes of `transactions`:

- `transaction_id`: unique number identifying the transaction
- `client_id`: id of the client associated with the transaction
- `date`: datetime object representing the date of the transaction
- `product`: string representing the specific product being purchased
- `quantity`: number of items purchased for the specific product
- `unit_price`: number representing the price of a single unit for the specific product
- `total`: unit price times quantity for the specific product

In [None]:
clients_file = '../data/fake_shop/fake_clients.csv'
transactions_file = '../data/fake_shop/fake_transactions.csv'

## 3. Loading Data From CSV

The function `pandas.read_csv()` can be used to load a CSV file into a pandas `DataFrame`.

pandas also supports other file formats out of the box: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [None]:
clients = pd.read_csv(clients_file, parse_dates=['date_of_birth'], encoding='utf-8')
clients

#### A Note About the `read_csv()` function

The `pandas.read_csv()` function takes several optional arguments to customise the way we load the file.

e.g. in the example above, we're using `parse_dates` to specify which columns should be trated as `datetime` objects rather than plain strings.

Full reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

## 4. Inspecting the `DataFrame`

Instead of printing out the whole `DataFrame`, we can inspect the first few records using `head()`.

Note: `head()` returns a new `DataFrame`, made of the desired number of records.

Most functions working on a `DataFrame` or a `Series` don't modify the original data structure, but create a new one instead. For downstream processing, you could also save the output to a new variable name instead of printing it out directly like we're doing here.

In [None]:
clients.head()  # first 5 records

In [None]:
clients.head(10)  # first 10 records

The function `info()` provides an overview on data types and null values

In [None]:
clients.info()

The function `count()` provides the number of non-null values per field

In [None]:
clients.count()

The attribute `columns` returns the sequence of column names (index) of the `DataFrame`

In [None]:
clients.columns

The function `tail()` is similar to `head()`, but works with the end of the `DataFrame`

In [None]:
clients.tail()

Let's load the dataset of transactions:

In [None]:
transactions = pd.read_csv(transactions_file)
transactions.head(10)

In [None]:
transactions.info()

You may notice from the result of running `info()` above that the [dtype](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes) of the date column is an object instead of a datetime. This is because pandas thinks it is a string instead of a date. This is bad not just from an efficiency standpoint, and also because you wouldn't be able to do datetime operations on that column (eg: comparing if it's earlier or later vs another date). It is good practice to process the data and fix this as early on as possible!

#### Exercise

Reload the transactions file, making sure that the column `date` is parsed as `datetime` object.

In [None]:
# Write your solution here:


If you are having trouble, recall the discussion earlier in this notebook about how `read_csv()` takes optional arguments to customise the way we load the file.

Or reveal the solution below:

In [None]:
# when you run this cell, the cell's contents will be replaced by the solution! MAGIC!
%load ../solutions/01-transactions-parse-dates.py

The function `describe()` provides an overview on summary statistics.

It's run against all the numerical columns.

In [None]:
transactions.describe()

**Note**: summary stats on `transaction_id` and `client_id` are not meaningful in this case.

It's always worth focusing only on the columns of interest first.

Basic statistics can also be computed individually using the relevant functions, e.g.

In [None]:
transactions['total'].sum()

In [None]:
transactions['unit_price'].min()

In [None]:
transactions['unit_price'].max()

In [None]:
transactions['quantity'].mean()

For categorical data, we can compute the frequency distribution using `value_counts()`:

In [None]:
transactions['product'].value_counts()

The function `unique()` returns the unique values of a `Series`:

In [None]:
clients['city'].unique()

## 5. Data Selection and Filtering

This section discusses different options to select partial subsets of your data.

### Select records by row label

In [None]:
# Select record with row label = 0
# the result is a Series
clients.loc[0]  

In [None]:
# Select records with labels between 0 and 2 (included)
# the result is a DataFrame
clients.loc[0:2]

### Select columns by column name

In [None]:
# Select a column - the output is a Series
clients['name']

In [None]:
# Select multiple columns - the output is a DataFrame
clients[['name', 'date_of_birth']]

Note: the double square brackets indicate that we're passing a list of labels, e.g.

In [None]:
columns_of_interest = ['name', 'date_of_birth']
clients[columns_of_interest].head()

#### Exercise

Use the function `describe()` to get the overview on summary statistics for the table of transactions.

Focus only on the relevant numerical columns: `quantity`, `unit_price` and `total`.

In [None]:
# Write your solution here


In [None]:
# or load the proposed solution by running this cell
%load ../solutions/01-transactions-describe.py

### Select the intersection of rows and columns (by label/name)

In [None]:
transactions.loc[0:3, ['product', 'total']]  # select by both rows and columns at the same time

### Select records using a condition

Comparisons are performed element-wise, e.g.

In [None]:
transactions['unit_price'] > 2

A boolean Series can be used for selecting records based on a condition:

In [None]:
transactions[transactions['unit_price'] > 2]

We can combine multiple conditions with *bitwise operators*:

- `&` instead of `and`
- `|` instead of `or`
- `~` instead o `not`

Note: the Python boolean operators `and`, `or` and `not` work on values that can be considered `True` or `False`. For a pandas `Series`, the truth value is ambiguous, hence *bitwise operators* are used.

In [None]:
transactions[(transactions['unit_price'] > 2) & (transactions['quantity'] > 2)]

**Note**: the round brackets around each `(condition)` are required because *bitwise operators* have high priority, so we need to force the order of operations.

**Tip**: when a condition is complex or difficult to read, we can break it down to improve readability.

For example the previous cell could be rephrased as:

In [None]:
# No need for round brackets here, the order of operations is clearly defined
price_over_2 = transactions['unit_price'] > 2
quantity_over_2 = transactions['quantity'] > 2
condition = price_over_2 & quantity_over_2

transactions[condition]

## 6. Data transformations

### Arithmetic operations

Basic arithmetic operations are performed element-wise, e.g.

In [None]:
vat_percentage = 0.2
transactions['vat_to_add'] = transactions['total'] * vat_percentage

transactions.head()

Also between two `Series`, as long as the index is compatible (e.g. two columns of the same table):

In [None]:
transactions['total_with_vat'] = transactions['total'] + transactions['vat_to_add']

transactions.head()

### Applying custom functions

A custom function can be applied to all the elements of the `Series` using the function `apply()`.

In [None]:
def describe_price(price):
    if price <= 2:
        return 'Cheap'
    elif price >= 5:
        return 'Expensive'
    else:
        return 'Fair'
    
transactions['unit_price'].apply(describe_price)

### Mapping values

We can map specific values from a `Series` to other custom values, passing them as a dictionary to the function `map()`

In [None]:
price_descriptions = {
    5: 'EXPENSIVE!',
    1: 'CHEAP!'
}

transactions['unit_price'].map(price_descriptions)

Note: values that are not explicitly described in the mapping are converted to null values

### Working with text

The `.str` attribute of a `Series` offers the same interface as the string manipulation functions from regular Python strings.

The functions are applied element-wise.

Examples of functions:

- `upper()`, `lower()`, `capitalize()`
- `split()`
- `replace()`
- `startswith()`, `endswith()`, `contains()`
- `islower()`, `isupper()`, `isalpha()`, `isdigit()`

In [None]:
transactions['product'].str.upper()

In [None]:
transactions['product'].str.contains('Ice')

**Try it yourself**: if you're not familiar with regular Python strings, try different functions from the list above to build an intuition of how they work.

### Working with dates

The `.dt` attribute of a `Series` offers a datetime-like interface. The functions/attributes are applied element-wise.

In [None]:
clients['date_of_birth'].dt.year

In [None]:
clients['date_of_birth'].dt.strftime('%d %B, %Y')

## 7. Sorting Values

The `sort_values()` function sorts by value.

The default behaviour is to sort low-to-high, numerically of alphabetically depending on the columns data type.

In [None]:
transactions.sort_values(by='product')

In [None]:
clients.sort_values(by='date_of_birth', ascending=False)

Note: `NaT` at the end of the output is similar to `NaN`, but for datetime columns.

## Exercises

Once you are familiar with the concepts described in this notebook, please continue with the following notebook:

[Exercises on pandas basics](01.1%20-%20Exercises%20on%20pandas%20basics.ipynb)