# Chapter I - Pandas
## Part 1: Introduction

In this notebook we will cover how to:
- work with the two main data types in `pandas`: `DataFrame` and `Series`
- work with data types in `pandas`, especially strings and dates
- load data from JSON and CSV into a `DataFrame`
- manipulate the columns of a `DataFrame`
- access data in a `DataFrame` by means of indexes and slicing

In [2]:
from random import random
import pandas as pd
import numpy as np

### Section (1): Creating a first dataframe from scratch

This section will take you through the steps needed to create a `pandas` DataFrame from scratch.

To create a DataFrame from scratch, you need the values for at least two columns.
Those values are stored in a data type called a `Series`. They can be thought of as the `pandas` version of lists.

A pandas `Series` can be created as follows:

#### A) Pandas Series

In [8]:
s = pd.Series([1,2,3])

print(s)
print(' > The type of s is:', type(s))

0    1
1    2
2    3
dtype: int64
 > The type of s is: <class 'pandas.core.series.Series'>


✏️ [Ex.1] 
- ✏️ Create a series called `s` containing 100 random numbers ranging between 0 and 1

In [9]:
## SOLUTION:
# s = pd.Series([random() for n in range(0, 100)])

Each observation in the series has an **index** as well as a set of **values**: they can be accessed via the omonymous properties.
- The data type of the **index** is a `pandas RangeIndex`, akin to a Python `range`.
- The data type of the **values** is a `numpy array`.

✏️ [Ex.2] 
- ✏️ Using the series **index**, print the length of the `Series`
- ✏️ Print the first three elements of the **values** of series `s`.

In [16]:
## SOLUTION:
# print(len(s.index))
# print(s.values[:3])

Pandas `Series` have got useful properties that you can call to easily access information on the data in the Series.
Some of them include:
- `head(n)` and `tail(n)` to access the beginning and end of the series — where `n` is the number of values to get.
- `value_counts()` to show the occurrences of all values in the series. Calling this property returns a `Counter` object, itself contains an `.index` and some `.values` which you can call to access the occurrences' count.
- `min()`, `max()`, `mean()`, `median()` give some basic statistics on the series' data.

✏️ [Ex.3] 
- ✏️ Calculate the range of values in `s`
- ✏️ Find if there are some duplicate values in `s`
- ✏️ Calculate the mean of the first 50 values in `s`

In [27]:
## SOLUTION:
# print(s.max()-s.min())
# print(max(s.value_counts().values)==1)
# print(s.head(50).mean())

- Some of you might want to manipulate time data in the form of dates. Pandas is very convenient for the manipulation of dates. 

To do that, you should use pandas appropriate date type, called `Timestamp`.

For example, VE-day can be encoded as such:

In [50]:
print(pd.Timestamp(1945, 5, 8))

1945-05-08 00:00:00


A date can also be encoded as a string, and pandas will do its best to convert it to a timestamp.

Note that it flexibly supports both 'YYYYMMDD' and 'YYYMMDDHHMMSS' 

In [82]:
print(pd.Timestamp('19450508'))
print(pd.Timestamp('19690711025615'))

# What happens if you try to create a Timestamp with a date that doesn't exist? Try it out.

1945-05-08 00:00:00
1969-07-11 02:56:15


The difference between two `Timestamps` is a `Timedelta` object. The number of days contained in the time difference can be accessed through the eponymous property:

In [65]:
print((pd.Timestamp('19690711025615') - pd.Timestamp('19450508')).days)

8830

A date can be shifted simply by adding to it a `Timedelta`:

In [73]:
print(pd.Timestamp('19450508')+pd.Timedelta('55 days 2 hours 15 minutes 10 seconds'))

1945-07-02 02:15:10


✏️ [Ex.4] 
- ✏️ Create a list of pandas `Timestamps` of all the days between the 24th May 1819 and the 22nd January 1901.
- ✏️ By converting this list into a pandas `Series`, get the median day of this time interval.


In [107]:
## SOLUTION 
# start = pd.Timestamp('18190524')
# end = pd.Timestamp('19010122')
# all_dates = [start]
# max_date = start
# while max_date < end:
#     max_date = max_date + pd.Timedelta('1 day')
#     all_dates.append(max_date)

# all_dates_series = pd.Series(all_dates)

# print(all_dates_series.median())

#### B) Pandas DataFrames

What is a `pandas.DataFrame`? Think of it as an in-memory spreadsheet that you can analyse and manipulate programmatically.

A `DataFrame` is a collection of `Series` having the same length and whose indexes are in sync. A *collection* means that each column of a dataframe is a series

Let's create a toy `DataFrame` by hand. 

In [94]:
dates = [pd.Timestamp(1970, 5, 23), pd.Timestamp(1978, 7, 14), pd.Timestamp(1986, 3, 14), pd.Timestamp(1993, 1, 1), pd.Timestamp(1998, 7, 14)]
events = ['birth', 'anniversary', 'wedding', 'wedding', 'anniversary']


From those two lists, you can create a `DataFrame` by passing to `pd.DataFrame` a dictionary:

In [97]:
toy_df = pd.DataFrame({
    "date": dates,
    "event": events
})

# What do you expect when dates and events are changed from lists to Series? Try it out.
# What will happen if the lists are of different lengths?

You can check that the `DataFrame` has been properly constructed. Notice how it is indeed of a tabular shape.

In [96]:
display(toy_df)

Unnamed: 0,date,event
0,1970-05-23,birth
1,1978-07-14,anniversary
2,1986-03-14,wedding
3,1993-01-01,wedding
4,1998-07-14,anniversary


✏️ [Ex.5] 
- ✏️ Create a list of pandas `Timestamps` of 200 random dates between the 1900 and 2000.
- ✏️ Create a list of 200 events taken *at random* among ['birth', 'anniversary', 'wedding']. The numpy function `np.random.choice()` might help.
- ✏️ Combine those two lists to create a dataframe
- ✏️ Get the count of occurrences of all **events** in the dataframe


In [104]:
## SOLUTION
# dates = [pd.Timestamp(int(random()*100)+1900, int(random()*12)+1, int(random()*28)+1) for n in range(0, 200)]
# events = [np.random.choice(['birth', 'anniversary', 'wedding']) for n in range(0, 200)]

# df = pd.DataFrame({
#     "date": dates,
#     "event": events
# })

# print(df.event.value_counts())

### Section (2): First manipulations of the dataframe

The columns of a `pandas.DataFrame` can be accessed as follows:

In [115]:
toy_df['date']

0   1970-05-23
1   1978-07-14
2   1986-03-14
3   1993-01-01
4   1998-07-14
Name: date, dtype: datetime64[ns]

 It returns a `pandas.Series`, the type we've seen in the introductory section of this notebook. To access its **values** the property keyword is used:

In [119]:
print(type(toy_df['date']))
print(toy_df['date'].values)

<class 'pandas.core.series.Series'>
['1970-05-23T00:00:00.000000000' '1978-07-14T00:00:00.000000000'
 '1986-03-14T00:00:00.000000000' '1993-01-01T00:00:00.000000000'
 '1998-07-14T00:00:00.000000000']


Each column in a `pandas.DataFrame` has a data type. Being sure that the right datatype is used is essential.

Depending on the nature of the data, its type can be changed using the method `.astype()`. 

For example, changing from a `pandas.Timestamp` to a `str` is possible:

In [125]:
print(toy_df['date'].astype(str))

0    1970-05-23
1    1978-07-14
2    1986-03-14
3    1993-01-01
4    1998-07-14
Name: date, dtype: object


But changing from a `pandas.Timestamp` to a `float` is not possible:

In [126]:
## What do you expect when you run the following?
# print(toy_df['date'].astype(float))

#### A) Accessor properties

For certain data types (string, datetime), `pandas` provides a number of common methods that can be called on any series containing values of that type. These methods become available as methods of the series itself within a property — called *accessor* — named after the data type:

- the `.dt.*` accessor contains methods to operate on `datetime` series
- the `str.` accessor contains methods to operate on `str` (string) series.

As you will see in a moment, these methods are very convenient when filtering rows of a dataset based on the value of a certain column.

In [111]:
toy_df.date.astype(str)

0    1970-05-23
1    1978-07-14
2    1986-03-14
3    1993-01-01
4    1998-07-14
Name: date, dtype: object

In [83]:

# ### Accessor properties

# For certain data types (string, datetime), `pandas` provides a number of common methods that can be called on any series containing values of that type. These methods become available as methods of the series itself within a property — called *accessor* — named after the data type:

# - the `.dt.*` accessor contains methods to operate on `datetime` series
# - the `str.` accessor contains methods to operate on `str` (string) series.

# As you will see in a moment, these methods are very convenient when filtering rows of a dataset based on the value of a certain column.
# #### `datetime` accessor
# To work with datetime series `pandas` provide a bunch of useful methods to operate on a series: they can be called from the `.dt` property of a datetime series.

# They can be used to:
# - convert from one timezone to another
# - get the day/day name/month/year information from each date
# - and much more (see the [documentation]())
# s1.head()
# s1.dt.day_of_week.head()
# #### `str` accessor
# s = Series(["One", "TWO", "tHrEE"])
# Accessors can be used to apply filters to a series by verifying whether a certain condition is verified or not, such is the case with `contains()`. Such methods will output a boolean value (`True` or `False`).
# s.str.contains('o')
# s.str.contains('O')
# Other methods can be used, instead, to manipulate an entire series, e.g. `lower()` and `upper()`.
# s.str.lower()
# ### Exploring a dataframe

# Exploring a dataframe: `df.head()`, `df.tail()`, `df.info()`.
# The method `info()` gives you information about a dataframe:
# - how much space does it take in memory?
# - what is the datatype of each column?
# - how many records are there?
# - how many `null` values does each column contain (!)?
# toy_df.info()
# Alternatively, if you need to know only the number of columns and rows you can use the `.shape` property.

# It returns a tuple with 1) number of rows, 2) number of columns.
# toy_df.shape
# `head()` prints by first five rows of a dataframe:
# toy_df.head()
# But the number of lines displayed is a parameter that can be changed:
# toy_df.head(2)
# `tail()` does the opposite, i.e. prints the last n rows in the dataframe:
# toy_df.tail()

SyntaxError: invalid character '—' (U+2014) (570942135.py, line 79)