# Chapter I - Pandas I

In this notebook we will cover how to:
- work with the two main data types in `pandas`: `DataFrame` and `Series`
- work with data types in `pandas`, especially strings and dates
- load data from JSON and CSV into a `DataFrame`
- manipulate the columns of a `DataFrame`
- access data in a `DataFrame` by means of indexes and slicing

In [2]:
from random import random
import pandas as pd
import numpy as np

## Section (1): Creating a first dataframe from scratch

This section will take you through the steps needed to create a `pandas` DataFrame from scratch.

To create a DataFrame from scratch, you need the values for at least two columns.
Those values are stored in a data type called a `Series`. They can be thought of as the `pandas` version of lists.

A pandas `Series` can be created as follows:

### A) Pandas Series

In [8]:
s = pd.Series([1,2,3])

print(s)
print(' > The type of s is:', type(s))

0    1
1    2
2    3
dtype: int64
 > The type of s is: <class 'pandas.core.series.Series'>


✏️ [Ex.1] 
- ✏️ Create a series called `s` containing 100 random numbers ranging between 0 and 1
- ✏️ Print the first value of the `Series` you just created.

In [152]:
# your solution

0.5527028611138213


Each observation in the series has an **index** as well as a set of **values**: they can be accessed via the omonymous properties.
- The data type of the **index** is a `pandas RangeIndex`, akin to a Python `range`.
- The data type of the **values** is a `numpy array`.

✏️ [Ex.2] 
- ✏️ Using the series **index**, print the length of the `Series`
- ✏️ Print the first three elements of the **values** of series `s`.

In [16]:
# your solution

Pandas `Series` have got useful properties that you can call to easily access information on the data in the Series.
Some of them include:
- `head(n)` and `tail(n)` to access the beginning and end of the series — where `n` is the number of values to get.
- `value_counts()` to show the occurrences of all values in the series. Calling this property returns a `Counter` object, itself contains an `.index` and some `.values` which you can call to access the occurrences' count.
- `min()`, `max()`, `mean()`, `median()` give some basic statistics on the series' data.

✏️ [Ex.3] 
- ✏️ Calculate the range of values in `s`
- ✏️ Find if there are some duplicate values in `s`
- ✏️ Calculate the mean of the first 50 values in `s`

In [27]:
# your solution

- Some of you might want to manipulate time data in the form of dates. Pandas is very convenient for the manipulation of dates. 

To do that, you should use pandas appropriate date type, called `Timestamp`.

For example, VE-day can be encoded as such:

In [50]:
print(pd.Timestamp(1945, 5, 8))

1945-05-08 00:00:00


A date can also be encoded as a string, and pandas will do its best to convert it to a timestamp.

Note that it flexibly supports both 'YYYYMMDD' and 'YYYMMDDHHMMSS' 

In [82]:
print(pd.Timestamp('19450508'))
print(pd.Timestamp('19690711025615'))

# What happens if you try to create a Timestamp with a date that doesn't exist? Try it out.

1945-05-08 00:00:00
1969-07-11 02:56:15


The difference between two `Timestamps` is a `Timedelta` object. The number of days contained in the time difference can be accessed through the eponymous property:

In [65]:
print((pd.Timestamp('19690711025615') - pd.Timestamp('19450508')).days)

8830

A date can be shifted simply by adding to it a `Timedelta`:

In [73]:
print(pd.Timestamp('19450508')+pd.Timedelta('55 days 2 hours 15 minutes 10 seconds'))

1945-07-02 02:15:10


✏️ [Ex.4] 
- ✏️ Create a list of pandas `Timestamps` of all the days between the 24th May 1819 and the 22nd January 1901.
- ✏️ By converting this list into a pandas `Series`, get the median day of this time interval.


In [107]:
# your solution

### B) Pandas DataFrames

What is a `pandas.DataFrame`? Think of it as an in-memory spreadsheet that you can analyse and manipulate programmatically.

A `DataFrame` is a collection of `Series` having the same length and whose indexes are in sync. A *collection* means that each column of a dataframe is a series

Let's create a toy `DataFrame` by hand. 

In [243]:
dates = [pd.Timestamp(1970, 5, 23), pd.Timestamp(1978, 7, 14), pd.Timestamp(1986, 3, 14), pd.Timestamp(1993, 1, 1), pd.Timestamp(1998, 7, 14)]
events = ['birth', 'anniversary', 'wedding', 'wedding', 'anniversary']


From those two lists, you can create a `DataFrame` by passing to `pd.DataFrame` a dictionary:

In [244]:
toy_df = pd.DataFrame({
    "date": dates,
    "event": events
})

# What do you expect when dates and events are changed from lists to Series? Try it out.
# What will happen if the lists are of different lengths?

You can check that the `DataFrame` has been properly constructed. Notice how it is indeed of a tabular shape. To extract its length, you can use `len(DataFrame)`. 

In [245]:
print('> This DataFrame has length:', len(toy_df))
display(toy_df)

> This DataFrame has length: 5


Unnamed: 0,date,event
0,1970-05-23,birth
1,1978-07-14,anniversary
2,1986-03-14,wedding
3,1993-01-01,wedding
4,1998-07-14,anniversary


✏️ [Ex.5] 
- ✏️ Create a list of pandas `Timestamps` of 200 random dates between the 1900 and 2000.
- ✏️ Create a list of 200 events taken *at random* among ['birth', 'anniversary', 'wedding']. The numpy function `np.random.choice()` might help.
- ✏️ Combine those two lists to create a dataframe
- ✏️ Get the count of occurrences of all **events** in the dataframe


In [104]:
# your solution

Once the `DataFrame` exists, you can add a column in exactly the same way you would define a new key/value pair in a dictionnary:

`dataframe_name['new_column'] = values`


Here, the datatype of `values` is quite flexible as `Pandas` allows many inputs: `pandas.Series` as we've seen before, but also `numpy.array` or even simple `lists`.

The only condition is that the length of the new column should be of the same length as the `DataFrame`.

There is one exception to that rule: if all rows of the new column have the same value, you can just pass that value as input.

✏️ [Ex.6] 
- ✏️ Add a new column named `author_firstname` to the dataframe created in [Ex.5]. 
- ✏️ This new column be input using a list-like variable, containing your first name as many times as there are rows.
- ✏️ Add a new column named `author_lastname` to the dataframe, this time containing your last name, and without using a list-like input.



In [186]:
# your solution

## Section (2): First manipulations of the dataframe

### A) General information on the dataframe

Some first pieces of information on a dataframe are given by the following useful functions: `df.head()`, `df.tail()`, `df.info()`.

The method `info()` gives you information about a dataframe:
- how much space does it take in memory?
- what is the datatype of each column?
- how many records are there?
- how many `null` values does each column contain (!)?


In [174]:
toy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    5 non-null      datetime64[ns]
 1   event   5 non-null      object        
dtypes: datetime64[ns](1), object(1)
memory usage: 212.0+ bytes


Alternatively, if you need to know only the number of columns and rows you can use the `.shape` property. 

It is a property, not a method — therefore it should be called without brackets.

Calling the property returns a tuple with 1) number of rows, 2) number of columns.



In [176]:
toy_df.shape

(5, 2)

`head()` prints by first five rows of a dataframe:


In [177]:
toy_df.head()


Unnamed: 0,date,event
0,1970-05-23,birth
1,1978-07-14,anniversary
2,1986-03-14,wedding
3,1993-01-01,wedding
4,1998-07-14,anniversary


But the number of lines displayed is a parameter that can be changed:


In [178]:
toy_df.head(2)


Unnamed: 0,date,event
0,1970-05-23,birth
1,1978-07-14,anniversary


`tail()` does the opposite, i.e. prints the last n rows in the dataframe:

In [179]:
toy_df.tail()

Unnamed: 0,date,event
0,1970-05-23,birth
1,1978-07-14,anniversary
2,1986-03-14,wedding
3,1993-01-01,wedding
4,1998-07-14,anniversary


You may sometimes want to sort the dataframe based on the values in one column.

To do this, you may use the `.sort_values(<column>)` method. 

The column will then be sorted depending on the datatype:
- numerically (float, integers);
- chronologically (datetimes);
- alphabetically (strings).

In [270]:
toy_df.sort_values('event')

Unnamed: 0,date,event
1,1978-07-14,anniversary
4,1998-07-14,anniversary
0,1970-05-23,birth
3,1993-01-01,marriage
2,1986-03-14,unknown


If you want to invert the sorting (z-to-a, 9-to-1, etc.), use the `ascending=True` argument.

In [271]:
toy_df.sort_values('event', ascending=False)

Unnamed: 0,date,event
2,1986-03-14,unknown
3,1993-01-01,marriage
0,1970-05-23,birth
1,1978-07-14,anniversary
4,1998-07-14,anniversary


### B) Columns and datatype

The columns of a `pandas.DataFrame` can be accessed as follows:

In [115]:
toy_df['date']

0   1970-05-23
1   1978-07-14
2   1986-03-14
3   1993-01-01
4   1998-07-14
Name: date, dtype: datetime64[ns]

 It returns a `pandas.Series`, the type we've seen in the introductory section of this notebook. To access its **values** the property keyword is used:

In [119]:
print(type(toy_df['date']))
print(toy_df['date'].values)

<class 'pandas.core.series.Series'>
['1970-05-23T00:00:00.000000000' '1978-07-14T00:00:00.000000000'
 '1986-03-14T00:00:00.000000000' '1993-01-01T00:00:00.000000000'
 '1998-07-14T00:00:00.000000000']


Each column in a `pandas.DataFrame` has a data type. Being sure that the right datatype is used is essential.

Depending on the nature of the data, its type can be changed using the method `.astype()`. 

For example, changing from a `pandas.Timestamp` to a `str` is possible:

In [125]:
print(toy_df['date'].astype(str))

0    1970-05-23
1    1978-07-14
2    1986-03-14
3    1993-01-01
4    1998-07-14
Name: date, dtype: object


But changing from a `pandas.Timestamp` to a `float` is not possible:

In [126]:
## What do you expect when you run the following?
# print(toy_df['date'].astype(float))

### C) Accessor properties

For certain data types (string, datetime), `pandas` provides a number of common methods that can be called on any series containing values of that type. These methods become available as methods of the series itself within a property — called *accessor* — named after the data type:

- the `.dt.*` accessor contains methods to operate on `datetime` series
- the `str.` accessor contains methods to operate on `str` (string) series.

Accessors are amongst the most convenient features of data manipulation in `pandas`.

They act on a `pandas.Series`, typically the column of a `DataFrame` and return a `pandas.Series` of the same length. The output new series is the result of the element-wise operation on the input series.

Let's start with temporal manipulation:

#### `datetime` accessor

To work with datetime series `pandas` provide a bunch of useful methods to operate on a series: they can be called from the `.dt` property of a datetime series.

They can be used to:
- convert from one timezone to another
- get the day/day name/month/year information from each date
- and much more (see the [documentation]())



In [139]:
type(toy_df['date'].dt.day_name())

pandas.core.series.Series

✏️ [Ex.7] 
To see this in action: 
- ✏️ Access your dataframe's `date` column
- ✏️ Print for each the corresponding day of the week. To do this, you should use the `datetime` accessor, and use the method `day_name`. This will return a `pandas.Series`.
- ✏️ Add the day of the week as a new column.
- ✏️ Try to do this again in a one-liner.

In [None]:
# your solution

#### `str` accessor

Much like the `datetime` accessor, the `str` one is the entry door to many very useful methods that you may need to tidy, process, or analyse your data.

Among other things, you can easily:
- test if the string starts with another string, 
- convert between lower and upper case, 
- determine if the string matches a regular expression,
- replace one substring with one another.

For example, if you want to check if the first three letters of the `event` column are those of "wedding", you can use:

In [164]:
toy_df['event'].str.startswith('wed')

0    False
1    False
2     True
3     True
4    False
Name: event, dtype: bool

You can also chain the accessors. However, remember than the output of an accessor method is a `pandas.Series`. You will therefore need to access `str` again!

For example, if you want first to capitalise a column, before checking whether it starts with the first letters of "wedding", you can do:

In [169]:
#This time, we match with "Wed"

toy_df['event'].str.capitalize().str.startswith('Wed')

0    False
1    False
2     True
3     True
4    False
Name: event, dtype: bool

✏️ [Ex.8] 
- ✏️ Using the `str` accessors, create a new column `is_weekend` that states if the date fell during a weekend.
- ✏️ The new column should be a boolean (True/False).

- ✏️ Again, using accessors, replace the first three letters of your last name with the first three letters of your first name, and vice-versa. This time the new last name should be lower case.
- ✏️ (Advanced) Try to write an expression where the letters are not manually entered, and that would work for any dataframe with columns 'author_firstname' and 'author_lastname'.

In [None]:
# your solution

## Section (3): Input/Output

Up until now, we have created and manipulated data from scratch.

However most often, they are created by loading existing data into a dataframe by means of `pandas`' input/output methods:

- Either by loading a complete dataframe, for example if you want to manipulate a CSV file;
- Or by loading the data columns independantly, and combining them into one dataframe.

#### From JSON

A very common data format is JSON, explicit, efficient, and widely used over the internet.

We will take here the example of some data on books from the British Library. We extracted it from the BL as a JSON file. You may face such a scenario in your research.

Loading data from a JSON file is very similar to creating a `DataFrame` from a `dict`, like we've done in Section (1).

This is how one would do it in pure Python:

In [181]:
import json
json_file_path = '../data/bl_books/sample/book_data_sample.json'

# JSON data gets read into a dictionary

with open(json_file_path, 'r') as jsonfile:
    json_data = json.load(jsonfile)
    
books_df = pd.DataFrame(json_data)

In [183]:
books_df.head()

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1841,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
1,1888,[British Library HMNTS 9025.cc.14.],Rivingtons,[A History of Greece. Part I. From the earlies...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Abbott, Evelyn']}",1888,{'1': 'lsidyv376da437'},4047,{},sample/full_texts/000004047_01_text.json,{'0': {'000257': ['11104648374']}}
2,"1847, 48 [1846-48]","[British Library HMNTS C.131.d.16., British Li...",Punch Office,[The Comic History of England ... With ... col...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['LEECH, John - Artist'], 'cre...",1847,{'1': 'lsidyv38b27a31'},5382,{},sample/full_texts/000005382_01_text.json,"{'0': {'000410': ['11026944866'], '000412': ['..."
3,1892,[British Library HMNTS 10351.cc.60.],"Eden, Remington & Co.",[The Cruise of “The Tomahawk”: the story of a ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Laffan, De Courcy - Mrs']}",1892,{'1': 'lsidyv3c4a946a'},14627,{},sample/full_texts/000014627_01_text.json,"{'0': {'000134': ['11300364604'], '000039': ['..."
4,1863,[British Library HMNTS 9078.c.10.],Virtue Bros. & Co.,[Scenes from the Drama of European History],,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Adams, W. H. Davenport (William ...",1863,{'1': 'lsidyv32c34c22'},17057,{},sample/full_texts/000017057_01_text.json,{'0': {'000005': ['11046098454']}}


Since reading from files is a very common operation in any data analysis workflow, `pandas` provides methods to read from a variety of formats (JSON, CSV, clipboard, etc.)

The block of code above can be replaced by the following one-liner:

In [184]:
books_df = pd.read_json(json_file_path)

  books_df = pd.read_json(json_file_path)


In [185]:
books_df.head()

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
1,1888,[British Library HMNTS 9025.cc.14.],Rivingtons,[A History of Greece. Part I. From the earlies...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Abbott, Evelyn']}",1970-01-01 00:31:28,{'1': 'lsidyv376da437'},4047,{},sample/full_texts/000004047_01_text.json,{'0': {'000257': ['11104648374']}}
2,"1847, 48 [1846-48]","[British Library HMNTS C.131.d.16., British Li...",Punch Office,[The Comic History of England ... With ... col...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['LEECH, John - Artist'], 'cre...",1970-01-01 00:30:47,{'1': 'lsidyv38b27a31'},5382,{},sample/full_texts/000005382_01_text.json,"{'0': {'000410': ['11026944866'], '000412': ['..."
3,1892,[British Library HMNTS 10351.cc.60.],"Eden, Remington & Co.",[The Cruise of “The Tomahawk”: the story of a ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Laffan, De Courcy - Mrs']}",1970-01-01 00:31:32,{'1': 'lsidyv3c4a946a'},14627,{},sample/full_texts/000014627_01_text.json,"{'0': {'000134': ['11300364604'], '000039': ['..."
4,1863,[British Library HMNTS 9078.c.10.],Virtue Bros. & Co.,[Scenes from the Drama of European History],,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Adams, W. H. Davenport (William ...",1970-01-01 00:31:03,{'1': 'lsidyv32c34c22'},17057,{},sample/full_texts/000017057_01_text.json,{'0': {'000005': ['11046098454']}}


#### From CSV

Similarly to `pandas.read_json()`, `pandas.read_csv()` is there to make your life easier when it comes to loading CSV data into a dataframe (and that happens very often!).

This is particularly useful if you want to export dataframes into files compatible with Excel or other tabular data software. 

Let's see how to import one of the CSV files from the "Venice Apprenticeship" dataset (`../data/apprenticeship_venice/`). 

They contain information extracted from ~10,000 work contracts in 17th-century Venice, in particular master-apprentice relationships in the glass-industry. The apprentices were nick-named *garzoni*, hence the name of the dataset.

In [192]:
csv_file_path = '../data/apprenticeship_venice/professions_data.csv'

Try the following. Does it work? Before going to the next cell, can you guess why?

In [None]:
garzoni_df = pd.read_csv(csv_file_path)

Let's have a look at the file first:

In [193]:
!head -n 2 ../data/apprenticeship_venice/professions_data.csv

page_title;register;annual_salary;a_profession;profession_code_strict;profession_code_gen;profession_cat;corporation;keep_profession_a;complete_profession_a;enrolmentY;enrolmentM;startY;startM;length;has_fled;m_profession;m_profession_code_strict;m_profession_code_gen;m_profession_cat;m_corporation;keep_profession_m;complete_profession_m;m_gender;m_name;m_surname;m_patronimic;m_atelier;m_coords;a_name;a_age;a_gender;a_geo_origins;a_geo_origins_std;a_coords;a_quondam;accommodation_master;personal_care_master;clothes_master;generic_expenses_master;salary_in_kind_master;pledge_goods_master;pledge_money_master;salary_master;female_guarantor;period_cat;incremental_salary
Carlo Della sosta (Orese) 1592-08-03;asv, giustizia vecchia, accordi dei garzoni, 114, 155;NA;orese;orese;orefice;orefice;Oresi;1;1;1592;08;1592;08;3;0;orese;orese;orefice;orefice;Oresi;1;1;1;Zuan Battista;Amigoni;;;0, 0;Carlo Della sosta;17;1;;;0, 0;1;0;1;1;1;0;0;0;0;0;NA;0


More than a comma-separated value, it looks like semicolon-separated values...
We thus need to adjust the `sep` parameter to specify which character is used to separate column values.

In [194]:
garzoni_df = pd.read_csv(
    csv_file_path,
    sep=';'
)

### Export

Once you have a `DataFrame` to play with in you python environment, exporting it is quite straightforward. 

Each export format has its own method: `.to_csv()`, `.to_json()`, `.to_htmm()`, etc.

The argument of the function is simply the `<path name>`, and sometime optional arguments, like which value separator you want to use for `.csv` files:

`<your_dataframe>.to_csv(<export_path>, sep=<which_separator>)`

In [221]:
toy_df.to_csv('example_dataframe.csv', sep=';')

## Section (4): Accessing and manipulating data.

We now have learned to get our hands on larger dataframes, with thousands of rows and dozens of columns, akin to the sort you may manipulate in a typical DH project.

It is now time to learn how to access the data stored in those dataframes.

### Indexing

There are two main ways to find information contained in a dataframe cell:
- either you know exactly where it is in the dataframe, for example if your frame is in a specific order;
- or, you want to access one or more cells based on conditions.

Let's take our books dataframe for the first of those cases. If you want to know the 'title' of the n-th row, you can get it using the `.at` keyword. The structure is:

`<dataframe>.at[<row_number>, <column_name>]`

✏️ [Ex.9]
- ✏️ What is the 'title' of the 26th row?


If we take the example of the `books_df` again, we might want to find all books that have been printed in 1841, or perhaps do we want all books printed privately, or perhaps still do we want all books privately printed in 1841:

In [None]:
# your solution

In [200]:
books_df.head(2)

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
1,1888,[British Library HMNTS 9025.cc.14.],Rivingtons,[A History of Greece. Part I. From the earlies...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Abbott, Evelyn']}",1970-01-01 00:31:28,{'1': 'lsidyv376da437'},4047,{},sample/full_texts/000004047_01_text.json,{'0': {'000257': ['11104648374']}}


The property to use to do this is called `.loc`. This is perhaps the most important `pandas` function.

It is applied to a `pandas.DataFrame` and returns a subset of that dataframe that matches the conditions given.

What is the structure? Retaking the examples phrased earlier, this is what it looks like? Is that structure clear to you?

In [202]:
# One condition
books_df.loc[(books_df['datefield']=='1841')]

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
33,1841,"[British Library HMNTS 793.f.25., British Libr...",John Mason,[Ashantee and the Gold Coast: being a sketch o...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['METCALFE, George Edgar.'], '...",1970-01-01 00:30:41,{'1': 'lsidyv38fa87d5'},249060,{},sample/full_texts/000249060_01_text.json,{'0': {'000027': ['11019234505']}}
138,1841,"[British Library HMNTS RB.23.a.16334., British...",Longman & Co.,[The History of Guernsey; with occasional noti...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['DUNCAN, Jonathan - B.A']}",1970-01-01 00:30:41,{'1': 'lsidyv3c4bbd67'},1005994,{},sample/full_texts/001005994_01_text.json,
141,1841,[British Library HMNTS 1295.c.36.],,[The Tragedy of the Seas; or sorrow on the oce...,,http://www.flickr.com/photos/britishlibrary/ta...,New York,monographic,"{'creator': ['ELLMS, Charles.']}",1970-01-01 00:30:41,{'1': 'lsidyv3cb7f1da'},1061079,{},sample/full_texts/001061079_01_text.json,"{'0': {'000415': ['11224532573'], '000012': ['..."
233,1841,[British Library HMNTS 10347.ee.19.(12.)],Hobson and Smiles,[An Abstract of Accounts for fourteen years en...,,http://www.flickr.com/photos/britishlibrary/ta...,Leeds,monographic,{},1970-01-01 00:30:41,{'1': 'lsidyv3c347a87'},2112441,{},sample/full_texts/002112441_01_text.json,


In [203]:
# Another condition
books_df.loc[books_df['publisher']=='Privately printed']

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
284,1899,[British Library HMNTS 11601.ddd.11.(6.)],Privately printed,[Nightshades. [Poems.]],,http://www.flickr.com/photos/britishlibrary/ta...,Paris,monographic,"{'creator': ['MONKSHOOD, G. F. - pseud. [i.e. ...",1970-01-01 00:31:39,{'1': 'lsidyv37212e0a'},2527441,{},sample/full_texts/002527441_01_text.json,


In [204]:
# Both conditions
books_df.loc[(books_df['datefield']=='1841')&(books_df['publisher']=='Privately printed')]

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,


Conditions can be combined using Boolean logic. We can thus for example search for:
- Condition A **and** Condition B
- Condition A **or** Condition B
- Condition A **and not** Condition C
- etc.

The boolean operators are the same as in traditional Python: `&`, `|`, `not`, etc.

✏️ [Ex.10]

Using the indexing structure we've just seen, try to find which book(s) obey the following conditions:

- ✏️ 1° Published by 'John Murray' in 1856
- ✏️ 2° Printed in London in 1818
- ✏️ 3° Published between 1830 and 1848
- ✏️ 4° Published by 'Longmans & Co.' or 'Henry Colburn'
- ✏️ 5° Published in 1894 but not by 'W. Blackwood & Sons'

In [None]:
# your solution

### NaN values

Throughout your own research, you will most likely stumble upon instances where your data is incomplete: incomplete metadata from an external source, data manipulation exceptions that don't return a value, missing record. 

This will take the form of a cell containing a `None` or `numpy.nan` value.

It is important to know how to filter them out (or in) in `pandas`, especially as they can mess up some functions you might want to apply to the whole dataframe.

For example, let's take our earlier toy dataframe and remove one cell;

In [246]:
toy_df.at[2, 'event'] = np.nan

To find which rows are NA in a specific **column**, we use the following `.isna()` method.

It returns a `series` of booleans telling whether the matching row has a nan value.

In [249]:
toy_df['event'].isna()

0    False
1    False
2    False
3    False
4    False
Name: event, dtype: bool

Using our `.loc` operator, we can convert that to a subset of the dataframe:

In [250]:
toy_df.loc[toy_df['event'].isna()]

Unnamed: 0,date,event


Empty cells can be filled with one replacement value. For example here, we might want to prefer "unknown" to a NaN value.

Be careful, `.fillna()` returns a `pandas.Series`, not a full dataframe. Therefore we use its output to recreate a column:

In [248]:
toy_df['event'] = toy_df['event'].fillna('unknown')

display(toy_df)

Unnamed: 0,date,event
0,1970-05-23,birth
1,1978-07-14,anniversary
2,1986-03-14,unknown
3,1993-01-01,wedding
4,1998-07-14,anniversary


### Reindexing

To change the value of one or more cells, you should know which rows are concerned by the change.
Like earlier, this can come either:
- from the precise row number (if you know beforehand which row/column cell you want to change)
- from a certain condition being met (let's change all cells that start with 'unk', for example)

The reindexing can be thought of as two distinct parts:
- 1° knowing which rows to change;
- 2° modifying the values of these rows for a specific column.

To do this, use the following structure:

`<dataframe>.loc[<dataframe>[<column_to_test>]==<value_to_check_for>, <column_to_change>] = <new_value>`

Continuing with our toy dataframe, this expression selects all rows whose 'event' is 'birth':

In [252]:
toy_df.loc[toy_df['event']=='wedding']

Unnamed: 0,date,event
3,1993-01-01,wedding


Let's say you may want to change that into 'marriage':

In [253]:
toy_df.loc[toy_df['event']=='wedding', 'event'] = 'marriage'

But again, the testing column doesn't have to be the one you change.

More generally, this is:
`<dataframe>.loc[<condition>, <column_to_change>] = <new_value>`

For example, say I want to change my last name in our earlier dataframe:

In [257]:
df.loc[df.author_firstname=='paul', 'author_lastname'] = 'Robert'

display(df)

Unnamed: 0,date,event,author_firstname,author_lastname,day_name
0,1925-10-22,wedding,paul,Robert,Thursday
1,1900-11-20,anniversary,paul,Robert,Tuesday
2,1998-01-08,birth,paul,Robert,Thursday
3,1905-08-22,wedding,paul,Robert,Tuesday
4,1978-09-17,anniversary,paul,Robert,Sunday
...,...,...,...,...,...
195,1904-02-19,anniversary,paul,Robert,Friday
196,1956-01-04,anniversary,paul,Robert,Wednesday
197,1908-04-04,anniversary,paul,Robert,Saturday
198,1929-07-07,anniversary,paul,Robert,Sunday


✏️ [Ex. 11]

In this exercise, we will combine what we've seen on Indexing, Detecting NA values, and Reindexing.

The dataframe of books that was opened from a JSON earlier `books_df` contains information on which is the identifier to a first page scan. This information is in the column `imgs`.
In some cases however, the image does not exist and the cell is empty.

- ✏️ Find how many books do not have an image (column `imgs`)
- ✏️ Find which books do not have an image
- ✏️ (Advanced) Find which are the top four geographical origin for books without an image.

In [None]:
# your solution

### Iterating

As a rule of thumb, one should prioritise any data manipulation that can be performed in one shot. `pandas.DataFrame` are ineffecient objects for frequent modification. 

However, there are scenarios where you may want to iterate over the rows of the dataframe. For example, to perform row-wise tests or manipulation that require access to more than one column.

To do that, the `.iterrows()` method (for "iterration over rows") should be used. 

Just like an `enumerate` in traditional python, the `.iterrows()` method will return two elements at each iteration:
- the index of the row (between 0 and the length of the dataframe);
- the row corresponding to that iteration.

That row is a `pandas.Series`. It acts a bit like a python `dictionary` and the value for a specific column is accessed through:
`row[<column>]`



In [264]:
for index, row in toy_df.iterrows():
    print('The event at index', index, 'is:', row['event'])

The event at index 0 is: birth
The event at index 1 is: anniversary
The event at index 2 is: unknown
The event at index 3 is: marriage
The event at index 4 is: anniversary


✏️ [Ex. 12]

Here we will combine iteration, and `datetime` and `str` accessors. By iterating over the `toy_df` dataframe:

- ✏️ Print the date of each row as well
- ✏️ Print the day of the week that each date corresponds to
- ✏️ Print whether the event starts with 'bir'

In [None]:
# your solution

**Dataset**

For this excercise, we will be working with an open-access data from a DH research project: the [*Shakespeare and Company project*](https://shakespeareandco.princeton.edu/).

The dataset we'll be using contains the list of books that were lent out in the 20th century by the celebrated *Shakespeare and Company* library in Paris.

The dataset can be downloaded from the following address:https://dataspace.princeton.edu/handle/88435/dsp01jm214s28p (file size = ~2MB).

You may use CSV or JSON:
- the CSV file will only contain `str`, `float` and `int`;
- whereas the JSON file will contain `lists`.

The choice will be one of comfort. Pick whichever yields the datatypes you are more comfortable with.
 
**Steps**

Start by loading the file it into a `pandas.DataFrame` called "SCo"

**Try to answer the following questions**
- 1° How many records does it contain?


- 2° What's the format(s) of the **most purchased** document(s)? How many times was it/where they purchased?
- 3° What's the format(s) of the **least borrowed** document(s)? How many times was it/where they borrowed?

- 4° Which is the most borrowed **book**? How many times was it borrowed?

- 5° How many 'Poems' are there in collection? What is the earliest one? How many don't have a date?

- 6° Replace the empty cells in the `circulation_year` column with "unknown".
- 7° Find which books were in circulation in 1951
- 8° (Advanced) By iterating over the rows of the dataframe, create a dictionnary of dates that gives how many books from the datafram were in circulation that year.

In [None]:
# your solution