# Chapter I - Pandas (Introduction)

In this notebook we will cover how to:
- work with the two main data types in `pandas`: `DataFrame` and `Series`
- work with data types in `pandas`, especially strings and dates
- load data from JSON and CSV into a `DataFrame`
- manipulate the columns of a `DataFrame`
- access data in a `DataFrame` by means of indexes and slicing

In [2]:
import pandas as pd
import numpy as np
from random import random

## Motivation

In [3]:
df = pd.read_csv('../data/2024-08-13_data.txt', header=None)
df.columns=['ID', 'Name', 'Math', 'Science', 'English']
df['average'] = df[['Math', 'Science', 'English']].mean(axis=1)
df

Unnamed: 0,ID,Name,Math,Science,English,average
0,101,John Doe,78,85,92,85.0
1,102,Jane Smith,88,79,85,84.0
2,103,Emily Davis,91,89,94,91.333333
3,104,Michael Brown,70,75,80,75.0
4,105,Jessica White,85,93,89,89.0


## Section (1): Creating a first dataframe from scratch

This section will take you through the steps needed to create a `pandas` DataFrame from scratch.

To create a DataFrame from scratch, you need the values for at least two columns.
Those values are stored in a data type called a `Series`. They can be thought of as the `pandas` version of lists.

A pandas `Series` can be created as follows:

### A) Pandas Series

In [4]:
s = pd.Series([1,2,3])

print(s)
print(' > The type of s is:', type(s))

0    1
1    2
2    3
dtype: int64
 > The type of s is: <class 'pandas.core.series.Series'>


✏️ [Ex.1] 
- ✏️ Create a series called `s` containing 100 random numbers ranging between 0 and 1. You may use the `random` function.
- ✏️ Print the first value of the `Series` you just created.

In [5]:
# step 1: create a list of 100 random values
# step 2: convert that list to a Pandas Series
# step 3: get the first element of that list

# your solution here:


#Step 1
random_list = []

#Step 2
for i in range(100):
    random_list.append(random())
random_series = pd.Series(random_list)
s = random_series
#Step 3
random_series[0]

0.6329100393914142

Each observation in the series has an **index** as well as a set of **values**: they can be accessed via the omonymous properties.
- The data type of the **index** is a `pandas RangeIndex`, akin to a Python `range`.
- The data type of the **values** is a `numpy array`.

✏️ [Ex.2] 
- ✏️ Using the series **index**, print the length of the `Series`
- ✏️ Print the first three elements of the **values** of series `s`.

In [6]:
# your solution here:

s.values[0:3]

array([0.63291004, 0.14821827, 0.02329244])

Pandas `Series` have got useful properties that you can call to easily access information on the data in the Series.
Some of them include:
- `head(n)` and `tail(n)` to access the beginning and end of the series — where `n` is the number of values to get.
- `value_counts()` to show the occurrences of all values in the series. Calling this property returns a `Counter` object, itself contains an `.index` and some `.values` which you can call to access the occurrences' count.
- `min()`, `max()`, `mean()`, `median()` give some basic statistics on the series' data.

✏️ [Ex.3] 
- ✏️ Calculate the range of values in `s`
- ✏️ Find if there are some duplicate values in `s`
- ✏️ Calculate the mean of the first 50 values in `s`

In [7]:
# your solution here:

#question 1
s.max() - s.min()

#question 2
s.duplicated().sum()

#question 3
s.head(50).mean()


0.4953951186316323

- Some of you might want to manipulate time data in the form of dates. Pandas is very convenient for the manipulation of dates. 

To do that, you should use pandas appropriate date type, called `Timestamp`.

For example, VE-day can be encoded as such:

In [8]:
print(pd.Timestamp(1945, 5, 8, 20, 10, 56))

1945-05-08 20:10:56


A date can also be encoded as a string, and pandas will do its best to convert it to a timestamp.

Note that it flexibly supports both 'YYYYMMDD' and 'YYYMMDDHHMMSS' 

In [9]:
print(pd.Timestamp('19450508'))
print(pd.Timestamp('19690711025615'))

# What happens if you try to create a Timestamp with a date that doesn't exist? Try it out.

1945-05-08 00:00:00
1969-07-11 02:56:15


The difference between two `Timestamps` is a `Timedelta` object. The number of days contained in the time difference can be accessed through the eponymous property:

In [10]:
(pd.Timestamp('19690711025615') - pd.Timestamp('-19450508')).days

1429623

A date can be shifted simply by adding to it a `Timedelta`:

In [11]:
print(pd.Timestamp('19450508')+pd.Timedelta('55 days 2 hours 15 minutes 10 seconds'))

1945-07-02 02:15:10


✏️ [Ex.4] 
- ✏️ Create a list of pandas `Timestamps` of all the days between the 24th May 1819 and the 22nd January 1901.
- ✏️ By converting this list into a pandas `Series`, get the median day of this time interval.


In [12]:
# Your solution here:


### B) Pandas DataFrames

What is a `pandas.DataFrame`? Think of it as an in-memory spreadsheet that you can analyse and manipulate programmatically.

A `DataFrame` is a collection of `Series` having the same length and whose indexes are in sync. A *collection* means that each column of a dataframe is a series

Let's create a toy `DataFrame` by hand. 

In [13]:
dates = [pd.Timestamp(1970, 5, 23), pd.Timestamp(1978, 7, 14), pd.Timestamp(1986, 3, 14), pd.Timestamp(1993, 1, 1), pd.Timestamp(1998, 7, 14)]
events = pd.Series(['birth', 'anniversary', 'wedding', 'wedding', 'anniversary'])

From those two lists, you can create a `DataFrame` by passing to `pd.DataFrame` a dictionary:

In [14]:
toy_df = pd.DataFrame({
    "date": dates,
    "event": events
})

# What do you expect when dates and events are changed from lists to Series? Try it out.
# What will happen if the lists are of different lengths?

You can check that the `DataFrame` has been properly constructed. Notice how it is indeed of a tabular shape. To extract its length, you can use `len(DataFrame)`. 

In [15]:
print('> This DataFrame has length:', len(toy_df))
display(toy_df)

> This DataFrame has length: 5


Unnamed: 0,date,event
0,1970-05-23,birth
1,1978-07-14,anniversary
2,1986-03-14,wedding
3,1993-01-01,wedding
4,1998-07-14,anniversary


In [16]:
toy_df

Unnamed: 0,date,event
0,1970-05-23,birth
1,1978-07-14,anniversary
2,1986-03-14,wedding
3,1993-01-01,wedding
4,1998-07-14,anniversary


Once the `DataFrame` exists, you can add a column in exactly the same way you would define a new key/value pair in a dictionnary:

`dataframe_name['new_column'] = values`


Here, the datatype of `values` is quite flexible as `Pandas` allows many inputs: `pandas.Series` as we've seen before, but also `numpy.array` or even simple `lists`.

The only condition is that the length of the new column should be of the same length as the `DataFrame`.

There is one exception to that rule: if all rows of the new column have the same value, you can just pass that value as input.

✏️ [Ex.5] 
- ✏️ Add a new column named `author_firstname` to the dataframe `toy_df`. 
- ✏️ This new column should be input using a list-like variable, containing your first name as many times as there are rows.
- ✏️ Add a new column named `author_lastname` to the dataframe, this time containing your last name, and without using a list-like input.



In [17]:
# your solution

toy_df['author_firstname'] = ['Paul', 'Ellen', 'Paul', 'Ellen', 'Ellen']
#toy_df['author_firstname'] = ['Paul' for i in range(5)]

toy_df['author_lastname'] = 'Guhennec'

In [18]:
toy_df.set_index('date')

Unnamed: 0_level_0,event,author_firstname,author_lastname
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1970-05-23,birth,Paul,Guhennec
1978-07-14,anniversary,Ellen,Guhennec
1986-03-14,wedding,Paul,Guhennec
1993-01-01,wedding,Ellen,Guhennec
1998-07-14,anniversary,Ellen,Guhennec


## Section (2): First manipulations of the dataframe

### A) General information on the dataframe

Some first pieces of information on a dataframe are given by the following useful functions: `df.head()`, `df.tail()`, `df.info()`.

The method `info()` gives you information about a dataframe:
- how much space does it take in memory?
- what is the datatype of each column?
- how many records are there?
- how many `null` values does each column contain (!)?


In [19]:
toy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              5 non-null      datetime64[ns]
 1   event             5 non-null      object        
 2   author_firstname  5 non-null      object        
 3   author_lastname   5 non-null      object        
dtypes: datetime64[ns](1), object(3)
memory usage: 292.0+ bytes


Alternatively, if you need to know only the number of columns and rows you can use the `.shape` property. 

It is a property, not a method — therefore it should be called without brackets.

Calling the property returns a tuple with 1) number of rows, 2) number of columns.



In [20]:
toy_df.shape

(5, 4)

`head()` prints by first five rows of a dataframe:


In [21]:
toy_df.sample(3)


Unnamed: 0,date,event,author_firstname,author_lastname
1,1978-07-14,anniversary,Ellen,Guhennec
3,1993-01-01,wedding,Ellen,Guhennec
4,1998-07-14,anniversary,Ellen,Guhennec


But the number of lines displayed is a parameter that can be changed:


In [22]:
toy_df.head(2)


Unnamed: 0,date,event,author_firstname,author_lastname
0,1970-05-23,birth,Paul,Guhennec
1,1978-07-14,anniversary,Ellen,Guhennec


`tail()` does the opposite, i.e. prints the last n rows in the dataframe:

In [23]:
toy_df.tail(2)

Unnamed: 0,date,event,author_firstname,author_lastname
3,1993-01-01,wedding,Ellen,Guhennec
4,1998-07-14,anniversary,Ellen,Guhennec


You may sometimes want to sort the dataframe based on the values in one column.

To do this, you may use the `.sort_values(<column>)` method. 

The column will then be sorted depending on the datatype:
- numerically (float, integers);
- chronologically (datetimes);
- alphabetically (strings).

In [24]:
toy_df.sort_values('event', ascending=True)

Unnamed: 0,date,event,author_firstname,author_lastname
1,1978-07-14,anniversary,Ellen,Guhennec
4,1998-07-14,anniversary,Ellen,Guhennec
0,1970-05-23,birth,Paul,Guhennec
2,1986-03-14,wedding,Paul,Guhennec
3,1993-01-01,wedding,Ellen,Guhennec


If you want to invert the sorting (z-to-a, 9-to-1, etc.), use the `ascending=False` argument.

In [25]:
toy_df.sort_values('event', ascending=False)

Unnamed: 0,date,event,author_firstname,author_lastname
2,1986-03-14,wedding,Paul,Guhennec
3,1993-01-01,wedding,Ellen,Guhennec
0,1970-05-23,birth,Paul,Guhennec
1,1978-07-14,anniversary,Ellen,Guhennec
4,1998-07-14,anniversary,Ellen,Guhennec


### B) Columns and datatype

The columns of a `pandas.DataFrame` can be accessed as follows:

In [26]:
toy_df

Unnamed: 0,date,event,author_firstname,author_lastname
0,1970-05-23,birth,Paul,Guhennec
1,1978-07-14,anniversary,Ellen,Guhennec
2,1986-03-14,wedding,Paul,Guhennec
3,1993-01-01,wedding,Ellen,Guhennec
4,1998-07-14,anniversary,Ellen,Guhennec


In [27]:
toy_df['date']

0   1970-05-23
1   1978-07-14
2   1986-03-14
3   1993-01-01
4   1998-07-14
Name: date, dtype: datetime64[ns]

 It returns a `pandas.Series`, the type we've seen in the introductory section of this notebook. To access its **values** the property keyword is used:

In [28]:
print(type(toy_df['date']))
print(toy_df['date'].values)

<class 'pandas.core.series.Series'>
['1970-05-23T00:00:00.000000000' '1978-07-14T00:00:00.000000000'
 '1986-03-14T00:00:00.000000000' '1993-01-01T00:00:00.000000000'
 '1998-07-14T00:00:00.000000000']


Each column in a `pandas.DataFrame` has a data type. Being sure that the right datatype is used is essential.

Depending on the nature of the data, its type can be changed using the method `.astype()`. 

For example, changing from a `pandas.Timestamp` to a `str` is possible:

In [29]:
type(toy_df['date'].astype(str)[0])

str

In [30]:
toy_df['date'].astype(str)

0    1970-05-23
1    1978-07-14
2    1986-03-14
3    1993-01-01
4    1998-07-14
Name: date, dtype: object

But changing from a `pandas.Timestamp` to a `float` is not possible:

In [31]:
## What do you expect when you run the following?
#print(toy_df['date'].astype(float))

### C) Accessor properties

For certain data types (string, datetime), `pandas` provides a number of common methods that can be called on any series containing values of that type. These methods become available as methods of the series itself within a property — called *accessor* — named after the data type:

- the `.dt.*` accessor contains methods to operate on `datetime` series
- the `.str.*` accessor contains methods to operate on `str` (string) series.

Accessors are amongst the most convenient features of data manipulation in `pandas`.

They act on a `pandas.Series`, typically the column of a `DataFrame` and return a `pandas.Series` of the same length. The output new series is the result of the element-wise operation on the input series.

Let's start with temporal manipulation:

#### `datetime` accessor

To work with datetime series `pandas` provide a bunch of useful methods to operate on a series: they can be called from the `.dt` property of a datetime series.

They can be used to:
- convert from one timezone to another
- get the day/day name/month/year information from each date
- and much more (see the [documentation]())



In [32]:
toy_df['date'].dt.day_name()

0    Saturday
1      Friday
2      Friday
3      Friday
4     Tuesday
Name: date, dtype: object

✏️ [Ex.6] 
To see this in action: 
- ✏️ Access your dataframe's `date` column
- ✏️ Print for each the corresponding day of the week. To do this, you should use the `datetime` accessor, and use the method `day_name`. This will return a `pandas.Series`.
- ✏️ Add the day of the week as a new column.
- ✏️ Try to do this again in a one-liner.

In [33]:
# your solution here:
toy_df['date_day'] = toy_df['date'].dt.day_name()
toy_df

Unnamed: 0,date,event,author_firstname,author_lastname,date_day
0,1970-05-23,birth,Paul,Guhennec,Saturday
1,1978-07-14,anniversary,Ellen,Guhennec,Friday
2,1986-03-14,wedding,Paul,Guhennec,Friday
3,1993-01-01,wedding,Ellen,Guhennec,Friday
4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday


#### `str` accessor

Much like the `datetime` accessor, the `str` one is the entry door to many very useful methods that you may need to tidy, process, or analyse your data.

Among other things, you can easily:
- test if the string starts with another string, 
- convert between lower and upper case, 
- determine if the string matches a regular expression,
- replace one substring with one another.

For example, if you want to check if the first three letters of the `event` column are those of "wedding", you can use:

In [34]:
toy_df

Unnamed: 0,date,event,author_firstname,author_lastname,date_day
0,1970-05-23,birth,Paul,Guhennec,Saturday
1,1978-07-14,anniversary,Ellen,Guhennec,Friday
2,1986-03-14,wedding,Paul,Guhennec,Friday
3,1993-01-01,wedding,Ellen,Guhennec,Friday
4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday


In [35]:
toy_df['event'].str.startswith('wed')

0    False
1    False
2     True
3     True
4    False
Name: event, dtype: bool

You can also chain the accessors. However, remember than the output of an accessor method is a `pandas.Series`. You will therefore need to access `str` again!

For example, if you want first to capitalise a column, before checking whether it starts with the first letters of "wedding", you can do:

In [36]:
toy_df['event'].str.capitalize().str.startswith('Wed')

0    False
1    False
2     True
3     True
4    False
Name: event, dtype: bool

In [37]:
#This time, we match with "Wed"

toy_df['event'].str.capitalize().str.startswith('wed')

0    False
1    False
2    False
3    False
4    False
Name: event, dtype: bool

✏️ [Ex.7] 
- ✏️ Using the `str` accessors, create a new column `is_weekend` that states if the date fell during a weekend.
- ✏️ The new column should be a boolean (True/False).


In [38]:
~ toy_df['date_day'].str.startswith('S')

0    False
1     True
2     True
3     True
4     True
Name: date_day, dtype: bool

In [39]:
# your solution here:
toy_df['is_weekend'] = toy_df['date_day'].str.startswith('S')

toy_df

Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend
0,1970-05-23,birth,Paul,Guhennec,Saturday,True
1,1978-07-14,anniversary,Ellen,Guhennec,Friday,False
2,1986-03-14,wedding,Paul,Guhennec,Friday,False
3,1993-01-01,wedding,Ellen,Guhennec,Friday,False
4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday,False


In [40]:
toy_df

Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend
0,1970-05-23,birth,Paul,Guhennec,Saturday,True
1,1978-07-14,anniversary,Ellen,Guhennec,Friday,False
2,1986-03-14,wedding,Paul,Guhennec,Friday,False
3,1993-01-01,wedding,Ellen,Guhennec,Friday,False
4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday,False


## Section (3): Input/Output

Up until now, we have created and manipulated data from scratch.

However most often, they are created by loading existing data into a dataframe by means of `pandas`' input/output methods:

- Either by loading a complete dataframe, for example if you want to manipulate a CSV file;
- Or by loading the data columns independantly, and combining them into one dataframe.

#### From JSON

A very common data format is JSON, explicit, efficient, and widely used over the internet.

We will take here the example of some data on books from the British Library. We extracted it from the BL as a JSON file. You may face such a scenario in your research.

Loading data from a JSON file is very similar to creating a `DataFrame` from a `dict`, like we've done in Section (1).

This is how one would do it in pure Python:

In [41]:
import json
json_file_path = '../data/bl_books/sample/book_data_sample.json'

# JSON data gets read into a dictionary

with open(json_file_path, 'r') as jsonfile:
    json_data = json.load(jsonfile)
    
books_df = pd.DataFrame(json_data)

In [42]:
books_df.head()

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1841,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
1,1888,[British Library HMNTS 9025.cc.14.],Rivingtons,[A History of Greece. Part I. From the earlies...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Abbott, Evelyn']}",1888,{'1': 'lsidyv376da437'},4047,{},sample/full_texts/000004047_01_text.json,{'0': {'000257': ['11104648374']}}
2,"1847, 48 [1846-48]","[British Library HMNTS C.131.d.16., British Li...",Punch Office,[The Comic History of England ... With ... col...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['LEECH, John - Artist'], 'cre...",1847,{'1': 'lsidyv38b27a31'},5382,{},sample/full_texts/000005382_01_text.json,"{'0': {'000410': ['11026944866'], '000412': ['..."
3,1892,[British Library HMNTS 10351.cc.60.],"Eden, Remington & Co.",[The Cruise of “The Tomahawk”: the story of a ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Laffan, De Courcy - Mrs']}",1892,{'1': 'lsidyv3c4a946a'},14627,{},sample/full_texts/000014627_01_text.json,"{'0': {'000134': ['11300364604'], '000039': ['..."
4,1863,[British Library HMNTS 9078.c.10.],Virtue Bros. & Co.,[Scenes from the Drama of European History],,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Adams, W. H. Davenport (William ...",1863,{'1': 'lsidyv32c34c22'},17057,{},sample/full_texts/000017057_01_text.json,{'0': {'000005': ['11046098454']}}


Since reading from files is a very common operation in any data analysis workflow, `pandas` provides methods to read from a variety of formats (JSON, CSV, clipboard, etc.)

The block of code above can be replaced by the following one-liner:

In [43]:
books_df = pd.read_json(json_file_path)

  books_df = pd.read_json(json_file_path)


In [47]:
books_df.head()

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
1,1888,[British Library HMNTS 9025.cc.14.],Rivingtons,[A History of Greece. Part I. From the earlies...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Abbott, Evelyn']}",1970-01-01 00:31:28,{'1': 'lsidyv376da437'},4047,{},sample/full_texts/000004047_01_text.json,{'0': {'000257': ['11104648374']}}
2,"1847, 48 [1846-48]","[British Library HMNTS C.131.d.16., British Li...",Punch Office,[The Comic History of England ... With ... col...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['LEECH, John - Artist'], 'cre...",1970-01-01 00:30:47,{'1': 'lsidyv38b27a31'},5382,{},sample/full_texts/000005382_01_text.json,"{'0': {'000410': ['11026944866'], '000412': ['..."
3,1892,[British Library HMNTS 10351.cc.60.],"Eden, Remington & Co.",[The Cruise of “The Tomahawk”: the story of a ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Laffan, De Courcy - Mrs']}",1970-01-01 00:31:32,{'1': 'lsidyv3c4a946a'},14627,{},sample/full_texts/000014627_01_text.json,"{'0': {'000134': ['11300364604'], '000039': ['..."
4,1863,[British Library HMNTS 9078.c.10.],Virtue Bros. & Co.,[Scenes from the Drama of European History],,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Adams, W. H. Davenport (William ...",1970-01-01 00:31:03,{'1': 'lsidyv32c34c22'},17057,{},sample/full_texts/000017057_01_text.json,{'0': {'000005': ['11046098454']}}


#### From CSV

Similarly to `pandas.read_json()`, `pandas.read_csv()` is there to make your life easier when it comes to loading CSV data into a dataframe (and that happens very often!).

This is particularly useful if you want to export dataframes into files compatible with Excel or other tabular data software. 

Let's see how to import one of the CSV files from the "Venice Apprenticeship" dataset (`../data/apprenticeship_venice/`). 

They contain information extracted from ~10,000 work contracts in 17th-century Venice, in particular master-apprentice relationships in the glass-industry. The apprentices were nick-named *garzoni*, hence the name of the dataset.

In [48]:
csv_file_path = '../data/apprenticeship_venice/professions_data.csv'

Try the following. Does it work? Before going to the next cell, can you guess why?

In [49]:
garzoni_df = pd.read_csv(csv_file_path)

ParserError: Error tokenizing data. C error: Expected 7 fields in line 36, saw 8


Let's have a look at the file first:

In [50]:
!head -n 2 ../data/apprenticeship_venice/professions_data.csv

page_title;register;annual_salary;a_profession;profession_code_strict;profession_code_gen;profession_cat;corporation;keep_profession_a;complete_profession_a;enrolmentY;enrolmentM;startY;startM;length;has_fled;m_profession;m_profession_code_strict;m_profession_code_gen;m_profession_cat;m_corporation;keep_profession_m;complete_profession_m;m_gender;m_name;m_surname;m_patronimic;m_atelier;m_coords;a_name;a_age;a_gender;a_geo_origins;a_geo_origins_std;a_coords;a_quondam;accommodation_master;personal_care_master;clothes_master;generic_expenses_master;salary_in_kind_master;pledge_goods_master;pledge_money_master;salary_master;female_guarantor;period_cat;incremental_salary
Carlo Della sosta (Orese) 1592-08-03;asv, giustizia vecchia, accordi dei garzoni, 114, 155;NA;orese;orese;orefice;orefice;Oresi;1;1;1592;08;1592;08;3;0;orese;orese;orefice;orefice;Oresi;1;1;1;Zuan Battista;Amigoni;;;0, 0;Carlo Della sosta;17;1;;;0, 0;1;0;1;1;1;0;0;0;0;0;NA;0


More than a comma-separated value, it looks like semicolon-separated values...
We thus need to adjust the `sep` parameter to specify which character is used to separate column values.

In [None]:
garzoni_df = pd.read_csv(csv_file_path,sep=';')

In [52]:
garzoni_df

Unnamed: 0,page_title,register,annual_salary,a_profession,profession_code_strict,profession_code_gen,profession_cat,corporation,keep_profession_a,complete_profession_a,...,personal_care_master,clothes_master,generic_expenses_master,salary_in_kind_master,pledge_goods_master,pledge_money_master,salary_master,female_guarantor,period_cat,incremental_salary
0,Carlo Della sosta (Orese) 1592-08-03,"asv, giustizia vecchia, accordi dei garzoni, 1...",,orese,orese,orefice,orefice,Oresi,1,1,...,1,1,1,0,0,0,0,0,,0
1,Antonio quondam Andrea (squerariol) 1583-01-09,"asv, giustizia vecchia, accordi dei garzoni, 1...",12.500000,squerariol,squerariol,lavori allo squero,lavori allo squero,Squerarioli,1,1,...,0,0,1,0,0,0,1,0,1.0,0
2,Cristofollo di Zuane (batioro in carta) 1591-0...,"asv, giustizia vecchia, accordi dei garzoni, 1...",,batioro,batioro,battioro,fabbricatore di foglie/fili/cordelle d'oro o a...,Battioro,1,1,...,0,0,0,0,0,0,0,0,,0
3,Illeggibile (marzer) 1584-06-21,"asv, giustizia vecchia, accordi dei garzoni, 1...",,marzer,marzer,marzer,merciaio,Merzeri,1,1,...,0,0,0,0,0,0,0,0,,0
4,Domenico Morebetti (spechier) 1664-09-13,"asv, giustizia vecchia, accordi dei garzoni, 1...",7.000000,marzer,marzer,marzer,merciaio,Merzeri,1,1,...,0,0,1,0,0,0,1,0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9648,Zuane de Antonio (paternoster) 1594-06-01,"asv, giustizia vecchia, accordi dei garzoni, 1...",2.500000,paternoster,paternostrer,paternoster,paternosteri o margariteri,Margariteri,1,1,...,1,0,1,0,0,0,1,0,1.0,0
9649,Isepo di Tononi (zavater) 1594-06-01,"asv, giustizia vecchia, accordi dei garzoni, 1...",2.571429,zavatter,zavater,ciabattino,fabbricazione calzature,Calegheri e zavatteri,1,1,...,1,0,1,0,0,0,1,1,1.0,0
9650,Antonio Ofiagio (caseler) 1594-06-01,"asv, giustizia vecchia, accordi dei garzoni, 1...",2.153846,casseler,casseler,fabbricatore di casse,fabbricatore di casse e simili,Casseleri,1,1,...,1,0,1,0,0,0,1,0,1.0,0
9651,Alvise Ferazo (caleger) 1594-06-02,"asv, giustizia vecchia, accordi dei garzoni, 1...",2.500000,calegher,calegher,calzolaio,fabbricazione calzature,Calegheri e zavatteri,1,1,...,1,0,1,0,0,0,1,0,1.0,0


### Export

Once you have a `DataFrame` to play with in you python environment, exporting it is quite straightforward. 

Each export format has its own method: `.to_csv()`, `.to_json()`, `.to_html()`, etc.

The argument of the function is simply the `<path name>`, and sometime optional arguments, like which value separator you want to use for `.csv` files:

`<your_dataframe>.to_csv(<export_path>, sep=<which_separator>)`

In [53]:
toy_df.to_csv('example_dataframe.csv', sep=';')

## Section (4): Accessing and manipulating data.

We now have learned to get our hands on larger dataframes, with thousands of rows and dozens of columns, akin to the sort you may manipulate in a typical DH project.

It is now time to learn how to access the data stored in those dataframes.

In [54]:
books_df

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
1,1888,[British Library HMNTS 9025.cc.14.],Rivingtons,[A History of Greece. Part I. From the earlies...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Abbott, Evelyn']}",1970-01-01 00:31:28,{'1': 'lsidyv376da437'},4047,{},sample/full_texts/000004047_01_text.json,{'0': {'000257': ['11104648374']}}
2,"1847, 48 [1846-48]","[British Library HMNTS C.131.d.16., British Li...",Punch Office,[The Comic History of England ... With ... col...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['LEECH, John - Artist'], 'cre...",1970-01-01 00:30:47,{'1': 'lsidyv38b27a31'},5382,{},sample/full_texts/000005382_01_text.json,"{'0': {'000410': ['11026944866'], '000412': ['..."
3,1892,[British Library HMNTS 10351.cc.60.],"Eden, Remington & Co.",[The Cruise of “The Tomahawk”: the story of a ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Laffan, De Courcy - Mrs']}",1970-01-01 00:31:32,{'1': 'lsidyv3c4a946a'},14627,{},sample/full_texts/000014627_01_text.json,"{'0': {'000134': ['11300364604'], '000039': ['..."
4,1863,[British Library HMNTS 9078.c.10.],Virtue Bros. & Co.,[Scenes from the Drama of European History],,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Adams, W. H. Davenport (William ...",1970-01-01 00:31:03,{'1': 'lsidyv32c34c22'},17057,{},sample/full_texts/000017057_01_text.json,{'0': {'000005': ['11046098454']}}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
447,1766,[British Library HMNTS 643.i.2.(5.)],,"[The Country Girl, a comedy [in five acts and ...",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Wycherley, William']}",1970-01-01 00:29:26,{'1': 'lsidyv36d016bd'},3988464,{},sample/full_texts/003988464_01_text.json,"{'0': {'000005': ['11000084354'], '000009': ['..."
448,1767,[British Library HMNTS 643.h.10.(7.)],,[[The Plain Dealer: a comedy [in five acts and...,[Another edition.],http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['BICKERSTAFFE, Isaac.'], 'cre...",1970-01-01 00:29:27,{'1': 'lsidyv3882404e'},3988508,{},sample/full_texts/003988508_01_text.json,"{'0': {'000011': ['11000138144', '11000190693'..."
449,[1833],[British Library HMNTS 10347.ee.18.(14.)],Atkinson,[Address to the Friends of Justice and Humanit...,,http://www.flickr.com/photos/britishlibrary/ta...,Bradford,monographic,{},1970-01-01 00:30:33,{'1': 'lsidyv3c3d324a'},3999073,"{'creator': ['West Riding (YORK, County of)']}",sample/full_texts/003999073_01_text.json,
450,1881,[British Library HMNTS 11781.dd.50],C. Kegan Paul & Co.,[Anne Boleyn. A tragedy in five acts. By the a...,,http://www.flickr.com/photos/britishlibrary/ta...,enk,monographic,"{'creator': ['Nutt, D. - Miss']}",1970-01-01 00:31:21,{'1': 'lsidyv3b77c51b'},4088697,{},sample/full_texts/004088697_01_text.json,


### Indexing

There are two main ways to find information contained in a dataframe cell:
- either you know exactly where it is in the dataframe, for example if your frame is in a specific order;
- or, you want to access one or more cells based on conditions.

Let's take our books dataframe for the first of those cases. If you want to know the 'title' of the n-th row, you can get it using the `.at` keyword. The structure is:

`<dataframe>.at[<row_number>, <column_name>]`

✏️ [Ex.8]
- ✏️ What is the 'title' of the 26th row?

In [58]:
# your solution here:
books_df.at[25, 'title']

['Lurline: a grand romantic original opera, in three acts, composed by W. Vincent Wallace, the words by E. Fitzball: first produced at the Royal English Opera, Covent Garden ... February 23rd 1860']

In [59]:
books_df

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
1,1888,[British Library HMNTS 9025.cc.14.],Rivingtons,[A History of Greece. Part I. From the earlies...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Abbott, Evelyn']}",1970-01-01 00:31:28,{'1': 'lsidyv376da437'},4047,{},sample/full_texts/000004047_01_text.json,{'0': {'000257': ['11104648374']}}
2,"1847, 48 [1846-48]","[British Library HMNTS C.131.d.16., British Li...",Punch Office,[The Comic History of England ... With ... col...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['LEECH, John - Artist'], 'cre...",1970-01-01 00:30:47,{'1': 'lsidyv38b27a31'},5382,{},sample/full_texts/000005382_01_text.json,"{'0': {'000410': ['11026944866'], '000412': ['..."
3,1892,[British Library HMNTS 10351.cc.60.],"Eden, Remington & Co.",[The Cruise of “The Tomahawk”: the story of a ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Laffan, De Courcy - Mrs']}",1970-01-01 00:31:32,{'1': 'lsidyv3c4a946a'},14627,{},sample/full_texts/000014627_01_text.json,"{'0': {'000134': ['11300364604'], '000039': ['..."
4,1863,[British Library HMNTS 9078.c.10.],Virtue Bros. & Co.,[Scenes from the Drama of European History],,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Adams, W. H. Davenport (William ...",1970-01-01 00:31:03,{'1': 'lsidyv32c34c22'},17057,{},sample/full_texts/000017057_01_text.json,{'0': {'000005': ['11046098454']}}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
447,1766,[British Library HMNTS 643.i.2.(5.)],,"[The Country Girl, a comedy [in five acts and ...",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['Wycherley, William']}",1970-01-01 00:29:26,{'1': 'lsidyv36d016bd'},3988464,{},sample/full_texts/003988464_01_text.json,"{'0': {'000005': ['11000084354'], '000009': ['..."
448,1767,[British Library HMNTS 643.h.10.(7.)],,[[The Plain Dealer: a comedy [in five acts and...,[Another edition.],http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['BICKERSTAFFE, Isaac.'], 'cre...",1970-01-01 00:29:27,{'1': 'lsidyv3882404e'},3988508,{},sample/full_texts/003988508_01_text.json,"{'0': {'000011': ['11000138144', '11000190693'..."
449,[1833],[British Library HMNTS 10347.ee.18.(14.)],Atkinson,[Address to the Friends of Justice and Humanit...,,http://www.flickr.com/photos/britishlibrary/ta...,Bradford,monographic,{},1970-01-01 00:30:33,{'1': 'lsidyv3c3d324a'},3999073,"{'creator': ['West Riding (YORK, County of)']}",sample/full_texts/003999073_01_text.json,
450,1881,[British Library HMNTS 11781.dd.50],C. Kegan Paul & Co.,[Anne Boleyn. A tragedy in five acts. By the a...,,http://www.flickr.com/photos/britishlibrary/ta...,enk,monographic,"{'creator': ['Nutt, D. - Miss']}",1970-01-01 00:31:21,{'1': 'lsidyv3b77c51b'},4088697,{},sample/full_texts/004088697_01_text.json,



If we take the example of the `books_df` again, we might want to find all books that have been printed in 1841, or perhaps do we want all books printed privately, or perhaps still do we want all books privately printed in 1841:

The property to use to do this is called `.loc`. This is perhaps the most important `pandas` function.

It is applied to a `pandas.DataFrame` and returns a subset of that dataframe that matches the conditions given.

What is the structure? Retaking the examples phrased earlier, this is what it looks like? Is that structure clear to you?

In [78]:
# One condition
messy_data = books_df.loc[~(books_df['datefield'].str.isnumeric())]
clean_data = books_df.loc[(books_df['datefield'].str.isnumeric())]

print('Clean data: ', len(clean_data))
print('Messy data: ', len(messy_data))
print(f"The data to clean is: {round(100*len(messy_data)/len(books_df),1)}%")

Clean data:  370
Messy data:  82
The data to clean is: 18.1%


In [77]:
# Another condition
books_df.loc[books_df['publisher']=='Privately printed']
books_df.loc[books_df['datefield']=='1841']

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,
33,1841,"[British Library HMNTS 793.f.25., British Libr...",John Mason,[Ashantee and the Gold Coast: being a sketch o...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['METCALFE, George Edgar.'], '...",1970-01-01 00:30:41,{'1': 'lsidyv38fa87d5'},249060,{},sample/full_texts/000249060_01_text.json,{'0': {'000027': ['11019234505']}}
138,1841,"[British Library HMNTS RB.23.a.16334., British...",Longman & Co.,[The History of Guernsey; with occasional noti...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['DUNCAN, Jonathan - B.A']}",1970-01-01 00:30:41,{'1': 'lsidyv3c4bbd67'},1005994,{},sample/full_texts/001005994_01_text.json,
141,1841,[British Library HMNTS 1295.c.36.],,[The Tragedy of the Seas; or sorrow on the oce...,,http://www.flickr.com/photos/britishlibrary/ta...,New York,monographic,"{'creator': ['ELLMS, Charles.']}",1970-01-01 00:30:41,{'1': 'lsidyv3cb7f1da'},1061079,{},sample/full_texts/001061079_01_text.json,"{'0': {'000415': ['11224532573'], '000012': ['..."
233,1841,[British Library HMNTS 10347.ee.19.(12.)],Hobson and Smiles,[An Abstract of Accounts for fourteen years en...,,http://www.flickr.com/photos/britishlibrary/ta...,Leeds,monographic,{},1970-01-01 00:30:41,{'1': 'lsidyv3c347a87'},2112441,{},sample/full_texts/002112441_01_text.json,


In [None]:
# Both conditions
books_df.loc[(books_df['datefield']=='1841')&(books_df['publisher']=='Privately printed')]


books_df.loc[(condition1)&(condition2)]

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,


Conditions can be combined using Boolean logic. We can thus for example search for:
- Condition A **and** Condition B
- Condition A **or** Condition B
- Condition A **and not** Condition C
- etc.

The boolean operators are the same as in traditional Python: `&`, `|`, `~`, etc.

✏️ [Ex.9]

Using the indexing structure we've just seen, try to find which book(s) obey the following conditions:

- ✏️ 1° Published by 'John Murray' in 1856
- ✏️ 2° Printed in London in 1818
- ✏️ 3° Published between 1830 and 1848
- ✏️ 4° Published by 'Longmans & Co.' or 'Henry Colburn'
- ✏️ 5° Published in 1894 but not by 'W. Blackwood & Sons'

In [None]:
# your solution here:
# 1.
books_df.loc[(books_df['publisher']=='John Murray')&(books_df['datefield']=='1856')]

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
85,1856,[British Library HMNTS 10096.g.21.],John Murray,[Correspondence of Lieut.-General the Hon. Sir...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['CATHCART, George - Hon. Sir']}",1970-01-01 00:30:56,{'1': 'lsidyv3c6dd420'},635054,{},sample/full_texts/000635054_01_text.json,"{'0': {'000020': ['11048889234'], '000022': ['..."


In [93]:
# 2.
books_df.loc[(books_df['place']=='London')&(books_df['datefield']=='1818')]

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
55,1818,[British Library HMNTS 1048.i.11.],Henry Colburn,[Letters of a Prussian Traveller; descriptive ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['BRAMSEN, John.']}",1970-01-01 00:30:18,{'1': 'lsidyv3cbc195b'},451882,{},sample/full_texts/000451882_01_text.json,"{'0': {'000349': ['11004041184'], '000749': ['..."
172,1818,[British Library HMNTS 1055.e.21.(1.)],,"[The Campaign of 1815; or, a narrative of the ...",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['GOURGAUD, Gaspard - Baron']}",1970-01-01 00:30:18,{'1': 'lsidyv37557aa8'},1476998,{},sample/full_texts/001476998_01_text.json,{'0': {'000006': ['11004061314']}}


In [116]:
# 3.
clean_data = books_df.loc[books_df['datefield'].str.isnumeric()]
clean_data['datefield_clean'] = clean_data['datefield'].astype(int)
clean_data.loc[(clean_data['datefield_clean']>=1830)&(clean_data['datefield_clean']<=1848)]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data['datefield_clean'] = clean_data['datefield'].astype(int)


Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs,datefield_clean
0,1841,[British Library HMNTS 11601.ddd.2.],Privately printed,"[The Poetical Aviary, with a bird's-eye view o...",,http://www.flickr.com/photos/britishlibrary/ta...,Calcutta,monographic,{'creator': ['A. A.']},1970-01-01 00:30:41,{'1': 'lsidyv35c55757'},196,{},sample/full_texts/000000196_01_text.json,,1841
8,1833,[British Library HMNTS 10411.bbb.28.],"Carey, Lea & Carey",[[Notions of the Americans; picked up by a tra...,[Another edition.],http://www.flickr.com/photos/britishlibrary/ta...,Philadelphia,monographic,{},1970-01-01 00:30:33,{'1': 'lsidyv35028b1b'},69890,{},sample/full_texts/000069890_01_text.json,,1833
33,1841,"[British Library HMNTS 793.f.25., British Libr...",John Mason,[Ashantee and the Gold Coast: being a sketch o...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['METCALFE, George Edgar.'], '...",1970-01-01 00:30:41,{'1': 'lsidyv38fa87d5'},249060,{},sample/full_texts/000249060_01_text.json,{'0': {'000027': ['11019234505']}},1841
35,1843,[British Library HMNTS 1424.h.8.],Henry Colburn,"[Narrative of a Voyage round the World, perfor...",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['BELCHER, Edward - Sir']}",1970-01-01 00:30:43,{'1': 'lsidyv3e610def'},254656,{},sample/full_texts/000254656_01_text.json,"{'0': {'000352': ['11226250853'], '000299': ['...",1843
40,1845,[British Library HMNTS 10055.ee.19.],Henry Colburn,[[Narrative of the Voyages and Services of the...,Second edition. [Abridged.],http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['HALL, William Hutcheon - Sir...",1970-01-01 00:30:45,{'1': 'lsidyv3cdeef6a'},293364,{},sample/full_texts/000293364_01_text.json,"{'0': {'000059': ['11007549816'], '000058': ['...",1845
67,1848,[British Library HMNTS 10410.e.4.],T. R. Marvin,"[History of the Town of Groton, including Pepp...",,http://www.flickr.com/photos/britishlibrary/ta...,Boston,monographic,"{'creator': ['BUTLER, Caleb.']}",1970-01-01 00:30:48,{'1': 'lsidyv35b1c0e3'},551646,{},sample/full_texts/000551646_01_text.json,"{'0': {'000274': ['11028212356'], '000531': ['...",1848
71,1836,"[British Library HMNTS 993.h.11.(1.), British ...",W. F. Wakeman,[Arnaldo; Gaddo; and other unacknowledged poem...,,http://www.flickr.com/photos/britishlibrary/ta...,Dublin,monographic,"{'contributor': ['VOLPI, Odoardo - pseud. [i.e...",1970-01-01 00:30:36,{'1': 'lsidyv388c64a9'},558610,{},sample/full_texts/000558610_01_text.json,,1836
76,1831,[British Library HMNTS 10361.b.23.],Benjamin Bridges,[The Poll on the Election of a Representative ...,,http://www.flickr.com/photos/britishlibrary/ta...,Cambridge,monographic,"{'contributor': ['COOPER, Charles Henry.']}",1970-01-01 00:30:31,{'1': 'lsidyv32d04301'},579863,{},sample/full_texts/000579863_01_text.json,,1831
86,1832,[British Library HMNTS 11781.df.1.],T. & W. Boone,"[Attila, a tragedy; and other poems]",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['CAUNTER, Richard Macdonald.']}",1970-01-01 00:30:32,{'1': 'lsidyv35e8c627'},638231,{},sample/full_texts/000638231_01_text.json,,1832
110,1830,[British Library HMNTS 10370.dd.9.],John Johnstone,"[A Description of Craufurd Priory, the propert...",,http://www.flickr.com/photos/britishlibrary/ta...,Edinburgh,monographic,{},1970-01-01 00:30:30,{'1': 'lsidyv2afc3e7d'},813771,{},sample/full_texts/000813771_01_text.json,"{'0': {'000002': ['11221590156'], '000024': ['...",1830


In [118]:
#4.

books_df.loc[(books_df['publisher']=='Longmans & Co.')|(books_df['publisher']=="Henry Colburn")]

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
7,1828,"[British Library HMNTS 792.g.28., British Libr...",Henry Colburn,[Notions of the Americans; picked up by a trav...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,{},1970-01-01 00:30:28,{'1': 'lsidyv38f9ad5c'},69889,{},sample/full_texts/000069889_01_text.json,
17,1896,[British Library HMNTS 010057.g.3.],Longmans & Co.,[East and West. Being papers reprinted from th...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['PRITCHETT, Robert Taylor - K...",1970-01-01 00:31:36,{'1': 'lsidyv39175c26'},119124,{},sample/full_texts/000119124_01_text.json,"{'0': {'000132': ['11301423416'], '000344': ['..."
35,1843,[British Library HMNTS 1424.h.8.],Henry Colburn,"[Narrative of a Voyage round the World, perfor...",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['BELCHER, Edward - Sir']}",1970-01-01 00:30:43,{'1': 'lsidyv3e610def'},254656,{},sample/full_texts/000254656_01_text.json,"{'0': {'000352': ['11226250853'], '000299': ['..."
40,1845,[British Library HMNTS 10055.ee.19.],Henry Colburn,[[Narrative of the Voyages and Services of the...,Second edition. [Abridged.],http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'contributor': ['HALL, William Hutcheon - Sir...",1970-01-01 00:30:45,{'1': 'lsidyv3cdeef6a'},293364,{},sample/full_texts/000293364_01_text.json,"{'0': {'000059': ['11007549816'], '000058': ['..."
55,1818,[British Library HMNTS 1048.i.11.],Henry Colburn,[Letters of a Prussian Traveller; descriptive ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['BRAMSEN, John.']}",1970-01-01 00:30:18,{'1': 'lsidyv3cbc195b'},451882,{},sample/full_texts/000451882_01_text.json,"{'0': {'000349': ['11004041184'], '000749': ['..."
80,1869,[British Library HMNTS 9603.cc.30.],Longmans & Co.,[History of Grant's Campaign for the capture o...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['CANNON, John - Historical Writer']}",1970-01-01 00:31:09,{'1': 'lsidyv38cfd1aa'},595792,{},sample/full_texts/000595792_01_text.json,
245,1859,[British Library HMNTS 1325.f.26.],Longmans & Co.,"[Lectures on the History of England, etc. Lect...",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['LONGMAN, William - President of ...",1970-01-01 00:30:59,{'1': 'lsidyv37bd8d02'},2255687,{},sample/full_texts/002255687_01_text.json,"{'0': {'000299': ['11032879016'], '000055': ['..."


In [None]:
#5.

books_df.loc[(books_df['datefield']=='1894')& ~(books_df['publisher']!='W. Blackwood & Sons')]

Unnamed: 0,datefield,shelfmarks,publisher,title,edition,flickr_url_to_book_images,place,issuance,authors,date,pdf,identifier,corporate,fulltext_filename,imgs
14,1894,[British Library HMNTS 010370.e.8.],J. Lewis,[Ettrick and Yarrow: a guide. With songs and b...,,http://www.flickr.com/photos/britishlibrary/ta...,Selkirk,monographic,"{'creator': ['ANGUS, William - of Selkirk']}",1970-01-01 00:31:34,{'1': 'lsidyv3c40a3cf'},89266,{},sample/full_texts/000089266_01_text.json,"{'0': {'000033': ['11305479244'], '000102': ['..."
68,1894,[British Library HMNTS 9771.aaa.4.],Hunt & Eaton,"[Sketches of Mexico in prehistoric, primitive,...",,http://www.flickr.com/photos/britishlibrary/ta...,New York,monographic,"{'creator': ['BUTLER, John Wesley.']}",1970-01-01 00:31:34,{'1': 'lsidyv2e9fd959'},552517,{},sample/full_texts/000552517_01_text.json,{'0': {'000008': ['11228572834']}}
124,1894,[British Library HMNTS 012641.e.93.],Cassell & Co.,[A Toy Tragedy],,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['DE LA PASTURE, Elizabeth Lydia R...",1970-01-01 00:31:34,{'1': 'lsidyv38dfa0d8'},897980,{},sample/full_texts/000897980_01_text.json,
139,1894,[British Library HMNTS 10172.h.3.],Cassell & Co.,"[Old and New Paris. Its history, its people, a...",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['EDWARDS, Henry Sutherland.']}",1970-01-01 00:31:34,{'1': 'lsidyv3bbac19d'},1041515,{},sample/full_texts/001041515_01_text.json,"{'0': {'000626': ['11305031043'], '000155': ['..."
281,1894,[British Library HMNTS 010057.f.29.],Osgood & McIlvaine,"[When we were Strolling Players in the East, etc]",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['MILN, Louise Jordan.']}",1970-01-01 00:31:34,{'1': 'lsidyv391683a6'},2499813,{},sample/full_texts/002499813_01_text.json,"{'0': {'000412': ['11226831694'], '000338': ['..."
382,1894,[British Library HMNTS 11779.g.8.(17.)],Jordison & Co.,[[Book of words of a romantic opera in three a...,[Another edition.] Book of words of a romantic...,http://www.flickr.com/photos/britishlibrary/ta...,Middlesbrough,monographic,"{'contributor': ['SANKEY, Villiers - and SMITH...",1970-01-01 00:31:34,{'1': 'lsidyv31921bdd'},3424364,{},sample/full_texts/003424364_01_text.json,
387,1894,[British Library HMNTS 9210.cc.28.],Rivington & Co.,[History and Literature of France in synoptic ...,,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['SPIERS, Victor Julian Taylor.']}",1970-01-01 00:31:34,{'1': 'lsidyv317bd242'},3463549,{},sample/full_texts/003463549_01_text.json,
411,1894,[British Library HMNTS 10352.g.56.],Printed for the Editor,"[The Old History of Bradford, 1776; with the M...",,http://www.flickr.com/photos/britishlibrary/ta...,Idle,monographic,"{'creator': ['TURNER, Joseph Horsfall.']}",1970-01-01 00:31:34,{'1': 'lsidyv301cdc47'},3693495,{},sample/full_texts/003693495_01_text.json,"{'0': {'000010': ['11304481434'], '000101': ['..."
429,1894,[British Library HMNTS 10161.dd.9.],T. F. Unwin,"[The Heart and Songs of the Spanish Sierras, etc]",,http://www.flickr.com/photos/britishlibrary/ta...,London,monographic,"{'creator': ['WHITE, George Whit.']}",1970-01-01 00:31:34,{'1': 'lsidyv3bf95ee4'},3907040,{},sample/full_texts/003907040_01_text.json,"{'0': {'000125': ['11304563934'], '000187': ['..."


### NaN values

Throughout your own research, you will most likely stumble upon instances where your data is incomplete: incomplete metadata from an external source, data manipulation exceptions that don't return a value, missing record. 

This will take the form of a cell containing a `None` or `numpy.nan` value.

It is important to know how to filter them out (or in) in `pandas`, especially as they can mess up some functions you might want to apply to the whole dataframe.

For example, let's take our earlier toy dataframe and remove one cell;

In [128]:
toy_df = pd.read_csv('example_dataframe.csv', sep=';')
toy_df

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend
0,0,1970-05-23,birth,Paul,Guhennec,Saturday,True
1,1,1978-07-14,anniversary,Ellen,Guhennec,Friday,False
2,2,1986-03-14,wedding,Paul,Guhennec,Friday,False
3,3,1993-01-01,wedding,Ellen,Guhennec,Friday,False
4,4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday,False


In [147]:
toy_df.at[2, 'event'] = np.nan
toy_df

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend,event_filled
0,0,1970-05-23,birth,Paul,Guhennec,Saturday,True,birth
1,1,1978-07-14,anniversary,Ellen,Guhennec,Friday,False,anniversary
2,2,1986-03-14,,Paul,Guhennec,Friday,False,not known
3,3,1993-01-01,wedding,Ellen,Guhennec,Friday,False,wedding
4,4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday,False,anniversary


To find which rows are NA in a specific **column**, we use the following `.isna()` method.

It returns a `series` of booleans telling whether the matching row has a nan value.

In [131]:
toy_df.isna()

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False


In [132]:
toy_df['event'].isna()

0    False
1    False
2     True
3    False
4    False
Name: event, dtype: bool

Using our `.loc` operator, we can convert that to a subset of the dataframe:

In [133]:
toy_df.loc[toy_df['event'].isna()]

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend
2,2,1986-03-14,,Paul,Guhennec,Friday,False


In [140]:
toy_df['event'].fillna('ARHHHHHH')

0          birth
1    anniversary
2       ARHHHHHH
3        wedding
4    anniversary
Name: event, dtype: object

Empty cells can be filled with one replacement value. For example here, we might want to prefer "unknown" to a NaN value.

Be careful, `.fillna()` returns a `pandas.Series`, not a full dataframe. Therefore we use its output to recreate a column:

In [None]:
s = toy_df['event'].fillna('not known')
toy_df['event'] = s

In [None]:
toy_df['event'].fillna('not known', inplace=True)

toy_df['event'] = toy_df['event'].fillna('not known')

0          birth
1    anniversary
2      not known
3        wedding
4    anniversary
Name: event, dtype: object

In [None]:
books_df.at[0, 'title'] = NEW_TITLE

### Reindexing

To change the value of one or more cells, you should know which rows are concerned by the change.
Like earlier, this can come either:
- from the precise row number (if you know beforehand which row/column cell you want to change)
- from a certain condition being met (let's change all cells that start with 'unk', for example)

The reindexing can be thought of as two distinct parts:
- 1° knowing which rows to change;
- 2° modifying the values of these rows for a specific column.

To do this, use the following structure:

`<dataframe>.loc[<dataframe>[<column_to_test>]==<value_to_check_for>, <column_to_change>] = <new_value>`

Continuing with our toy dataframe, this expression selects all rows whose 'event' is 'birth':

In [154]:
toy_df

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend,event_filled
0,0,1970-05-23,birth,Paul,Guhennec,Saturday,True,birth
1,1,1978-07-14,anniversary,Ellen,Guhennec,Friday,False,anniversary
2,2,1986-03-14,not known,Paul,Guhennec,Friday,False,not known
3,3,1993-01-01,wedding,Ellen,Guhennec,Friday,False,wedding
4,4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday,False,anniversary


In [155]:
toy_df.loc[toy_df['event']=='wedding']

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend,event_filled
3,3,1993-01-01,wedding,Ellen,Guhennec,Friday,False,wedding


Let's say you may want to change that into 'marriage':

In [None]:
toy_df.loc[toy_df['event']=='wedding', 'event'] = 'marriage'

In [170]:
toy_df.loc[toy_df['event']=='anniversary', 'event'] = "birthday"

In [178]:
toy_df.loc[toy_df['author_firstname']=='Ellen', 'event'] = 'anniversary'

In [179]:
toy_df

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend,event_filled
0,0,1970-05-23,birth,Paul,Guhennec,Saturday,True,birth
1,1,1978-07-14,anniversary,Ellen,Guhennec,Friday,False,anniversary
2,2,1986-03-14,not known,Paul,Guhennec,Friday,False,not known
3,3,1993-01-01,anniversary,Ellen,Guhennec,Friday,False,wedding
4,4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday,False,anniversary


But again, the testing column doesn't have to be the one you change.

More generally, this is:
`<dataframe>.loc[<condition>, <column_to_change>] = <new_value>`

For example, say I want to change my last name in our earlier dataframe:

✏️ [Ex. 10]

In this exercise, we will combine what we've seen on Indexing, Detecting NA values, and Reindexing.

The dataframe of books that was opened from a JSON earlier `books_df` contains information on which is the identifier to a first page scan. This information is in the column `imgs`.
In some cases however, the image does not exist and the cell is empty.

- ✏️ Find how many books do not have an image (column `imgs`)
- ✏️ Find *which* books do not have an image
- ✏️ Replace the empty cells that do not have an image with a dummy value
- ✏️ (Advanced) Find which are the top four geographical origin for books without an image.

In [None]:
# your solution here:

#1.
len(books_df.loc[books_df['imgs'].isna()])

#2.
books_df.loc[books_df['imgs'].isna()]['title']

#3.
books_df['imgs'] = books_df['imgs'].fillna('no_img_here')
books_df

#4.
books_df.loc[books_df['imgs']=='no_img_here']['place'].value_counts()[:4]

place
London    97
Leeds      6
enk        5
Dublin     5
Name: count, dtype: int64

### Iterating

As a rule of thumb, one should prioritise any data manipulation that can be performed in one shot. `pandas.DataFrame` are ineffecient objects for frequent modification. 

However, there are scenarios where you may want to iterate over the rows of the dataframe. For example, to perform row-wise tests or manipulation that require access to more than one column.

To do that, the `.iterrows()` method (for "iterration over rows") should be used. 

Just like an `enumerate` in traditional python, the `.iterrows()` method will return two elements at each iteration:
- the index of the row (between 0 and the length of the dataframe);
- the row corresponding to that iteration.

That row is a `pandas.Series`. It acts a bit like a python `dictionary` and the value for a specific column is accessed through:
`row[<column>]`



In [216]:
toy_df

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend,event_filled
0,0,1970-05-23,birth,Paul,Guhennec,Saturday,True,birth
1,1,1978-07-14,anniversary,Ellen,Guhennec,Friday,False,anniversary
2,2,1986-03-14,not known,Paul,Guhennec,Friday,False,not known
3,3,1993-01-01,anniversary,Ellen,Guhennec,Friday,False,wedding
4,4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday,False,anniversary


In [215]:
type(i[1])

pandas.core.series.Series

In [221]:
for i in toy_df.iterrows():
    index = i[0]
    row = i[1]
    print(f"This is {row["author_firstname"]}'s {row["event"]} on {row["date"]}!")
    


This is Paul's birth on 1970-05-23!
This is Ellen's anniversary on 1978-07-14!
This is Paul's not known on 1986-03-14!
This is Ellen's anniversary on 1993-01-01!
This is Ellen's anniversary on 1998-07-14!


In [237]:
toy_df['day'] = pd.to_datetime(toy_df['date']).dt.day_name()

for index, row in toy_df.iterrows():
    print('The day at index', index, 'is:', row['day'])

The day at index 0 is: Saturday
The day at index 1 is: Friday
The day at index 2 is: Friday
The day at index 3 is: Friday
The day at index 4 is: Tuesday


✏️ [Ex. 11]

Here we will combine iteration, and `datetime` and `str` accessors. By iterating over the `toy_df` dataframe:

- ✏️ Print the date of each row as well
- ✏️ Print the day of the week that each date corresponds to
- ✏️ Print whether the event starts with 'bir'

In [254]:
toy_df

Unnamed: 0.1,Unnamed: 0,date,event,author_firstname,author_lastname,date_day,is_weekend,event_filled,day,day_of_the_week
0,0,1970-05-23,birth,Paul,Guhennec,Saturday,True,birth,Saturday,Saturday
1,1,1978-07-14,anniversary,Ellen,Guhennec,Friday,False,anniversary,Friday,Friday
2,2,1986-03-14,not known,Paul,Guhennec,Friday,False,not known,Friday,Friday
3,3,1993-01-01,anniversary,Ellen,Guhennec,Friday,False,wedding,Friday,Friday
4,4,1998-07-14,anniversary,Ellen,Guhennec,Tuesday,False,anniversary,Tuesday,Tuesday


In [262]:
pd.to_datetime(toy_df['date']).dt.day_name()

0    Saturday
1      Friday
2      Friday
3      Friday
4     Tuesday
Name: date, dtype: object

In [None]:
# your solution here:

toy_df['day_of_the_week'] = pd.to_datetime(toy_df['date']).dt.day_name()

for i in toy_df[3:].iterrows():
    index = i[0]
    row = i[1]

    print("The date of this row is:", row['date'])
    print("The day of this row is:", row['day_of_the_week'])
    print("The event strats with 'bir':", row['event'].startswith('bir'))
    print('-----')

The date of this row is: 1970-05-23
The day of this row is: Saturday
The event strats with 'bir': True
-----
The date of this row is: 1978-07-14
The day of this row is: Friday
The event strats with 'bir': False
-----
The date of this row is: 1986-03-14
The day of this row is: Friday
The event strats with 'bir': False
-----
The date of this row is: 1993-01-01
The day of this row is: Friday
The event strats with 'bir': False
-----
The date of this row is: 1998-07-14
The day of this row is: Tuesday
The event strats with 'bir': False
-----


## Final exercise: Shakespeare and Company project

**Dataset**

For this excercise, we will be working with an open-access data from a DH research project: the [*Shakespeare and Company project*](https://shakespeareandco.princeton.edu/).

The dataset we'll be using contains the list of books that were lent out in the 20th century by the celebrated *Shakespeare and Company* library in Paris.

The dataset can be downloaded from the following address:https://dataspace.princeton.edu/handle/88435/dsp01jm214s28p (file size = ~2MB).

You may use CSV or JSON:
- the CSV file will only contain `str`, `float` and `int`;
- whereas the JSON file will contain `lists`.

The choice will be one of comfort. Pick whichever yields the datatypes you are more comfortable with.
 
**Steps**

Start by loading the file it into a `pandas.DataFrame` called "SCo"

**Try to answer the following questions**
- 1° How many records does it contain?


- 2° What's the format of the **most purchased** document? How many times was it purchased?
- 3° What's the format of the **least borrowed** document? How many times was it borrowed?

- 4° Which is the most borrowed **book**? How many times was it borrowed?

- 5° How many 'Poems' are there in collection? What is the earliest one? How many don't have a date?

- 6° Replace the empty cells in the `circulation_year` column with "unknown".
- 7° Find which books were in circulation in 1951
- 8° (Advanced) By iterating over the rows of the dataframe, create a dictionnary of dates that gives how many books from the datafram were in circulation that year.

In [264]:
# your solution here:

sco = pd.read_csv('../data/SCoData_books_v1.2_2022_01.csv')

In [280]:
#2.
max_purchases = sco['purchase_count'].max()
sco.loc[(sco['purchase_count']==max_purchases)]['format'].values[0]

#3.
min_borrows = sco['borrow_count'].min()
sco.loc[(sco['borrow_count']==min_borrows)]['format'].value_counts()

#4.
max_borrows_book = sco.loc[sco['format']=='Book']['borrow_count'].max()
sco.loc[
    (sco['format']=='Book')
    &
    (sco['borrow_count']==max_borrows_book)
    ]

Unnamed: 0,uri,title,author,editor,translator,introduction,illustrator,photographer,year,format,uncertain,ebook_url,volumes_issues,notes,event_count,borrow_count,purchase_count,circulation_years,updated
1305,https://shakespeareandco.princeton.edu/books/j...,A Portrait of the Artist as a Young Man,"Joyce, James",,,,,,1916.0,Book,False,https://archive.org/details/portraitofartist00...,1928,,63,56,5,1920;1921;1922;1923;1924;1925;1926;1928;1929;1...,2021-02-25T15:33:55+00:00


In [286]:
#5.
poems = sco.loc[(sco['title'].str.contains('Poems'))]
len(poems.loc[poems['year'].isna()])

29