In [None]:
# Full name test
NAME = ""
# Institutional email (hm.edu or hmtm.de)
EMAIL = ""

In [None]:
#@title install required packages
%pip install otter-grader
%pip install pandas
%pip install matplotlib
%pip install seaborn

In [None]:
#@title clone git repository
%%capture
%rm -rf aica-assignments
!git clone https://github.com/aica-wavelab/aica-assignments.git
%cd aica-assignments/A1_introduction

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("3_tutorial_data_science.ipynb")

***

# Day 2 - Rudiments of Data Science: Analysing the Curation of the MoMA

+ **AI in Culture and Arts - Tech Crash Course**
+ **Date:** 10.04.2025
+ **Author:** Dr. Benedikt Zönnchen & Dr. Téo Sanchez

<a href="https://colab.research.google.com/github/aica-wavelab/aica-assignments/blob/main/A1_introduction/3_tutorial_data_science.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


This notebook is a brief introduction to data science. It is intended for beginners who are interested in the manipulation of data and the process of answering research questions using data.

The notebook is interested in a concrete application: **the analysis of the artistic curation of the Museum of Modern Art (MoMA) in New York**.

<image src="https://live.staticflickr.com/159/430975967_c97c2d8e35_b.jpg" style="width: 700px; display: block; margin-left: auto; margin-right: auto;"/>

## 9 Data manipulation

In the previous tutorial, you learned about two native data structures in python: lists (`list`) and dictionaries (`dict`). 
Even if thoses data structure are essential in python, they lack functionalities to manipulate data in a more efficient way.


With the package `pandas`, we are going to learn how to manipulate data frames, i.e. a table with *attributes* as columns and *instances* as rows.

### 9.1 The Pandas Package

[Pandas](https://fr.wikipedia.org/wiki/Pandas) is a ``Python`` package that provides data structures and data analysis tools. It is built on top of the ``numpy`` package, which allows for efficient numerical operations, especially in the field of applied linear algebra.

In [None]:
# Run this cell if you do this notebook on Google Colab or have not installed the package yet
%pip install pandas
import pandas as pd

Using ``numpy`` we can very easily work with lists of numbers like mathematical vectors and lists of lists like matrices. In ``numpy`` we call these lists ``arrays``.

For example, let us define two vectors $a$ and $b$ and then compute the dot product of these vectors:

In [None]:
import numpy as np

a = np.array([[1,2,3,4]])
b = np.array([[-1, 0, 0, 1]])

dot_product = a @ b.T
dot_product

This computes $$\begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix} \cdot \begin{bmatrix} -1 \\ 0 \\ 0 \\ 1 \end{bmatrix} = 1 \cdot (-1) + 2 \cdot 0 + 3 \cdot 0 + 4 \cdot 1 = -1+4 = 3$$

Instead of $a \cdot b^\top$ we can also compute $a^\top \cdot b$ which results in a matrix:

In [None]:
matrix = a.T @ b
matrix

Machine learning is all about matrix-matrix and matrix-vector multiplication! ``numpy`` is highly optimized for these operations.

### 9.2 Series (`pandas.Series`)

But now let us focus on ``pandas`` which is different from ``numpy`` in that it deals not only with numbers!
A serie (`pandas.Series`) is a one-dimensional array that can store any data type (integers, strings, floats, etc.). It is similar to a list in python, but with more functionalities. 

A serie can be instanciated from a Python `list`.

Let's create a serie of artworks from the MoMA collection.


In [None]:
artworks = pd.Series(["Spiral for Shared Dreams",
                      "Woman with Dead Child",
                      "Self-Portrait en Face",
                      "Standard Station, Ten-Cent Western Being Torn in Half",
                      "Houston Street",
                      "Shooting Painting American Embassy",
                      "Self-Portrait with Cropped Hair",
                      "Patchwork Quilt"
])

artworks

The serie is displayed as a column. The rows are displayed alongside indexes $0,1,\ldots$. It is possible to reassign indexes, even to non-integers. 

The notation 'serie[i]' is used to access the row with index `i`:

In [None]:
artworks[5]

As with Python lists, the length of a series can be accessed with the `len` function:

In [None]:
len(artworks)

Let us create an hypothetical series of prices for the artworks stored in the `artworks` variable (in thousands of dollars \$)

In [None]:
prices = pd.Series([500, 300, 250, 5000, 1000, 1200, 10000, 700])
prices

Because this serie is numerical, we can perform operations on it. For example, we can calculate the average price of the artworks:

In [None]:
prices.mean()

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

<b>Instruction 9.1:</b> Now calculate the median price using the `median` method.

</div>

In [None]:
...

In [None]:
grader.check("q91")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 9.2:</b> Calculate the minimum resp. maximum price of the serie using the min resp. max method.

</div>

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 9.3:</b> What is the range of prices, i.e. the difference between the maximum and minimum prices?
</div>

In [None]:
print(prices.max() - prices.min())

<!-- END QUESTION -->

### 9.3 Data frames (`pandas.DataFrame`)

*Data frames* (type `pandas.DataFrame`) are arrays of two dimensions, with labels (*labels*) associated to rows and columns.
A data frame is not necessarily **homogeneous** as data types can change from one column to another (strings, floats, integer, etc.).


In the rest of the course, by **table**, we mean a two-dimensional array of type `pandas.DataFrame`, while by **series** we mean a one-dimensional array of type `pandas.Series`.

These concepts of `DataFrame` and `Series` can also be found in other packages or data analysis systems such as [`R`](https://www.r-project.org/).

Let's create a table containing the artworks and their corresponding prices. A `DataFrame` can be instanciated from a dictionary whose keys are the column names and the values are the series.

In [None]:
df = pd.DataFrame({
    "Artwork" : artworks,
    "Price" : prices
});
df

NB: It is customary to use the suffix `df` to store a `DataFrame`. It is however better to choose a more evocative and explicit name such as `artwork_prices_df`.

The table we created has two dimensions and can be seen as a collection of series (i.e. the columns). The series (columns) can be accessed from their labels using the notation `df['label']`:

In [None]:
df["Artwork"]

In [None]:
df["Price"]

In [None]:
print(type(df))
print(type(df["Artwork"]))

You can then access each of the values in the table by
specifying its column label and then its row index. Here is the
artwork in the second line (index 1):

In [None]:
df['Artwork'][1]

And the price value in the fourth line (index 3):

In [None]:
df["Price"][3]

You can use the `loc` method to access the values of a row by specifying its index:

In [None]:
df.loc[1]

You can also access the values of a column by specifying its index instead of its label using `iloc`:

In [None]:
df.iloc[:, 0]

The `:` symbol is used to select all rows or all columns, but it can also be used to select a range of rows or columns. For example, to select the rows from the second to the fourth, you can use the notation `df[1:4]`:

In [None]:
df[1:4]

The `:` notation can also be used with `iloc` to select an arbitrary range of rows and columns:

In [None]:
df.iloc[1:4, 0]

### 9.4 Metadata and Statistics of a Table

We're now using `Pandas` to extract some metadata
and statistics from our data.

Firstly, the size of the table:

In [None]:
df.shape # return (nb of rows, nb of columns)

The columns labels:

In [None]:
df.columns

The number of rows:

In [None]:
len(df)

Some general information on column types and memory usage:

In [None]:
df.info()

The average of each column containing numerical values (in this case only prices):

In [None]:
df.mean(numeric_only = True)

Standard deviations:

In [None]:
df.std(numeric_only = True)

Medians:

In [None]:
df.median(numeric_only = True)

The `describe` method provides a more exhaustive summary and statistics on each column:

In [None]:
df.describe()

To add a new column to the table, you can use the following syntax:
```python
df['new_column'] = values
```

Let's add a column with the year of creation of each artwork: 

In [None]:
df["Year"] = [2023,
              1903, 
              1904,
              1964, 
              1986,
              1961,
              1940,
              1970]
df

In [None]:
df.describe()

### 9.5 Basic operations on data

Pandas allows you to perform operations on data tables. 

Let's first transform the table so that one of the columns is used as indices, for example the artwork names:

In [None]:
df_tmp = df.set_index("Artwork")

It is now possible to access a price value using the name directly as a line index:

In [None]:
df_tmp['Price']["Patchwork Quilt"]

Let us now select all rows whose price is greater than 1 million of $:

In [None]:
df[df["Price"] > 1000]

Do you understand the above expression? Let's decompose it...

It is important to understand that operations are **vectorised** on all elements of a series. So, if we write `prices + 100`, this adds 100 to all the values in the series:

In [None]:
prices + 100

Similarly, if we write `prices == 5000`, it returns a series of Booleans, each one indicating whether the corresponding value is equal to $5000$ or not:

In [None]:
prices == 5000

Let's now store the series of artworks whose price is greater than $1 million in a new variable:

In [None]:
costMoreThan1000 = prices > 1000
costMoreThan1000

Indexing a `DataFrame` by a series of booleans will extracts the rows for which the series contains `True`:

In [None]:
df[costMoreThan1000]

This is why `df[df["Price"] > 1000]` selects the rows for which the price is greater than 1000k of US dollars (1 million).

## 10 A Closer Look at the MoMA's Artist Curation

Now you've learned the basics to manipulate data with `pandas`, let's move on to a real problem !

The **Museum of Modern Art (MoMA)**, established in 1929, is a museum located in Midtown Manhattan, New York City. It houses an extensive collection that has grown to almost 200,000 artworks from around the globe, covering the last 150 years.

The file `moma_artist.csv` contains information about artists including: their name, nationality, gender, birth and death year.

The file `moma_artwork.csv`contains information about artworks including: their title, artist name, the date of acquisition, and the category of the artwork (painting, sculpture, etc.)

Our analysis will try to answer the following research questions:

> **RQ1.** What are the demographic characteristics of the artists curated by the MoMA?

> **RQ2.** Did the artist curation of the MoMA shifted since its creation?

> **RQ3.** In particular, how has the proportion of international and women artists changed?

Data frames can be loaded from comma-separated values (CSV) files using the `read_csv` method:

In [None]:
df_artists = pd.read_csv("data/moma_artist.csv") 
df_artists

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 10.1:</b> How many rows and columns does the dataset contain?

</div>

_Type your answer here, replacing this text._

In [None]:
...

<!-- END QUESTION -->

The column `Artist ID` assigns a unique identifier to each individual, in case homonyms exist in the dataset.

In [None]:
df_artists["Artist ID"].nunique()

### 10.1 Analyzing Artists' Nationality

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 10.2:</b> What is the most common artist's nationalities in this dataset? Use the `value_counts` method on the `Nationality` serie extracted from the dataframe.

</div>

_Type your answer here, replacing this text._

In [None]:
...

<!-- END QUESTION -->

Visualization is of crucial importance in data analysis. The `seaborn` package is an extension of `matplotlib` that provides a high-level interface for drawing attractive and informative statistical graphics.

If you are not sure what would be a good visualization for your data, you can browse the excellent [Python graph gallery](https://python-graph-gallery.com/) that provides code snippets (using `matplotlib` or `seaborn`) to generate the plots.

Let's use an horizontal barplot to visualize the most common artist's nationalities:


In [None]:
import seaborn as sns

For visibility let's only display the top 10 artists' nationalities:

In [None]:
top10Nationalities = df_artists["Nationality"].value_counts().head(10)
top10Nationalities

In [None]:
sns.barplot(x=top10Nationalities, y=top10Nationalities.index)

### 10.2 Analyzing Artists' Gender

Dataset often contain incorrect or missing values.
For example, let's look at artists' gender:

In [None]:
gender = df_artists["Gender"].value_counts()
gender

It becomes obvious that there is an inconsistency in the format of the labels. Let's fix this by converting all labels to lowercase:

In [None]:
df_artists["Gender"] = df_artists["Gender"].str.lower()
gender = df_artists["Gender"].value_counts()
gender

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 10.3:</b> Visualize the gender distribution of artists that were curated at the MoMA using the visualization of your choice.

</div>

In [None]:
...

<!-- END QUESTION -->

### 10.3 Birth Date of Artists

Let's have a look at the birth date of the artists curated at the MoMA, using an histogram:

In [None]:
sns.histplot(data=df_artists, x='Birth Year', bins=50)

Now let's visualize the birth year distribution of artists, grouped by gender. We assign the `Gender` column to the `hue` (color) parameter of the `sns.histplot` function, and decide to stack the bars by setting the `multiple` parameter to `stack`.

In other word, the histogram has the same shape as before but now we have colors for artists' gender.

In [None]:
sns.histplot(data=df_artists, x='Birth Year', hue="Gender", multiple="stack")


It is hard to interpret gender balance from this plot because:

- the number of artists per band of birth years birth years also varies;
- there is a lot of bins (band of birth years), i.e. the histogram is too granular;

Let's instead plot the proportion of female artists as their birth year increases. We can do this by using the `sns.histplot` function with the `stat` parameter set to `density` and the `multiple` parameter set to `fill`.

In [None]:
sns.histplot(data=df_artists, x='Birth Year', hue="Gender", stat="density", multiple="fill", bins=12)

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 10.4:</b> Describe the visualization above.

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 10.5:</b> Give possible interpretations of the visualization above. Keep in mind that interpretations are additional hypotheses that need to be verified with further analysis. They are however essential to guide future analysis and is part your role as a data analyst or researcher.
</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 11 Analyzing the Artworks Acquired by the MoMA

The `moma_artist.csv` file contains limited information about the curation of the MoMA.
In particular, we do not have information at the artwork level or the date of acquisition of the pieces.

Let us now import the `moma_artwork.csv` file to get more information about the artworks.

In [None]:
df_artworks = pd.read_csv("data/moma_artwork.csv")
df_artworks.head(7)

To further shed light on our question about gender balance in the curation of the MoMA, we can now analyze their curation at the level of artworks, and in particular the acquisition year.

<div class="alert alert-info">
    
<b>Instruction 11.1:</b> What is the most recent acquisitions in the present dataset? Use the `sort_values` method to list the artworks by descending order of acquisition date.

</div>

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 11.2:</b> What is the most ancient acquisition present in the dataset? What can you say about it?
</div>

_Type your answer here, replacing this text._

In [None]:
...

<!-- END QUESTION -->

Let's delete the row corresponding to the artwork with the acquisition date 1216 using the `drop` method and the `index` parameter:

In [None]:
df_artworks = df_artworks.drop(index=128443)

We want to analyze artists' gender when their work are acquired by the MoMA. The problem is that `moma_artwork.csv` does not contain information about artists' gender.

We can however merge the two datasets, `moma_artist.csv` and `moma_artwork.csv`, on the basis of artist names and artists birth years.

In [None]:
df_merged = pd.merge(df_artists, df_artworks, left_on=["Name", "Birth Year"], right_on=["Artist", "BirthYear"])
df_merged.head(5)

The `merge` method allows to merge two dataframes on the basis of common columns.
Now, each row corresponds to an artwork as in `moma_artwork.csv`, but the columns also have information about the artist as in `moma_artist.csv`.

### 11.1 Analyzing Artists' Nationality

Now that we have merged the two datasets, we can analyze the nationality of artists over time and at the artwork level.

Let's now implement an appropriate format and visualization to show the evolution of artists' nationality as a function of the acquisition year of artworks.

In [None]:
# Clean data
df_clean = df_merged.dropna(subset=['Nationality', 'YearAcquired'])

# Identify the top 5 nationalities
top_nationalities = df_clean['Nationality'].value_counts().nlargest(5).index

# Group and recategorize nationalities
df_clean['GroupedNationality'] = df_clean['Nationality'].apply(lambda x: x if x in top_nationalities else 'Other')

# Group by YearAcquired and GroupedNationality and count the artworks
nationality_yearly = df_clean.groupby(['YearAcquired', 'GroupedNationality']).size()

# Convert the Series to a DataFrame for easier manipulation
nationality_yearly_df = nationality_yearly.unstack(fill_value=0)

# Cumulative total artworks acquired by nationality over years
cumulative_nationality = nationality_yearly_df.cumsum()

# Sort columns based on values in 2017
cumulative_nationality_sorted = cumulative_nationality.loc[:, cumulative_nationality.loc[2017].sort_values(ascending=False).index]
cumulative_nationality_sorted

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Plotting
plt.figure(figsize=(12, 6))
cumulative_nationality_sorted.plot(kind='area', stacked=True, ax=plt.gca())
plt.title("Cumulative artists' nationality from MoMA artworks acquisitions")
plt.xlabel('Year Acquired')
plt.ylabel('Total Artworks Acquired')
plt.legend(title='Nationality', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()  # Adjusts plot to ensure everything fits without overlap
plt.show()


<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 11.3:</b> Describe and interpret the visualization above.
</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 11.2 Analyzing Artists' Gender

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 11.4:</b> Repeat a similar analysis but for artists' gender as a function of artwork acquisition year. Use a stacked area plot as a visualization.
</div>

In [None]:
...

<!-- END QUESTION -->



In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Plotting
plt.figure(figsize=(12, 6))
cumulative_gender_sorted.plot(kind='area', stacked=True, ax=plt.gca())
plt.title("Cumulative artists' gender from MoMA artworks acquisitions")
plt.xlabel('Year Acquired')
plt.ylabel('Total Artworks Acquired')
plt.legend(title='Gender', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()  # Adjusts plot to ensure everything fits without overlap
plt.show()

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 11.5:</b> Describe and interpret the visualization above.
</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 11.6:</b> Conclude about your analysis of the MoMA dataset and according to the research questions raised at the beginning. What are the limitations of your analysis and possible future work?
</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Well done!

You've learned the basics of `pandas` and `seaborn`, two essential packages for data manipulation and visualization in Python.

In particular, you should now be able to:

 - create and manipulate `Series` and `DataFrame` objects;
 - extract metadata and statistics from a dataset;
 - perform basic operations on data, such as filtering and adding columns;
 - visualize data using `seaborn` and `matplotlib`;
 - merge two datasets on the basis of common columns;
 - comment and interpret the results of your analysis.
 
It is time to learn about machine learning in python! Go to the next sheet [Machine learning tutorial](4_tutorial_machine_learning.ipynb)