# Homework 1 - data validation & cleaning (deadline 3. 11. 2024, 23:59)

In short, the main task is to clean The Metropolitan Museum of Art Open Access dataset.
  
> The instructions are not given in detail: It is up to you to come up with ideas on how to fulfill the particular tasks as best as possible!

However, we **strongly recommend and require** the following:
* Follow the assignment step by step. Number each step.
* Most steps contain the number of features that should be treated. You can preprocess more features. However, it does not mean the teacher will give you more points. Focus on quality, not quantity.
* Properly comment on all your steps. Use Markdown cells and visualizations. Comments are evaluated for 2 points of the total, together with the final presentation of the solution. However, it is not desirable to write novels! 
* This task is timewise and computationally intensive. Do not leave it to the last minute.
* Hand in a notebook that has already been run (i.e., do not delete outputs before handing in).

## What are you supposed to do:

  1. Download the dataset MetObjects.csv from the repository https://github.com/metmuseum/openaccess/.
  1. Check consistency (i.e., that the same things are represented in the same way) of at least **three features** where you expect problems (including the "Object Name" feature). You can propose how to clean the selected features. However, **do not apply cleaning** (in your interest) 🙂 _(1.5 points)_
  1. Select at least **two features** (i.e., one couple) where you expect integrity problems (describe your choice) and check the integrity of those features. By integrity, we mean correct logical relations between features (e.g., female names for females only). _(2 points)_
  1. Convert at least **five features** to a proper data type. Choose at least one numeric, one categorical (i.e., ordinal or nominal), and one datetime. _(1.5 points)_
  1. Find some outliers and describe your method. _(3 points, depends on creativity)_
  1. Detect missing data in at least **three features**, convert them to a proper representation (if they are already not), and impute missing values in at least **one feature** using some imputation method (i.e., imputation by mean or median is too trivial to obtain any points). _(2 + 3 points, depends on creativity)_
  1. Focus more precisely on cleaning the "Medium" feature. As if you were to use it in the KNN classification algorithm later. _(3 points)_
  1. Focus on the extraction of the physical dimensions of each item (width, depth, and height in centimeters) from the "Dimensions" feature. _(2 points)_
  
All your steps, your choices of methods, and the following code **must be commented on!** For text comments (discussion, etc., not code comments), use **Markdown cells**. Comments are evaluated for 2 points together with the final presentation of the solution. 

**If you do all this properly, you will obtain 20 points.**

## Comments

  * Please follow the technical instructions from https://courses.fit.cvut.cz/NI-PDD/homeworks/index.html.
  * Methods that are more complex and were not shown during the tutorials are considered more creative and should be described in detail.
  * English is not compulsory.

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

## 1. Downloading the dataset
We will download it via url (hoping it will not change in next two monthes [30.09]). Either way it will be downloaded and read by pandas

In [3]:
url = 'https://github.com/metmuseum/openaccess/raw/refs/heads/master/MetObjects.csv?download='
df = pd.read_csv(url, index_col='Object ID')
df.head()

  df = pd.read_csv(url, index_col='Object ID')


Unnamed: 0_level_0,Object Number,Is Highlight,Is Timeline Work,Is Public Domain,Gallery Number,Department,AccessionYear,Object Name,Title,Culture,...,River,Classification,Rights and Reproduction,Link Resource,Object Wikidata URL,Metadata Date,Repository,Tags,Tags AAT URL,Tags Wikidata URL
Object ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1979.486.1,False,False,False,,The American Wing,1979.0,Coin,One-dollar Liberty Head Coin,,...,,,,http://www.metmuseum.org/art/collection/search/1,,,"Metropolitan Museum of Art, New York, NY",,,
2,1980.264.5,False,False,False,,The American Wing,1980.0,Coin,Ten-dollar Liberty Head Coin,,...,,,,http://www.metmuseum.org/art/collection/search/2,,,"Metropolitan Museum of Art, New York, NY",,,
3,67.265.9,False,False,False,,The American Wing,1967.0,Coin,Two-and-a-Half Dollar Coin,,...,,,,http://www.metmuseum.org/art/collection/search/3,,,"Metropolitan Museum of Art, New York, NY",,,
4,67.265.10,False,False,False,,The American Wing,1967.0,Coin,Two-and-a-Half Dollar Coin,,...,,,,http://www.metmuseum.org/art/collection/search/4,,,"Metropolitan Museum of Art, New York, NY",,,
5,67.265.11,False,False,False,,The American Wing,1967.0,Coin,Two-and-a-Half Dollar Coin,,...,,,,http://www.metmuseum.org/art/collection/search/5,,,"Metropolitan Museum of Art, New York, NY",,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 484956 entries, 1 to 900748
Data columns (total 53 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Object Number            484956 non-null  object 
 1   Is Highlight             484956 non-null  bool   
 2   Is Timeline Work         484956 non-null  bool   
 3   Is Public Domain         484956 non-null  bool   
 4   Gallery Number           49541 non-null   object 
 5   Department               484956 non-null  object 
 6   AccessionYear            481094 non-null  object 
 7   Object Name              482690 non-null  object 
 8   Title                    456153 non-null  object 
 9   Culture                  208190 non-null  object 
 10  Period                   91143 non-null   object 
 11  Dynasty                  23201 non-null   object 
 12  Reign                    11236 non-null   object 
 13  Portfolio                26514 non-null   object 
 14  Constitue

In [5]:
display(df.select_dtypes(include=object).describe())
display(df.select_dtypes(exclude=object).describe())

Unnamed: 0,Object Number,Gallery Number,Department,AccessionYear,Object Name,Title,Culture,Period,Dynasty,Reign,...,Excavation,River,Classification,Rights and Reproduction,Link Resource,Object Wikidata URL,Repository,Tags,Tags AAT URL,Tags Wikidata URL
count,484956.0,49541,484956,481094.0,482690,456153,208190,91143,23201,11236,...,16571,2092,406239,24529,484956,69154,484956,192455,192455,192455
unique,481656.0,563,19,316.0,28631,245800,7313,1891,405,396,...,411,228,1244,1507,484956,69076,1,44171,43699,43886
top,62.635,774,Drawings and Prints,1963.0,Print,Terracotta fragment of a kylix (drinking cup),American,Edo period (1615–1868),Dynasty 18,reign of Amenhotep III,...,MMA excavations,Upper Sepik River,Prints,"© Walker Evans Archive, The Metropolitan Museu...",http://www.metmuseum.org/art/collection/search...,https://www.wikidata.org/wiki/Q97732991,"Metropolitan Museum of Art, New York, NY",Flowers,http://vocab.getty.edu/page/aat/300132399,https://www.wikidata.org/wiki/Q506
freq,4.0,7037,172630,39846.0,102986,6415,28579,9127,7184,2750,...,2455,362,84326,7364,1,17,484956,8543,8543,8543


Unnamed: 0,Object Begin Date,Object End Date,Metadata Date
count,484956.0,484956.0,0.0
mean,1303.913734,1402.978142,
std,1710.259182,1132.101347,
min,-400000.0,-240000.0,
25%,1535.0,1593.0,
50%,1800.0,1840.0,
75%,1891.0,1905.0,
max,5000.0,2870.0,


## 2. Checking consistency

TODO: 
* write reduction functions
  * whitechars -> spaces
  * Non-alpha -> empty

### Object Name column
Lets check for less frequent names.

In [25]:
col_count = df['Object Name'].value_counts()
col_count.tail(15)

Object Name
Academicien's Habit                        1
Coat Dress                                 1
Print; portfolio                           1
Album\r\nPrints                            1
Contrabass saxophone in E flat             1
La Mort du Cygne Grand Piano               1
Mark for a mail shirt                      1
Valentine maker's album (Jonathan King)    1
Tops                                       1
Frock Coat                                 1
Toy musket                                 1
Ten wootz steel ingots with bag            1
Helmet (<i>Top</i>)                        1
Armor of mail                              1
Spearhead (<i>Sang</i>)                    1
Name: count, dtype: int64

There appears to be names with html tags, additional white characters and semicolons as separators. Lets check if there are others "Album print", "Print portfolio" and "Helmet (Top)".

In [31]:
for name in ["Album print", "Print portfolio", "Helmet (Top)", "Spearhead (Sang)"]:
    print(f"'{name}' occurs {col_count.get(name, 0)} times")

'Album print' occurs 11 times
'Print portfolio' occurs 2 times
'Helmet (Top)' occurs 0 times
'Spearhead (Sang)' occurs 0 times


As we see there are many other Album prints and portfolios, so I would suggest to turn any sequence of white chars into a space, strip html tags and remove characters that are not letters or apostrophe

In [44]:
def wc_to_space(s : pd.Series) -> pd.Series:
    return s.str.replace(r'\s+', r' ', regex=True)
def strip_html(s : pd.Series) -> pd.Series:
    return s.str.replace(r'<.*>', '', regex=True)

def remove_nonalpha(s : pd.Series, *, keep_parenthesis=False) -> pd.Series:
    PATTERN = r'\w \'' + (r'\(\)' if keep_parenthesis else '')
    return s.str.replace(r'[^' + PATTERN + r']', '', regex=True)

In [46]:
tmp = wc_to_space(df['Object Name'])
print('\n', tmp.value_counts().tail(5))
tmp = strip_html(tmp)
print('\n', tmp.value_counts().tail(5))
tmp = remove_nonalpha(tmp, keep_parenthesis=True)
print('\n', tmp.value_counts().tail(5))


 Object Name
Painting fragments          1
Tusk fragment               1
IIllustrated single work    1
Fragments of a robe         1
Possibly a lid              1
Name: count, dtype: int64

 Object Name
IIllustrated single work    1
Fragments of a robe         1
Mixing spoon                1
Fragment of dish            1
Hatchet (Nata)              1
Name: count, dtype: int64

 Object Name
Work box                 1
Collar or belt           1
Gold Bracelet            1
Valentine makers' box    1
Etching plate            1
Name: count, dtype: int64


## 3. Integrity

## 4. Correcting dtypes

## 5. Outliers