# Data Set

**Source:** FBI National Stolen Art File\
**Retrieval Date:** August 25, 2024\
**About:** A listing of the paintings, statuary, and other forms of fine art in the FBI's database of stolen artwork and culturally-significant property.].\
**Source URL:** [Access web database here.](https://artcrimes.fbi.gov/nsaf-view?searchText=&crimeCategory=)\
**Github Source:** The data taken from the FBI National Stolen Art File will be saved in several formats in this repo, and different formatted versions will be used for this project dependent on the transformation and type of test being applied.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df1 = pd.read_csv('/Users/Alli_1/Documents/Repos/fbi_stolen_art_research/Data/FBI_to_sample.csv',na_values=['NA'])

Briefly making sure the data loaded correctly and looks as expected...

In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4561 entries, 0 to 4560
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Title                   4561 non-null   object
 1   Category                4561 non-null   object
 2   Ref_Num                 4548 non-null   object
 3   Maker_Artist            4129 non-null   object
 4   Materials               4224 non-null   object
 5   Measurements            3595 non-null   object
 6   Time Period             3293 non-null   object
 7   Additional Information  4022 non-null   object
dtypes: object(8)
memory usage: 285.2+ KB


In [4]:
df1.head()

Unnamed: 0,Title,Category,Ref_Num,Maker_Artist,Materials,Measurements,Time Period,Additional Information
0,Buddhist Bell Sets and Dorjes,Instruments,258,Tibetan,Bronze,,,"bell, dorje, music, Buddhism"
1,Preston C. Hudson Binoculars,Other - Assorted,808,,Brass; Glass,Base diameter 6.0 cm.; Eyepiece diameter 4.1 c...,1862-1865,binoculars; Civil War; The inscription describ...
2,Musician's Sword,Firearms and Blades,89,Emerson and Silver,Brass and brushed steel,35 in,1860s,"sword; US DFII 1863, DFM, Emerson and Silver,..."
3,Model 1933 SS Dagger,Firearms and Blades,34,"Attributed to ""Jacobs""",,,1933,
4,Model 1933 SS Full Roehm Dagger,Firearms and Blades,34,"Attributed to ""Eickhorn""",,,1933,


### Step 2:

My dataset is too big and too messy to properly clean in its current size due to the amount of "hand" editing that needs to be done and additional art historical context columns that will need to be added. So, I need to create a sample. So it isn't too imbalanced, I'm going to do a stratified sample based on the "Category" feature. This is one of the data-points most consistently incldued in the original web database data, and one that I will apply the least transformations to down the road. As such, it will be the category used as the basis for the stratification.

In [5]:
## briefly getting a closer handle on the 'category' feature
df1['Category'].describe()

count          4561
unique           27
top       Paintings
freq           1634
Name: Category, dtype: object

In [6]:
## checking here to get a handle of the general districution among the category types as a kick-off
percentage_counts = df1['Category'].value_counts(normalize=True) * 100
print(percentage_counts)

Paintings                       35.825477
Sculptures                      16.136812
Prints                          11.203683
Books                            6.007455
Other - Arts and Antiques        5.064679
Drawings, Watercolors            4.406928
Jewelry                          3.617628
Other - Assorted                 2.345977
Photographs                      1.863626
Ornamental Ceramic Wares         1.863626
Textiles                         1.403201
Coins and Paper Money            1.337426
Collectibles                     1.271651
Vases                            1.118176
Dolls and Figurines              0.986626
Bowls                            0.898926
Clock, Timepiece                 0.767376
Instruments                      0.613900
Ethnographic Works of Art        0.591975
Lamps, Lighting                  0.460425
Icons, Triptychs, Diptychs       0.438500
Firearms and Blades              0.372725
Clothing and Costumes            0.328875
Ceramic Tea and Coffee Wares     0

In [7]:
#making sample
fbi_messy_sample = df1.groupby('Category', group_keys=False).apply(lambda x: x.sample(frac=0.1))

In [8]:
## checking to see if sample is indeed appropriately proportional
sample_percentage_counts = fbi_messy_sample['Category'].value_counts(normalize=True) * 100
print(sample_percentage_counts)

Paintings                       35.903084
Sculptures                      16.299559
Prints                          11.233480
Books                            5.947137
Other - Arts and Antiques        5.066079
Drawings, Watercolors            4.405286
Jewelry                          3.524229
Other - Assorted                 2.422907
Photographs                      1.762115
Ornamental Ceramic Wares         1.762115
Textiles                         1.321586
Coins and Paper Money            1.321586
Collectibles                     1.321586
Vases                            1.101322
Bowls                            0.881057
Dolls and Figurines              0.881057
Clock, Timepiece                 0.881057
Instruments                      0.660793
Ethnographic Works of Art        0.660793
Lamps, Lighting                  0.440529
Firearms and Blades              0.440529
Clothing and Costumes            0.440529
Icons, Triptychs, Diptychs       0.440529
Holloware                        0

In [9]:
#again just checking in to see that the sample is the right size etc.
fbi_messy_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 454 entries, 138 to 4439
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Title                   454 non-null    object
 1   Category                454 non-null    object
 2   Ref_Num                 454 non-null    object
 3   Maker_Artist            411 non-null    object
 4   Materials               423 non-null    object
 5   Measurements            354 non-null    object
 6   Time Period             335 non-null    object
 7   Additional Information  403 non-null    object
dtypes: object(8)
memory usage: 31.9+ KB


### Step 3: Export

Since everything looks as it should, I am next going to export this sample as a new CSV which I will be able to further clean both by hand, and later merge with necessary data in a separate notebook

In [11]:
fbi_messy_sample.to_csv('fbi_messy_sample.csv')