# CSV Export

This notebook is for visualizing the CSV exports from source files to Django.

In [1]:
import pandas as pd

In [2]:
cols = [
    'ExhibitionID',
    'ExhibitionNumber',
    'ExhibitionTitle',
    'ConstituentURL', 
    'FirstName',
    'MiddleName',
    'LastName',
    'Suffix',
    'ExhibitionURL',
    'ExhibitionRole',
    'DisplayName',
]

exh = pd.read_csv(
    '~/data1/moma/exhibitions/MoMAExhibitions1929to1989.csv', 
    usecols=cols,
    dtype={
        'ExhibitionID': 'Int64',
    },
    converters={
        'ExhibitionTitle': str,
        'FirstName': str,
        'LastName': str,
        'MiddleName': str,
        'Suffix': str,
    },
    encoding="iso8859-1",
)

## The complete list of columns

['ExhibitionID',
 'ExhibitionNumber',
 'ExhibitionTitle',
 'ExhibitionCitationDate',
 'ExhibitionBeginDate',
 'ExhibitionEndDate',
 'ExhibitionSortOrder',
 'ExhibitionURL',
 'ExhibitionRole',
 'ExhibitionRoleinPressRelease',
 'ConstituentID',
 'ConstituentType',
 'DisplayName',
 'AlphaSort',
 'FirstName',
 'MiddleName',
 'LastName',
 'Suffix',
 'Institution',
 'Nationality',
 'ConstituentBeginDate',
 'ConstituentEndDate',
 'ArtistBio',
 'Gender',
 'VIAFID',
 'WikidataID',
 'ULANID',
 'ConstituentURL']
 

## Filter for artists

The CSV contains one role for each artist in a given exhibition. So let's filter for only artists.

In [3]:
artists = exh.loc[exh['ExhibitionRole'] == 'Artist']
artists = artists.loc[artists['ExhibitionTitle'] != "No#"]

In [None]:
artists

## Add a column for the Gensim token

Since the Gensim tokenizer trimmed trailing 'e's and otherwise altered artist names, it would be cool to have a column for the token, so that Django can translate those names when interacting with the model.

In [4]:
from gensim.parsing.preprocessing import preprocess_string

In [5]:
def format_name(names):
    "Join a name with an underscore"
    # process all the names at once.
    return ["".join(preprocess_string(n)) for n in names]

In [6]:
artists_tokenized = artists.assign(
    token=lambda x: format_name(x.DisplayName)
)

In [None]:
artists_tokenized

## Export time!

All right, now we have a mapping to the Gensim model's token. 

TODO:

This is good, but there is one thing that would make it better. This script exports _all_ the exhibitions, but only some of them are included in the model. Exhibitions above a threshold were not included.

```python
for en in exh_numbers:
        terms = Moma.exhibition_artists(en)
        # Don't calculate and output big lists
        if len(terms) <= 50:
            Moma.append_to_outfile(terms)
```


This is cool, but the dataframe isn't ready yet. We need to remove exhibitions that are above or equal to 50.

In [7]:
exh_numbers = artists_tokenized.ExhibitionNumber.unique()

drop_indices = []

for exn in exh_numbers:
    # Select a group of rows by ExhibitionNumber
    e = artists_tokenized.loc[artists_tokenized["ExhibitionNumber"] == str(exn)]
    # Create a list of rows to drop
    if len(e) >= 50:
        for i in e.index.tolist():
            drop_indices.append(i)

# Drop the oversize exhibitions
tokenized_filtered = artists_tokenized.drop(drop_indices)

In [8]:
# Export the CSV
tokenized_filtered.to_csv(
    index=False,
    path_or_buf='../data/artists_tokenized_filtered.csv',
)