# Creator Update

### Steps

1. Import the dictionary (named "dict.csv") to update creators. The dictionary has two columns - "Old" and "New", based on the data cleaning work done by Claire Cui and Claire Williams. The update history and process could be found in 2018-07-04_Creators List.xlsx.
2. Import the file ("chung_creator.csv") contains ID and creators of the Chung collection. The creator names in this file will be replaced by the new creator name according to the dictionary imported from the previous step.
3. Replace new_creator with the information in "dict.csv"

## Step 1: Import dict.csv

In [1]:
import pandas as pd
#import cleaned creators with old and new creator names
df_dict = pd.read_csv('dict.csv')

In [2]:
#the first five rows of the data. To see the full data, input df_dict.
df_dict.head()

Unnamed: 0,Old,New
0,[Bauzong studio],Bauzong Studio
1,[Bing Kong Tong],Bing Kong Tong
2,"[Bohm, Charles]","Bohm, Charles"
3,[Bonhene?],"Borchers, Louis"
4,[Burrard Yarrows Corporation],Yarrows Limited


In [3]:
#Convert dataframe to a list, in convinience for replace creators
c_dict=df_dict.values.tolist()

In [4]:
#The lenth of the list is 1320, which means there are 1320 pairs of old and new creators
len(c_dict)

1320

## Step 2: Import chung_creator.csv

In [5]:
#Import chung creators, with the identifier linked to the creator
df_chung=pd.read_csv('chung_creator.csv')
df_chung.head()

Unnamed: 0,ID,RBSC_Access Identifier,RBSC_Creator
0,1,CC-PH-28-1,[unknown]
1,2,CC-PH-28-2,[unknown]
2,3,CC-PH-28-3,"Browne, D. L."
3,4,CC-PH-29-1,[unknown]
4,5,CC-PH-30-1,[unknown]


In [6]:
df_chung=df_chung.set_index('ID')

In [7]:
#Convert dataframe to dictionary for processing
chung_creator=df_chung.to_dict()

In [8]:
#Print the creator of records ID 229, which is BSL. This is used as a test case to check if the replacement succeeded.
chung_creator["RBSC_Creator"][229]

'BSL'

In [9]:
#Save all creators in a dictionary named "creators", with ID and corresponding creators
import itertools
creators=chung_creator["RBSC_Creator"]
dict(itertools.islice(creators.items(), 20))

{1: '[unknown]',
 2: '[unknown]',
 3: 'Browne, D. L. ',
 4: '[unknown]',
 5: '[unknown]',
 6: '[unknown]',
 7: '[unknown]',
 8: '[unknown]',
 9: 'David, Kitty',
 10: 'David, Kitty',
 11: 'David, Kitty',
 12: 'Jue, Frank',
 13: '[unknown]',
 14: '[unknown]',
 15: '[unknown]',
 16: '[unknown]',
 17: '[unknown]',
 18: '[unknown]',
 19: '[unknown]',
 20: '[unknown]'}

In [10]:
#Store multiple creators as seperate entities
for key, creator in creators.items():
    creators[key]=list(str(creator).strip(";").split(";"))

In [11]:
print(len(creators))
creators[55]

20302


['Yip, Randall ', ' Canadian Broadcasting Corporation']

In [12]:
old_creators_df=pd.DataFrame.from_dict(creators,orient='index')

In [13]:
old_creators_df.loc[54:55]

Unnamed: 0,0,1,2,3,4,5,6
54,"Yip, Randall",,,,,,
55,"Yip, Randall",Canadian Broadcasting Corporation,,,,,


In [14]:
creators_new=dict.fromkeys(creators.keys(), [])

## Step 3: Update creators in Chung
This is the main code to update creators. The code will first loop through the dictionary, for each "Old" creator in the dictionary, the code will find all matched creators in the Chung creator list. This function is similar to the "Replace All" function in the excel, except that:
1. It's more efficient, since Excel can only replace one pair of creators a time; 
2. In some cases, the "Replace All" could cause errors, for example, I want to replace "Canadian Pacific" to "Canadian Pacific Railway Company". But if I simply use "replace all", it will also change "Canadian Pacific Railway Company"(which I don't want to change) to "Canadian Pacific Railway Company Railway Company".

In [15]:
#Replace the creators
for x in c_dict:
#Only update the record if the new creator is different from the old.     
    if x[0]!=x[1]:
        for key,z in creators.items():
            creators_new[key]=z;
            for index, item in enumerate(z):
                if str(x[0]).strip()==str(item).strip():
                    creators_new[key][index] = str(x[1]).strip()          

In [16]:
#Test if the creator is updated. The old no.229 creator is "BSL". It has been updated to "Birmingham Silver Plate Limited"
creators[229]

['Birmingham Silver Plate Limited']

In [17]:
creators_new[229]

['Birmingham Silver Plate Limited']

In [18]:
#Remove spaces at the beginning or end of the creator
for key,z in creators_new.items():
    for index, item in enumerate(z):
        creators_new[key][index] = str(item).strip()

In [19]:
creators_new[55]

['Yip, Randall', 'Canadian Broadcasting Corporation']

### Count the times of each creator appeared in Chung collection

In [20]:
word_freq={}
for key,z in creators_new.items():
    for index, item in enumerate(z):
        if item not in word_freq:
            word_freq[item] = 0
        word_freq[item] += 1

In [21]:
# How many times does "Canadian Pacific Railway Company" appeared in the dataset
word_freq['Canadian Pacific Railway Company']

1643

In [22]:
word_freq_df=pd.DataFrame.from_dict(word_freq,orient='index')

In [23]:
# Export the frequency of creators to "word_freq.csv"
word_freq_df.to_csv('word_freq.csv')

### Save the creators as CSV Template required

In [24]:
updated_creators_df=pd.DataFrame.from_dict(creators_new,orient='index')

In [25]:
creators_new_pipe=creators_new

In [26]:
#For multiple creators, use pipe seperators("|") to seperate them for import to AtoM
for key,z in creators_new_pipe.items():
    creators_new_pipe[key]="|".join(z)

In [27]:
#Check the number of records
len(creators_new_pipe)

20302

In [28]:
updated_creators_pipe_df=pd.DataFrame.from_dict(creators_new_pipe,orient='index')

In [29]:
#Check if the creators are updated
updated_creators_pipe_df.loc[220:230]

Unnamed: 0,0
220,Elkington and Company
221,Elkington and Company
222,International Silver Company of Canada
223,Birmingham Silver Plate Limited
224,Elkington and Company
225,Elkington and Company
226,Elkington and Company
227,Elkington and Company
228,Elkington and Company
229,Birmingham Silver Plate Limited


In [30]:
#Save the updated creator to "chungCreatorUpdated.csv"
updated_creators_pipe_df.to_csv('chungCreatorUpdated.csv')