# ACDE Map

<style>
  a {
    color: #1ea5a6 !important;
  }
</style>

In this notebook, we showcase the ACDE Map - a web-based interactive map that allows users to explore the ACDEA data in a visual and intuitive way. Instructions on how to use the application are provided below.

### Fetch data

In [1]:
# for data mgmt
import numpy as np
from collections import Counter
import requests, gzip, io, os, json, pandas as pd
import ast

import warnings
warnings.filterwarnings("ignore")

# # provide folder_name which contains uncompressed data i.e., csv and jsonl files
# # only need to change this if you have already donwloaded data
# # otherwise data will be fetched from google drive
# global folder_name
# folder_name = 'data/local'

# def fetch_small_data_from_github(fname):
#     url = f"https://raw.githubusercontent.com/acd-engine/jupyterbook/master/data/analysis/{fname}"
#     response = requests.get(url)
#     rawdata = response.content.decode('utf-8')
#     return pd.read_csv(io.StringIO(rawdata))

# def fetch_date_suffix():
#     url = f"https://raw.githubusercontent.com/acd-engine/jupyterbook/master/data/analysis/date_suffix"
#     response = requests.get(url)
#     rawdata = response.content.decode('utf-8')
#     try: return rawdata[:12]
#     except: return None

# def check_if_csv_exists_in_folder(filename):
#     try: return pd.read_csv(os.path.join(folder_name, filename), low_memory=False)
#     except: return None

# def fetch_data(filetype='csv', acdedata='organization'):
#     filename = f'acde_{acdedata}_{fetch_date_suffix()}.{filetype}'

#     # first check if the data exists in current directory
#     data_from_path = check_if_csv_exists_in_folder(filename)
#     if data_from_path is not None: return data_from_path

#     urls = fetch_small_data_from_github('acde_data_gdrive_urls.csv')
#     sharelink = urls[urls.data == acdedata][filetype].values[0]
#     url = f'https://drive.google.com/u/0/uc?id={sharelink}&export=download&confirm=yes'

#     response = requests.get(url)
#     decompressed_data = gzip.decompress(response.content)
#     decompressed_buffer = io.StringIO(decompressed_data.decode('utf-8'))

#     try:
#         if filetype == 'csv': df = pd.read_csv(decompressed_buffer, low_memory=False)
#         else: df = [json.loads(jl) for jl in pd.read_json(decompressed_buffer, lines=True, orient='records')[0]]
#         return pd.DataFrame(df)
#     except: return None 

# acde_events = fetch_data(acdedata='event')
# acde_works = fetch_data(acdedata='work')

acde_events = pd.read_csv('acde_event_202307252224.csv') # 12 secs
acde_works = pd.read_csv('acde_work_202307252224.csv')

### Events (DAAO)

In [3]:
acde_events['data_source'].value_counts()

"AusStage"    124916
"DAAO"         21838
"CircusOZ"       480
Name: data_source, dtype: int64

In [2]:
events_expanded = []  # Use a list to store the expanded rows

relevant_columns = ['_id','data_source','coverage_ranges','types','title','ori_url', 'related_people']
acde_events_reduced = acde_events[acde_events['coverage_ranges'].str.contains('geo_coord',na=False)][relevant_columns]

# get start and end years - optimized version
for idx, row in acde_events_reduced.iterrows(): # takes 8 min
    this_locations = pd.json_normalize(ast.literal_eval(row['coverage_ranges']))
    row_expanded = pd.DataFrame(index=range(len(this_locations)), columns=relevant_columns)

    for idx2, row2 in this_locations.iterrows():
        row_expanded.loc[idx2, row2.index] = row2.values

    # Fill the remaining columns with the same values as the original row
    row_expanded.fillna(row, inplace=True)

    events_expanded.append(row_expanded)

# Concatenate all the expanded rows into a final DataFrame
events_expanded = pd.concat(events_expanded, ignore_index=True)
events_expanded = events_expanded[(events_expanded['place.geo_coord.latitude'].notnull()) &\
                                    (events_expanded['place.geo_coord.longitude'].notnull())]
events_expanded.shape

(136764, 32)

### Works (DAQA, DAAO)

In [5]:
acde_works['data_source'].value_counts()

"DAAO"        23729
"AusStage"    19972
"DAQA"         2203
Name: data_source, dtype: int64

In [3]:
works_expanded = []  # Use a list to store the expanded rows

relevant_columns = ['_id','data_source','coverage_range','title','ori_url', 'typologies','related_people']
acde_works_reduced = acde_works[acde_works['coverage_range'].str.contains('geo_coord',na=False)][relevant_columns]

# get start and end years - optimized version
for idx, row in acde_works_reduced.iterrows():
    # if idx == 1000: break
    this_locations = pd.json_normalize(ast.literal_eval(row['coverage_range']))
    row_expanded = pd.DataFrame(index=range(len(this_locations)), columns=relevant_columns)

    for idx2, row2 in this_locations.iterrows():
        row_expanded.loc[idx2, row2.index] = row2.values

    # Fill the remaining columns with the same values as the original row
    row_expanded.fillna(row, inplace=True)
    works_expanded.append(row_expanded)

# Concatenate all the expanded rows into a final DataFrame
works_expanded = pd.concat(works_expanded, ignore_index=True)
works_expanded = works_expanded[(works_expanded['place.geo_coord.latitude'].notnull()) &\
                                (works_expanded['place.geo_coord.longitude'].notnull())]
works_expanded.shape

(1447, 19)

### Unify events and works and remove redundant fields

In [4]:
geocoded_data = events_expanded.append(works_expanded).drop(['coverage_ranges', 'coverage_range'], axis=1)

geocoded_data['title'] = geocoded_data['title'].apply(lambda x: ast.literal_eval(x) if type(x) == str else np.nan)
geocoded_data['ori_url'] = geocoded_data['ori_url'].apply(lambda x: ast.literal_eval(x) if type(x) == str else np.nan)

geocoded_data['place.display_name'] = np.where(geocoded_data['place.display_name'].notnull(), geocoded_data['place.display_name'], 
                                                np.where(geocoded_data['place.address.ori_address'].notnull(), 
                                                         geocoded_data['place.address.ori_address'], 'Unknown'))

cols_to_keep = ['_id', 'data_source', 'types', 'typologies','title', 'ori_url', 'date_range.date_end.year',
                'date_range.date_end.month','date_range.date_start.year', 'date_range.date_start.month',
                'place.geo_coord.latitude','place.geo_coord.longitude','place.display_name', 'related_people']

geocoded_data = geocoded_data[(geocoded_data['date_range.date_start.year'].notnull()) | (geocoded_data['date_range.date_end.year'].notnull())][cols_to_keep]

# if date_range.date_start.month and date_range.date_end.month are null, then change both values to 1
geocoded_data.loc[(geocoded_data['date_range.date_start.month'].isnull()) & (geocoded_data['date_range.date_end.month'].isnull()),
                 ['date_range.date_start.month', 'date_range.date_end.month']] = (1,1)

geocoded_data['date_range.date_start.month'] = np.where((geocoded_data['date_range.date_start.year'].notnull()) &\
                                                        (geocoded_data['date_range.date_end.year'].notnull()) &\
                                                        (geocoded_data['date_range.date_end.month'].notnull()) &\
                                                        (geocoded_data['date_range.date_start.month'].isnull()),
                                                        geocoded_data['date_range.date_end.month'], geocoded_data['date_range.date_start.month'])


geocoded_data['date_range.date_start.month'] = np.where((geocoded_data['date_range.date_start.year'].notnull()) &\
                                                        (geocoded_data['date_range.date_end.year'].notnull()) &\
                                                        (geocoded_data['date_range.date_end.month'].notnull()) &\
                                                        (geocoded_data['date_range.date_start.month'].isnull()),
                                                        geocoded_data['date_range.date_end.month'], geocoded_data['date_range.date_start.month'])

# correct month and year inaccuracies
geocoded_data['date_range.date_start.month'] = np.where(geocoded_data['date_range.date_start.month'].isnull(), geocoded_data['date_range.date_end.month'], geocoded_data['date_range.date_start.month'])
geocoded_data['date_range.date_end.month'] = np.where(geocoded_data['date_range.date_end.month'].isnull(), geocoded_data['date_range.date_start.month'], geocoded_data['date_range.date_end.month'])

geocoded_data['date_range.date_start.year'] = np.where(geocoded_data['date_range.date_start.year'].isnull(), geocoded_data['date_range.date_end.year'], geocoded_data['date_range.date_start.year'])
geocoded_data['date_range.date_end.year'] = np.where(geocoded_data['date_range.date_end.year'].isnull(), geocoded_data['date_range.date_start.year'], geocoded_data['date_range.date_end.year'])

geocoded_data['date_range.date_start.year'] = geocoded_data['date_range.date_start.year'].astype(int)
geocoded_data['date_range.date_start.month'] = geocoded_data['date_range.date_start.month'].astype(int)
geocoded_data['date_range.date_end.year'] = geocoded_data['date_range.date_end.year'].astype(int)
geocoded_data['date_range.date_end.month'] = geocoded_data['date_range.date_end.month'].astype(int)

geocoded_data['date_range.date_end.year'] = np.where(geocoded_data['date_range.date_start.year'] > geocoded_data['date_range.date_end.year'], geocoded_data['date_range.date_start.year'], geocoded_data['date_range.date_end.year'])
geocoded_data['date_range.date_end.month'] = np.where(geocoded_data['date_range.date_start.year'] > geocoded_data['date_range.date_end.year'], geocoded_data['date_range.date_start.month'], geocoded_data['date_range.date_end.month'])

month_misaligned_cond = (geocoded_data['date_range.date_start.year'] == geocoded_data['date_range.date_end.year']) &\
                (geocoded_data['date_range.date_start.month'] > geocoded_data['date_range.date_end.month'])

geocoded_data['date_range.date_end.month'] = np.where(month_misaligned_cond, geocoded_data['date_range.date_start.month'], geocoded_data['date_range.date_end.month'])

### Clean categories

#### Pre-process DAQA and DAAO data

In [5]:
daao_daqa_geocoded_data = geocoded_data[~geocoded_data['data_source'].str.contains('AusStage')]

daao_daqa_geocoded_data['types'] = daao_daqa_geocoded_data['types']\
    .apply(lambda x: ast.literal_eval(x.replace('[{"primary_type": [','').replace(']}]','')) if type(x) == str else np.nan)
daao_daqa_geocoded_data['typologies'] = daao_daqa_geocoded_data['typologies']\
    .apply(lambda x: ast.literal_eval(x.replace('[','').replace(']','')) if type(x) == str else np.nan)
daao_daqa_geocoded_data['types'] = np.where(daao_daqa_geocoded_data['types'].notnull(), daao_daqa_geocoded_data['types'], daao_daqa_geocoded_data['typologies'])
daao_daqa_geocoded_data.drop('typologies', axis=1, inplace=True)

# all nan values are daao, so we can fillna with exhibition
daao_daqa_geocoded_data['types'].fillna('exhibition', inplace=True)

# clean categories
daao_daqa_geocoded_data['category'] = np.where(daao_daqa_geocoded_data['ori_url'].str.contains('qldarch', na=False), 'Architecture', np.nan)
daao_daqa_geocoded_data['category'] = np.where(daao_daqa_geocoded_data['types'].str.contains('other event|performance|recital|opening', na=False), 'Other', daao_daqa_geocoded_data['category'])
daao_daqa_geocoded_data['category'] = np.where(daao_daqa_geocoded_data['types'].str.contains('exhibition', na=False), 'Exhibition', daao_daqa_geocoded_data['category'])
daao_daqa_geocoded_data['category'] = np.where(daao_daqa_geocoded_data['types'].str.contains('festival', na=False), 'Festival', daao_daqa_geocoded_data['category'])

festival_index = daao_daqa_geocoded_data[(daao_daqa_geocoded_data['category'] == 'nan')]['types'].apply(lambda x: 'festival' if 'festival' in str(x) else np.nan).to_frame()
daao_daqa_geocoded_data['types'] = np.where(daao_daqa_geocoded_data.index.isin(festival_index[festival_index.types.notnull()].index), 'festival', daao_daqa_geocoded_data['types'])

daao_daqa_geocoded_data['category'] = daao_daqa_geocoded_data['category'].apply(lambda x: 'Exhibition' if x == 'nan' else x)
daao_daqa_geocoded_data['subcategory'] = np.where(daao_daqa_geocoded_data['category'] == 'Architecture', daao_daqa_geocoded_data['types'], daao_daqa_geocoded_data['category'])
daao_daqa_geocoded_data.drop('types', axis=1, inplace=True)

daao_daqa_geocoded_data['category'].value_counts()

Exhibition      18453
Architecture     1005
Other             139
Festival           64
Name: category, dtype: int64

In [6]:
ausstage_geocoded_data = geocoded_data[geocoded_data['data_source'].str.contains('AusStage')]
ausstage_geocoded_data.drop('typologies', axis=1, inplace=True)

primary_type = []
secondary_type = []

for t in ausstage_geocoded_data['types']: # takes 1 min
    if 'numberDouble' in t:
        primary_type.append('Other'); secondary_type.append(None)
        continue

    primary_type.append(", ".join(pd.json_normalize(json.loads(t))['primary_type'].unique()))

    try: secondary_type.append(", ".join(pd.json_normalize(json.loads(t))['secondary_type'].unique()))
    except: secondary_type.append(None)

ausstage_geocoded_data['category'] = primary_type
ausstage_geocoded_data['subcategory'] = secondary_type

ausstage_geocoded_data['category'] = np.where(ausstage_geocoded_data['subcategory'].str.contains('Festival', na=False), 'Festival', ausstage_geocoded_data['category'])
ausstage_geocoded_data['category'] = np.where(ausstage_geocoded_data['subcategory'].str.contains('Circus', na=False), 'Circus', ausstage_geocoded_data['category'])
ausstage_geocoded_data['category'] = np.where((ausstage_geocoded_data['category'] == 'Other') & (ausstage_geocoded_data['subcategory'].str.contains('Exhibition', na=False)), 
                                     'Exhibition', ausstage_geocoded_data['category'])

ausstage_geocoded_data.drop('types', axis=1, inplace=True)
ausstage_geocoded_data['category'].value_counts()

Theatre - Spoken Word    64338
Music Theatre            21426
Music                    11049
Dance                     9793
Other                     8223
Circus                    1673
Exhibition                 446
Festival                   427
Name: category, dtype: int64

In [17]:
geocoded_data_categorised = daao_daqa_geocoded_data.append(ausstage_geocoded_data)
geocoded_data_categorised['category'].value_counts()

Theatre - Spoken Word    64338
Music Theatre            21426
Exhibition               18899
Music                    11049
Dance                     9793
Other                     8362
Circus                    1673
Architecture              1005
Festival                   491
Name: category, dtype: int64

### Extract related people

In [11]:
related_persons = []

for idx, row in geocoded_data_categorised.iterrows(): # takes 4 mins
    try:
        this_person = pd.json_normalize(json.loads(row['related_people']))
        [related_persons.append(row2[['relation_class','object.label','subject.curr_dbid.$oid']])\
            if 'RelatedPerson' in row2['relation_class'] \
            else related_persons.append(row2[['relation_class','subject.label','object.curr_dbid.$oid']])\
            for idx2, row2 in this_person.iterrows()]
    except: continue

related_persons = pd.DataFrame(related_persons)

related_persons['person'] = np.where(related_persons['relation_class'].str.contains('RelatedPerson'), related_persons['object.label'], related_persons['subject.label'])
related_persons['_id'] = np.where(related_persons['relation_class'].str.contains('RelatedPerson'), related_persons['subject.curr_dbid.$oid'], related_persons['object.curr_dbid.$oid'])

related_persons = related_persons[['person','_id']]
related_persons.head()

Unnamed: 0,person,_id
0,Anita Aarons,64581122d72f0e29f613ad80
0,Anita Aarons,64581122d72f0e29f613ad81
0,Inez M. Abbot,64581122d72f0e29f613ad82
0,L. B. Abercrombie,64581122d72f0e29f613ad83
0,Myra Felton,64581122d72f0e29f613ad85


In [18]:
geocoded_data_categorised['_id'] = geocoded_data_categorised['_id'].apply(lambda x: x.replace('{"$oid": "','').replace('"}',""))
geocoded_data_categorised.drop(['data_source','related_people'], axis=1, inplace=True)
geocoded_data_categorised.head(3).T

Unnamed: 0,0,1,2
_id,64581122d72f0e29f613ad80,64581122d72f0e29f613ad81,64581122d72f0e29f613ad82
title,Expo,World Crafts Conference,Exhibition of water colors by Inez Abbott
ori_url,https://www.daao.org.au/bio/event/expo,https://www.daao.org.au/bio/event/world-crafts...,https://www.daao.org.au/bio/event/exhibition-o...
date_range.date_end.year,1967,1964,1939
date_range.date_end.month,1,1,11
date_range.date_start.year,1967,1964,1939
date_range.date_start.month,1,1,11
place.geo_coord.latitude,45.508889,40.714353,-37.813187
place.geo_coord.longitude,-73.554167,-74.005973,144.96298
place.display_name,"Canadian pavillion, Montreal, Canada","New York, United States","Sedon Galleries, Melbourne, Victoria"


### Create binary indicator to see if the event occurred in Australia

In [19]:
longs = []
lats = []

for idx,row in geocoded_data_categorised.iterrows():
    try: longs.append(pd.to_numeric(row['place.geo_coord.longitude']))
    except: longs.append(None)

    try: lats.append(pd.to_numeric(row['place.geo_coord.latitude']))
    except: lats.append(None)

geocoded_data_categorised['place.geo_coord.longitude'] = longs
geocoded_data_categorised['place.geo_coord.latitude'] = lats

geocoded_data_categorised = geocoded_data_categorised[geocoded_data_categorised['place.geo_coord.latitude'].notnull() &\
                                                        geocoded_data_categorised['place.geo_coord.longitude'].notnull()]

# Define the valid range for longitude and latitude
valid_longitude_range = (-180, 180); valid_latitude_range = (-90, 90)

# Filter out rows with longitude outside the valid range
geocoded_data_categorised = geocoded_data_categorised[(geocoded_data_categorised['place.geo_coord.longitude'] >= valid_longitude_range[0]) &\
    (geocoded_data_categorised['place.geo_coord.longitude'] <= valid_longitude_range[1])]

# Filter out rows with latitude outside the valid range
geocoded_data_categorised = geocoded_data_categorised[(geocoded_data_categorised['place.geo_coord.latitude'] >= valid_latitude_range[0]) &\
    (geocoded_data_categorised['place.geo_coord.latitude'] <= valid_latitude_range[1])]

# Reset the DataFrame index after filtering
geocoded_data_categorised = geocoded_data_categorised.reset_index(drop=True)
geocoded_data_categorised.shape

(136715, 12)

In [20]:
# !pip install reverse_geocoder
import reverse_geocoder as rg
unique_geocodes = set(zip(geocoded_data_categorised['place.geo_coord.latitude'], geocoded_data_categorised['place.geo_coord.longitude']))
unique_geocodes_list = list(unique_geocodes)
results = rg.search(unique_geocodes_list)
results_df = pd.concat([pd.DataFrame(results),pd.DataFrame(unique_geocodes_list)], axis=1)
results_df.rename(columns={0: 'place.geo_coord.latitude', 1: 'place.geo_coord.longitude'}, inplace=True)
results_df.head()

Loading formatted geocoded file...


Unnamed: 0,lat,lon,name,admin1,admin2,cc,place.geo_coord.latitude,place.geo_coord.longitude
0,-33.76604,151.16213,Killara,New South Wales,Ku-ring-gai,AU,-33.763331,151.155534
1,-34.086,150.78512,Glen Alpine,New South Wales,Campbelltown Municipality,AU,-34.069551,150.792157
2,-37.9,144.66667,Werribee,Victoria,Wyndham,AU,-37.883915,144.647691
3,51.68139,-2.35333,Dursley,England,Gloucestershire,GB,51.683653,-2.304888
4,-37.46667,145.23333,Kinglake West,Victoria,Murrindindi,AU,-37.3223,145.30293


In [29]:
# geocoded_data_categorised = pd.merge(geocoded_data_categorised,results_df,on=['place.geo_coord.latitude', 'place.geo_coord.longitude'])
# geocoded_data_categorised.drop(['lat','lon','name','admin1','admin2'], axis=1, inplace=True)

# # WA error - fixes 470 rows
# geocoded_data_categorised.loc[(round(geocoded_data_categorised['place.geo_coord.latitude'], 2) == 47.75),
#                                 ['place.geo_coord.latitude', 'place.geo_coord.longitude']] = (31.9523, 115.8613)

# geocoded_data_categorised.loc[geocoded_data_categorised['place.geo_coord.latitude'] == 31.9523, ['cc']] = ('AU')

## save data
## geocoded_data_categorised.drop_duplicates(subset=['_id']).to_csv('acde_map_geocoded_data.csv', index=False)
## related_persons.drop_duplicates().to_csv('acde_map_related_persons.csv', index=False)

geocoded_data_categorised['cc'].value_counts(normalize=True).head(10)

AU    0.934952
GB    0.034349
NZ    0.008251
US    0.006481
FR    0.002202
DE    0.001704
CA    0.001178
NL    0.001068
JP    0.000966
CN    0.000746
Name: cc, dtype: float64

## Temporal Mapping of Australian Cultural Activity

**Instructions**
1. Drag the slider to establish a start year for the animated spatio-temporal map.
2. Uncheck cultural categories that you do not want to include in the map, otherwise leave as is.
3. Press the play button and this will begin the animation. Each frame corresponds to one month. You can pause the animation at any time by pressing the pause button.
4. You can zoom in/out, drag the map, and also click data points for more information.
5. Once you have stopped the animation and landed on a month/year of interest, you can then press the `Generate network graph` button which will load a network graph displaying relationships across all organisations and participants for the selected month/year. This graph is made up of a sample of what is within the mapview. If you would like to fetch another sample for the same month/year, simply press the `Generate network graph` button again.
6. If you want to produce a network graph for a different month/year, you will need to start the process again. First, you will need to press the `Clear map` button to remove all data points from the map. Next, you can make changes to the animation speed, start year or filter specific cultural categories. From here you will need to press the play button again to generate a new animation. The data that is presented in the mapview is the same data that is used to generate the network graph, therefore if you would like to generate a network graph of a specific cultural category or many, you need to make sure the corresponding cultural categories are checked.

**Other Functionalities**
- If you would like the animation to run faster, you can reduce the default speed value from 500 to 200. Be careful with this input as every time you modify the speed value, the slider will reset to the default start year (1900).
- In its initial state, the animated map will keep data from previous months/years. If you would prefer to only see relevant data for each month/year, you can check the `Clear data after every month` checkbox.

**Network Graph Legend**
- Blue nodes represent organisations/venues.
- Grey nodes represent participants.
- Edges represent an occurence in the given time period - they are uniformally weighted. The colour of the edge corresponds with the cultural category of the participant's engagement with an organisation/venue.
- The size of a participant node also corresponds with the number of participated events in the given time period. This frequency can be obtained by hovering over the node.

<iframe height="1950" width="800" frameborder="no" src="https://jmunoz.shinyapps.io/acde-map/"> </iframe>