## Data Overview and Analysis


In the following, we will inspect and analyze the data set <em>template</em>. For this we will use the programming language Python 3, henceforth simplified as Python. The first step we have to complete before touching the data, is to load the modules that we would need for loading the data, preparing the data and creating nice graphs and maps. If the import of these modules fails on your computer, you probably have to install them into your Python distribution.

In [2]:
import pandas as pd
import geopandas as gpd
import altair as alt

If this first step of loading required modules did not throw an error, the next step will be to load the data set <em>template</em> from the compressed csv-file <em>template.csv.gz</em> into a data frame. Both the csv-file, as well as the data frame can be understood as a table of data, similar to what most computer users might have seen in the form of spreadsheet tables in one of the popular office suites. The csv-file is a notational format of this table, similar to notes in music, while the data frame is its computational equivalent, the actual music if we want to continue with this metaphor.

In [3]:
df = pd.read_csv('template.csv.gz', sep='\t',compression='gzip')

We can obtain a first glimpse of the content of the data set by printing out the first rows of the data frame. A fact that might strike most readers is the cryptic naming of the columns in this table, such as <em>cidoc.e27.p48.e42.X</em>. This naming is based on the CIDOC CRM standard, developed by the International Council of Museums (ICOM) for the documentation of cultural heritage items.

In [3]:
df

Unnamed: 0,cidoc.e90.p48.e42.X,cidoc.e27.p46.e19.carries.p128.symbolic.object.e90,cidoc.e90.is.carried.by.p128.physical.object.e19,cidoc.e27.p46.e19.p48.e42.X,cidoc.e27.p48.e42.X,cidoc.e90.length.middleline,cidoc.e90.focus.latex,cidoc.e90.latex,cidoc.e90.character.script,cidoc.e90.family,...,cidoc.e31.p70.artwork.figurative.scroll,cidoc.e31.p70.artwork.figurative.flower,cidoc.e31.p70.offerings,cidoc.e31.p70.offerings.paper,cidoc.e31.p70.offerings.stone,cidoc.e31.p70.offerings.flower,ecidoc.31.p70.offerings.candle,cidoc.e31.p70.writing,cidoc.e31.p70.tomb.angle,cidoc.e31.p70.tomb.inside.outside
0,22036,22036,22343,22343,22015,6.0,,\large ？位之佳城\\ \\ erected,chinese,,...,,,,,,f,f,f,12.0,i
1,41151,41151,44422,44422,70606,,,,--,--,...,,,,,,f,f,t,12.0,i
2,94834,94834,94933,94933,92233,,,,chinese,洪,...,,,,,,f,f,t,12.0,i
3,21786,21786,22093,22093,21773,,,,chinese,,...,,,f,,,f,f,f,12.0,i
4,21845,21845,22152,22152,21824,12.0,,loc\\ \\ \small 民國三十八年季秋\\ \\ \small 五代𩔰考蔡公諱...,chinese,蔡,...,,,f,f,f,f,f,t,5.0,o
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
254,51987,51987,52197,52197,50396,12.0,\large 陽青\\,\small 民國八十七年戊寅桐月修\\ \\ \small 蔡家歷代祖先顕妣閨名草墓\...,chinese,蔡,...,,,f,f,,f,f,t,5.0,o
255,74186,74186,74281,74281,72489,,,,--,,...,,,,f,,f,f,t,12.0,i
256,51987,51987,52197,52197,50396,12.0,\large 陽青\\,\small 民國八十七年戊寅桐月修\\ \\ \small 蔡家歷代祖先顕妣閨名草墓\...,chinese,蔡,...,,,,f,,f,f,t,12.0,i
257,51987,51987,52197,52197,50396,12.0,\large 陽青\\,\small 民國八十七年戊寅桐月修\\ \\ \small 蔡家歷代祖先顕妣閨名草墓\...,chinese,蔡,...,,,f,f,f,f,f,t,12.0,i


For our convenience we will rename frequently used cryptic column names into user-friendly names which can be intuitively understood, and are easy to remember when writing our Python code.

In addition, we drop columns which seem to be redundant or which are unlikely to be used.

In [4]:
# renaming column names
df.rename(columns={'cidoc.e27.p48.e42.X':'siteId',
                   'cidoc.e27.p2.e55':'siteType',
                   'cidoc.e27.community':'community',
                   'cidoc.e27.island':'island',
                   'cidoc.e27.archipelago':'archipelago',
                   'cidoc.e27.p46.e19.p48.e42.X':'physicalObjectId',
                   'cidoc.e90.p48.e42.X':'symbolicObjectId',
                   'cidoc.e31.url':'document.url',
                   'cidoc.e31.p48.e42.X':'documentId',
                   'cidoc.e27.p46.e19.has.current.location.p55.x':'longitude',
                   'cidoc.e27.p46.e19.has.current.location.p55.y':'latitude',
                   'cidoc.e90.character.script':'script',
                   'cidoc.e90.family':'surname'},
             inplace=True)

# dropping some columns we are not going to use
del df['cidoc.e27.p46.e19.carries.p128.symbolic.object.e90']
del df['cidoc.e90.is.carried.by.p128.physical.object.e19']
del df['cidoc.e90.focus.latex']
del df['cidoc.e90.latex']

df

Unnamed: 0,symbolicObjectId,physicalObjectId,siteId,cidoc.e90.length.middleline,script,surname,cidoc.e90.character.color,cidoc.e90.semantic.roles,cidoc.e90.semantic.roles.focus,cidoc.e90.semantic.roles.focus.sub,...,cidoc.e31.p70.artwork.figurative.scroll,cidoc.e31.p70.artwork.figurative.flower,cidoc.e31.p70.offerings,cidoc.e31.p70.offerings.paper,cidoc.e31.p70.offerings.stone,cidoc.e31.p70.offerings.flower,ecidoc.31.p70.offerings.candle,cidoc.e31.p70.writing,cidoc.e31.p70.tomb.angle,cidoc.e31.p70.tomb.inside.outside
0,22036,22343,22015,6.0,chinese,,gold,f:person:erected,loc,,...,,,,,,f,f,f,12.0,i
1,41151,44422,70606,,--,--,--,--,,,...,,,,,,f,f,t,12.0,i
2,94834,94933,92233,,chinese,洪,stone,--,,,...,,,,,,f,f,t,12.0,i
3,21786,22093,21773,,chinese,,stone,f:loc:date:person:erected,loc,,...,,,f,,,f,f,f,12.0,i
4,21845,22152,21824,12.0,chinese,蔡,stone,f:loc:date:person:erected,loc,,...,,,f,f,f,f,f,t,5.0,o
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
254,51987,52197,50396,12.0,chinese,蔡,red,f:loc:date:person:erected,loc,ch,...,,,f,f,,f,f,t,5.0,o
255,74186,74281,72489,,--,,red,--,,,...,,,,f,,f,f,t,12.0,i
256,51987,52197,50396,12.0,chinese,蔡,red,f:loc:date:person:erected,loc,ch,...,,,,f,,f,f,t,12.0,i
257,51987,52197,50396,12.0,chinese,蔡,red,f:loc:date:person:erected,loc,ch,...,,,f,f,f,f,f,t,12.0,i


The result is a much more readable table which we now can now start to put to use.

### The Map of Sites

In [5]:
# load Penghu geojson data
geojsonUrl = '../externalResources/ph-20201220.json'

gdf = gpd.read_file(geojsonUrl)
unnamedIslands = gdf[ gdf['ISLAND'] == 'xyz' ].index
# Delete these row indexes from data
gdf.drop(unnamedIslands,inplace=True)

print(gdf.crs)

# reset the projection
gdf = gdf.to_crs(epsg = 4326)

# define inline geojson data object
data_geojson = alt.InlineData(values=gdf.to_json(), format=alt.DataFormat(property='features',type='json')) 


epsg:4326


In the following step we are creating possible background maps. In a later step we will draw dots or lines onto these maps and thus create nice looking and informative maps.

In [6]:
# backgrounds for maps

bg_township = alt.Chart(
       data_geojson
   ).mark_geoshape(
   ).encode(
       color=alt.Color("properties.TOWN:N",
                       title='Township', 
                       legend=alt.Legend(columns=2)))

 
bg_island = alt.Chart(
       data_geojson
   ).mark_geoshape(
   ).encode(
       color=alt.Color("properties.ISLAND:N",
                       title='Island', 
                       legend=alt.Legend(columns=2)))

bg_village = alt.Chart(
       data_geojson
   ).mark_geoshape(
   ).encode(
       color=alt.Color("properties.VILLAGE:N",
                       title='Village', 
                       legend=alt.Legend(columns=10,symbolLimit=200)))


In [47]:
points = alt.Chart(
       df
   ).transform_aggregate(
       latitude='mean(latitude)',
       longitude='mean(longitude)',
       count='count()',
       groupby=['island']
   ).mark_circle(
   ).encode(
       longitude='longitude:Q',
       latitude='latitude:Q',
       size=alt.Size('count:Q', title='Number of Objects'),
       color=alt.value('red'),
       tooltip=['island:N','count:Q']
   ).properties(
       title='Number of Objects per Island'
)

bg_island + points

In [7]:

# chart object
background = alt.Chart(data_geojson).mark_geoshape(
      stroke='white'
    ).encode(
      color=alt.Color("properties.VILLAGE:N",title='Village', 
                      legend=alt.Legend(columns=4)))

points = alt.Chart(df).transform_aggregate(
    latitude='mean(latitude)',
    longitude='mean(longitude)',
    count='count()',
    groupby=['community']
).mark_circle().encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    size=alt.Size('count:Q', title='Number of Objects'),
    color=alt.value('steelblue'),
    tooltip=['community:N','count:Q']
).properties(
    title='Number of Objects in community'
)

background + points

In [9]:
# chart object
background = alt.Chart(data_geojson).mark_geoshape(stroke='white')

points = alt.Chart(df).encode(color:'count:Q').transform_aggregate(
    latitude='mean(latitude)',
    longitude='mean(longitude)',
    count='count()',
    groupby=['community']
).mark_circle().encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    size=alt.Size('count:Q', title='Number of Objects'),
    color=alt.value('steelblue'),
    tooltip=['community:N','count:Q']
).properties(
    title='Number of Objects in community'
)

background + points

SyntaxError: invalid syntax (<ipython-input-9-056129574c7a>, line 4)