# AusStage Event Relations Exploration
<font size=4.5> —— How should we define the unit of events? </font>

Note: We are using the backup data dumping in March, 2022.

## How are Event Recorded in AusStage?

Each event would have the following (essential) attributes:
* **eventid**;
* **event_name**;
* **first_date**;
* **last_date**;
* **part_of_a_tour**;
* **umbrella**;

In general, there're 2 types of events according to event duration:
1. One-day Event Record: [Annie Hamilton](https://www.ausstage.edu.au/pages/event/161427)
2. Multi-day Event Record: [An American in Paris](https://www.ausstage.edu.au/pages/event/162016), [Adelaide Festival of Arts 2020](https://www.ausstage.edu.au/pages/event/160508)

## Our 3 Options

By now, we have 3 options in counting events of AusStage:<br>
<span style="color:red">
<br>
1.**Our assumption**: Treat every event record (every page in AusStage event category) as one credit to the contributor.<br>
2.Treat every single "production" as one credit to the contributor. That means even though he/she has participated 3 events of a tour, he/she will still get 1 credit only. We can get those single production by filtering those event records whose **part_of_a_tour** = **yes**.<br>
3.**Jonathan's method**: For each contributor in each year, count **event_name** once. For example, [John Bell](https://www.ausstage.edu.au/pages/contributor/639) has 5 event records whose **event_name** = **Henry 4** in 2013, then he would get 1 credit only.
</span>
<br>

In practice, each option would have its shortcoming. It seems that no option has 100% reliability. If we need more consistency and reliability, we'd better keep our method, but if we want a more practical definition for analysis, Jonathan's method looks better. 

## Set-up & Data Loading

In [1]:
import sys

codefolder = "./codes"
sys.path.append(codefolder)

from AusStage_Extraction import *

In [2]:
ausstage_engine = CreateMySQLEngine()

In [3]:
events_df = pd.read_sql_table(
    "acde_event",
    ausstage_engine,
    columns=[
        "eventid",
        "event_name",
        "umbrella",
        "part_of_a_tour",
        "first_date",
        "yyyyfirst_date",
        "last_date",
        "yyyylast_date",
        "primary_genre",
        "primary_content_indicator",
    ],
)
events_df.yyyylast_date = events_df.yyyylast_date.fillna(events_df.yyyyfirst_date)
events_df["year_of_event"] = events_df.yyyyfirst_date

cont_df = pd.read_sql_table(
    "acde_full_cont",
    ausstage_engine,
    columns=[
        "contributorid",
        "display_name",
        "yyyydate_of_birth",
        "career_start_year",
        "career_end_year",
    ],
)

cc_df = pd.read_sql(
    """
    select
        ccl.CHILDID as childid,
        ccl.CONTRIBUTORID as contributorid,
        rl.parent_relation,
        rl.child_relation
    from
        `ausstage`.`contribcontriblink` `ccl`
    left join `ausstage`.`relation_lookup` `rl` on
        `ccl`.`relationlookupid` = `rl`.`relationlookupid`;
    """,
    ausstage_engine,
)

event_contfunc = pd.read_sql(
    """
    select
        cel.CONTRIBUTORID as contributorid,
        cel.EVENTID as eventid,
        cel.`FUNCTION` as functionid,
        ifnull(cfp.PREFERREDTERM, "Unknown") as `function`,
        ifnull(cel.`PRIMARY_CREATOR`, "no") as `primary_creator`
    from
        `ausstage`.`conevlink` `cel`
    left join `ausstage`.`contributorfunctpreferred` `cfp` on
        `cfp`.`CONTRIBUTORFUNCTPREFERREDID` = `cel`.`FUNCTION`
     having cel.CONTRIBUTORID is not null;
    """,
    ausstage_engine,
)

ee_df = pd.read_sql(
    """
    select
        eel.CHILDID as childid,
        eel.EVENTID as eventid,
        rl.parent_relation,
        rl.child_relation
    from
        `ausstage`.`eventeventlink` `eel`
    left join `ausstage`.`relation_lookup` `rl` on
        `eel`.`relationlookupid` = `rl`.`relationlookupid`;
    """,
    ausstage_engine,
)

In [4]:
cont_df[["career_start_year", "career_end_year"]] = cont_df[
    ["career_start_year", "career_end_year"]
].astype(float)
cont_df.yyyydate_of_birth = cont_df.yyyydate_of_birth.apply(
    lambda x: None if x == "" else x
).astype(float)
cont_df = cont_df.query("career_start_year >= 1900")

## Events


Ideally, the second option would be the best. However, we need to find out how should we apply this definition and whether this definition is feasible in current dataset. We will check the data interity(accuracy, completeness, and consistency) by exploring the event dataset.

### How many events have event relations?

In [5]:
print(f"There are {ee_df.shape[0]} events having relation records")

There are 7351 events having relation records


There are 4 types of relations in event relation records. Only <b>Is umbrella event of</b> and <b>Is tour of</b> have associated attribute in event records, which are <b>umbrella</b> and <b>part_of_a_tour</b> respectively.

In [6]:
ee_df.parent_relation.value_counts().to_frame("Num_of_Relation").reset_index().rename(
    {'index': 'Event_Relation'}, axis=1
)

Unnamed: 0,Event_Relation,Num_of_Relation
0,Is umbrella event of,5618
1,Is tour of,1103
2,Has part,318
3,Mixed Bill,305


### How many events are in AusStage?

In [7]:
print(f"There are {events_df.shape[0]} events in total")

There are 124745 events in total


2 attributes indicate the relations of AusStage:
1. <b>part_of_a_tour</b>: this attribute has very good completeness.
2. <b>umbrella</b>: the missing rate is so high (77.24%) that we can't use it in practice.

(I guess this 2 attributes could be newly added, especially *umbrella*. That might explain why *umbrella* has such a high missing rate)

Now there's no bother what's the difference between <b>umbrella</b> and <b>part_of_a_tour</b> because we won't use the <b>umbrella</b> attribute.

In [8]:
round(
    (
        (events_df[["eventid", "umbrella", "part_of_a_tour"]].isnull())
        | (events_df[["eventid", "umbrella", "part_of_a_tour"]] == "")
    ).sum()
    * 100
    / events_df.shape[0],
    2,
).apply(lambda x: str(x) + "%").to_frame("Missing Rate")

Unnamed: 0,Missing Rate
eventid,0.0%
umbrella,77.24%
part_of_a_tour,0.0%


### Check Consistency of "part_of_a_tour"

Then let's look into <b>part_of_a_tour</b> attribute.<br>
Here's an example of a tour "Henry 4". Only one record has <b>part_of_a_tour = 'no'</b>, which indicates that if we want to count the event by "tour" unit, we need to filter the event records having <b>part_of_a_tour = 'yes'</b>.

In [62]:
events_df[(events_df.eventid.isin([139866, 104685, 104686, 104687, 104688,]))][
    ["eventid", "event_name", "part_of_a_tour", "first_date", "last_date"]
]

Unnamed: 0,eventid,event_name,part_of_a_tour,first_date,last_date
69245,104685,Henry 4,yes,2013-02-23,2013-03-09
69246,104686,Henry 4,yes,2013-03-14,2013-03-30
69247,104687,Henry 4,yes,2013-04-05,2013-04-13
69248,104688,Henry 4,yes,2013-04-19,2013-04-26
102872,139866,Henry 4,no,2013-05-04,2013-05-26


It's shown that 31407 events were recorded as an event is part of a tour. Comparing with 1103 <b>"is tour of"</b> relations, this strongly indicates the "tour relation" is inconsistent and broken...

In [10]:
events_df.part_of_a_tour.value_counts().to_frame()

Unnamed: 0,part_of_a_tour
no,93338
yes,31407


Also, some events only have "yes" in <b>part_of_a_tour</b> (see "Tatty Toys and Perfect People"). So we will drop those tours if we filter the event by <b>part_of_a_tour</b> attribute. 

In [11]:
events_df.pivot_table(
    index="event_name",
    columns=["part_of_a_tour"],
    values="eventid",
    aggfunc="count",
    fill_value=0,
).sort_values("no", ascending=False)

part_of_a_tour,no,yes
event_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sydney Symphony Orchestra,501,172
Corroboree,494,23
Sydney Symphony,380,6
A Midsummer Night's Dream,197,67
Australian Chamber Orchestra,190,42
...,...,...
Ballet Rambert at the Belgrade Theatre Coventry September 1958,0,1
Ballet Rambert at the Birmingham Repertory Theatre August 1950,0,1
Ballet Rambert at the Birmingham Repertory Theatre August 1951,0,1
Ballet Rambert at the Birmingham Repertory Theatre December 1980,0,1


### Some fault examples on Option 3

* Different names for same event: "Richard 3" VS "Richard III" in [John Bell](https://www.ausstage.edu.au/pages/contributor/639).
* For a popular masterpieces the contributor has been performing for years, is it a single production (credit)? Or it's counted as different production (credit) by year. How do we know whether it's the same version? i.e. [Yasmina Reza](https://www.ausstage.edu.au/pages/contributor/1)
* Same name for different events: "Brandenburg Orchestra" in [Paul Dyer](https://www.ausstage.edu.au/pages/contributor/449767)