<a href="https://colab.research.google.com/github/claytoncohn/CoralBleaching_SkinCancer/blob/master/DetectingCausalRelations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
DATA_TYPE = "coral"

This notebook is created by Clayton Cohn for the purposes of detecting the existence of causal chains in the Coral Bleaching and Skin Cancer datasets using BERT.

BERT will be fine-tuned for binary classification: 0 indicating the absense of a causal relation and 1 indicating the presence of a causal relation.

The code in this notebook is originally adopted from:

https://colab.research.google.com/drive/1ywsvwO6thOVOrfagjjfuxEf6xVRxbUNO#scrollTo=IUM0UA1qJaVB

I have adapted it for use with the Skin Cancer and Coral Bleaching datasets below:

https://knowledge.depaul.edu/display/DNLP/Tasks+and+Data

**Make sure to define the DATA_TYPE constant at the top as either "coral" or "skin" prior to running this notebook.**

---


Mount Drive to Colab.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import the desired dataset.

**Note: the Coral Bleaching data had to be preprocessed to remove extraneous "did not finish" notations. Otherwise, Pandas would not be able to properly import it.**

In [3]:
import pandas as pd

DATA_PATH = "drive/My Drive/colab/data/"
DATA_NAME = ""

if DATA_TYPE == "skin":
  DATA_NAME = "EBA1415-SkinCancer-big-sentences.tsv"
elif DATA_TYPE == "coral":
  DATA_NAME = "EBA1415-CoralBleaching-big-sentences.tsv"
else:
  print("DATA_TYPE must be set to either 'coral' or 'skin.'\nThe DATA_TYPE variable is the first line in this notebook.")

h = 0 if DATA_TYPE == "skin" else None

df = pd.read_csv(DATA_PATH + DATA_NAME, delimiter='\t', header=h, names=['essay', 'relation', 's_num', 'sentence'])
df.head(10)

Unnamed: 0,essay,relation,s_num,sentence
0,EBA1415_KNKC_1_CB_ES-05410,O,1.0,Coral and zooxanthellae depend an each other i...
1,EBA1415_KNKC_1_CB_ES-05410,R-7-50,2.0,"If the coral dies, or gets bleached, then the ..."
2,EBA1415_KNKC_1_CB_ES-05410,O,3.0,Or the other way around.
3,EBA1415_KNKC_1_CB_ES-05410,O,4.0,"In the text Shifting Trade Winds, it talks abo..."
4,EBA1415_KNKC_1_CB_ES-05410,R-3-1,5.0,And another source states how when the water t...
5,EBA1415_KNKC_1_CB_ES-05410,O,6.0,"When trade winds weaken, sea levels rises inch..."
6,EBA1415_KNKC_1_CB_ES-05410,O,7.0,Which is bad for coral and zooxanthellae becau...
7,EBA1415_KNKC_1_CB_ES-05410,O,8.0,If the zooxanthellae can't get a light source ...
8,EBA1415_KNKC_1_CB_ES-05410,R-5B-50,9.1,Which can affect the coral because if the zoox...
9,EBA1415_KNKC_1_CB_ES-05410,R-50-7,9.2,Which can affect the coral because if the zoox...


Must transform relation labels to binary labels.


In [4]:
relations_pd = df.relation.copy(deep=True)

coral_relations = [
                   "1,2", "1,3", "1,4", "1,5", "1,5B", "1,14", "1,6", "1,7", "1,50",
                   "2,3", "2,4", "2,5", "2,5B", "2,14", "2,6", "2,7", "2,50",
                   "3,4", "3,5", "3,5B", "3,14", "3,6", "3,7", "3,50",
                   "4,5", "4,5B", "4,14", "4,6", "4,7", "4,50",
                   "5,5B", "5,14", "5,6", "5,7", "5,50",
                   "5B,14", "5B,6", "5B,7", "5B,50",
                   "11,12", "11,13", "11,14", "11,6", "11,7", "11,50",
                   "12,13", "12,14", "12,6", "12,7", "12,50",
                   "13,14", "13,6", "13,7","13,50",
                   "14,6", "14,7", "14,50",
                   "6,7", "6,50",
                   "7,50"
                  ]
print("{} unique coral bleaching relations".format(len(coral_relations)))

skin_relations = [
                  "1,2", "1,3", "1,4", "1,5", "1,6", "1,50",
                  "2,3", "2,4", "2,5", "2,6", "2,50",
                  "3,4", "3,5", "3,6", "3,50",
                  "4,5", "4,6", "4,50",
                  "5,6", "5,50",
                  "11,12", "11,6", "11,50",
                  "12,6", "12,50",
                  "6,50"     
                 ]

print("{} unique skin cancer relations".format(len(skin_relations)))

for i, rel in relations_pd.items():
  chain = rel.split("-")
  if chain[0] != "O":

    chain = chain[1] + "," + chain[2]

    if DATA_TYPE == "coral":
      if chain in coral_relations:
        relations_pd.at[i] = True
        continue
    
    elif DATA_TYPE == "skin":
      if chain in skin_relations:
        relations_pd.at[i] = True
        continue
    
  relations_pd.at[i] = False

df_binary = df.copy(deep=True)
df_binary.head(10)

60 unique coral bleaching relations
26 unique skin cancer relations


Unnamed: 0,essay,relation,s_num,sentence
0,EBA1415_KNKC_1_CB_ES-05410,O,1.0,Coral and zooxanthellae depend an each other i...
1,EBA1415_KNKC_1_CB_ES-05410,R-7-50,2.0,"If the coral dies, or gets bleached, then the ..."
2,EBA1415_KNKC_1_CB_ES-05410,O,3.0,Or the other way around.
3,EBA1415_KNKC_1_CB_ES-05410,O,4.0,"In the text Shifting Trade Winds, it talks abo..."
4,EBA1415_KNKC_1_CB_ES-05410,R-3-1,5.0,And another source states how when the water t...
5,EBA1415_KNKC_1_CB_ES-05410,O,6.0,"When trade winds weaken, sea levels rises inch..."
6,EBA1415_KNKC_1_CB_ES-05410,O,7.0,Which is bad for coral and zooxanthellae becau...
7,EBA1415_KNKC_1_CB_ES-05410,O,8.0,If the zooxanthellae can't get a light source ...
8,EBA1415_KNKC_1_CB_ES-05410,R-5B-50,9.1,Which can affect the coral because if the zoox...
9,EBA1415_KNKC_1_CB_ES-05410,R-50-7,9.2,Which can affect the coral because if the zoox...


In [5]:
df_binary.relation = relations_pd
df_binary.head(10)

Unnamed: 0,essay,relation,s_num,sentence
0,EBA1415_KNKC_1_CB_ES-05410,False,1.0,Coral and zooxanthellae depend an each other i...
1,EBA1415_KNKC_1_CB_ES-05410,True,2.0,"If the coral dies, or gets bleached, then the ..."
2,EBA1415_KNKC_1_CB_ES-05410,False,3.0,Or the other way around.
3,EBA1415_KNKC_1_CB_ES-05410,False,4.0,"In the text Shifting Trade Winds, it talks abo..."
4,EBA1415_KNKC_1_CB_ES-05410,False,5.0,And another source states how when the water t...
5,EBA1415_KNKC_1_CB_ES-05410,False,6.0,"When trade winds weaken, sea levels rises inch..."
6,EBA1415_KNKC_1_CB_ES-05410,False,7.0,Which is bad for coral and zooxanthellae becau...
7,EBA1415_KNKC_1_CB_ES-05410,False,8.0,If the zooxanthellae can't get a light source ...
8,EBA1415_KNKC_1_CB_ES-05410,True,9.1,Which can affect the coral because if the zoox...
9,EBA1415_KNKC_1_CB_ES-05410,False,9.2,Which can affect the coral because if the zoox...


Next, we must address the issue that some sentences have multiple relations. This could be a problem if a sentence has one valid relation and one invalid one (the same sentence will be labeled True in one instance and False in another instance). To correct this, we will remove the duplicate instances and define each sentence to be True if it contains *at least one* causal relation.

The parse was provided by @TrentonMcKinney on StackOverflow:
https://stackoverflow.com/questions/63697275/regex-string-for-different-versions/63697498#63697498

In [6]:
df_duplicate_sentences = df_binary[df_binary.s_num.astype(str).str.split('.', expand=True)[1] != '0']
df_duplicate_sentences.head(25)

Unnamed: 0,essay,relation,s_num,sentence
8,EBA1415_KNKC_1_CB_ES-05410,True,9.1,Which can affect the coral because if the zoox...
9,EBA1415_KNKC_1_CB_ES-05410,False,9.2,Which can affect the coral because if the zoox...
23,EBA1415post_WSKT_1_CB_ES-05486,True,23.1,This is a problem because corals need co2 in o...
24,EBA1415post_WSKT_1_CB_ES-05486,True,23.2,This is a problem because corals need co2 in o...
35,EBA1415_KYNS_4_CB_ES-05388,False,34.1,"In the text it states ""This is because"
36,EBA1415_KYNS_4_CB_ES-05388,False,34.2,Ran out time //
38,EBA1415_KYLS_5_CB_ES-05647,True,36.1,The reason why coral reefs are 'bleach' becaus...
39,EBA1415_KYLS_5_CB_ES-05647,True,36.2,The reason why coral reefs are 'bleach' becaus...
44,EBA1415_KYLS_5_CB_ES-05647,True,41.1,The reasons is if the winds weaken where the P...
45,EBA1415_KYLS_5_CB_ES-05647,True,41.2,The reasons is if the winds weaken where the P...


Now that the duplicates are isolated, they need to be evaluated. If there is at least one relation, one copy of the sentence will be kept as true. If there are no relations, one copy will be kept as false.

In [7]:
import numpy as np

current = -1

same_arr_inds = []
drop_list = []

for i, row in df_duplicate_sentences.iterrows():
  s_num = str(df_duplicate_sentences.loc[i].s_num)
  first_num, second_num = s_num.split(".")

  if first_num != current:
    current = first_num

    if len(same_arr_inds) > 1:
      flag = False
      for n in same_arr_inds:
        if df_duplicate_sentences.loc[n].relation == True:
          flag = True
          break

      left = same_arr_inds[0]
      right = same_arr_inds[1:]

      if flag == True:
       df_duplicate_sentences.loc[left].relation = True
      else:
       df_duplicate_sentences.loc[left].relation = False

      drop_list += right   

    same_arr_inds = []
  same_arr_inds.append(i)

df_duplicate_sentences.drop(drop_list, inplace=True)   

df_duplicate_sentences.head(25)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,essay,relation,s_num,sentence
8,EBA1415_KNKC_1_CB_ES-05410,True,9.1,Which can affect the coral because if the zoox...
23,EBA1415post_WSKT_1_CB_ES-05486,True,23.1,This is a problem because corals need co2 in o...
35,EBA1415_KYNS_4_CB_ES-05388,False,34.1,"In the text it states ""This is because"
38,EBA1415_KYLS_5_CB_ES-05647,True,36.1,The reason why coral reefs are 'bleach' becaus...
44,EBA1415_KYLS_5_CB_ES-05647,True,41.1,The reasons is if the winds weaken where the P...
64,EBA1415_TRJA_11_CB_ES-06108,True,60.1,Another thing that can lead to differences of ...
66,EBA1415_TRJA_11_CB_ES-06108,True,61.1,"The storms increase the amount of fresh water,..."
74,EBA1415_TWDG_11_CB_ES-05698,True,68.1,I don't know exactly the differences in the ra...
85,EBA1415_SDMK_4_CB_ES-04768,True,78.1,Environmental stressors cause the coral to eje...
91,EBA1415_SDMK_4_CB_ES-04768,True,83.1,It seems that when the trade winds push the co...
