Installation
```
pip install spacy_experimental
pip install chardet
pip install thinc[torch]
pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl
```


In [6]:
import spacy
nlp_coref = spacy.load("en_coreference_web_trf")

In [34]:
doc = nlp_coref("My dad passed away last summer after suffering from the behavioral variant FTD, he was older when he was diagnosed but he probably had it longer but we probably missed a lot of signs over the years. When the symptoms really started getting bad that's when my dad's doctor ordered an MRI and we learned that the frontal part of his brain was shrinking and in atrophy.  For my dad, from diagnosis to death it took less than two years, a year and nine and a half months to be exact.    As another poster mentioned I wrote a journal of everything he went through and it was challenging for sure as my mother and I were his caregivers the entire time.  If there was one blessing it was that he never lost his memory of who my mom and I were so that was a good thing.  I could write a book here on what we went through with this disease but almost seven months since his passing, I wish I could take care of him for just one more day.")
print(doc.spans)
print(type(doc.spans.get("coref_clusters_1")))
s = doc.spans.get("coref_clusters_1")
[str(a) for a in s]

{'coref_clusters_1': [My dad, he, he, he, my dad's, his, my dad, he, his, he, his, his, him], 'coref_clusters_2': [the behavioral variant FTD, it, this disease], 'coref_clusters_3': [passed, that, diagnosis to death, his passing], 'coref_clusters_4': [My, my, my, I, my, I, my, I, I, I, I], 'coref_clusters_5': [wrote, it], 'coref_clusters_6': [one blessing, it], 'coref_clusters_7': [my mother, my mom, we], 'coref_clusters_8': [lost, that]}
<class 'spacy.tokens.span_group.SpanGroup'>


['My dad',
 'he',
 'he',
 'he',
 "my dad's",
 'his',
 'my dad',
 'he',
 'his',
 'he',
 'his',
 'his',
 'him']

In [47]:
type(doc)

spacy.tokens.doc.Doc

In [9]:
doc = nlp_coref("Dr. PERSON, mom is behaving the opposite way. She lives in an assisted living facility and she is pushing the call button every few minutes to have them hand her the remote when it is sitting right next to her, wanting them to wipe her when she goes to the bathroom and many other things like that. The caregivers are so frustrated and the nurse is trying to get her to do these things for herself while she still can. I am frustrated and can’t be around her because I am just so exhausted and I feel like her slave. What is your advice?")
print(doc.spans)

{'coref_clusters_1': [mom, She, she, her, her, her, she, her, herself, she, her, her], 'coref_clusters_2': [the remote, it], 'coref_clusters_3': [them, them], 'coref_clusters_4': [Dr. PERSON, I, I, I, your]}


In [46]:
doc = nlp_coref("""Yes but it all depends on the state of dementia of your LO. Anything I try to say, request her to do or discuss with my mother is met with hostility. She is deeply paranoid and suspicious. She makes unfounded and quite horrid accusations to and about me. She argues with me even when I am being nice to her. She argues about her arguing!!! Bascially I am her punch bag and it is soul destroying. If you haven't already get a PoA and do what you have to do without them making it harder for you. If day to day tasks become impossible to complete then get home help for assistance or check your LO into a care facility. There comes a point when the stress and aggravation is just not worth it. They have dementia what do they know? Many times I feel like my mother is frying my brain and it literally hurts my head and I just want out. Can't do it any more, but I am trapped and she knows it because as well as having dementia she is a manipulative selfish narcissist which is a terrible combination. Get help is all I can say. Trying to deal with contentious issues on a one to one basis with a LO who is resistant non compliant is, 99% of the time, going to fail.""")
print(doc.spans)

{'coref_clusters_1': [I, my, me, me, I, I, I, my, my, my, I, I, I], 'coref_clusters_2': [her, my mother, She, She, She, her, She, her, her, my mother, she, she], 'coref_clusters_3': [am, it], 'coref_clusters_4': [They, they], 'coref_clusters_5': [frying, it], 'coref_clusters_6': [am, it]}


In [11]:
doc = nlp_coref("""We embraced my husband's dementia because.... it is what it is.   When he first got the diagnosis we were mostly relieved, better than the dramas and psychosis that plagued him for a few years.   That had been a miserable time and I was his target.   Knowing what it was,  made our lives better.

We ended up with a good medical team, the right meds after a few months of trials, government pension,  did all the legal paperwork while he could still function and we told people what he had with no shame or hesitation.

7 years down with him at a moderate to severe stage and him sitting most days with his own thoughts, I think maybe his life is not so bad, no decisions, no bills, no driving, no responsibility,  not answerable for anything...... perpetual holiday of the mind.  It's then I think he's the lucky one.   I just get the work.
"""
)
print(doc.spans)

{'coref_clusters_1': [my husband's dementia, it, it, it], 'coref_clusters_2': [my husband's, he, him, his, he, he, him, him, his, his, he], 'coref_clusters_3': [We, we, our, We, we], 'coref_clusters_4': [my, I, I, I, I]}


### Proposed Methodology
1. Sort most common head coref clusters of interest (start w/ my then a word...). Count them, calculate distribution of length of references.
2. Create list of relevant coref cluster heads.
3. Determine threshold for "talking about self". What is length of "I" (knowledgeable informant) vs. the patient?

### Proposed NLP Pipeline
1. Remove single sentence comments
2. NER, replace names
3. Remove thank you's
4. Apply coreference pipeline

In [4]:
import pandas as pd
import sqlite3
import os
from dotenv import load_dotenv
import sys
load_dotenv()

comments = pd.read_sql("SELECT ROWID, * FROM comments", sqlite3.connect(os.path.join("..", "data", os.environ["SQLITE_DB_NAME"])))

In [13]:
print(comments.shape)
sys.getsizeof(comments) / 1024**3 # GB

(232214, 13)


0.25164382439106703

In [7]:
nlp_sm = spacy.load("en_core_web_sm")
nlp_sm.add_pipe("merge_noun_chunks")



### Initial Filtering Dataset
Reduce dataset size before apply compute heavy coref.

In [15]:
# Reduce Before Applying Coref
#### Filtering Criteria
comments["sentence_count"] = comments["comment_text"].apply(lambda x: len(list(nlp_sm(x).sents)))
filter_1 = (comments.is_reply == 0)
# filter_2 = (comments.reply_by_channel_owner == 0) redudant
filter_3 = (comments.sentence_count) > 1

filters = [
    ('remove replies', filter_1),
    ('remove comments with only one sentence', filter_3)
]

filtered = comments.copy()
for name, bool_srs in filters:
    orig_rows = filtered.shape[0]
    filtered = filtered.loc[bool_srs, :]
    updated_rows = filtered.shape[0]
    print(f"{name}: {orig_rows} -> {updated_rows} ({updated_rows / orig_rows:.2%})")

filtered.to_pickle(os.path.join("..", "data", "filtered_comments.pkl"))

remove replies: 232214 -> 179816 (77.44%)
remove replies by channel owner: 179816 -> 179816 (100.00%)
remove comments with only one sentence: 179816 -> 87476 (48.65%)


In [27]:
filtered.comment_text.sample(10)

215671    I understand what you going through. keep the ...
3450      Yes I have.  Just recently.  Good resource for...
162735    Would love to hear her on a decent piano. That...
10660     OMG all of them!!!! But then I looked at the p...
160900    Is it out of tune?...I forgot how a perfectly ...
61022     Doc… I love you man, but vegetable carbohydrat...
87825     My grandma has dementia..her memory lasts like...
114940    even though she’s and actress, i HATE that bit...
10805     Yes, going to make a binder for myself and my ...
103976    Why do some people scratch their heads when th...
Name: comment_text, dtype: object

### Apply Coreference Resolution

In [43]:
filtered = pd.read_pickle(os.path.join("..", "data", "filtered_comments.pkl"))
database_path = os.path.join("..", "data", os.environ["SQLITE_DB_NAME"])
conn = sqlite3.connect(database_path)

# iterate over filtered in batches
batch_size = 1500
for i in range(0, filtered.shape[0], batch_size):
    print(f"Processing {i} to {i+batch_size}, {i / filtered.shape[0]:.2%}")
    batch = filtered.iloc[i:i+batch_size, :].copy()
    batch["coref_result"] = batch["comment_text"].apply(lambda x: nlp_coref(x))
    batch["coref_result_json"] = batch["coref_result"].apply(lambda x: str(x.to_json()))
    batch["coref_spans_json"] = batch["coref_result"].apply(lambda x: str(x.spans))
    batch["coref_doc_bytes"] = batch["coref_result"].apply(lambda x: x.to_bytes())
    batch.rename(columns={"rowid": "comment_rowid"}, inplace=True)
    batch.drop(columns=[c for c in batch.columns if c not in ["comment_rowid", "coref_result_json", "coref_spans_json", "coref_doc_bytes"]], inplace=True)
    batch.to_sql(con=conn, name="comments_coref", if_exists="append", index=False)


Processing 0 to 150, 0.00%


### Load Coref Results

In [8]:
database_path = os.path.join("..", "data", os.environ["SQLITE_DB_NAME"])
conn = sqlite3.connect(database_path)

comments_coref = pd.read_sql("SELECT * FROM comments_coref3", conn)
comments_coref["coref_result"] = comments_coref["coref_doc_bytes"].apply(lambda x: spacy.tokens.doc.Doc(nlp_coref.vocab).from_bytes(x))

In [73]:

def unpack_coref_to_tokens(row):
    lst = []
    all_coref_chains = row["coref_result"].spans
    for span in all_coref_chains:
        span_index = span.split("_")[-1]
        for j, ent in enumerate(all_coref_chains[span]):
            p = nlp_sm(ent.text)
            s = next(p.sents)

            possessive_tup = None
            root_tup = None
            root_first = True
            compressed_possessive = None
            compressed_possessive_lemma = None

            for i, t in enumerate(s):
                if t.dep_ == "poss" and t.pos_ == "PRON":
                    possessive_tup = (t.text, i)
                    possessive_lemma = t.lemma_
                elif t.dep_ == "ROOT":
                    root_tup = (t.text, i, t.pos_)
                    root_lemma = t.lemma_
            if root_tup is not None and possessive_tup is not None:
                root_first = root_tup[1] < possessive_tup[1]
                if root_first:
                    compressed_possessive = root_tup[0] + " " + possessive_tup[0] 
                    compressed_possessive_lemma = root_lemma + " " + possessive_lemma
                else:
                    compressed_possessive = possessive_tup[0] + " " + root_tup[0]
                    compressed_possessive_lemma = possessive_lemma + " " + root_lemma
                    
                
            row_dict = {
                "comment_rowid": row["comment_rowid"],
                "ref_chain_index": span_index,
                "ref_index": j,
                "original_token": ent.text,
                "lower_token": ent.text.lower(),
                "root": s.root,
                "root_lemmatized": root_lemma,
                "root_pos": root_tup[2],
                "compressed_possessive": compressed_possessive,
                "compressed_possessive_lemmatized": compressed_possessive_lemma
            }
            lst.append(row_dict)
    return lst

    


In [74]:
for i, row in comments_coref.iterrows():
    lst = unpack_coref_to_tokens(row)
    if i == 0:
        df = pd.DataFrame(lst)
    else:
        df = pd.concat([df, pd.DataFrame(lst)])

In [75]:
df["final_token"] = df["compressed_possessive_lemmatized"].fillna(df["root_lemmatized"]).str.lower()
df["final_token_len"] = df["final_token"].str.len()
df["within_chain_max_len"] = df.groupby(["comment_rowid", "ref_chain_index"])["final_token_len"].transform(lambda x: x.max())


In [79]:
a = nlp_sm("mother")
for i, t in enumerate(a):
    print(t.text, t.dep_, t.pos_, t.ent_iob, t.ent_type)

mother ROOT NOUN 2 0


In [76]:
df.iloc[25:50]

Unnamed: 0,comment_rowid,ref_chain_index,ref_index,original_token,lower_token,root,root_lemmatized,root_pos,compressed_possessive,compressed_possessive_lemmatized,final_token,final_token_len,within_chain_max_len
25,1,2,10,he,he,he,he,PRON,,,he,2,10
26,1,3,0,we,we,we,we,PRON,,,we,2,2
27,1,3,1,we,we,we,we,PRON,,,we,2,2
28,1,4,0,an attorney,an attorney,attorney,attorney,NOUN,,,attorney,8,8
29,1,4,1,that,that,that,that,PRON,,,that,4,8
30,1,5,0,you,you,you,you,PRON,,,you,3,4
31,1,5,1,you,you,you,you,PRON,,,you,3,4
32,1,5,2,your,your,your,your,PRON,,,your,4,4
33,1,6,0,FTD,ftd,FTD,FTD,PROPN,,,ftd,3,7
34,1,6,1,this disease,this disease,disease,disease,NOUN,,,disease,7,7


### Compute Chain Level Features
- Count of first-person subjective "I", filter out "my"
- Longest lemmatized possessive phrase designated as "head", root is PNOUN or NOUN

### Compute Comment Level Features
- Balance "I" against narrated dimensions