In [1]:
import pandas as pd

import spacy

nlp = spacy.load('en_core_web_sm')

In [2]:
df = pd.read_csv(
    '../data/raw/plot_summaries.txt', 
    sep='\t', 
    header=None, 
    names=['wikipedia_movie_id', 'summary']
)

In [14]:
death_terms = [
    'die', 'dies', 'died', 'dying',
    'kill', 'kills', 'killed', 'killing',
    'murder', 'murders', 'murdered', 'murdering',
    'assassinate', 'assassinates', 'assassinated', 'assassinating',
    'perish', 'perishes', 'perished', 'perishing',
    'execute', 'executes', 'executed', 'executing',
    'slaughter', 'slaughters', 'slaughtered', 'slaughtering',
    'slay', 'slays', 'slew', 'slain', 'slaying',
    'poison', 'poisons', 'poisoned', 'poisoning',
    'drown', 'drowns', 'drowned', 'drowning',
    'hang', 'hangs', 'hanged', 'hanging',
    'decapitate', 'decapitates', 'decapitated', 'decapitating',
    'sacrifice', 'sacrifices', 'sacrificed', 'sacrificing',
    'death', 'demise', 'fatality', 'casualty', 'massacre',
    'decease', 'grave', 'suicide', 'extinct', 'martyr',
    'annihilated', 'decimated', 'obliterated', 'devastated',
    'overkill', 'euthanatized', 'extinguished', 'overdosed',
    'deathbed', 'mortal', 'posthumous', 'postmortem', 'rigor mortis',
    'snuffed', 'suffocate', 'perish', 'corpse', 'coroner', 'cadaver',
    
]

death_phrases = [
    'pass away', 'passes away', 'passed away', 'passing away',
    'lose his life', 'lose her life', 'lost his life', 'lost her life',
    'meet their end', 'meets their end', 'met their end',
    'breathe his last', 'breathe her last', 'breathed his last', 'breathed her last',
    'take his life', 'take her life', 'took his life', 'took her life',
    'put down', 'moved down', 'six feet under', 'bleed out', 'bled out', 'met their end',
    'meet their end', 'met his end', 'met her end', 'meet his end', 'meet her end',
    'met their maker', 'met his maker', 'met her maker', 'meet their maker', 
    'meet his maker', 'meet her maker', 'not long for this world', 'not long for this life',
    'pay the ultimate price', 'payed the ultimate price', 'paying the ultimate price',
    'taken out', 'took out', 
]

In [15]:
from spacy.matcher import PhraseMatcher

phrase_matcher = PhraseMatcher(nlp.vocab)
death_phrase_patterns = [nlp.make_doc(text) for text in death_phrases]
phrase_matcher.add('DEATH_PHRASES', None, *death_phrase_patterns)

In [16]:
def contains_death_terms(text):
    doc = nlp(text)
    death_found = False

    # Check for death-related lemmas
    for token in doc:
        if token.lemma_.lower() in death_terms:
            # Check for negations (e.g., "did not die")
            if not any(child.dep_ == 'neg' for child in token.children):
                death_found = True
                break

    # Check for death-related phrases
    matches = phrase_matcher(doc)
    if matches:
        for match_id, start, end in matches:
            span = doc[start:end]
            # Check for negations in phrases
            if not any(token.dep_ == 'neg' for token in span.root.children):
                death_found = True
                break

    return death_found

In [17]:
df['contains_death'] = df['summary'].apply(contains_death_terms)

df.head()

Unnamed: 0,wikipedia_movie_id,summary,contains_death
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",False
1,31186339,The nation of Panem consists of a wealthy Capi...,True
2,20663735,Poovalli Induchoodan is sentenced for six yea...,True
3,2231378,"The Lemon Drop Kid , a New York City swindler,...",False
4,595909,Seventh-day Adventist Church pastor Michael Ch...,True


In [18]:
df['contains_death'].value_counts()

contains_death
True     21452
False    20851
Name: count, dtype: int64

### Many summaries include links and other tags:

```
<ref namehttp://www.mtv.com/news/articles/1596736/20081009/spears_britney.jhtml|titleVena|first2008-09-08|publisher2008-09-10}}{{cite news}}` 
```
```
<ref name Die Like a Dog, A lauded Mongolian film probes a mongrel's soul | publisher  http://www.time.com/time/asia/asia/magazine/1999/990125/mongolia_dog1.html}}
```
```
<ref namehttp://movies.nytimes.com/movie/226202/Waldo's-Last-Stand/overview |title2008-10-08|work=NY Times}}
```
```
<ref namehttp://www.onf-nfb.gc.ca/eng/collection/film/?idSkin Deep|work9 June 2009}}<ref name0111211|title=Skin Deep}}
```
```
{{Plot|date"Farewell My Concubine Study Notes">{{cite web}}
```

**Notice that some of the tags are mismatched `<ref ...}}` or unclosed `{{..`.**


### Some summaries are cast lists:

Cast  *Violent J&nbsp;– J *Shaggy 2 Dope&nbsp;– Shaggy *Krista Kalmus&nbsp;– Amy *Lindsay Ballew&nbsp;– Stacy *Kathlyne Pham&nbsp;– Tiffany *Damian Lea&nbsp;– Brad *Sabin Rich&nbsp;– Carl *Mark Jury&nbsp;– Guy In Car *Roxxi Dolt&nbsp;– Girl In Car
--&#62;  *Peter Haber as Martin Beck *Mikael Persbrandt as Gunvald Larsson *Stina Rautelin as Lena Klingström *Per Morberg as Joakim Wersén *Rebecka Hemse as Inger  *Michael Nyqvist as John Banck *Anna Ulrica Ericsson as Yvonne Jäder *Peter Hüttner as Oljelund *Lennart Hjulström as Gavling *Lasse Lindroth as Peter

### Some summaries are reviews:

An attempt to bring the famed "Mr. Bill" clay characters to "life" in a sitcom format, this Showtime special featured Mr. Bill , his wife  and son , as well as his next-door neighbor, Sluggo ([[Michael McManus , his wife  and daughter . Although starring actors, the "Bills" were shown to be a "miniature" family, with many of the jokes revolving around the characters' small size and the challenges they faced living in a "large" human world, as well as scenarios where Mr. Bill is subjected to the various abusive situations the original Saturday Night Live character was best known for. Although the audience was invited to "look out for more shows" at the end of the 43-minute special, no follow-up "Mr. Bill" shows were ever produced.


### Some valid summaries have special characters:

Alan Colby, heir to a vast fortune, reappears after a seven year absence, only to be murdered before he can claim his inheritance. The Lowells have been living off the Colby fortune, and now someone is trying to kill Henrietta Lowell, matriarch of the family. Among the suspects are: *Fred and Janice Gage, who live off the Lowell  fortune, which would have gone to Alan Colby, the murdered man *Prof. Bowen, who is paid handsomely by the Lowells for his valuable psychic research *Mr. Phelps, the executor of the Lowell estate *Ulrich, who had a longstanding grudge against Alan Colby *Henrietta Lowell, who wants to continue psychic research