# Generate annotations of movie plots using Stanford coreNLP 4.5.5

1. Download Stanford CoreNLP 4.5.5 from [here](https://nlp.stanford.edu/software/stanford-corenlp-4.5.5.zip) (you will Java 8 installed and added to PATH)
2. Launch a server from terminal: `java -Xmx24g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 1200000 -annotators tokenize,ssplit,pos,lemma,ner,parse,coref,sentiment -coref.algorithm neural -parse.originalDependencies true -outputFormat text` 

`-Xmx24g` for allocating heap space up to 24GB, `-timeout 1200000` for a server timeout of 20 minutes, `-annotators tokenize,ssplit,pos,lemma,ner,parse,coref,sentiment` for the Stanford annotators we are interested in, `-coref.algorithm neural` for the neural network version of the coreference algorithm, `-parse.originalDependencies true` to use the original dependency names which will match the ones of 2013. **Adjust the memory parameter according to your machine**.

3. Run the `Step 1` cell
4. If some files didn't pass or exhausted memory, reboot and run the `Step 2`, if some files still persist, use the terminal and run the Stanford coreNLP pipeline directly on those specific files. 

In [None]:
import pandas as pd
import os
import requests
from tqdm.notebook import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
import sys

sys.path.append('..')
from helpers.readers import read_dataframe

PATH_OUT = '../nlp_results/'
if not os.path.exists(PATH_OUT):
    os.makedirs(PATH_OUT)

In [None]:
plot_summaries = read_dataframe('cmu/summaries', usecols=[
    "Wikipedia movie ID", 
    "Plot Summary"
])

# removing special character
plot_summaries['Plot Summary'] = plot_summaries['Plot Summary'].str.replace('\\\\', '', regex=True)

### Step 1: Parallel Run (adjust workers if needed)

In [None]:
# server propertiers (the request can modify the server properties)
url = "http://localhost:9000"
properties = {
    "annotators": "tokenize,ssplit,pos,lemma,ner,parse,coref,sentiment",
    "coref.algorithm": "neural",
    "parse.originalDependencies": "true",
    "outputFormat": "text",
}

def process_summary(row):
    data = row['Plot Summary']
    response = requests.post(url, params={"properties": str(properties)}, data=data.encode('utf-8'))

    if response.status_code == 200:
        output_file = os.path.join(PATH_OUT, f'nlp_movie_{row["Wikipedia movie id"]}.txt')
        with open(output_file, 'w') as outfile:
            outfile.write(response.text.strip())
    else:
        print(f"Error processing movie ID {row['Wikipedia movie id']}: {response.status_code}")

# Adjust max_workers according to your machine
with ThreadPoolExecutor(max_workers=16) as executor:
    future_to_row = {executor.submit(process_summary, row): row for index, row in plot_summaries.iterrows()}
    for future in tqdm(as_completed(future_to_row), total=len(future_to_row)):
        future.result()

Some files didn't went through also got some out of memory issue around 30k (can happen earlier based on machine CPU RAM, nb workers, allowed memory for server). Solution is to save the id of the files that didn't went through, reboot, and restart the annotation for error ids and non annotated files.

### Step 2: Run again
Find how many files are not annotated:

In [3]:
import numpy as np
import os
import re

def extract_ids_from_filenames(folder_path):
    ids = []
    pattern = r'nlp_movie_(\d+).txt'

    for filename in os.listdir(folder_path):
        match = re.match(pattern, filename)
        if match:
            ids.append(int(match.group(1))) 

    return ids


folder_path = '../nlp_results/'
annoated_ids = extract_ids_from_filenames(folder_path)
len(annoated_ids) # missing 2 files

42301

Launch server from terminal: `java -Xmx24g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 1200000 -annotators tokenize,ssplit,pos,lemma,ner,parse,coref,sentiment -coref.algorithm neural -parse.originalDependencies true -outputFormat text`

In [None]:
import pandas as pd
import os
import requests
from tqdm.notebook import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed


plot_summaries = read_dataframe('cmu/summaries', usecols=[
    "Wikipedia movie ID", 
    "Plot Summary"
])

filtered_summaries = plot_summaries[
    (~plot_summaries['Wikipedia movie id'].isin(annoated_ids)) 
]

In [5]:
filtered_summaries

Unnamed: 0,Wikipedia movie id,plot_summary
29218,16019180,The film shows a selection of Suras from the Q...
34186,30039,"The film begins with a prologue, the only comm..."


### Parallel Run (adjust workers if needed)

In [None]:
url = "http://localhost:9000"
properties = {
    "annotators": "tokenize,ssplit,pos,lemma,ner,parse,coref,sentiment",
    "coref.algorithm": "neural",
    "parse.originalDependencies": "true",
    "outputFormat": "text",
}

def process_summary(row):
    data = row['plot_summary']
    response = requests.post(url, params={"properties": str(properties)}, data=data.encode('utf-8'))

    if response.status_code == 200:
        output_file = os.path.join(PATH_OUT, f'nlp_movie_{row["Wikipedia movie id"]}.txt')
        with open(output_file, 'w') as outfile:
            outfile.write(response.text.strip())
    else:
        print(f"Error processing movie ID {row['Wikipedia movie id']}: {response.status_code}")

# Adjust max_workers according to your machine
with ThreadPoolExecutor(max_workers=1) as executor:
    future_to_row = {executor.submit(process_summary, row): row for index, row in filtered_summaries.iterrows()}
    for future in tqdm(as_completed(future_to_row), total=len(future_to_row)):
        future.result() 

### Step 3: Manual Stanford pipeline for naughty files

Example: `java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,coref,sentiment -coref.algorithm neural -parse.originalDependencies true -outputFormat text -file 16019180.txt`