# Harry's EDA on the script data and reference data

The goal of the notebook is to answer the questions before we move in to fast iteration in baseline models.

### Installation
1. mount the google drive so we have access to the data folder
2. load the combined data (a list of reference text)
3. load the scripts data via functions


In [None]:
%pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.3


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import os
import zipfile
from io import BytesIO, StringIO
import pprint
from google.colab import drive
import operator
drive.mount('/content/drive')

Mounted at /content/drive


### The reference data (the **combined_data.xlsx**)

In [None]:
# read in the combioned_data.xslx
try:
    combined_data_df = pd.read_excel('/content/drive/MyDrive/w266/final_proj_data/combined_data.xlsx')
    display(combined_data_df.sample(10))
except FileNotFoundError:
    print('please upload combined_data.xlsx into the /content folder')

Unnamed: 0,Title,Overview
83,Son of Rambow,Will Proudfoot (Bill Milner) is looking for an...
407,My Cousin Vinny,Two New Yorkers accused of murder in rural Ala...
3791,Hellboy II: The Golden Army,An evil elf breaks an ancient pact between hum...
3353,The Human Stain,Coleman Silk is a worldly and admired professo...
3342,"I, Tonya",Competitive ice skater Tonya Harding rises amo...
4220,Sixteen Candles,A teenage girl deals with her parents forgetti...
2823,City of Joy,Hasari Pal (Om Puri) is a rural farmer who mov...
2580,Inside Out,"In 1975, Harry Morgan (Telly Savalas) and Sylv..."
1733,All About Eve,Margo Channing (Bette Davis) is one of the big...
4640,Why Him?,A dad forms a bitter rivalry with his daughter...


In [None]:
print('we have ', combined_data_df['Overview'].isnull().sum(), 'records that does not have the reference')
print(combined_data_df.shape[0], '<-- total number of records. must be duplicating?')


we have  564 records that does not have the reference
5314 <-- total number of records. must be duplicating?


### Investigating the duplication in movie title in reference data.

We have multiple records associated with one movie. Seems like we need to either do the waterfall joining (only the one who did not get a match in wiki overview can join with the 2nd data; remainders to join 3st...)

In [None]:
# check the number of total samples
num_titles = len(combined_data_df['Title'].value_counts())
print(f'{num_titles} TOTAL movie titles')

# check the number of unique movie titles
num_unique_titles = len(combined_data_df['Title'].value_counts().loc[lambda x: x > 1])
print(f'{num_unique_titles} UNIQUE movie titles')

# check the number of duplicate movie titles
num_duplicate_titles = num_titles - num_unique_titles
print(f'{num_duplicate_titles} DUPLICATE movie titles')

2820 TOTAL movie titles
1731 UNIQUE movie titles
1089 DUPLICATE movie titles


In [None]:
combined_data_df['Title'].value_counts().loc[lambda x: x > 1].sort_values(ascending=False).head(10)

The Three Musketeers       9
Beauty and the Beast       8
Alice in Wonderland        8
King Kong                  7
Carrie                     7
Anna Karenina              7
The Mummy                  7
Robin Hood                 7
Little Women               6
Dr. Jekyll and Mr. Hyde    6
Name: Title, dtype: int64

In [None]:
# take on exampple Frozen.
combined_data_df[combined_data_df['Title'] == 'Frozen']

Unnamed: 0,Title,Overview
19,Frozen,Young princess Anna of Arendelle dreams about ...
146,Frozen,When three skiers find themselves stranded on ...
808,Frozen,The Film tells story of Gigi and Kit who meet ...
1366,Frozen,Princess Elsa of Arendelle possesses cryokinet...
3379,Frozen,"Childhood friends Dan Walker and Joe Lynch, al..."


### The movie scripts data

In [None]:
%cd drive/MyDrive/w266/final_proj_data/

/content/drive/MyDrive/w266/final_proj_data


In [None]:
%ls

 [0m[01;34mBERT_annotations[0m/      raw_texts.zip
 BERT_annotations.zip   [01;34msubset_BERT_annotations[0m/
 combined_data.xlsx     subset_BERT_annotations.zip
 final_data.csv         train_df_f1k.csv
 openaiapi.txt          train_df_f1k_prepared.jsonl
 [01;34mraw_text_lemmas[0m/       Wikipedia_movie_meta_data.csv
[01;34m'raw_texts (1)'[0m/        Wikipedia_Summary.csv


In [None]:
# load contents from BERT annotations
# bert_annotations_file_path = '/content/drive/MyDrive/W266_Movie_Data/subset_BERT_annotations/'
bert_annotations_file_path = '/BERT_annotations/'
raw_texts_file_path =        '/content/drive/MyDrive/w266/final_proj_data/raw_texts/'
try:
    all_files = os.listdir(bert_annotations_file_path)
    print(all_files)
except FileNotFoundError:
    print('file not found')

file not found


##### functions to load data

In [None]:
def get_movie_title(script_txt_file):
    '''get the movie title without the unique identifier and _anno.txt suffix'''
    movie_title = script_txt_file.split('_')[0]
    return movie_title

def get_script_length(file_path, script_txt_file):
    '''calculate the number of lines in a BERT annotated script'''
    with open(str(file_path) + str(script_txt_file), 'r') as test_file:
        script_length = len(test_file.readlines())
    return script_length

def read_script(file_path, script_txt_file):
    '''read in the BERT annotated script'''
    script_text = open(str(file_path) + str(script_txt_file), 'r')
    # print(test_file.read())
    return script_text.read()

def count_script_elements(file_path, script_txt_file):
    '''count script elements such as dialog, text, speaker_heading, scene_heding'''
    script_element_dict = {}
    with open(str(file_path) + str(script_txt_file), 'r') as script_file:
        for line in script_file:
            script_element = line.split(':')[0]    
            if script_element not in script_element_dict:
                script_element_dict[script_element] = 1
            else:
                script_element_dict[script_element] += 1
    return script_element_dict

def identify_characters(file_path, script_txt_file):
    '''count number of characters and their speaking parts'''
    speaker_heading_dict = {}
    with open(str(file_path) + str(script_txt_file), 'r') as script_file:
        for line in script_file:
    
            # if the script element is 'speaker_heading' then that is a character
            if 'speaker_heading' in line.split(':')[0]:
                # some speaker_headings do not contain character names
                if re.search('[a-zA-Z]', line.split(':')[1]) != None:

                    # remove leading and trailing spaces and trailing newlines
                    character = line.split(':')[1].lstrip().rstrip().replace('\n', '')

                    # remove text that is not uppercase
                    character = ''.join(ch for ch in character if not ch.islower())

                    # remove (O.S.) off screen from character name
                    character = character.replace(' (O.S.)', '')

                    # remove trailing punctuation
                    character = character.rstrip('.').rstrip('?').rstrip('!')

                    ##### NEED TO ADD LOGIC TO DEAL WITH CONTINUOUS, CONTINUED, and CONT'D #####

                    if character not in speaker_heading_dict:
                        speaker_heading_dict[character] = 1
                    else:
                        speaker_heading_dict[character] += 1

    # remove characters that only have one speaking line
    character_dict = {k:v for k, v in speaker_heading_dict.items() if v > 1}
    print(f'character_dict length before removing single speaking lines: {len(speaker_heading_dict)}')
    print(f'character_dict length after removing single speaking lines: {len(character_dict)}')

    return character_dict

In [None]:
# get script length of BERT annotated script
print(get_movie_title(all_files[0]), get_script_length(raw_texts_file_path, all_files[0]))
print(get_movie_title(all_files[1]), get_script_length(raw_texts_file_path, all_files[1]))
print(get_movie_title(all_files[2]), get_script_length(raw_texts_file_path, all_files[2]))

FileNotFoundError: ignored

In [None]:
read_script(bert_annotations_file_path, all_files[0])



In [None]:
def CountFrequency(my_list):
     
    # Creating an empty dictionary
    freq = {}
    for items in my_list:
        freq[items] = my_list.count(items)
     
    return freq
 
movie_scripts_count_dict = CountFrequency(all_files)
sorted_d = dict( sorted(movie_scripts_count_dict.items(), key=operator.itemgetter(1),reverse=True))
print('Dictionary in descending order by value : ',sorted_d)
print('no duplications!')




In [None]:
print(len(all_files))
print(len(title_and_length))

# what are the 11 movies? they cannot be counted?

1987
1998


In [None]:
# title_and_length = {}

# THIS IS TIME CONSUMING! Think before uncomment this session.
######
# for i in range(len(all_files)):
#     title_and_length[get_movie_title(all_files[i])] = get_script_length(bert_annotations_file_path, all_files[i])
######

first10pairs = {k: title_and_length[k] for k in list(title_and_length)[:10]}
pprint.pprint(first10pairs)

{'Molly s Game': 8508,
 'Moonstruck': 3919,
 'My Best Friend s Wedding': 5285,
 'Network': 7275,
 'Night of the Living Dead': 3786,
 'Nine': 3864,
 'Noah': 4575,
 'Notes on a Scandal': 3625,
 'Oldboy': 3686,
 'Olympus Has Fallen': 3678}


In [None]:
%pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1
