# Analyse subtitles of Harry Potter 4

In this notebook we'll go throught Harry Potter 4 subtitles to see if we can find some entities in the text and if we can map it with characters.

In [1]:
%load_ext autoreload
%autoreload 2

## Import packages

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json

from bechdelai.nlp.process_srt import load_srt
from bechdelai.data.tmdb import get_movie_cast_from_id
from bechdelai.nlp.analyse_srt import extract_person_references_in_srt
from bechdelai.nlp.entities import match_entities_with_cast

In [30]:
pd.set_option('display.max_rows', 100)

## Load data

Srt file

In [31]:
fpath = "../../../data/srt/harry_potter_4.srt"
srt_list = load_srt(fpath)

In [32]:
for srt in srt_list[:2]:
    print(srt)

1
00:01:46,125 --> 00:01:48,543
Bloody kids.

2
00:02:35,007 --> 00:02:38,676
How fastidious you've become,
Wormtail.



TMDB meta data

In [33]:
tmdb_id = "674"
cast = get_movie_cast_from_id(tmdb_id)["cast"]

In [34]:
cast_df = pd.DataFrame(cast)
cast_df.head()

Unnamed: 0,adult,gender,id,known_for_department,name,original_name,popularity,profile_path,cast_id,character,credit_id,order
0,False,2,10980,Acting,Daniel Radcliffe,Daniel Radcliffe,60.579,/iPg0J9UzAlPj1fLEJNllpW9IhGe.jpg,1,Harry Potter,52fe4268c3a36847f801c21d,0
1,False,2,10989,Acting,Rupert Grint,Rupert Grint,34.387,/q2KZZ0ltTEl7Sf8volNFV1JDEP4.jpg,2,Ron Weasley,52fe4268c3a36847f801c221,1
2,False,1,10990,Acting,Emma Watson,Emma Watson,64.354,/tvPPRGzAzdQFhlKzLbMO1EpuTJI.jpg,3,Hermione Granger,52fe4268c3a36847f801c225,2
3,False,2,1923,Acting,Robbie Coltrane,Robbie Coltrane,21.872,/jOHs3xvlwRiiG2CLtso5zzmGCXg.jpg,7,Rubeus Hagrid,52fe4268c3a36847f801c235,3
4,False,2,5469,Acting,Ralph Fiennes,Ralph Fiennes,66.925,/tJr9GcmGNHhLVVEH3i7QYbj6hBi.jpg,4,Lord Voldemort,52fe4268c3a36847f801c229,4


## Get entities and nouns

In [39]:
%%time
results = extract_person_references_in_srt(srt_list, 100)

Wall time: 20.9 s


In [40]:
results.sample(20)

Unnamed: 0,srt_id,text,start_sec,end_sec,ent,start_idx,end_idx,ent_type,gender
44,31,Arthur!,270,271,Arthur !,0,2,PER,unknown
81,67,"- Excellent, excellent.\n- Ginny, look!",420,422,Ginny,6,7,PER,unknown
16,8,It cannot be done without him.\nAnd it will be...,178,182,him,6,7,PRON,man
79,64,"See you later, Cedric.",401,402,you,1,2,PRON,unknown
73,61,"Parting of the ways, I think, old chap.",395,397,I,5,6,PRON,unknown
89,76,"Well, put it this way:",460,461,it,3,4,PRON,unknown
24,13,Nagini tells me\nthe old Muggle caretaker...,206,209,Nagini,0,1,PER,unknown
61,42,"- Shall we?\n- Oh, yeah.",307,309,we,2,3,PRON,unknown
28,15,"Step aside, Wormtail,\nso I can give our guest...",213,219,Wormtail,3,4,MISC,unknown
93,79,Father and I are in\nthe minister's box...,466,469,I,2,3,PRON,unknown


In [41]:
results_with_cast = match_entities_with_cast(results, cast_df)

In [42]:
results_with_cast.sample(20)

Unnamed: 0,srt_id,text,start_sec,end_sec,ent,start_idx,end_idx,ent_type,gender,character_found
88,75,"Blimey, Dad. How far up are we?",456,459,we,8,9,PRON,unknown,
125,99,...to welcome\neach and every one of you...,568,572,you,8,9,PRON,unknown,
21,10,"- I will not disappoint you, my Lord.\n- Good.",185,189,you,5,6,PRON,unknown,
117,93,Krum! Krum! Krum!,540,543,Krum,0,1,PROPN,man,Viktor Krum
89,76,"Well, put it this way:",460,461,it,3,4,PRON,unknown,
69,49,"- Ready! After three. One, two...\n- Harry!",324,328,Harry !,11,13,PER,man,Harry Potter
67,48,What's a Portkey?,323,324,What,0,1,PRON,unknown,
86,74,Get your Quidditch World Cup\nprograms here!,453,456,Quidditch World Cup,2,5,MISC,unknown,
11,6,...perhaps if we were to do it\nwithout the boy.,172,175,it,7,8,PRON,unknown,
119,93,Krum! Krum! Krum!,540,543,Krum,4,5,PROPN,man,Viktor Krum


TODO :
- Match with character name OK
- Handle idx problem OK
- remove other notebooks OK
- remove subtitles.py OK
- remove unused function
- get path syn dict from path
- documentaiton
- Test ?