-
Notifications
You must be signed in to change notification settings - Fork 1
/
final project_v2.Rmd
973 lines (759 loc) · 47.6 KB
/
final project_v2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
---
title: "Final Project: A text mining analysis of the Harry Potter films"
subtitle: "Text Mining - UC3M"
author: "Carlos San Juan, Eric Hausken-Brates"
date: "2024-03-26"
format:
html:
toc: true
toc-depth: 2
number-sections: true
number-depth: 2
editor_options:
chunk_output_type: inline
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
## Libraries
The libraries we are going to use for the work are the following:
```{r}
library(tidyverse)
library(dplyr)
library(magrittr)
library(scales)
library(RColorBrewer)
library(ggsci)
library(ggthemes)
library(lubridate)
library(viridis)
library(ggrepel)
library(reshape)
library(gridExtra)
library(tidyverse)
library(reshape)
library(viridis)
library(tm)
library(SnowballC)
library(wordcloud)
library(NLP)
library(reshape)
library(widyr)
library(wordcloud2)
library(tidytext)
library(janeaustenr)
library(htmlwidgets)
library(topicmodels)
library(stringr)
```
## Introduction
In this paper we will apply different Text Mining techniques using the different scripts of the Harry Potter films to reveal different patterns and trends in the narrative, the characters and the emotions they may have experienced. Using natural language processing, we will look for insights into plot evolution and emotional development throughout the saga, providing unique insights into one of the most iconic universes of literature and cinema.
Before we continue, we want to warn that this work is made by big fans of the saga, so we will do it with great affection and we apologise in advance if we are too much of a fan. It is also necessary to warn that we may make some spoilers but we promise that they will be small (not related to the plot). For that reason, we recommend that those who read us, watch the movies before, or better yet, read the books. You will thank us when you finish them.
## Databases
The database we will use in this project has been compiled from GitHub and is directly accessible through the following link: [GitHub - HP Dataset](https://github.com/Kornflex28/hp-dataset/tree/main/datasets).
```{r}
hp1 <- read_csv("hp1.csv")
hp2 <- read_csv("hp2.csv")
hp3 <- read_csv("hp3.csv")
hp4 <- read_csv("hp4.csv")
hp5 <- read_csv("hp5.csv")
hp6 <- read_csv("hp6.csv")
hp7 <- read_csv("hp7.csv")
hp8 <- read_csv("hp8.csv")
# Fix misspelling of movie #4
hp4 <- hp4 |>
mutate(movie = str_replace_all(string = movie, pattern = "Gobelt", replacement = "Goblet"))
df <- rbind(hp1,hp2,hp3,hp4,hp5,hp6,hp7,hp8)
```
### Data wrangling
```{r movies-order}
movie_order <- tribble(~num, ~movie, ~film.name,
1, "Harry Potter and the Philosopher's Stone", "1-Philosopher's Stone",
2, "Harry Potter and the Chamber of Secrets", "2-Chamber of Secrets",
3, "Harry Potter and the Prisoner of Azkaban", "3-Prisoner of Azkaban",
4, "Harry Potter and the Goblet of Fire", "4-Goblet of Fire",
5, "Harry Potter and the Order of the Phoenix", "5-Order of the Phoenix",
6, "Harry Potter and the Half-Blood Prince", "6-Half-Blood Prince",
7, "Harry Potter and the Deathly Hallows Part 1", "7-Deathly Hallows Part 1",
8, "Harry Potter and the Deathly Hallows Part 2", "8-Deathly Hallows Part 2"
)
df <- df |>
left_join(movie_order, by = "movie")
```
## Initial Hypothesis
Some of the questions we are going to address in the paper are:
- Who are the characters that have made the greatest impact on popular culture?
- Is the number of words related to the length of the films?
- What are the most distinctive words or characters in each film?
- How do the most frequently used words differ from the most common bigrams and trigrams?
- Which films, scenes, and characters have the most positive and negative sentiment throughout the series?
- Does the grouping size of dialogue chunks affect our results in sentiment analysis?
# TF-IDF
### Most sentences in movies
One of the most useful tools in Text Mining is the word count in each text to determine how relevant a certain word or topic may be within a corpus. This approach allows us to identify key terms, frequencies and patterns that emerge in the discourse, offering a solution to discover the predominant themes and relative importance of different concepts throughout the narrative.
However, the count is also useful to identify who are the main characters in different novels or in this case, films. Whereas those people who show the highest number of scripted lines in a film should be the main characters.
That's going to be the first step in our work, identifying the main characters of the different films. Luckily, we are big fans of the saga and we will be able to check the results quite easily, but we could do it with any script to know its importance.
```{r}
# Save df as a dataframe with variables 'character' and 'movie'
Char_Dial <- data.frame(table(df$character, df$movie))
# Sum lines for each character throughout all movies
Char_Dial_Sum <- Char_Dial %>%
group_by(Var1) %>%
summarise(Total_Freq = sum(Freq)) %>%
ungroup()
# Select top 10 characters with the most spoken lines
Char_Dial_Top10 <- Char_Dial_Sum %>%
arrange(desc(Total_Freq)) %>%
slice_max(Total_Freq, n = 10)
# Create a graph for the top 10 characters with the most lines
ggplot(Char_Dial_Top10, aes(x = reorder(Var1, Total_Freq), y = Total_Freq)) +
geom_bar(stat = "identity", width = 0.62, fill = "steelblue") +
coord_flip() +
labs(title = "Characters with the most sentences",
subtitle = "Top 10 across all parts of a movie series",
x = "Character", y = "Number of sentences") +
theme_minimal() +
theme(legend.position = "none") # Remove legend because it is not relevant
```
In this graph we can see who are the 10 most important characters in the Harry Potter saga.
As you might expect, the character who has the most lines in the films is the one who appears in the title of the films and is known as 'The Chosen One' or Harry Potter, to those who aren't such big fans. Harry is followed by his best friends, `Ron Weasley` and `Hermione Granger`, completing the `Golden Trio`.
However, when reviewing the results and as fans of the saga we are struck by the appearance of a particular character `Horace Slughorn`. This character appears in the sixth installment having great prominence only in this one, being more forgotten in the last two. In addition, in this list there are great forgotten characters such as `Draco Malfoy` who despite being a very important character in the saga, being one of the main enemies of Harry Potter, is not in the top 10. This could be an indication of the great impact that this character had on popular culture, as everyone who has seen the films or read the books remembers this character, but instead he hardly appears on screen, according to the results obtained.
Next, we are going to divide this analysis by films, to observe how the phrases are distributed throughout the saga. To do this, we will store the names of the characters mentioned in the following vector, with the licence to change `Horace Slughorn` to `Draco Malfoy`.
```{r fig.width=11, fig.height=8}
top_characters <- c("Harry Potter", "Ron Weasley", "Hermione Granger", "Albus Dumbledore", "Rubeus Hagrid", "Severus Snape", "Minerva McGonagall", "Voldemort","Neville Longbottom", "Draco Malfoy")
Char_Dial <- data.frame(table(df$character, df$film.name))
Char_Dial %>%
arrange(desc(Freq)) %>%
filter(Var1 %in% top_characters) %>%
ggplot(aes(reorder(Var1, +Freq), Freq, fill = Var2)) +
geom_bar(stat = "identity", width = 0.62)+
scale_fill_brewer(type = "div", ) +
coord_flip()+
guides(fill = guide_legend(title.position = "top", reverse = T))+
labs(title = "Characters with the most sentences",
subtitle = "Top 10, by movie", fill = "Movie",
x = "Character", y = "Number of sentences")+
theme_minimal()+
theme(legend.title.align = 0.5, legend.position = "right", legend.direction = "vertical")
```
### Most used Spells
In this section, we are going to talk about magic, more specifically spells. In the world of Harry Potter, in order to do magic, you have to cast a spell in a certain way. That is why we are going to see which are the most used spells in the saga.
To do this, first of all, we are going to store in a vector all the spells that are mentioned in the films. To see where we have taken the spells from, click on [here](https://screenrant.com/harry-potter-spells-list-from-movies-and-books/).
```{r}
spells <- c('Accio', 'Alohomora', 'Avada Kedavra', 'Crucio', 'Expecto Patronum', 'Expelliarmus', 'Imperio',
'Lumos', 'Obliviate', 'Petrificus Totalus', 'Reparo', 'Riddikulus', 'Sectumsempra', 'Wingardium Leviosa')
# We add a column to identify the spell mentioned in each dialogue.
df$spell <- NA # Initialize the variable with `NA`
# Loop through the `spells` vector. For each spell, check if it is written in each line of the script. If it is present in that line, add that spell to the variable `spell` in `df` for that row.
for(spell in spells) {
df$spell <- ifelse(grepl(spell, df$dialog, ignore.case = TRUE), spell, df$spell)
}
# We calculate the count of each spell.
spell_counts <- df %>%
filter(!is.na(spell)) %>% # Exclude lines without spells
count(spell, sort = TRUE) # Count occurrances for each spell and sort in order
library(viridis) # Make sure to have this package installed
ggplot(spell_counts, aes(x = reorder(spell, n), y = n, fill = spell)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), position = position_dodge(width = 0.9), hjust = -0.1, size = 3.5) +
coord_flip() +
scale_fill_viridis(discrete = TRUE, option = "D") +
labs(title = 'Spells most commonly used',
subtitle = "Frequency of mentioning spells in dialogues",
x = 'Spells',
y = 'Frequency') +
theme_minimal() +
theme(legend.title = element_blank(),
axis.title.x = element_text(size = 12, face = "bold"),
axis.title.y = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
legend.position = "none")
```
The most used spells are `Expeliarmus` and `Expecto Patronum` with a total of 12 times throughout all 8 movies. But let's see how they are distributed across the series.
```{r}
Spels_df <- data.frame(table(df$character, df$film.name, df$spell))
Spels_df %>%
arrange(desc(Freq)) %>%
filter(Var3 %in% spells) %>%
ggplot(aes(reorder(Var3, +Freq), Freq, fill = Var2)) +
geom_bar(stat = "identity", width = 0.62) +
scale_fill_brewer(palette = "Set2") +
coord_flip() +
guides(fill = guide_legend(title.position = "top", title = "Movie Part")) +
labs(title = "Spells most commonly used",
subtitle = "Frequency of mentioning spells by movie",
x = "Spells",
y = "Number of appareances") +
theme_minimal() +
theme(legend.title.align = 0.5,
legend.position = "right",
legend.direction = "vertical",
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 12),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
legend.text = element_text(size = 10))
```
However, there is some confusion in this review because, for example, the spell `Riddikulus` might seem to be of great importance to the plot, but in reality, it is not (again, we take advantage of the fact that we are fans to discover this). This spell appears, because in a scene numerous characters say it, because they are in a magic class, but they don't use it again in any film, that's why it only appears in the film of the 'Prisoner of Azkaban'.
In consequence, we are going to do the same thing but this time instead of separating by films by the number of films they appear in:
```{r}
Spels_df |>
filter(Freq > 0) |>
select(Var2, Var3) |>
distinct() |>
count( Var3) |>
ggplot(aes(y = reorder(Var3, +n), x = n, fill = Var3 == "Riddikulus") ) +
geom_bar(stat = "identity", width = 0.62, color = "steelblue") +
scale_fill_manual(values = c( "steelblue", "peru")) +
labs(title = "Riddikulus is in only one movie",
subtitle = "Number of movies mentioning this spell",
y = "Spells",
x = "Number of movies appearing in") +
theme_minimal() +
theme(legend.title.align = 0.5,
legend.position = "none",
legend.direction = "vertical",
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 12),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
legend.text = element_text(size = 10),
panel.grid.minor = element_blank()
)
```
Having seen how the spells are distributed by film, another way to visualize it is to divide it up according to who the characters are who conjure them. Let's get down to it:
```{r, fig.width=15, fig.asp=0.65, out.width='100%', preview = TRUE}
spell_character_counts <- df %>%
filter(spell %in% spells) %>%
count(spell, character) %>%
arrange(spell, desc(n))
# Create the graph
ggplot(spell_character_counts, aes(y = reorder(character, n), x = n, fill = character, label = n)) +
geom_bar(stat = "identity") +
geom_text(hjust = -1) +
facet_wrap(~ spell, scales = "free_y") +
scale_fill_viridis_d(begin = 0.2, end = 0.8, direction = -1, option = "C") +
labs(title = "Character Spell Usage",
y = NULL,
x = "Frequency of Spell Usage") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.x = element_text(face = "bold", hjust = 0, size = 12),
legend.position = "none",
panel.border = element_rect(fill = NA, color = "gray20")
) +
scale_x_continuous(breaks = c(0,2,4,6,8,10), limits = c(0,10))
```
Here we can see the distribution of the different spells cast by the different characters. We can see how most of them are dominated by either `Harry Potter` or `Hermione Granger`. Or in the case of the 'Unforgivable Curses' (for non-fans, we are referring to spells that are forbidden in the Harry Potter world, such as `Avada Kedabra`), predominate `Voldemort` and other dark wizards from the saga. This serves as a great indicator of the relevance of these characters, as well as showing who other influential characters are in the plot.
At this point, we have already found out which spells are used the most and which characters speak the most. The next step is to observe which words are repeated the most and calculate their frequency.
### Most used Words
The first step is to check which film is the most scripted or the longest. That is to say, we will assume that the films that have the most dialogue are those that offer us the most minutes on the big screen.
```{r}
library(dplyr)
total_dialogs <- df %>%
group_by(film.name) %>%
summarize(total_dialogs = n()) # Count the number of rows per group, which is equivalent to counting dialogues.
total_dialogs
```
Let's represent it in a graph.
```{r}
ggplot(total_dialogs, aes(y = reorder(film.name, -total_dialogs), x = total_dialogs, fill = film.name)) +
geom_bar(stat = "identity") +
labs(title = "Total Dialogues by Harry Potter Movie",
y = NULL,
x = "Total Dialogues") +
theme_minimal() +
theme(legend.position = "none")
```
Looking at the results and comparing it with the length of the films, we can see that it does not match, i.e. the film with the most dialogue is `Harry Potter and the Order of the Phoenix`, being the second shortest film of the saga. The information about the duration of the films is the following, although you can find more information about it, in the following [link](https://www.pottertalk.net/harry-potter-movie-lengths/).
- Philosopher’s Stone = 152 minutes = 2 hours 32 minutes
- Chamber of Secrets = 161 minutes = 2 hours 41 minutes
- Prisoner of Azkaban = 142 minutes = 2 hours 22 minutes
- Goblet of Fire = 157 minutes = 2 hours 37 minutes
- Order of the Phoenix = 139 minutes = 2 hours 18 minutes
- Half Blood Prince = 153 minutes = 2 hours 33 minutes
- Deathly Hallows pt 1 = 146 minutes = 2 hours 26 minutes
- Deathly Hallows pt 2 = 130 minutes = 2 hours 10 minutes
There doesn't seem to be that much of a relationship between the amount of dialogue and the number of minutes in the film. Let's do the same, but instead of dialogue for words.
```{r}
words <- df %>%
unnest_tokens(word, dialog) %>%
count(movie, word, sort = TRUE)
movie_words <- df %>%
# we tokenize as usual (as an exception we won't be filtering stopwords now)
unnest_tokens(word, dialog) %>%
count(movie, word, sort = TRUE) |>
group_by(movie) %>%
summarize(total_words = sum(n))
movie_words
ggplot(movie_words, aes(x = reorder(movie, -total_words), y = total_words, fill = movie)) +
geom_bar(stat = "identity") +
coord_flip() + # Barras horizontales
labs(title = "Total Words by Harry Potter Movie",
x = "Movie",
y = "Total Words") +
theme_minimal() +
theme(legend.position = "none")
```
Again, we see that although the number of words is more closely related to the length of the film, it does not coincide, so we can discard the hypothesis that the longer the dialogue, the longer the film.
Once we have checked this, the next step is to look at the term frequency of each word, to see which words are the most representative of each film. This is calculated as the number of times the word is repeated in the film divided by the total number of words in the film.
```{r}
movie_words <- words %>%
left_join(movie_words, by = "movie") |>
#we add a column for term_frequency in each novel
mutate(term_frequency = n/total_words)
movie_words
ggplot(movie_words, aes(x = term_frequency)) +
geom_histogram(binwidth = 0.0001, fill = "#0073C2FF", color = "black") +
xlim(NA, 0.009) + # Límites en el eje x para enfocar hasta 0.01
scale_y_continuous(breaks = seq(0, 7000, by = 500)) + # Ajusta los breaks del eje y
labs(title = "Distribution of Term Frequency Across All Movies",
x = "Term Frequency (as a percentage of total words)",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
This graph is a histogram showing the distribution of the frequency of terms in dialogue across all the Harry Potter films. Each bar represents the number of terms (y-axis) that occur with a certain frequency (x-axis) within the total set of words in the films. It can be seen that there are a large number of words with low frequency, suggesting that there are a large number of words that are not repeated.
Now we are going to observe a frequency distribution but per film, looking at which film is richer in vocabulary.
```{r}
ggplot(movie_words, aes(x = term_frequency, fill = movie)) +
geom_histogram(bins = 30, position = "identity") + # Eliminamos la transparencia con alpha
scale_x_continuous(limits = c(NA, 0.0009), labels = scales::percent_format(accuracy = 0.01)) +
scale_fill_manual(values = c("Harry Potter and the Chamber of Secrets" = "#1f77b4",
"Harry Potter and the Deathly Hallows Part 1" = "#ff7f0e",
"Harry Potter and the Deathly Hallows Part 2" = "#2ca02c",
"Harry Potter and the Goblet of Fire" = "#d62728",
"Harry Potter and the Half-Blood Prince" = "#9467bd",
"Harry Potter and the Order of the Phoenix" = "#8c564b",
"Harry Potter and the Philosopher's Stone" = "#e377c2",
"Harry Potter and the Prisoner of Azkaban" = "#7f7f7f")) + # Añade más colores según sea necesario
labs(title = "Distribution of Term Frequency Across Movies",
x = "Term Frequency (as a percentage of total words)",
y = "Count") +
theme_minimal() +
theme(legend.position = "right", legend.title = element_blank())
```
We can observe that the film richest in vocabulary is `Harry Potter and the Half-Blood Prince`. However, the distribution of the different films is not clear.
```{r, fig.width=15, fig.asp=0.65, out.width='100%', preview = TRUE}
# Vizualization
ggplot(movie_words, aes(x = term_frequency, fill = movie)) +
geom_histogram(binwidth = 0.0001, color = "white") +
scale_x_continuous(limits = c(NA, 0.003), labels = scales::percent_format(accuracy = 0.0001)) +
facet_wrap(~ movie, ncol = 2, scales = "free_y") +
scale_fill_brewer(palette = "Set3") +
labs(title = "Term Frequency Distribution by Movie",
x = "Term Frequency",
y = "Count") +
theme_light() + # Aplicamos un tema claro
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1),
strip.background = element_rect(fill = "lightblue"),
strip.text.x = element_text(size = 8, color = "navy"))
```
In general, we can observe that all the films present a similar distribution.
### TF-IDF
The next step in our analysis is to use the term frequency-inverse document frequency (tf-idf) technique to highlight words that are distinctive in each film in the Harry Potter series. Tf-idf is useful because it helps us to identify not only the most frequent words, but also those that are particularly significant in a given document in relation to a collection of documents. This allows us to look beyond mere frequency and consider the relevance of a term, giving us a more nuanced view of how language is used in the different films.
By calculating the tf-idf of each term in the context of each film, we can filter and visualise the 20 most characteristic words per film, giving us a list of distinctive terms that define or are emblematic of each film.
```{r, fig.width=15, fig.asp=0.65, out.width='100%', preview = TRUE}
#we create a new variable with the analysis
movie_tf_idf <- movie_words %>%
bind_tf_idf(word, movie, n) |>
select(-total_words) %>%
#we arrange by tf-idf in descending order
arrange(desc(tf_idf))
# Visualization
movie_tf_idf %>%
group_by(movie) %>%
slice_max(tf_idf, n = 20) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = movie)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ movie, ncol = 2, scales = "free") +
scale_fill_brewer(palette = "Set2") +
labs(x = "TF-IDF", y = "Words") +
theme_minimal() +
theme(axis.text.y = element_text(angle = 0),
strip.background = element_rect(fill = "lightblue"),
strip.text.x = element_text(size = 10, color = "navy"))
```
This analysis is quite representative and is of great help in identifying very relevant factors in the plot. Moreover, when we look at the results, they are quite faithful to reality, because we as fans can state that in most cases, the words with the highest tf-idf are very important in the film.
For example, in `Harry Potter and the Deathly Hallows Part 1` the most distinctive word is Dobby, this character, without going into spoilers, has one of the most important and moving scenes of the whole saga in this film. On the other hand, if we look at the fifth instalment of the saga ( `Harry Potter and the Order of the Phoenix`), the most distinctive word is `prohesy` which makes sense because the whole film revolves around a very important prophecy in the story. Likewise, in the third instalment `Harry Potter and the Prisoner of Azkaban` the most important word is `Pettigrew` and `dementos`, both of which are quite relevant to this film. If the person reading this is a fan, he or she will agree with the results.
Finally, at the beginning of the analysis we were surprised by the sequences of dialogue that `Horace Slughorn` had, as he only appears in 3 of the 8 films, but apparently this character is very distinctive in the sixth film, with his name and surname being the most important words in the whole film.
# SENTIMENT ANALYSIS
We will perform Sentiment Analysis on all the dialogues using the `Bing` and `AFINN` lexicon dictionaries. The Bing lexicon provides two options for each word in its dictionary: `positive` or `negative`. The AFINN dictionary includes a numerical score between [-5, +5], which will be more useful for quantifying the mood of a dialogue, scene or large portions of the movie.
```{r setup-sentiment-data}
# Bing dataset
bing_sentiment <- df %>%
unnest_tokens(output = word, input = dialog) %>%
inner_join(get_sentiments("bing"), "word")
# Afinn dataset
AFINN_sentiments <- df |>
mutate(linenumber = row_number(), .by = film.name) %>% # adds a line number to each dialogue, starting at one for each movie.
unnest_tokens(output = word, input = dialog) %>%
inner_join(get_sentiments("afinn"))
```
### Sentiment by chunks of dialogue
```{r plot-AFINN-100, fig.width=11}
AFINN_sentiments |>
group_by(film.name, index = linenumber %/% 100) %>% # split movie dialogue into chunks of lines.
summarise(sentiment = sum(value)) |> # sum up the sentiment
ggplot(aes(index, sentiment, fill = sentiment, label = sentiment)) +
geom_col(show.legend = FALSE) +
geom_text( ) +
scale_fill_gradient2(high = "darkgreen",
midpoint = 0,
low = "red3")+
facet_wrap(~film.name , ncol = 2, scales = "free_x") +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_blank(),
strip.text = element_text(face = "bold", hjust = 0),
panel.border = element_rect(fill = NA, color = "gray20")
) +
labs(
x = "Each chunk split up by 100 lines",
subtitle = "AFINN method"
)
```
When splitting up the dialogue into 100-line chunks, you can see which movies have the most positive and negative sentiment. The 6th movie appears to be the most positive. One chunk in the early middle part of the movie has the highest rating of all. The most negative chunk of dialogue is at the middle of the 2nd movie. The 7th movie appears to be neutral throughout and the 8th movie is mostly negative. The first two movies end very positively, unlike any of the others.
Now let's get more detailed.
```{r plot-AFINN-8, fig.width=11}
AFINN_sentiments |>
group_by(film.name, index = linenumber %/% 20) %>%
summarise(sentiment = sum(value)) |>
ggplot(aes(index, sentiment, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_gradient2(high = "darkgreen", midpoint = 0,
low = "red3") +
facet_wrap(~film.name , ncol = 2, scales = "free_x") +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_blank(),
strip.text = element_text(face = "bold", hjust = 0),
panel.border = element_rect(fill = NA, color = "gray20")
) +
labs(
x = "Each chunk split up by 20 lines",
subtitle = "AFINN method"
)
```
Now we can see more detailed changes in sentiment with 20-line chunks. The 6th movie is still positive, especially in the middle, but now we see a few negative dialogue between the positive ones. It is not *all* positive. We see that the 7th movie has both positive and negative dialogues but they cancel each other out when grouping into large chunks. The most positive sentiment happens at the end of the 1st movie. The 3rd movie starts very positive, as we can see from the 100-line chunks, but now we see that there are some really negative dialogues in the second half of the movie. The most negative dialogues appear to be in the 3rd film.
Now, let's see about grouping it by chapter:
```{r plot-AFINN-chapter, fig.width=11}
AFINN_sentiments |>
group_by(film.name, chapter) %>% # group by chapter
summarise(sentiment = sum(value), index = min(linenumber)) |>
arrange(index) |>
ggplot(aes(reorder(chapter, index),
sentiment, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_gradient2(high = "darkgreen", midpoint = 0,
low = "red3")+
facet_wrap(~film.name , ncol = 2, scales = "free_x") +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_blank(),
strip.text = element_text(face = "bold", hjust = 0),
panel.border = element_rect(fill = NA, color = "gray20")
)+
labs(
x = "Each chunk split up by chapter",
subtitle = "AFINN method"
)
```
The most positive chapter is in the 6th movie, but close is in the 3rd. Most nagative chapter is in the 2nd. Now we can see one very positive chapter towards the end of teh 8th movie. We also see that the negative detailed chunks in the middle of the 6th movie is gone. All the chapters in the middle
The three methods of looking at sentiment analysis shows that we will get different results from different methods of grouping text. The first method, large chunks of dialogue, gives an overall feeling for long portions of the movie, but hides the mood of a specific scene. The second method, small chunks of dialogue, shows you the detailed view of sentiment which is cancelled out in the first method. The last method, grouping dialogue by chapter, provides a more natural analysis of the flow of each movie. The writers and director meant for each chapter to have a certain feeling, which can be lost when cutting a movie into chunks of dialogue arbitrarily.
## Bigrams
Bigrams are consecutive words in the script, which can give us more context than looking at single words. We would expect to see lots of names (like Harry Potter) and common 2-word phrases. But many 2-word phrases are not helpful for text analysis so we will take out all stopwords.
```{r build-bigram}
df_bigrams <- df %>%
#we take the dialogue in df, and tokenize it to sequences of 2 words
unnest_tokens(bigram, dialog, token = "ngrams", n = 2) %>%
drop_na(bigram)
# all bigrams, before taking out stopwords
df_bigrams %>%
count(bigram, sort = TRUE)
```
Notice how the most popular bigrams are "in the" and "are you," which do not provide valuable information about the topic or sentiment. Let's remove the stopwords and see what we get.
```{r bigram-filter}
bigrams_separated <- df_bigrams |>
# separate each bigram in two columns, word1 and word2
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
```
After removing all stopwords from the list of bigrams, we see a lot of names: Harry Potter, Professor Dumbledore, Sirius Black, and so on. The second most used bigram is "ha ha," which I assume is referring to someone laughing. Also in the top ten is "bloody hell," which is an exclamation and "dark arts," which is part of the name of a class at Hogwarts.
```{r bigram}
bigrams_united %>%
count(bigram, sort = T) %>%
slice_max(n, n=15) %>%
ggplot(aes(y = reorder(bigram, n), x = n))+
geom_bar(stat = "identity", width = 0.65, fill = "peru", alpha = 1)+
labs(title = "Most popular bigrams in the movies",
subtitle = "Top 15, after removing stopwords",
y = NULL, x = "Frequency")+
theme_minimal()
```
### Negation words
There is a major problem with simply looking at sentiment for each word by itself: *context*. By itself, the word "funny" is very positive. The AFINN lexicon gives it a score of **+4**. But in the context of a dialogue, "not funny" is not positive and could be considered negative! Let's see how many bigrams that include negations words like *not*, *no*, and *never*, affect the sentiment of the Harry Potter films.
```{r}
negation_words <- c("not", "no", "never", "without", "neither", "nor")
not_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE)
```
```{r not-word-plot, fig.width=11}
not_words %>%
mutate(contribution = n * value, # multiply the number of appearances with the value that AFINN assigns to the word to get its overall contribution to the movies.
sign = if_else(value > 0, "postive", "negative")) %>%
group_by(word1) %>%
slice_max(abs(contribution), n=10) %>% # get the top 10 bigrams for each negation word.
ungroup() %>%
ggplot(aes(y = reorder_within(word2, contribution, word1),
x = contribution,
fill = contribution,
label = contribution)) +
geom_col() +
geom_text() +
geom_vline(xintercept = 0, alpha = .3) +
scale_y_reordered() +
facet_wrap(~ word1, scales = "free_y", ncol = 2) +
labs(y = 'Words preceeded by a negation',
x = "Contribution (Sentiment value * number of mentions)",
title = "Most common pos or neg words to follow negations") +
scale_fill_gradient2(high = "darkgreen",
midpoint = 0,
low = "red3")+
theme_minimal() +
theme(
strip.text = element_text(face = "bold", hjust = 0, size = 12),
panel.border = element_rect(fill = NA, color = "gray20"),
legend.position = "none"
)
```
Now we see a different picture of the sentiment analysis performed before with single words. The negation bigram that provided provided the largest absolute contribution was "no no," but I think it could still be considered negative in a dialogue. We see that "not good" falsely contributed to 21 points in the positive direction.
Excluding "no no," the net contribution of the negation words is **7**, but should have been in the negative direction.
```{r}
not_words |> filter( not(word2=="no" & word1=="no")) |>
summarise(sum = sum(value * n, na.rm = T))
```
## Trigrams
Yet another form of tokenization is to look at three consecutive words. This provides even more information for topic modelling, term frequency and sentiment analysis. Once again, we must take out the stopwords.
```{r building-trigrams}
df_trigrams <- df %>%
#we take the dialogue in df, and tokenize it to sequences of 2 words
unnest_tokens(trigram, dialog, token = "ngrams", n = 3) %>%
drop_na(trigram)
trigrams_separated <- df_trigrams |>
# separate each trigram in three columns, word1 and word2 and word3
separate(trigram, c("word1", "word2", "word3"), sep = " ")
trigrams_filtered <- trigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word)
trigrams_filtered %>%
count(word1, word2, word3, sort = TRUE)
trigrams_united <- trigrams_filtered %>%
unite(trigram, word1, word2, word3, sep = " ")
```
We can see some familiar words in the most popular trigrams: "ha ha ha" and "dark arts teacher." Not as many proper names, but we do have "Professor Dumblefdore Sir" in the top ten.
```{r trigram}
trigrams_united %>%
count(trigram, sort = T) %>%
slice_max(n, n=10) %>%
ggplot(aes(y = reorder(trigram, n), x = n))+
geom_bar(stat = "identity", width = 0.65, fill = "peru", alpha = 1)+
labs(title = "Most popular trigrams in the movies",
subtitle = "Top 15, after removing stopwords",
y = NULL, x = "Frequency")+
theme_minimal()
```
### Using Bing's lexicon
Let's see how Bing's lexicon describes the sentiment for this movie series.
```{r}
bing_sentiment %>%
group_by(word, sentiment) %>%
summarise(count = n()) |>
ungroup() %>%
arrange(desc(count)) %>%
slice(1:20) %>%
ggplot(aes(y = reorder(word, count), x = count, fill = sentiment))+
geom_bar(stat = "identity", width = 0.62)+
scale_fill_manual(values = c("darkred", "darkgreen")) +
labs(title = "Top 10 most popular words with assigned sentiment",
subtitle = "Bing lexicon",
y = NULL, x = "Frequency", fill = "Sentiment")+
guides(fill = guide_legend(reverse = T))+
theme_minimal()+
theme(legend.position = "top")
```
The most common words in Bing's dictionary that show up in the Harry Potter series are "well," "right," and "like." The top four words are positive and the next three after that are negative. Just looking at this graph, it appears that the negative words are related to killing and death.
```{r}
bing_sentiment %>%
group_by(film.name, num, sentiment) %>%
summarise(count = n()) %>%
ungroup() |>
ggplot(aes(y = reorder(film.name, -num ), x = count, fill = sentiment)) +
geom_bar(stat = "identity", position = "fill", width = 0.7, alpha = 0.9)+
geom_vline(xintercept = 0.5) +
scale_fill_manual(values = c("red4", "darkgreen"))+
scale_x_continuous(labels = scales::percent)+
labs(title = "Share of words with positive and negative sentiment",
subtitle = "Bing lexicon", fill = "Sentiment",
x = "Percentage", y = "Ratio")+
guides(fill = guide_legend(reverse = T))+
theme_minimal()+
theme(legend.position = "top")
```
We can see that two films have a higher share of negative sentiment words: *Chamber of Secrets* and *Deathly Hallows Part 2*. Recall that *Chamber of Secrets* had the most negative chapter in all the movies, according to AFINN. The *Half-Blood Prince* has a higher share of positive words, which aligns with the AFINN lexicon. Despite some dark scenes in the Harry Potter series, most movies have a slightly higher share of positive sentiment overall.
```{r}
bing_sentiment %>%
filter(character %in% c("Harry Potter", "Ron Weasley", "Hermione Granger", "Rubeus Hagrid", "Albus Dumbledore", "Remus Lupin", "Minerva McGonagall", "Draco Malfoy", "Severus Snape", "Lucius Malfoy", "Voldemort", "Tom Riddle", "Sirius Black", "Neville Longbottom")) %>%
group_by(character, sentiment) %>%
summarise(count = n(), .groups = 'drop') %>%
ggplot(aes(y= reorder(character, count), x = count, fill = sentiment))+
geom_bar(stat = "identity", position = "fill", width = 0.57, alpha = 0.9) +
scale_fill_manual(values = c("darkred", "darkgreen"))+
scale_x_continuous(labels = scales::percent)+
geom_vline(xintercept = 0.5)+
labs(title = "Share of words with positive and negative sentiment",
subtitle = "Bing lexicon, top 15 characters with the most sentences", fill = "Sentiment",
y = NULL, x = "Percentage")+
guides(fill = guide_legend(reverse = T))+
theme_minimal()+
theme( legend.position = "top")
```
Of the top 15 characters with the most lines, `Remus Lupin` and `Minerva McGonagall` have the highest share of positive sentiment. That makes sense given their inspirational impact on the plot. Not surprisingly, `Draco Malfoy` and his father `Lucius Malfoy` are the most negative characters. `Tom Riddle`, the character later known as `Voldemort`, has a higher share of positive words than his later self. This is interesting because it shows a change in personality development as he grows up and becomes an evil villain.
# TOPIC MODELLING
In a saga with multiple films, such as Harry Potter, where characters, plots and motifs evolve over time, thematic modelling could identify patterns and thematic shifts in the narrative. This could reveal how certain themes are introduced, how they develop over the course of the films, and which films focus more on certain aspects of the story or characters. The advantage is that, instead of looking at individual words as in TF-IDF analysis, we would be examining patterns of words that represent themes, which could provide a deeper understanding of the content and structure of the series' narratives.
First of all, we are going to prepare the dataset with which we are going to work throughout the section. To do so, we will use the chapters that refer to the chapter of the movie according to the script
```{r}
library(stringr)
# divide into documents, each representing one chapter
by_chapter <- df %>%
unite(document, movie, chapter)
# tokenize
by_chapter_word <- by_chapter %>%
unnest_tokens(word, dialog)
# find document-word counts
word_counts <- by_chapter_word %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE)
word_counts
```
We have now a tidy dataframe with document, word, and how many times the word appears in the document. The next step is to apply the LDA to the chapters.
### LDA
First, let's look at the degree of sparsity.
```{r}
chapters_dtm <- word_counts %>%
cast_dtm(document, word, n)
chapters_dtm
```
The results show 99% Sparsity in all the films, which suggests that the vocabulary used is very rich, which is to be expected, because as these are film scripts, it is perhaps less frequent that the vocabulary tends to repeat itself, as might be more likely to happen in books. The high degree of sparsity is a good sign that the topic modelling will be successful. So we can apply LDA without problems.
```{r}
library(topicmodels)
# 4 topics are applied
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda
```
However, in order to be able to continue with the analysis, it is necessary to be able to transform the format to tidy.
```{r}
chapter_topics <- tidy(chapters_lda)
chapter_topics
```
By analysing the beta values in our theme model, it is evident that some words have strong associations with particular themes, giving us clues as to their relevance in the context of the story. Let's go ahead and visualise the top five words in each theme, which will give us a clear and tangible picture of the recurring motifs throughout the series.
```{r}
library(forcats)
top_terms <- chapter_topics %>%
group_by(topic) %>%
slice_max(beta, n = 5) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(term = reorder_within(term, beta, topic)) # Rearranges the terms within each topic for visualisation
# Creando el gráfico
ggplot(top_terms, aes(x = beta, y = term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered() +
labs(x = "Beta Value", y = "Top Terms") +
theme_minimal() +
theme(axis.title.y = element_blank(),
strip.background = element_rect(fill = "lightblue"),
strip.text = element_text(size = 12, color = "navy"))
```
At a glance, it seems that `harry` is a star term, heading themes 3,4,5,6,7,8 and in 2 being `Potter`. This is not surprising, since we are talking about Harry Potter. In theme 1, words like `sir`, `dumbledore` and `professor`, accompany `harry`.
Theme 2 shows a different variety, with `ron` and `hermione` near the top, suggesting that this theme could be about Harry's friends and companions, the character whose name is on the cover of the title page, it would be rare if this were not the case. On the other hand, theme 3 highlights `hagrid` and `hermione` alongside `harry`, perhaps pointing to key moments where these characters interact or play significant roles.
Finally, theme 4, apart from `harry`, highlights `dobby` and `time`. `Dobby` could be associated with emotional and significant moments in the saga, while `time` could be related to critical events in the plot or even to the time-turner in "The Prisoner of Azkaban".
### Each chapter to its book
Once we have identified those keywords that define each theme, the next step is to try to reconstruct the puzzle of chapters and assign each one to its original book. To do this, we are going to use gamma, which tells us the probability that a chapter contains certain themes. Think of gamma as the relationship a chapter has to the different topics, sort of like the `Sorting Hat` deciding which Hogwarts house you belong to (if the person reading this hasn't seen a movie or read the Harry Potter books, they've missed a very good reference for explaining gamma). With this information, we can predict which book each chapter might belong to based on its dominant themes.
```{r}
chapters_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_gamma
```
At the end, each chapter will have a set of probabilities associated with the themes, and thus a suggestion of which book it belongs to. This step is crucial because it allows us to see how the chapters are grouped together. If we get it right, we can see how the pieces of the puzzle fit together perfectly!
```{r}
chapters_gamma <- chapters_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
chapters_gamma
```
And now, let's make a plot:
```{r, fig.width=15, fig.asp=0.65, out.width='100%', preview = TRUE}
# reorder titles in order of topic 1, topic 2, etc before plotting
chapters_gamma %>%
mutate(title = fct_reorder(title, gamma, .fun = sum)) %>% # Rearrange the titles by adding the gammas
ggplot(aes(x = factor(topic), y = gamma, fill = factor(topic))) +
geom_boxplot() +
facet_wrap(~ title, scales = "free_y") +
labs(x = "Topic", y = expression(Gamma)) +
theme_minimal() +
theme(legend.position = "none",
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))
```
We can observe how some books show a clear preference for certain themes, indicated by the height of the boxes in each theme. For example, `Harry Potter and the Half-Blood Prince` has a very high box in the theme depicted in red, which might indicate that this theme is particularly prominent in that book. In contrast, `Harry Potter and the Chamber of Secrets` shows a more even distribution between two themes, with the highest boxes in green and blue, i.e. what would represent topic 2 and 3.
The scattering of dots outside the boxes, i.e. outliers, could point to chapters that are uniquely different in their thematic content compared to other chapters within the same book. For example, there are some chapters in `Harry Potter and the Order of the Phoenix` that appear to be unique in their thematic profile, given the scattering of dots.
Let's investigate further.
First, we should have a look at the topic that has been most associated with each chapter of the book.
```{r}
chapter_classifications <- chapters_gamma %>%
group_by(title, chapter) %>%
#we use slice max to order by gamma
slice_max(gamma) %>%
ungroup()
chapter_classifications
```
It seems that most of the chapters of "The Chamber of Secrets" have an almost total linkage (gamma values very close to 1) with themes 2 and 3. For example, the chapters `About the Chamber`, `Aragog`, and `Cornelius Fudge` are strongly linked to theme 3.
Some chapters, such as `Backfire` and `Car rescue`, are almost entirely associated with theme 2, which could represent another central facet of this book's narrative, perhaps more action- and adventure-oriented.
However, there are chapters such as `Dobby's reward` and `Dobby's warning` that deviate from this trend, with a stronger relationship to theme 4, which may be reflecting aspects more related to character development or plot elements that are not as focused on the main mystery of the chamber.
Finally, to conclude our magical analysis, let's unveil those chapters that seem to have taken an unexpected plot twist and aligned themselves with themes different from the main thread of your book. Imagine it as the `Sorting Hat` having a peculiar day and assigning students to unexpected houses (another great reference that if you're a fan you're sure to have enjoyed and if you have no idea what I'm talking about, I apologise :)).
First, we create a dataframe just with two columns: the book and its consensus topic.
```{r}
book_topics <- chapter_classifications %>%
count(title, topic) %>%
group_by(title) %>%
slice_max(n, n = 1) %>%
ungroup() %>%
transmute(consensus = title, topic)
book_topics
```
Second, we check with an inner join if any chapter is assigned to a topic different from the consensus.
```{r}
chapter_classifications %>%
inner_join(book_topics, by = "topic") %>%
filter(title != consensus)
```
Most chapters are consistently associated with theme 3, suggesting that this theme captures key elements that are central to this particular book. It is possible that theme 3 encapsulates the mystery and suspense of the Chamber of Secrets.
However, there are also chapters associated with other themes, such as "Backfire" and "Car rescue" with theme 2, and "Chamber of Secrets" with theme 1, which tells us that there is thematic diversity in the book. Particularly interesting is "Dobby's reward" and "Dobby's warning", which are unmarked with an association to theme 4; this may indicate a subplot or a focus on aspects of character development, especially as it relates to Dobby.