# Harry Potter and a Network Analysis

## based on NER, Sentiment Analysis & Topic Modeling using SpaCy & Prodigy


### What is Network Analysis & Why Network Analysis with Harry Potter?

In terms of social network analysis, we want to understand it here as follows:

>Network analysis can be considered as a set of techniques with a common methodological perspective that allow researchers to represent the relationships between actors and to analyze the social structures that result from the recurrence of these relationships. The basic assumption is that a better explanation of social phenomena is achieved by analyzing the relationships between actors. This analysis is done by collecting relationship data organized in the form of a matrix. When actors are represented as nodes and their relationships are represented as lines between pairs of nodes, the concept of a social network is transformed from a metaphor to an operational analytical tool that uses the mathematical language of graph theory and the linear assumptions of matrix algebra. (Chiesi 2015, 518)

The forText digital research environment defines network analysis in the context of literary analysis and visualization as follows:

>In network analysis, certain entities (e.g., figures, authors, places) are previously examined in their relationship to each other as a network of nodes and connecting edges or relations. In this way, quantitative aspects of the relational system, primarily the number of nodes, edges, and links, first become clear and can serve as the basis for a qualitative analysis. (Schumacher 2018)

Our goal is to use natural language processing (NLP) and network analysis tools to **provide a representation of the social relationships of the main characters in the Harry Potter films**. Based on a database containing the dialog scripts of the eight films and various views created by queries, we first want to pre-select the characters (NER) and build up some plain text data. Then, based on the results of Sentiment Analysis and Topic Modeling, we want to map their relationships.

We think that this approach could have **exemplary value for the analysis of more complex (fictional) social networks**. In particular, we can imagine that following such a basic consideration, especially the change of such relationships over time could be of interest for both film and literature or, for example, theatre studies. In this fields, for example, the comparison of certain patterns of relationships in certain periods, concerning certain authors, or certain genres might also be of interest.


### What is SpaCy & What could we use it for?

SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. SpaCy is designed specifically for production use and helps you build applications that process and *understand* large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

Among other things, SpaCy can be used for Tokenization, Part-of-Speech Tagging, Dependency Parsing, Rule-based Matching, training statistical models and Named Entity Recognition (NER).

...



### What is Prodigy & What could we use it for?

Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. Prodigy is designed to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models. Among other things, Prodigy can be used for Span Categorization, Named Entity Recognition and Text Classification. Prodigy can be used as both a command-line tool and web application. 

...


### What is Named Entity Recognition (NER)?

Named Entity Recognition (NER) is the task of automatically finding and tagging entities in a text and of identifying and catigorizing them. Possible entities are, for example: persons, locations, dates.

#### Basic Python NER with SpaCy

When working with NER, it is important to tell the model how to recognise the words that are to be marked as named entities. For this purpose, a pattern for labelling the data is defined, with which the model is trained to enable the most accurate recognition possible.

Basic stepts include:

1. Create match pattern for film characters
2. Label data with help of the defined patterns 
3. Train temporary model
4. Label more by correcting model
5. Train a new and better model
6. Run model

[See: Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning](https://www.youtube.com/watch?v=59BKHO_xBPA)

[Named Entity Recognition 101](https://spacy.io/usage/linguistic-features#named-entities)



### What is Sentiment Analysis

Sentiment analysis or Sentiment Detection is a subfield of text mining and refers to the automatic evaluation of texts with the goal of identifying an expressed attitude as positive, negative or neutral.


#### Basic Sentiment Analysis using the SpaCy Pipeline Extension Textblob

The textblob tool evaluates the sentiment of a text based on the parameters polarity and subjectivity and outputs the former as a value between -1 and 1, the latter as a value from 0 to 1. Other machinelearning based models are furthermore able to distinguish between different emotions, but for our purposes the results of SpaCy's Textblob prove to be sufficient.

https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524

https://techxplore.com/news/2019-07-sentiart-sentiment-analysis-tool-profiling.html


### What is Topic Modeling?

Topic modeling is a probability-based text-mining tool that is used to find "topics" by mapping frequent common occurrences of words in natural language texts. Words may belong to several topics.

#### Basic Python Topic Modeling with Gensim

In order to use topic modeling it is important to perpare the coprus. For this it is useful to create a manually annotated training and testing corpus. 

Basic steps include:
1. Preparing our filmscripts based on the pre-selected characters (via NER) 
2. Cleaning, tokenizing and stemming the scripts
3. Splitting the scripts into a training and a testing corpus
4. Applying the Latent Dirichlet allocation (LDA) function
5. Finding the dominant topic for each character
6. Visualizing the relationship of dominant topics and relevance labels
7. Classification of the entire corpus
8. Creating uni-, bi- or trigram word clouds

[See: Oberbichler, S., & Pfanzelter, E. Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods, in: Journal of Digital History, jdh001 (2021)](https://journalofdigitalhistory.org/en/article/4yxHGiqXYRbX#n103)

...

[Building a Topic Modeling Pipeline with SpaCy and Gensim](https://towardsdatascience.com/building-a-topic-modeling-pipeline-with-spacy-and-gensim-c5dc03ffc619)


### First Steps

##### Project Goal, Ideas and Structure
Our goal is to **represent social relations of the Harry Potter world** using NLP based on data from the movie scripts. To do this, we will first take a closer look at the given data and select main characters based on certain parameters. Then, we will analyze the reciprocal relationship of these characters to the main protagonist Harry Potter. Sentiment analysis and topic modeling will be used for this purpose. Finally, sentiment and topics will be visualized in a descriptive way to make them presentable. This will be followed by a critical reflection of the project.

To develop this project idea, we started with research on corpora that would be suitable for NLP and chose the subject matter based on personal interest. After that, we started to familiarize ourselves with the basics of Git, GitHub and Python (Jupyter Notebook). We created a project notebook and decided to use the following data as the basis of our project.


##### Data
We use the data from the GitHub account Konrflex28, which we found via Kaggle. The repository **contains the .csv files of the eight-part film saga** that we use, as well as a table on their metadata. 

To gain better insight into the data, we loaded it into SQLite Browser DB and first performed some queries (more on this in the digressions: Hands on Data - from .csv to SQL & SQLite Queries).

https://www.kaggle.com/kornflex/harry-potter-movies-dataset?select=datasets

https://github.com/kornflex28/hp-dataset

### Realization

#### How to: Named Entity Recognition

#### How to: Sentiment Analysis

Step One: Preprocessing the Data

Step Two: Feeding Data to Prodigy

Step Three: Text Categorization using Prodigy

**Problem! Might not be enough Data to train a model**
To perform the sentiment analysis, we wanted to try to train our own classification model using the Prodigy mentioned earlier. However, it quickly became clear that the data basis would not be sufficient to train a meaningful model.

**Alternative?** Using the Textblob SpaCy Pipeline Extension for Sentiment Analysis

**What is Textblob?** 

https://medium.com/analytics-vidhya/sentiment-analysis-using-textblob-ecaaf0373dff

https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524

https://stackabuse.com/sentiment-analysis-in-python-with-textblob/

#### How to: Sentiment Analysis Part II


Step One: Preprocessing the Data (again)
> - Manually exporting the SQLite Views as .csv-files and converting them to .txt-files?
> - Trying to use SQLite within the Python Jupyter Notebook?

Step Two: Adding the TextBlob Extension 

Step Three: Using TextBlob to assign sentiments

#### How to: Topic Modelling

Step One:

Step Two:

Step Three:

### Visualization

#### How to: Visualize Networks with Gephi

[Gephi](https://gephi.org/)

...

#### How to: Visualize topics with Wordclouds

...


### Resumeé

#### Future Options
* timefactor
* language model
* extended sentiments (different emotions instead of polarity)
* sarcasm as relevant factor? (see: https://www.sciencedirect.com/science/article/pii/B9780128044124000073) 

...


#### Documentation

1. Get Inspiration (Data, Purpose)
2. Create GitHub Project
3. Create Project Schedule (Milestones)
4. Hands on Data (.csv to .sqlite?)
5. Assign fundamentals (Merle: network analysis, sentiment analysis / Teresa: NER, topic modeling)
6. Decision-Making: use Prodigy to train models?
7. Create plain text data using SQLite DBBrowser
8. Use SQLite within JupyterNotebook to shorten process
9. Loop over dialogdata (SQL views) to assign sentiments
10. Visualize sentiment distribution for further data understanding

...

11. Visualisation: Gephi? wordcloudtool?


#### Bibliography

Antonio M. Chiesi: *Network Analysis* in: *International Encyclopedia of the Social & Behavioral Sciences* Edited by James D. Wright (2015: 518-523)

Jan Horstmann: *Topic Modeling* on: https://fortext.net/routinen/methoden/topic-modeling (January 15, 2018) 

S. Oberbichler & E. Pfanzelter: *Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods.* in: Journal of Digital History, jdh001 https://journalofdigitalhistory.org/en/article/4yxHGiqXYRbX (2021)

Mareike Schumacher: *Netzwerkanalyse* on: https://fortext.net/routinen/methoden/netzwerkanalyse (November 12, 2018)

Mareike Schumacher: *Named Entity Recognition (NER)* on: https://fortext.net/routinen/methoden/named-entity-recognition-ner (May 17, 2018)




#### EXTRA: Hands on Data - from csv. to SQLite

Trying to import the data from .csv to .sqlite using dataframes (pandas)

In [None]:
## Code

#### EXTRA: SQLite Queries?

Task: Create plain text data to use for NER, training a sentiment analysis model & analyse topics by using gensim.

How much do the character speak by length of dialog?
* SELECT DISTINCT character from all_parts; 175 results
* SELECT character, count(dialog) FROM all_parts GROUP BY character ORDER BY count(dialog)
* SELECT character, sum(length(dialog)) AS dialogsum FROM all_parts GROUP BY character ORDER BY dialogsum DESC LIMIT 30


* CREATE VIEW harry_albus AS SELECT * from all_parts WHERE character = \"Harry Potter\" AND dialog like '%Dumbledore%' OR dialog like '%Albus%';
* CREATE VIEW albus_harry AS SELECT * from all_parts WHERE character = \"Albus Dumbledore\" AND dialog like '%Harry%' OR dialog like '%Potter%' OR dialog like '%chosen one%';


#### EXTRA: Jupyter Notebook Slideshow?
Task: Create a presentable version of jupyter notebook using the slideshow view.

How does this work: https://medium.com/@mjspeck/presenting-code-using-jupyter-notebook-slides-a8a3c3b59d67