# Harry Potter and a Network Analysis

## based on NER, Sentiment Analysis & Topic Modeling using SpaCy & Prodigy


### What is Network Analysis & Why Network Analysis with Harry Potter?

In terms of social network analysis, we want to understand it here as follows:

>Network analysis can be considered as a set of techniques with a common methodological perspective that allow researchers to represent the relationships between actors and to analyze the social structures that result from the recurrence of these relationships. The basic assumption is that a better explanation of social phenomena is achieved by analyzing the relationships between actors. This analysis is done by collecting relationship data organized in the form of a matrix. When actors are represented as nodes and their relationships are represented as lines between pairs of nodes, the concept of a social network is transformed from a metaphor to an operational analytical tool that uses the mathematical language of graph theory and the linear assumptions of matrix algebra. (Chiesi 2015, 518)

The forText digital research environment defines network analysis in the context of literary analysis and visualization as follows:

>In network analysis, certain entities (e.g., figures, authors, places) are previously examined in their relationship to each other as a network of nodes and connecting edges or relations. In this way, quantitative aspects of the relational system, primarily the number of nodes, edges, and links, first become clear and can serve as the basis for a qualitative analysis. (Schumacher 2018)

Our goal is to use natural language processing (NLP) and network analysis tools to **provide a representation of the social relationships of the main characters in the Harry Potter films**. Based on a database containing the dialog scripts of the eight films and various views created by queries, we first want to pre-select the characters (NER) and build up some plain text data. Then, based on the results of Sentiment Analysis and Topic Modeling, we want to map their relationships.

We think that this approach could have **exemplary value for the analysis of more complex (fictional) social networks**. In particular, we can imagine that following such a basic consideration, especially the change of such relationships over time could be of interest for both film and literature or, for example, theatre studies. In this fields, for example, the comparison of certain patterns of relationships in certain periods, concerning certain authors, or certain genres might also be of interest.


### What is SpaCy & What could we use it for?

SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. SpaCy is designed specifically for production use and helps you build applications that process and *understand* large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

Among other things, SpaCy can be used for Tokenization, Part-of-Speech Tagging, Dependency Parsing, Rule-based Matching, training statistical models and Named Entity Recognition (NER).

...



### What is Prodigy & What could we use it for?

Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. Prodigy is designed to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models. Among other things, Prodigy can be used for Span Categorization, Named Entity Recognition and Text Classification.

...


### What is NER?

#### Basic Python NER with SpaCy

...


### What is Sentiment Analysis

Sentiment analysis or Sentiment Detection is a subfield of text mining and refers to the automatic evaluation of texts with the goal of identifying an expressed attitude as positive, negative or neutral.

#### Basic Python Sentiment Analysis with Prodigy

Text classification models learn to assign one or more labels to text. You can use text classification over short pieces of text like sentences or headlines, or longer texts like paragraphs or even whole documents. 
Basic steps include:
1. Training model for Harry Potter Sentiment Analysis
2 Applying model on the moviescript Data

...

###### Basic Python Sentiment Analysis with Prodigy?

[Sentiment Analysis Using SpaCy Pipelines](https://medium.com/mlearning-ai/ai-in-the-real-world-2-sentiment-analysis-using-spacy-pipelines-b39a2618d7c1)

...


### What is Topic Modeling?

...

##### Basic Python Topic Modeling with Gensim

...


### First Steps

##### Project Idea and Structure

##### Data

##### Recap: Project Goals


### Realization

##### How to: Named Entity Recognition

##### How to: Sentiment Analysis

##### How to: Topic Modelling

### Visualization

#### How to: Visualize Networks with Gephi

[Gephi](https://gephi.org/)

...

#### How to: Visualize topics with Wordclouds

...


### Resumeé

#### Future Options
* timefactor

...


#### Documentation

1. Get Inspiration (Data, Purpose)
2. Create GitHub Project
3. Create Project Schedule (Milestones)
4. Hands on Data (.csv to .sqlite?)
5. Assign fundamentals (Merle: network analysis, sentiment analysis / Teresa: NER, topic modeling)
6. Decision-Making: use Prodigy to train models?
7. Create plain text data using SQLite DBBrowser
8. Visualisation: Gephi? wordcloudtool?


#### Bibliography

Antonio M. Chiesi: *Network Analysis* in: *International Encyclopedia of the Social & Behavioral Sciences* Edited by James D. Wright (2015: 518-523)

Mareike Schumacher: *Netzwerkanalyse* on: https://fortext.net/routinen/methoden/netzwerkanalyse (November 12, 2018)



#### EXTRA: Hands on Data - from csv. to SQLite

Trying to import the data from .csv to .sqlite using dataframes (pandas)

In [None]:
## Code

#### EXTRA: SQLite Queries?

Task: Create plain text data to use for NER, training a sentiment analysis model & analyse topics by using gensim.

How much do the character speak by length of dialog?
* SELECT DISTINCT character from all_parts; 175 results
* SELECT character, count(dialog) FROM all_parts GROUP BY character ORDER BY count(dialog)
* SELECT character, sum(length(dialog)) AS dialogsum FROM all_parts GROUP BY character ORDER BY dialogsum DESC LIMIT 30


* CREATE VIEW harry_albus AS SELECT * from all_parts WHERE character = \"Harry Potter\" AND dialog like '%Dumbledore%' OR dialog like '%Albus%'
* CREATE VIEW albus_harry AS SELECT * from all_parts WHERE character = \"Albus Dumbledore\" AND dialog like '%Harry%' OR dialog like '%Potter%'
