Skip to content

Commit

Permalink
star wars arc diagram project
Browse files Browse the repository at this point in the history
  • Loading branch information
gastonstat committed Jul 11, 2012
0 parents commit 50e35ab
Show file tree
Hide file tree
Showing 29 changed files with 25,500 additions and 0 deletions.
63 changes: 63 additions & 0 deletions README.md
@@ -0,0 +1,63 @@
## Visualizing Star Wars Movie Scripts
======================================

**Description:**
Perform a statistical text analysis on the Star Wars scripts from Episodes IV, V, and VI, in order to obtain an arc-diagram representation (more or less equivalent to the one in [Similar Diversity](http://similardiversity.net/)) with the most talkative characters of the Star Wars trilogy.

<br>

### R scripts
**Parsing scripts:** R scripts to parse the movie scripts, extract the characters & dialogues, and produce the data tables:
```
1. parsing_episodeIV.R
2. parsing_episodeV.R
3. parsing_episodeVI.R
```
**Analysis scripts:** R scripts for the performed analysis (run them sequentially):
```
1. get_top_characters.R
2. get_top_characters_network.R
3. get_terms_by_episodes.R
4. ultimate_arc_diagram.R
```
**Functions:** R functions for plotting different types of arc diagrams (these functions are used in some of the above analysis scripts):
```
1. arcDiagram.R
2. arcPies.R
3. arcBands.R
4. arcBandBars.R
```

<br>

### Text Files
**Movie Scripts:** These are the movie scripts (raw text data) which are parsed to produced the dialogues files:
```
1. StarWars_EpisodeIV_script.txt
2. StarWars_EpisodeV_script.txt
3. StarWars_EpisodeVI_script.txt
```
**Text dialogues:** These are <em>intermediate</em> files containing the extracted dialogues from the movie scripts:
```
1. EpisodeIV_dialogues.txt
2. EpisodeV_dialogues.txt
3. EpisodeVI_dialogues.txt
```
**Input tables:** These are the input files (in data table format) used for the text mining analysis:
```
1. SW_EpisodeIV.txt
2. SW_EpisodeV.txt
3. SW_EpisodeVI.txt
```
**Output tables:** These are the output files (in data table format), produced in the text mining analysis, that are used to get the different arc-diagrams:
```
1. top_chars_by_eps.txt
2. top_chars_network.txt
3. top_char_terms.txt
4. weight_edges_graph1.txt
5. weight_edges_graph2.txt
```
<br>

#### PS:
Keep in mind that my main motivation for this project was to replicate, as much as possible, the arc diagram of Similar Diversity by Philipp Steinweber and Andreas Koller. I'm sure there are a lot of parts in the analysis that can be made faster, more efficient, and simpler. However, my quota of spare time is very limited and I haven't had enough time to write the ideal code.
80 changes: 80 additions & 0 deletions Rscripts/DOCUMENTATION.md
@@ -0,0 +1,80 @@
## Star Wars Arc Diagram
=============================================
by [Gaston Sanchez](http://www.gastonsanchez.com/)

**Motivation**
The main idea behind this project is to reproduce, as much as possible, an arc-diagram in R like the one depicted in [Similar Diversity](http://similardiversity.net/) (by Philipp Steinweber and Andreas Koller).

**Long story short**
I think I saw the arc-diagram of ***Similar Diversity*** for the first time back in 2010. Every time I contemplated that diagram, both amazed and amused, I always ended up wondering how the hell did Philipp and Andreas do it. Finally one day, I couldn't stand my questions anymore so I decided to try to make my own arc-diagram in R... and this project is the result of such attempt. Although I haven't been able to obtain an identical representation of the Similar Diversity arc diagram, I'm really happy with what I got and I guess it's pretty much similar.

**Source**
In my case, I decided to use the movie scripts from the original Star Wars trilogy (episodes IV, V, and VI) as the text data for my analysis. I found the movie scripts at [Ben and Grove](http://www.benandgrover.com/scripts.asp) and [corkey.net](http://corky.net/scripts/) where you can find other scripts if you want to play with other movies.

----------------------------------

## Analysis recipe:

### Step 1) Text Parsing
Parse the text data of the movie scripts to extract the dialogues of each character. The final output files are the extracted dialogues in table format (one table per movie)
```
Movie script text files | Parsed files | Data table files
------------------------------- | ------------------------ | -----------------
StarWars_EpisodeIV_script.txt | EpisodeIV_dialogues.txt | SW_EpisodeIV.txt
StarWars_EpisodeV_script.txt | EpisodeV_dialogues.txt | SW_EpisodeV.txt
StarWars_EpisodeVI_script.txt | EpisodeVI_dialogues.txt | SW_EpisodeVI.txt
```
The R scripts for this step are:
```
1. parsing_episodeIV.R
2. parsing_episodeV.R
3. parsing_episodeVI.R
```
------------------------------------
### Step 2) Identify top characters
Once we have the dialogues in table format, the next step is to identify the most talkative characters of the trilogy. This implies performing a frequency analysis to detect the *most talkative* characters, that is, the characters with the greatest number of dialogues in english. Unfortunately, Artoo and Chewie are excluded since they don't speak english. The R script for this step is:
```
get_top_characters.R
```
The output file is:
```
top_chars_by_eps.txt
```
------------------------------------
### Step 3) Network between top characters
Having identified the top characters (ie most talkative ones) of the trilogy, the following step is to get a network between them. The way I get the network is by looking at associations between the words that appear in the dialogues of the top characters. This process implies a text mining analysis with the help of the ```tm``` package. In turn, the network is obtained by using the ```igraph``` package. The R script for this step is:
```
get_top_characters_network.R
```
The output files are:
```
top_chars_network.txt
weight_edges_graph1.txt
weight_edges_graph2.txt
```
For exploratory purposes, we can get some preliminary visualizations using some of the plotting functions:
```
1. arcDiagram.R
2. arcPies.R
3. arcBands.R
```
------------------------------------
### Step 4) Identify top terms of top characters
The next step is to identify the top terms said by the top characters, according to the episodes in which they appear. The results obtained in this phase are used for plotting the bar-charts below each character in the final arc-diagram. The R script for this step is:
```
get_terms_by_episodes.R
```
The output file is:
```
top_char_terms.txt
```
------------------------------------
### Step 5) Ultimate Arc-Diagram
The last step is to get the final arc-diagram containing the top characters, their participation by episode, and the associated top terms from their dialogues. The size of the bands around each character reflect how much they talk in the movies, as well as their participation in each episode. The width of the arcs connecting two characters reflect the number of words in common. In turn, the color of the arcs reflect the association between two characters depending on the similarity of the words in their dialogues. The R script for this step is:
```
ultimate_arc_diagram.R
```
The plotting function is:
```
arcBandBars.R
```
184 changes: 184 additions & 0 deletions Rscripts/arcBandBars.R
@@ -0,0 +1,184 @@
############################################################################
# Title: arcBandBars.R
# Description: function to plot an arc-diagram with bands and bar-charts
# Author: Gaston Sanchez
# www.gastonsanchez.com
# License: BSD Simplified License
# http://www.opensource.org/license/BSD-3-Clause
# Copyright (c) 2012, Gaston Sanchez
# All rights reserved
############################################################################

arcBandBars <- function(
edgelist, bands, bars, col.bands=NULL, sorted=TRUE, decreasing=FALSE,
lwd=NULL, col=NULL, cex=NULL, col.nodes=NULL, cex.terms=NULL, col.terms=NULL,
lend=1, ljoin=2, lmitre=1, bg=NULL, mar=c(4,1,3,1))
{
# ARGUMENTS
# edgelist: two-column matrix with edges
# bands: numeric matrix with rows=nodes and columns=numbers
# bars: list of numeric tables with propotions for bar-charts
# sorted: logical to indicate if nodes should be sorted
# decreasing: logical to indicate type of sorting (used only when sorted=TRUE)
# lwd: widths for the arcs (default 1)
# col: color for the arcs (default "gray50")
# cex: magnification of the nodes labels (default 1)
# col.nodes: color of the nodes labels (default "gray50")
# cex.terms: magnification of the terms in bar charts
# col.terms: color of the terms in bar charts
# lend: the line end style for the arcs (see par)
# ljoin: the line join style for the arcs (see par)
# lmitre: the line mitre limit fort the arcs (see par)
# bg: background color (default "white")
# mar: numeric vector for margins (see par)

# make sure edgelist is a two-col matrix
if (!is.matrix(edgelist) || ncol(edgelist)!=2)
stop("argument 'edgelist' must be a two column matrix")
edges = edgelist
# how many edges
ne = nrow(edges)
# get nodes
nodes = unique(as.vector(edges))
nums = seq_along(nodes)
# how many nodes
nn = length(nodes)
# ennumerate
if (sorted) {
nodes = sort(nodes, decreasing=decreasing)
nums = order(nodes, decreasing=decreasing)
}
# make sure bands is correct
if (!is.matrix(bands) && !is.data.frame(bands))
stop("argument 'bands' must be a numeric matrix or data frame")
if (is.data.frame(bands))
bands = as.matrix(bands)
if (nrow(bands) != nn)
stop("number of rows in 'bands' is different from number of nodes")

# check default argument values
if (is.null(lwd)) lwd = rep(1, ne)
if (length(lwd) != ne) lwd = rep(lwd, length=ne)
if (is.null(col)) col = rep("gray50", ne)
if (length(col) != ne) col = rep(col, length=ne)
if (is.null(col.nodes)) col.nodes = rep("gray50", nn)
if (length(col.nodes) != nn) col.nodes = rep(col.nodes, length=nn)
if (!is.null(cex) && length(cex) != nn) cex = rep(cex, length=nn)
if (is.null(bg)) bg = "white"

# nodes frequency from bands
nf = rowSums(bands) / sum(bands)
# words center coordinates
fin = cumsum(nf)
ini = c(0, cumsum(nf)[-nn])
centers = (ini + fin) / 2
names(centers) = nodes
# node radiums
nrads = nf / 2

# arcs coordinates
# matrix with numeric indices
e_num = matrix(0, nrow(edges), ncol(edges))
for (i in 1:nrow(edges))
{
e_num[i,1] = centers[which(nodes == edges[i,1])]
e_num[i,2] = centers[which(nodes == edges[i,2])]
}
# max arc radius
radios = abs(e_num[,1] - e_num[,2]) / 2
max_radios = which(radios == max(radios))
max_rad = unique(radios[max_radios] / 2)
# arc locations
locs = rowSums(e_num) / 2

# function to get pie segments
t2xy <- function(x1, y1, u, rad)
{
t2p <- pi * u + 0 * pi/180
list(x2 = x1 + rad * cos(t2p), y2 = y1 + rad * sin(t2p))
}

# plot
par(mar = mar, bg=bg)
plot.new()
plot.window(xlim=c(-0.025, 1.025), ylim=c(-0.7*max_rad, 1*max_rad*2))
# plot connecting arcs
z = seq(0, pi, l=100)
for (i in 1:ne)
{
radio = radios[i]
x = locs[i] + radio * cos(z)
y = radio * sin(z)
lines(x, y, col=col[i], lwd=lwd[i],
lend=lend, ljoin=ljoin, lmitre=lmitre)
}
# plot node bands
for (i in 1:nn)
{
radius = nrads[i]
p = c(0, cumsum(bands[i,] / sum(bands[i,])))
dp = diff(p)
np = length(dp)
angle <- rep(45, length.out = np)
for (k in 1:np)
{
n <- max(2, floor(200 * dp[k]))
P <- t2xy(centers[i], 0, seq.int(p[k], p[k+1], length.out=n), rad=radius)
polygon(c(P$x2, centers[i]), c(P$y2, 0), angle=angle[i],
border=NA, col=col.bands[k], lty=0)
}
# draw white circles
theta = seq(0, pi, length=100)
x3 = centers[i] + 0.7*nrads[i] * cos(theta)
y3 = 0 + 0.7*nrads[i] * sin(theta)
polygon(x3, y3, col=bg, border=bg, lty=1, lwd=2)
}
# add node names
if (is.null(cex)) {
cex = nf
cex[nf < 0.01] = 0.01
cex = cex * 5
}
# add node names
text(centers, 0, nodes, cex=cex, adj=c(0.5,0), col=col.nodes)

# plot bar-charts below each node
# mwax number of bar-charts divisions
bar_divs = max(sapply(bars, nrow)) + 1
# heights (I'm adding one more division to plot blank lines)
yh = seq(-0.05*max_rad, -0.7*max_rad, length.out=bar_divs+1)
# default cex effect for the terms
if (is.null(cex.terms)) {
cex.terms = nf
cex.terms[nf < 0.01] = 0.01
cex.terms = cex.termsex * 3
}
# for each node
for (w in nums)
{
xrange = fin[w] - ini[w]
# for each adjunct term
nadjs = nrow(bars[[w]])
nc = ncol(bars[[w]])
for (i in 1:nadjs)
{
Bp_ranges = bars[[w]][i,] * xrange
ord = order(Bp_ranges, decreasing=TRUE)
Bp_ranges_ord = sort(Bp_ranges, decreasing=TRUE)
Bp_intervals = ini[w] + cumsum(Bp_ranges_ord)
x_start = c(ini[w], Bp_intervals[-nc])
x_end = Bp_intervals
cols_ord = col.bands[ord]
for (j in 1:nc)
{
rect(x_start[j], yh[i+1], x_end[j], yh[i], col=cols_ord[j],
border=NA, lwd=0, lty=1)
}
# add white line in right border
lines(rep(x_end[nc],nadjs+1), yh[1:(nadjs+1)], col=bg)
# add labels of terms
text(x_start[1], yh[i+1], rownames(bars[[w]])[i],
col=col.terms, cex=cex.terms[w], adj=c(-0.05,-0.4))
}
}
}

0 comments on commit 50e35ab

Please sign in to comment.