star wars arc diagram project

gastonstat · Jul 11, 2012 · 50e35ab · 50e35ab
commit 50e35ab
Show file tree

Hide file tree

Showing 29 changed files with 25,500 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,63 @@
+## Visualizing Star Wars Movie Scripts
+======================================
+
+**Description:**
+Perform a statistical text analysis on the Star Wars scripts from Episodes IV, V, and VI, in order to obtain an arc-diagram representation (more or less equivalent to the one in [Similar Diversity](http://similardiversity.net/)) with the most talkative characters of the Star Wars trilogy.
+
+<br>
+
+### R scripts
+**Parsing scripts:** R scripts to parse the movie scripts, extract the characters & dialogues, and produce the data tables:
+```
+1. parsing_episodeIV.R
+2. parsing_episodeV.R
+3. parsing_episodeVI.R
+```
+**Analysis scripts:** R scripts for the performed analysis (run them sequentially):
+```
+1. get_top_characters.R
+2. get_top_characters_network.R
+3. get_terms_by_episodes.R
+4. ultimate_arc_diagram.R
+```
+**Functions:** R functions for plotting different types of arc diagrams (these functions are used in some of the above analysis scripts):
+```
+1. arcDiagram.R
+2. arcPies.R
+3. arcBands.R
+4. arcBandBars.R
+```
+
+<br>
+
+### Text Files 
+**Movie Scripts:** These are the movie scripts (raw text data) which are parsed to produced the dialogues files:
+```
+1. StarWars_EpisodeIV_script.txt
+2. StarWars_EpisodeV_script.txt
+3. StarWars_EpisodeVI_script.txt
+```
+**Text dialogues:** These are <em>intermediate</em> files containing the extracted dialogues from the movie scripts:
+```
+1. EpisodeIV_dialogues.txt
+2. EpisodeV_dialogues.txt
+3. EpisodeVI_dialogues.txt
+```
+**Input tables:** These are the input files (in data table format) used for the text mining analysis:
+```
+1. SW_EpisodeIV.txt
+2. SW_EpisodeV.txt
+3. SW_EpisodeVI.txt
+```
+**Output tables:** These are the output files (in data table format), produced in the text mining analysis, that are used to get the different arc-diagrams:
+```
+1. top_chars_by_eps.txt
+2. top_chars_network.txt
+3. top_char_terms.txt
+4. weight_edges_graph1.txt
+5. weight_edges_graph2.txt
+```
+<br>
+
+#### PS:
+Keep in mind that my main motivation for this project was to replicate, as much as possible, the arc diagram of Similar Diversity by Philipp Steinweber and Andreas Koller. I'm sure there are a lot of parts in the analysis that can be made faster, more efficient, and simpler. However, my quota of spare time is very limited and I haven't had enough time to write the ideal code.
diff --git a/Rscripts/DOCUMENTATION.md b/Rscripts/DOCUMENTATION.md
@@ -0,0 +1,80 @@
+## Star Wars Arc Diagram
+=============================================
+by [Gaston Sanchez](http://www.gastonsanchez.com/)
+
+**Motivation**
+The main idea behind this project is to reproduce, as much as possible, an arc-diagram in R like the one depicted in [Similar Diversity](http://similardiversity.net/) (by Philipp Steinweber and Andreas Koller). 
+
+**Long story short**
+I think I saw the arc-diagram of ***Similar Diversity*** for the first time back in 2010. Every time I contemplated that diagram, both amazed and amused, I always ended up wondering how the hell did Philipp and Andreas do it. Finally one day, I couldn't stand my questions anymore so I decided to try to make my own arc-diagram in R... and this project is the result of such attempt. Although I haven't been able to obtain an identical representation of the Similar Diversity arc diagram, I'm really happy with what I got and I guess it's pretty much similar.
+
+**Source**
+In my case, I decided to use the movie scripts from the original Star Wars trilogy (episodes IV, V, and VI) as the text data for my analysis. I found the movie scripts at [Ben and Grove](http://www.benandgrover.com/scripts.asp) and [corkey.net](http://corky.net/scripts/) where you can find other scripts if you want to play with other movies.
+
+----------------------------------
+
+## Analysis recipe:
+
+### Step 1) Text Parsing
+Parse the text data of the movie scripts to extract the dialogues of each character. The final output files are the extracted dialogues in table format (one table per movie)
+```
+Movie script text files         | Parsed files             | Data table files
+------------------------------- | ------------------------ | -----------------
+StarWars_EpisodeIV_script.txt   | EpisodeIV_dialogues.txt  | SW_EpisodeIV.txt
+StarWars_EpisodeV_script.txt    | EpisodeV_dialogues.txt   | SW_EpisodeV.txt
+StarWars_EpisodeVI_script.txt   | EpisodeVI_dialogues.txt  | SW_EpisodeVI.txt
+```
+The R scripts for this step are:
+```
+1. parsing_episodeIV.R
+2. parsing_episodeV.R
+3. parsing_episodeVI.R
+```
+------------------------------------
+### Step 2) Identify top characters
+Once we have the dialogues in table format, the next step is to identify the most talkative characters of the trilogy. This implies performing a frequency analysis to detect the *most talkative* characters, that is, the characters with the greatest number of dialogues in english. Unfortunately, Artoo and Chewie are excluded since they don't speak english. The R script for this step is:
+```
+get_top_characters.R
+```
+The output file is:
+```
+top_chars_by_eps.txt
+```
+------------------------------------
+### Step 3) Network between top characters
+Having identified the top characters (ie most talkative ones) of the trilogy, the following step is to get a network between them. The way I get the network is by looking at associations between the words that appear in the dialogues of the top characters. This process implies a text mining analysis with the help of the ```tm``` package. In turn, the network is obtained by using the ```igraph``` package. The R script for this step is:
+```
+get_top_characters_network.R
+```
+The output files are:
+```
+top_chars_network.txt
+weight_edges_graph1.txt
+weight_edges_graph2.txt
+```
+For exploratory purposes, we can get some preliminary visualizations using some of the plotting functions:
+```
+1. arcDiagram.R
+2. arcPies.R
+3. arcBands.R
+```
+------------------------------------
+### Step 4) Identify top terms of top characters
+The next step is to identify the top terms said by the top characters, according to the episodes in which they appear. The results obtained in this phase are used for plotting the bar-charts below each character in the final arc-diagram. The R script for this step is:
+```
+get_terms_by_episodes.R
+```
+The output file is:
+```
+top_char_terms.txt
+```
+------------------------------------
+### Step 5) Ultimate Arc-Diagram
+The last step is to get the final arc-diagram containing the top characters, their participation by episode, and the associated top terms from their dialogues. The size of the bands around each character reflect how much they talk in the movies, as well as their participation in each episode. The width of the arcs connecting two characters reflect the number of words in common. In turn, the color of the arcs reflect the association between two characters depending on the similarity of the words in their dialogues. The R script for this step is:
+```
+ultimate_arc_diagram.R
+```
+The plotting function is:
+```
+arcBandBars.R
+```
diff --git a/Rscripts/arcBandBars.R b/Rscripts/arcBandBars.R
@@ -0,0 +1,184 @@
+############################################################################
+# Title:        arcBandBars.R
+# Description:  function to plot an arc-diagram with bands and bar-charts
+# Author:       Gaston Sanchez
+#               www.gastonsanchez.com
+# License:      BSD Simplified License
+#               http://www.opensource.org/license/BSD-3-Clause
+#               Copyright (c) 2012, Gaston Sanchez
+#               All rights reserved
+############################################################################
+
+arcBandBars <- function(
+  edgelist, bands, bars, col.bands=NULL, sorted=TRUE, decreasing=FALSE,
+  lwd=NULL, col=NULL, cex=NULL, col.nodes=NULL, cex.terms=NULL, col.terms=NULL,
+  lend=1, ljoin=2, lmitre=1, bg=NULL, mar=c(4,1,3,1))
+{
+  # ARGUMENTS
+  # edgelist:   two-column matrix with edges
+  # bands:      numeric matrix with rows=nodes and columns=numbers
+  # bars:       list of numeric tables with propotions for bar-charts
+  # sorted:     logical to indicate if nodes should be sorted
+  # decreasing: logical to indicate type of sorting (used only when sorted=TRUE)
+  # lwd:        widths for the arcs (default 1)
+  # col:        color for the arcs (default "gray50")
+  # cex:        magnification of the nodes labels (default 1)
+  # col.nodes:  color of the nodes labels (default "gray50")
+  # cex.terms:  magnification of the terms in bar charts
+  # col.terms:  color of the terms in bar charts
+  # lend:       the line end style for the arcs (see par)
+  # ljoin:      the line join style for the arcs (see par)
+  # lmitre:     the line mitre limit fort the arcs (see par)
+  # bg:         background color (default "white")
+  # mar:        numeric vector for margins (see par)
+
+  # make sure edgelist is a two-col matrix
+  if (!is.matrix(edgelist) || ncol(edgelist)!=2)
+    stop("argument 'edgelist' must be a two column matrix")
+  edges = edgelist
+  # how many edges
+  ne = nrow(edges)
+  # get nodes
+  nodes = unique(as.vector(edges))
+  nums = seq_along(nodes)
+  # how many nodes
+  nn = length(nodes)  
+  # ennumerate
+  if (sorted) {
+    nodes = sort(nodes, decreasing=decreasing)
+    nums = order(nodes, decreasing=decreasing)
+  }
+  # make sure bands is correct
+  if (!is.matrix(bands) && !is.data.frame(bands))
+    stop("argument 'bands' must be a numeric matrix or data frame")
+  if (is.data.frame(bands))
+    bands = as.matrix(bands)
+  if (nrow(bands) != nn)
+    stop("number of rows in 'bands' is different from number of nodes")
+
+  # check default argument values
+  if (is.null(lwd)) lwd = rep(1, ne)
+  if (length(lwd) != ne) lwd = rep(lwd, length=ne)
+  if (is.null(col)) col = rep("gray50", ne)
+  if (length(col) != ne) col = rep(col, length=ne)
+  if (is.null(col.nodes)) col.nodes = rep("gray50", nn)
+  if (length(col.nodes) != nn) col.nodes = rep(col.nodes, length=nn)
+  if (!is.null(cex) && length(cex) != nn) cex = rep(cex, length=nn)
+  if (is.null(bg)) bg = "white"
+
+  # nodes frequency from bands
+  nf = rowSums(bands) / sum(bands)
+  # words center coordinates
+  fin = cumsum(nf)
+  ini = c(0, cumsum(nf)[-nn])
+  centers = (ini + fin) / 2
+  names(centers) = nodes
+  # node radiums
+  nrads = nf / 2
+
+  # arcs coordinates
+  # matrix with numeric indices
+  e_num = matrix(0, nrow(edges), ncol(edges))
+  for (i in 1:nrow(edges))
+  {
+    e_num[i,1] = centers[which(nodes == edges[i,1])]
+    e_num[i,2] = centers[which(nodes == edges[i,2])]
+  }
+  # max arc radius
+  radios = abs(e_num[,1] - e_num[,2]) / 2
+  max_radios = which(radios == max(radios))
+  max_rad = unique(radios[max_radios] / 2)
+  # arc locations
+  locs = rowSums(e_num) / 2
+
+  # function to get pie segments
+  t2xy <- function(x1, y1, u, rad)
+  {
+    t2p <- pi * u + 0 * pi/180
+    list(x2 = x1 + rad * cos(t2p), y2 = y1 + rad * sin(t2p))
+  }
+
+  # plot
+  par(mar = mar, bg=bg)
+  plot.new()
+  plot.window(xlim=c(-0.025, 1.025), ylim=c(-0.7*max_rad, 1*max_rad*2))
+  # plot connecting arcs
+  z = seq(0, pi, l=100)
+  for (i in 1:ne)
+  {
+    radio = radios[i]
+    x = locs[i] + radio * cos(z)
+    y = radio * sin(z)
+    lines(x, y, col=col[i], lwd=lwd[i], 
+          lend=lend, ljoin=ljoin, lmitre=lmitre)
+  }
+  # plot node bands
+  for (i in 1:nn)
+  {
+    radius = nrads[i]
+    p = c(0, cumsum(bands[i,] / sum(bands[i,])))
+    dp = diff(p)
+    np = length(dp)
+    angle <- rep(45, length.out = np)
+    for (k in 1:np)
+    {
+      n <- max(2, floor(200 * dp[k]))
+      P <- t2xy(centers[i], 0, seq.int(p[k], p[k+1], length.out=n), rad=radius)
+      polygon(c(P$x2, centers[i]), c(P$y2, 0), angle=angle[i], 
+              border=NA, col=col.bands[k], lty=0)
+    }
+    # draw white circles
+    theta = seq(0, pi, length=100)
+    x3 = centers[i] + 0.7*nrads[i] * cos(theta)
+    y3 = 0 + 0.7*nrads[i] * sin(theta)
+    polygon(x3, y3, col=bg, border=bg, lty=1, lwd=2)    
+  }
+  # add node names
+  if (is.null(cex)) {
+    cex = nf
+    cex[nf < 0.01] = 0.01
+    cex = cex * 5
+  }
+  # add node names
+  text(centers, 0, nodes, cex=cex, adj=c(0.5,0), col=col.nodes)
+
+  # plot bar-charts below each node
+  # mwax number of bar-charts divisions
+  bar_divs = max(sapply(bars, nrow)) + 1
+  # heights (I'm adding one more division to plot blank lines)
+  yh = seq(-0.05*max_rad, -0.7*max_rad, length.out=bar_divs+1)
+  # default cex effect for the terms
+  if (is.null(cex.terms)) {
+    cex.terms = nf
+    cex.terms[nf < 0.01] = 0.01
+    cex.terms = cex.termsex * 3
+  }
+  # for each node
+  for (w in nums)
+  {
+    xrange = fin[w] - ini[w]
+    # for each adjunct term
+    nadjs = nrow(bars[[w]])
+    nc = ncol(bars[[w]])
+    for (i in 1:nadjs)
+    {
+      Bp_ranges = bars[[w]][i,] * xrange
+      ord = order(Bp_ranges, decreasing=TRUE)
+      Bp_ranges_ord = sort(Bp_ranges, decreasing=TRUE)
+      Bp_intervals = ini[w] + cumsum(Bp_ranges_ord)
+      x_start = c(ini[w], Bp_intervals[-nc])
+      x_end = Bp_intervals
+      cols_ord = col.bands[ord]
+      for (j in 1:nc)
+      {
+        rect(x_start[j], yh[i+1], x_end[j], yh[i], col=cols_ord[j], 
+             border=NA, lwd=0, lty=1)
+      }
+      # add white line in right border
+      lines(rep(x_end[nc],nadjs+1), yh[1:(nadjs+1)], col=bg)
+      # add labels of terms
+      text(x_start[1], yh[i+1], rownames(bars[[w]])[i], 
+           col=col.terms, cex=cex.terms[w], adj=c(-0.05,-0.4))  
+    }
+  }
+}