<a href="https://colab.research.google.com/github/christianbentz/Workshop_DGfS2022/blob/main/Code/Application2/Visualization_WO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Order Entropy Visualization

Author: Chris Bentz

Date: 18/02/2022

## Install Libraries

Some packages are already pre-installed on jupyter, but some need to be installed. Run this code to make sure that the packages/libraries needed to run this code are installed.


In [None]:
install.packages("ggrepel")

## Load Libraries
If the libraries are not installed yet, you need to install them using, for example, the command: install.packages("ggplot2").

In [10]:
library(ggplot2)
library(ggrepel)

## Load Data
Load file with ML estimated word order entropies and unigram orthographic word entropies across ca. 1000 languages.

In [None]:
h.est.wo <- read.csv(file = "/content/results/wordorder_tokens.csv")
head(h.est.wo)

### Data Cleaning
Use only data from Parallel Bible Corpus (this also removes NAs). Also, languages might be included/excluded according to how many sentences they have in the word order variable.

In [25]:
# use PBC
h.est.wo.clean <- h.est.wo[h.est.wo$corpus == "PBC", ]
# only use languages with more than x sentences (in the word order variable)
h.est.wo.clean <- h.est.wo.clean[h.est.wo.clean$Num_Sentences > 10, ]
# check how many rows, i.e. languages are left
nrow(h.est.wo.clean)

## Scatterplots
Create scatterplots with entropy estimations on the x-axis and y-axis.

In [None]:
h.plot <- ggplot(h.est.wo.clean, aes(x = H_ML, y = H_SO)) + 
  geom_point(alpha = 0.8, size  = 1) +
  geom_smooth(method = "lm") +
  #geom_label_repel(aes(label = ISO), label.size = 0.2, size = 3) + 
  labs(x = "Unigram Entropy for Words (ML Estimate)", 
       y = "Entropy of SO Order (ML Estimate)")
h.plot

## Linear model
Fit a simple linear model which corresponds to the linear model (lm) smoother in the scatterplot.

In [None]:
model <- lm(H_SO~H_ML, data = h.est.wo.clean)
summary(model)

## Safe Figures
Safe complete figures to file.

In [None]:
ggsave("/content/WordOrderEntropyPlot.pdf", h.plot, dpi = 300, 
       scale = 1, device = cairo_pdf)