<a href="https://colab.research.google.com/github/christianbentz/Workshop_DGfS2022/blob/main/Code/Application1/EstimationML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Entropy Estimation with Maximum Likelihood Method

Author: Chris Bentz

Date: 18/02/2022

# Install Libraries
Some packages are already pre-installed on jupyter, but some need to be installed. Run this code to make sure that the packages/libraries needed to run this code are installed.

In [2]:
install.packages("stringr")
install.packages("entropy")
install.packages("quanteda")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘ISOcodes’, ‘fastmatch’, ‘RcppParallel’, ‘SnowballC’, ‘stopwords’, ‘RcppArmadillo’




# Load Libraries

If the libraries are not installed yet, you need to install them using, for example, the command: install.packages("ggplot2").

In [3]:
library(stringr)
library(entropy)
library(quanteda)

Package version: 3.2.0
Unicode version: 10.0
ICU version: 60.2

Parallel computing: 2 of 2 threads used.

See https://quanteda.io for tutorials and examples.



# List Files
Create list with all the file paths of files which are about to be processed further.

In [6]:
filelist <- list.files(path = "/content/processed", 
                        recursive = T, full.names = T)
head(filelist)
length(filelist)

# Character Entropy Estimation
Estimate character entropy by using relative frequencies of characters in the text.

In [7]:
# start time
start_time <- Sys.time()

# set counter
counter = 0

# initialize data frame to append results to
entropy.df <- data.frame(filename = character(0), subcorpus = character(0), 
                         id = character(0), h.unigrams = numeric (0),
                         h.bigrams = numeric(0), h.trigrams = numeric(0)) 

for (file in filelist) {
  # loading textfile ("skip" specifies the number of lines to skip, whereas
  # nmax gives the max number of lines to read.)
  chars <- scan(file, what = "char", quote = "", comment.char = "", 
                   encoding = "UTF-8", sep = " " , skip = 0, nmax = F) 
  # get filename
  filename <- basename(file) 
  # get subcorpus category
  subcorpus <- sub("_.*", "", filename)
  # get three letter identifier from filename
  id <- substr(str_extract(filename, "_.*_"), 2, 4) 
  
  # estimate entropy for character unigrams
  chars.uni.df <- as.data.frame(table(chars))
  h.unigrams <- entropy(chars.uni.df$Freq, method = "ML", unit = "log2")
  
  # estimate entropy for character bigrams
  chars.bi <- char_ngrams(chars, n = 2, concatenator = "")
  chars.bi.df <- as.data.frame(table(chars.bi)) 
  h.bigrams <- entropy(chars.bi.df$Freq, method = "ML", unit = "log2")
  
  # estimate entropy for character trigrams
  chars.tri <- char_ngrams(chars, n = 3, concatenator = "")
  chars.tri.df <- as.data.frame(table(chars.tri)) 
  h.trigrams <- entropy(chars.tri.df$Freq, method = "ML", unit = "log2")
  
  # append results to data frame
  local.df <- data.frame(filename, subcorpus, id, h.unigrams, 
                        h.bigrams, h.trigrams)
  entropy.df <- rbind(entropy.df, local.df)
  # counter
  counter <- counter + 1
  # print(counter)
}

# get end time
end_time <- Sys.time()
end_time - start_time

# show final data frame with results
print(entropy.df)

Time difference of 1.729106 secs

                    filename    subcorpus  id h.unigrams h.bigrams h.trigrams
1  unclassified_voy_0001.txt unclassified voy   3.850406  6.182006   8.103153
2       writing_aii_0001.txt      writing aii   4.172559  7.613666   9.779882
3       writing_arb_0001.txt      writing arb   4.548830  8.241311  10.512547
4       writing_azj_0001.txt      writing azj   4.613046  8.077250  10.258100
5       writing_azj_0002.txt      writing azj   4.611793  8.076691  10.257565
6       writing_ben_0001.txt      writing ben   4.955011  8.471218  10.427342
7       writing_blt_0001.txt      writing blt   5.300575  8.838212  10.563588
8       writing_bod_0001.txt      writing bod   4.456135  7.410592   9.402117
9       writing_bos_0001.txt      writing bos   4.603988  7.966511  10.310123
10      writing_bos_0002.txt      writing bos   4.532071  7.854217  10.212521
11      writing_chr_0001.txt      writing chr   5.489144  8.920484  10.558936
12      writing_cmn_0001.txt      writing cmn   7.690635 10.1613

# Write Table to File
Write the table as comma separted values (csv) to a file (useful for checking the table in a regular program like excel, and later loading it into another session).

In [9]:
write.csv(entropy.df, "/content/results/entropy_ML.csv", row.names = F)