
# Title: Text Preprocessing

# Author: Chris Bentz

# Date: 

# Install Libraries
Some packages are already pre-installed on jupyter, but some need to be installed. Run this code to make sure that the packages/libraries needed to run this code are installed.

In [1]:
install.packages("gsubfn")
install.packages("readr")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘proto’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# Load Libraries
If the respective libraries are not installed yet, you need to install them using, for example, the command: install.packages("ggplot2")

In [9]:
library(gsubfn)
library(readr)

Loading required package: proto

“no DISPLAY variable so Tk is not available”


# List Files
Create list with all the files which are about to be processed. The path needs to point to the exact location (i.e. folder) where all the raw text files are stored.

In [7]:
# file list for udhr and voynich texts
file.list <- list.files(path = "/content/original", 
                        recursive = T, full.names = T)
head(file.list)
length(file.list)

In [11]:
# capture starting time
start_time <- Sys.time()

# set a counter 
counter <- 0

# use for-loop to run through all files
for (file in file.list) { 
  # loading textfile ("skip" specifies the number of lines to skip, whereas
  # nmax gives the max number of lines to read.)
  textfile <- scan(file, what = "char", quote = "", comment.char = "", 
                   encoding = "UTF-8", sep = "\n" , skip = 8, nmax = 100) 
  # remove tabs, parentheses, and further symbols that cause processing issues
  textfile <- gsubfn(".", list("\t" = "", "(" = "", ")" = "", "]" = "",
                              "[" = "", "}" = "",  "{" = "", "*" = "", "+" = ""), textfile)
  # remove line annotations marked by '<>'
  textfile <- gsub("<.*>", "", textfile) 
  # get filename
  filename <- basename(file)
  print(filename)
  # split the textfile into individual utf-8 characters. The output of strsplit() 
  # is a list, so it needs to be "unlisted"" to get a vector. Note that white spaces are
  # counted as utf-8 characters here.
  chars <- unlist(strsplit(textfile, ""))
  # remove white spaces from character vector
  chars <- chars[chars != " "] 
  # remove NAs for vectors which are already shorter than n
  chars <- chars[!is.na(chars)]
  # print(head(chars)) 
  # write characters to file
  output.file <- paste("/content/processed/", filename, sep = "")
  write_lines(chars, output.file, sep = " ")
  # counter
  counter <- counter + 1
  print(counter)
}
end_time <- Sys.time()
end_time - start_time

[1] "unclassified_voy_0001.txt"
[1] "f" "a" "c" "h" "y" "s"
[1] 1
[1] "writing_aii_0001.txt"
[1] "ܥ" "ܘ" "ܬ" "ܕ" "ܐ" "ܡ"
[1] 2
[1] "writing_arb_0001.txt"
[1] "ا" "ل" "د" "ي" "ب" "ا"
[1] 3
[1] "writing_azj_0001.txt"
[1] "П" "Р" "Е" "А" "М" "Б"
[1] 4
[1] "writing_azj_0002.txt"
[1] "P" "R" "E" "A" "M" "B"
[1] 5
[1] "writing_ben_0001.txt"
[1] "ম" "ু"  "খ" "ব" "ন" "্" 
[1] 6
[1] "writing_blt_0001.txt"
[1] "ꪹ" "ꪜ" "ꪸ"  "ꪙ" "ꪭ" "ꪴ" 
[1] 7
[1] "writing_bod_0001.txt"
[1] "ས" "ྔ"  "ོ"  "ན" "་" "བ"
[1] 8
[1] "writing_bos_0001.txt"
[1] "У" "В" "О" "Д" "Б" "У"
[1] 9
[1] "writing_bos_0002.txt"
[1] "U" "V" "O" "D" "B" "U"
[1] 10
[1] "writing_chr_0001.txt"
[1] "Ꭼ" "ꮒ" "ᏻ" "ꮙ" "Ꭷ" "ꮓ"
[1] 11
[1] "writing_cmn_0001.txt"
[1] "序" "言" "鉴" "于" "对" "人"
[1] 12
[1] "writing_cmn_0002.txt"
[1] "序" "言" "鑑" "於" "對" "人"
[1] 13
[1] "writing_csw_0001.txt"
[1] "ᓂ" "ᑲ" "ᓂ" "ᐃ" "ᑗ" "ᐎ"
[1] 14
[1] "writing_div_0001.txt"
[1] "ދ" "ީ"  "ބ" "ާ"  "ޖ" "ާ" 
[1] 15
[1] "writing_ell_0001.txt"
[1] "Π" "Ρ" "Ο" "Ο" "Ι" "Μ"
[1] 16
[1]

Time difference of 18.66021 secs