NaLaFi

This repo contains data and code to work on Natural Language Fingerprints (NaLaFi), and to replicate the results in Bentz (2023). The code should be run in the following order:

Data Generation

randomStringGenerator.Rmd: generates random strings for comparison to natural languages and other sign strings.
shuffledTextGenerator.Rmd: takes the files in folder NaLaFi/data/writing and shuffles the characters randomly.

Simple Stats for the Data

simpleStats.Rmd: this gives an overview of the files in NaLaFi/data in terms of number of files per subcorpus and number of characters per file.

Sampling of Character Strings

sampler.Rmd: samples chunks of UTF-8 characters of pre-defined length (e.g. 10, 100, 1000) and stores them in NaLaFi/samples. Note that this folder should be emptied before re-running the code.

Estimations of Feature Values

estimations.Rmd: calculating the feature values (TTR, unigram entropy, entropy rate, repetition measure) for each string of UTF-8 characters (one per line) in the files of NaLaFi/samples. The output is a csv file stored in NaLaFi/results/features.csv. Note that this file should be deleted before re-running the code.
estimationPlots.Rmd: provides plots for the estimated feature values.
stabilizationAnalyses.Rmd: estimates feature values for stepsizes (i.e. given number of characters), and creates plots of ``stabilization'', i.e. how feature values change with the number of characters.

Classification

classificationKnn.Rmd: classifies the character strings into "writing" and "non-writing" based on the feature values (TTR, unigram entropy, entropy rate, repetition rate) with the k-nearest neighbor method, and stores the results in results/KNN.
classificationLR.Rmd: classifies the character strings with logistic regression model (LR), and stores the results in results/LR.
classificationSVM.Rmd: classifies the character strings with a support vector machine (SVM), and stores the results in results/SVM.
classificationMLP.Rmd: classifies the character strings with different Multilayer Perceptron (MLP) architectures, and stores the results in results/MLP.

Hyperparameters:

HyperParamTuning.Rmd: gives diagnostic plots for hyperparameter values and model performance.

Reference

Bentz (2023). The Zipfian Challenge: Learning the statistical fingerprint of natural languages. CoNLL, Singapore.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
code		code
data		data
figures		figures
results		results
samples		samples
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

figures

figures

results

results

samples

samples

README.md

README.md

Repository files navigation

NaLaFi

About

Releases

Packages

christianbentz/NaLaFi

Folders and files

Latest commit

History

Repository files navigation

NaLaFi

About

Resources

Stars

Watchers

Forks