The Small World of Words project (SWOW) project is a scientific project to map word meaning in various languages. In contrast to dictionaries, it focuses on the aspects of word meaning that are shared between people without imposing restrictions on what aspects of meaning should be considered. The methodology is based on a continued word association task, in which participants see a cue word and are asked to give three associated responses to this cue word..
In this repository you will find a basic analysis pipeline for the Rioplatense Spanish SWOW project which allows you to import an preprocessing the data as well as compute some basic statistics.
Suggestions are always appreciated, and do not hesitate to get in touch if you any questions.
In addition to the scripts, you will need to retrieve the word association data. Currently word association and participant data is available for over 13000 cues. The data consists of over 3M responses collected between 2014 and 2022. They are currently submitted for publication. Note that the final version is subject to change. If you want to use these data for your own research, you can obtain them from the Small World of Words research page.
Please note that data themselves are licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. They cannot be redistributed or used for commercial purposes.
To cite these data:
If you find any of this useful, please consider sharing the word association study.
Since this is an ongoing project, data is regularly updated. Hence, all datafiles refer to a release date in its filename.
- The release associated with the published paper is
26-04-2022
. - Current release is
16-09-2022
.
In order to use a specific release, you have to set the release
variable in the settings.R file accordingly.
The raw data consists of the original participant and response data.
participantID
: unique identifier for the participantage
: age of the participantnativeLanguage
: native language from a short list of common languagesgender
: gender of the participant (Female / Male / X)education
: Highest level of education: 1 = None, 2 = Elementary school, 3 = High School, 4 = College or University Bachelor, 5 = College or University Mastercity
: city (city location when tested, might be an approximation)country
: country (country location when tested)responseID
: identifier for the responsesection
: identifier for the snowball iteration (e.g. set2017)cue
: cue wordR1
: primary associative responseR2
: secondary associative responseR3
: tertiary associative response
The file containing this raw date has the naming convention: SWOW-RP.complete.version
.csv, should be located in the data/SWOW/raw folder, and is the output of the [imporRawData.R]createAssoStrengthTable.R script. It is used by the preprocessData.R script as input (see below).
The preprocessed data consist of normalizations of cues and responses by spell-checking them, correcting capitalization and Americanizing. This file is generated by the preprocessData.R script.
In many cases, this preprocessed data is used to derive the associative strengths (i.e. the conditional probability of a response given a cue). These data can be derived using the createAssoStrengthTable.R script.
The preprocessed data file has the same structure as the raw data file described above. It has the naming convention: SWOW-RP.spellchecked.version
.csv, and should be located in the data/SWOW/processed folder. It is used by the balanceParticipants.R script as input (see below).
After to normalizing cues and responses, the balanceParticipants.R script will also extract a balanced dataset, in which each cue is judged by exactly 70 participants. Because each participant generated three responses, this means each cue has 210 associations. The participants were selected to favor native speakers.
The file containing the balanced date has the naming convention: SWOW-RP.R70.version
.csv, should be located in the data/SWOW/processed folder. It is used by all downstream analysis scripts (see below).
Use createSWOWENGraph.R to extract the largest strongly connected component for graphs based on the first response (R1) or all responses (R123). The results are written to data/SWOW/output/adjacencyMatrices and consist of a file with labels and a sparse file consisting of three values corresponding to row- and column-indices followed by the association frequencies.
In most cases, associative frequencies will need to be converted to associative strengths by dividing with the sum of all strengths for a particular cue.
Vertices that are not part of the largest connected component are listed in a report in the data/SWOW/output/centrality
subdirectory.
Use createResponseStats.R to calculate a number of response statistics. Currently the script calculates the number of types, tokens and hapax legomena responses (responses that only occur once). The results can be found in the output
directory.
Use createCueStats.R to calculate cue statistics. Only words that are part of the strongly connected component are considered. Results are provided for the R1 graph and the graph with all responses (R123). The file includes the following:
coverage
: (how many of the responses are retained in the graph after removing those words that aren't a cue or aren't part of the strongest largest component).H
: Shannon entropy of response distributions for each cueunknown
: the number of unknown responsesx.R2
: the number of missing R2 responsesx.R3
: the number of missing R3 responses
A histogram of the response coverage for R1 and R123 graphs can be obtained from the script plotCoverage.R. Vocabulary growth curves can be obtained with plotVocabularyGrowth.R.
Later responses can be affected by the previous response a participant gave. In general, this is quite rare, but for some cues this effect can be more pronounced. To investigate response chaining, we compare the conditional probabilities of the second response when preceded with a mediated R1 response with conditional probabilities when R2 is not preceded by this mediator. An example of this analysis is available in calculateR12ResponseChaining.R.
We tried to check the spelling of the most common responses (those occurring at least two times in the data), but it's quite likely that some corrections can be improved and some misspellings are missed. This is where git can make our lives a bit easier. If you find errors, please check the correction file and submit a pull request with additional or amended corrections.
Many files are of importance at this step, please look at the data/dictionaries folder. Among these:
-
rioplatenseProperNames.txt: List of proper names that should not be corrected when found as response
-
rioplatenseWordlist.txt responses that are manually checked. The data in these files take priority over automated (and sometimes faulty) spell-checking. As such, exceptions that should not be touched can be easily included by given including the original response and a correction that is identical.