Skip to content

almadana/SWOW-RP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SWOW Logo

SWOW-RP

The Small World of Words project (SWOW) project is a scientific project to map word meaning in various languages. In contrast to dictionaries, it focuses on the aspects of word meaning that are shared between people without imposing restrictions on what aspects of meaning should be considered. The methodology is based on a continued word association task, in which participants see a cue word and are asked to give three associated responses to this cue word..

In this repository you will find a basic analysis pipeline for the Rioplatense Spanish SWOW project which allows you to import an preprocessing the data as well as compute some basic statistics.

Suggestions are always appreciated, and do not hesitate to get in touch if you any questions.

Obtaining the data

In addition to the scripts, you will need to retrieve the word association data. Currently word association and participant data is available for over 13000 cues. The data consists of over 3M responses collected between 2014 and 2022. They are currently submitted for publication. Note that the final version is subject to change. If you want to use these data for your own research, you can obtain them from the Small World of Words research page.

Please note that data themselves are licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. They cannot be redistributed or used for commercial purposes.

To cite these data:

If you find any of this useful, please consider sharing the word association study.

Data versioning

Since this is an ongoing project, data is regularly updated. Hence, all datafiles refer to a release date in its filename.

  • The release associated with the published paper is 26-04-2022.
  • Current release is 16-09-2022.

In order to use a specific release, you have to set the release variable in the settings.R file accordingly.

Raw data

The raw data consists of the original participant and response data.

  • participantID: unique identifier for the participant
  • age: age of the participant
  • nativeLanguage: native language from a short list of common languages
  • gender: gender of the participant (Female / Male / X)
  • education: Highest level of education: 1 = None, 2 = Elementary school, 3 = High School, 4 = College or University Bachelor, 5 = College or University Master
  • city: city (city location when tested, might be an approximation)
  • country: country (country location when tested)
  • responseID: identifier for the response
  • section: identifier for the snowball iteration (e.g. set2017)
  • cue: cue word
  • R1: primary associative response
  • R2: secondary associative response
  • R3: tertiary associative response

The file containing this raw date has the naming convention: SWOW-RP.complete.version.csv, should be located in the data/SWOW/raw folder, and is the output of the [imporRawData.R]createAssoStrengthTable.R script. It is used by the preprocessData.R script as input (see below).

Preprocessed data

The preprocessed data consist of normalizations of cues and responses by spell-checking them, correcting capitalization and Americanizing. This file is generated by the preprocessData.R script.

In many cases, this preprocessed data is used to derive the associative strengths (i.e. the conditional probability of a response given a cue). These data can be derived using the createAssoStrengthTable.R script.

The preprocessed data file has the same structure as the raw data file described above. It has the naming convention: SWOW-RP.spellchecked.version.csv, and should be located in the data/SWOW/processed folder. It is used by the balanceParticipants.R script as input (see below).

Balanced data

After to normalizing cues and responses, the balanceParticipants.R script will also extract a balanced dataset, in which each cue is judged by exactly 70 participants. Because each participant generated three responses, this means each cue has 210 associations. The participants were selected to favor native speakers.

The file containing the balanced date has the naming convention: SWOW-RP.R70.version.csv, should be located in the data/SWOW/processed folder. It is used by all downstream analysis scripts (see below).

Processing scripts

Graphs

Use createSWOWENGraph.R to extract the largest strongly connected component for graphs based on the first response (R1) or all responses (R123). The results are written to data/SWOW/output/adjacencyMatrices and consist of a file with labels and a sparse file consisting of three values corresponding to row- and column-indices followed by the association frequencies. In most cases, associative frequencies will need to be converted to associative strengths by dividing with the sum of all strengths for a particular cue. Vertices that are not part of the largest connected component are listed in a report in the data/SWOW/output/centrality subdirectory.

Derived statistics

Response statistics

Use createResponseStats.R to calculate a number of response statistics. Currently the script calculates the number of types, tokens and hapax legomena responses (responses that only occur once). The results can be found in the output directory.

Cue statistics

Use createCueStats.R to calculate cue statistics. Only words that are part of the strongly connected component are considered. Results are provided for the R1 graph and the graph with all responses (R123). The file includes the following:

  • coverage: (how many of the responses are retained in the graph after removing those words that aren't a cue or aren't part of the strongest largest component).
  • H: Shannon entropy of response distributions for each cue
  • unknown: the number of unknown responses
  • x.R2: the number of missing R2 responses
  • x.R3: the number of missing R3 responses

A histogram of the response coverage for R1 and R123 graphs can be obtained from the script plotCoverage.R. Vocabulary growth curves can be obtained with plotVocabularyGrowth.R.

R1 - R2 response chaining

Later responses can be affected by the previous response a participant gave. In general, this is quite rare, but for some cues this effect can be more pronounced. To investigate response chaining, we compare the conditional probabilities of the second response when preceded with a mediated R1 response with conditional probabilities when R2 is not preceded by this mediator. An example of this analysis is available in calculateR12ResponseChaining.R.

Spelling and lexica

We tried to check the spelling of the most common responses (those occurring at least two times in the data), but it's quite likely that some corrections can be improved and some misspellings are missed. This is where git can make our lives a bit easier. If you find errors, please check the correction file and submit a pull request with additional or amended corrections.

Many files are of importance at this step, please look at the data/dictionaries folder. Among these:

  • rioplatenseProperNames.txt: List of proper names that should not be corrected when found as response

  • rioplatenseWordlist.txt responses that are manually checked. The data in these files take priority over automated (and sometimes faulty) spell-checking. As such, exceptions that should not be touched can be easily included by given including the original response and a correction that is identical.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages