Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Querying and manipulating of open lexical databases

Most lexical databases consist of plain text files in .tsv or .csv formats which can easily be imported into R using readr::read_delim, or into Python with pandas.read_csv. To open a .tsv or a .csv file with Excel, check out "How to open a tsv file in Excel".

Table of Contents

Example 1: extraction d'une liste de mots avec R

To extract the rows of Lexique382.tsv corresponding to a list of words:

    items <- c('bateau', 'avion', 'maison', 'arbre')


    lex <- read_delim("", delim='\t')
    # lex <- read_delim('Lexique382.tsv.gz', delim='\t')  # if you have the file

    selection <- subset(lex, ortho %in% items)


    write_tsv(selection, 'selection.tsv')

    ### Using regular expressions

   # liste les mots qui finissent par "ion"
   lex$ortho %>% str_subset("ion$")

   # liste les mots qui contiennent trois voyelles successives
   lex$ortho %>% str_subset('[aeiouy][aeiouy][aeiouy]')

   # trouve les mots qui contiennent des groupes de 3 lettres répétés
   lex$ortho %>% str_subset("(...)\\1")

   # see

Download select.R. (If you have not already, to install R and Rstudio Desktop)

Remark that this code reads Lexique382.tsv directly from the web. If the server or the connection is too slow, you will get a message "Error in open.connection(con, "rb") : Timeout was reached".

In this case, you should first download Lexique382.tsv on your local hard drive and change the file path passed as argument to read_delim.

More generally, you can download the source tables of a number of databases from our list of open databases.

Example 2: sélection d'items avec Python

This example shows how to select four random sets of twenty nouns and verbs of low and high frequencies from Lexique382, using Python. (If you have not already, install Python: Go to ; Select your OS (Windows, MacOS or Linux) and download the Python 3.7 installer.)

""" Exemple de sélection d'items dans la base Lexique382 """

import pandas as pd

lex = pd.read_csv("../databases/Lexique382/Lexique382.tsv", sep='\t')

# alternatively, you can download the table from the Internet:
# lex = pandas.read_csv('', sep='\t')


# restreint la recherche à des mots de longueur comprises entre 5 et 8 lettres
subset = lex.loc[(lex.nblettres >= 5) & (lex.nblettres <=8)]

# separe les noms et les verbes dans deux dataframes:
noms = subset.loc[subset.cgram == 'NOM']
verbs = subset.loc[subset.cgram == 'VER']

# sectionne sur la bases de la fréquence lexicale
noms_hi = noms.loc[noms.freqlivres > 50.0]
noms_low = noms.loc[(noms.freqlivres < 10.0) & (noms.freqlivres > 1.0)]

verbs_hi = verbs.loc[verbs.freqlivres > 50.0]
verbs_low = verbs.loc[(verbs.freqlivres < 10.0) & (verbs.freqlivres > 1.0)]

# choisi des items tirés au hasard dans chacun des 4 sous-ensembles:
N = 20
noms_hi.sample(N).ortho.to_csv('nomhi.txt', index=False)
noms_low.sample(N).ortho.to_csv('nomlo.txt', index=False)
verbs_hi.sample(N).ortho.to_csv('verhi.txt', index=False)
verbs_hi.sample(N).ortho.to_csv('verlo.txt', index=False)


French syllabation

french-syllabation provides the scripts that were used to syllabify the phonological representations in Brulex and Lexique.

Back to main page

Time-stamp: <2019-03-31 14:01:37>

You can’t perform that action at this time.