Skip to content
A package for basic linguistic analysis.
Emacs Lisp
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
img
LICENSE
README.md
graphs.org
linguistic.el

README.md

Table of Contents

  1. Introduction
  2. How to install
  3. How to use
    1. Building a Corpus
    2. Collocation
      1. Collocation Frequency
    3. Ngrams
    4. Word Frequency
    5. Ngram Frequency
    6. Custom variables
  4. Examples
    1. Integration with eww
    2. With large .txt files

Introduction

Linguistic mode allows to perform basic linguistic analysis on a buffer's contents. It is possible to do collocation searches and obtain word or ngram frequencies.

It was designed in order to introduce linguistics students and linguistics enthusiasts to the field corpus analysis. Therefore, it is quite ideal for simple analyses on small files.

How to install

You can install linguistic-mode from MELPA with M-x package-list-packages after you have added the MELPA repository to your .emacs file with the following:

    (add-to-list 'package-archives
                 '("melpa" . "http://melpa.org/packages/"))

You can also use M-x package-build-create-recipe and use the following:

(linguistic :fetcher github 
	    :repo "andcarnivorous/linguistic-mode" 
	    :files ("*.el" "*.org"))

If you want to install with git clone you will have to tweak the functions linguistic-gram-freq and linguistic-word-freq in order for them to find the graph.org file.

How to use

The main interactive functions are 5:

  • linguistic-collocation    (C-c C-k)
  • linguistic-collocation-freq    (C-c C-f)
  • linguistic-ngrams     (C-c C-g)
  • linguistic-word-freq    (C-c C-w)
  • linguistic-grams-freq    (C-c C-l)

If you activate the minor mode with M-x linguistic-mode these functions will be available with their respective keybindings.

Building a Corpus

If you want to build a custom corpus from different files, buffers or regions you can do so by using the functions (these do not have keybindings):

  • linguistic-build-corpus
  • linguistic-collect-file
  • linguistic-collect-buffer
  • linguistic-collect-region

With linguistic-build-corpus you will just create a new empty buffer called corpus, where then you can maybe copy-paste things from other windows in your OS.

With the other functions you can take the contents of multiple files, buffers or regions and have them put all together in a new buffer called corpus.

This allows you to maybe take the contents of a .txt file, some eww buffers and a region of a file and put them all together to be scrutinized by means of the previous functions.

Basic operations

With linguistic-mode you can get the number of sentences in a raw corpus and also the average number of characters or words per sentence.

  • linguistic-count-sentences
  • linguistic-average-sent-length
  • linguistic-average-words-sent

Collocation

This function allows you to find every instance of a word in the buffer and its surrounding context. The size of the context (how many words on the left and right will be displayed) is chosen by the user.

After calling linguistic-collocation you will be asked to insert the number of words after the keyword, before the keyword and finally the keyword itself. Remember that the keyword should always be lowercase.

Once the function has analyzed the whole buffer it will return in a new buffer the list of all the occurrences of the selected keyword.

If you use 0 as number for your context before and/or after the keyword the word "nil" will show up instead of any context.

Remember that the more words on the sides, the more it will take to analyze the buffer. The function also substitutes any punctuation with a period.

Collocation Frequency

The new function linguistic-collocation-freq requires the same input as the other one, but it returns in a new buffer the list of all the collocates and their frequencies.

Ngrams

This function will return a list, in a new buffer, of all the ngrams present in the buffer and the number of ngrams resulted.

When you call linguistic-ngrams you will have to insert the size of the ngram first.

Word Frequency

This function will return, in a new buffer, an org-table with the N most frequent words in the buffer or selected area and their occurrences. The new buffer will also contain some code snippets in Python, Gnuplot and R that will allow you to get a bar chart of the result.

When you call linguistic-word-freq you will be asked how long you want the result list (the table, basically) to be. If you choose a number that is higher than the number of single words in the result, the function will return an error.

You will also be asked if you want to include stopwords or not. Stopwords are words that usually are not really relevant to word-frequency analysis (e.g. "and", "I", "what", "could", etc…).

example

Ngram Frequency

This function works just like linguistic-word-freq and can be applied to the whole buffer or a selected area. The only difference is that you will also be prompted to insert the size of the ngram you want in your result (2 for bigrams, 3 for trigrams and so on).

Custom variables

There are 2 custom variable that you can customize with M-x customize-group linguistic-analysis:

  • linguistic-splitter contains the regex that regulate which special characters will be included in ngrams, word-freq and grams-freq.

  • linguistic-stopwords contains a list of words that will not be included in word-freq if the uses chooses to apply the stopwords filter.

Examples

Integration with eww

A nice way to use linguistic-mode is to implement it when browsing with eww. Instead of copy-pasting the contents of a tabloid article in a new txt file or stripping the html in other programming languages, it is possible to have preliminary results immediately, which can be quite handy to people just starting out with corpus linguistics.

With large .txt files

When you use linguistic-mode on big files the waiting time for results can be long.

Using linguistic-collocation, with an i5, it took this machine 61 seconds to find all the occurrences of the word "emma", with 3 context words on each side, in the novel "Emma" by Austen. Instead it took 33 seconds to find the same word with just 1 context word per side.

Using linguistic-grams-freq to get the most frequent trigrams in the novel took 160 seconds.

TODO

  • Preview word-freq and gram-freq list length before prompt.
  • Affixes in collocation
  • Snowball Stemmer
You can’t perform that action at this time.