# Chapter 19 - Introduction to Quantitative Linguistic Methods

Quantitative linguistics covers a huge variety of subfields and individual methods,
spanning not only linguistics itself, but computer science, statistics, information
theory, and more.

We can generally make a distinction between two large subfields, both of which share 
many methods but typically pursue different goals: applied linguistics, commonly referred
to as Natural Language Processing (NLP), and empirical linguistics. Of course
the term NLP is often used interchangably with empirical linguistics, due to the large
overlap of methods. But these fields pursue distinctive goals.

**Natural Language Processing** is often aimed at using quantitative methods to build useful 
tools that solve real-world problems. Think of tools such as Google search or Alexa. NLP methods
are often theory-neutral when it comes to language science. The success of an NLP project is best 
judged by the question, *does it work?*

**Empirical linguistics** is aimed at using quantitative methods to formulate and test linguistic 
hypthotheses with real-world data. This field often relies on data procured from digital corpora or 
from human respondants. That data is then analyzed with a given research question in mind. The success
of an empirical linguistic project is best judged by the question, *what question does it answer?*

**The following chapters will primarily focus on the empirical linguistic approach from the perspective
of corpus linguistics.**  A lot of theoretical issues can be addressed via empirical corpus 
linguistics, including (but not limited to, see Stefanowitsch 2020 below): 

* [lexical/grammatical collocation](https://www.researchgate.net/publication/37929828_Collostructions_Investigating_the_interaction_of_words_and_constructions)
* [grammar, syntax, semantics](http://wwwling.arts.kuleuven.be/qlvl/prints/levshina_heylen_9998draft_radically_data-driven_construction.pdf)
* [metaphor](https://benjamins.com/catalog/scl.2)
* [intertextuality](https://www.springer.com/gp/book/9783030234133)
* [historical linguistics](http://lingpy.org)
* and more

The chapters of this tutorial will not cover these areas individually, but will rather focus on a 
few methods that are widely applicable to these various areas.

A major source for this chapter that provides a more comprehensive overview of the relevant
issues and methods is *Corpus Linguistics* (2020) by Anatol Stefanowitsch. A free copy of the 
book can be downloaded by following the image link below.

<a href="https://langsci-press.org/catalog/book/148"><img src="https://langsci-press.org/$$$call$$$/submission/cover/cover?submissionId=148&random=1485eaad8671e2ef" height=300 width=300></a>

Additional resources are given/cited below. Even though they are written for R, the methods are 
equally applicable for Python and only require some "translation" of R dataframes to
Pandas DataFrames.

* [Harald Baayen, *Analyzing Linguistic Data: A practical introduction to statistics using R*, 2008.](https://www.researchgate.net/publication/311509723_Analyzing_Linguistic_Data_A_practical_introduction_to_statistics_using_R)
* [Natalia Levshina, *How to do Linguistics with R*, 2015.](https://benjamins.com/catalog/z.195)

## Formulating Linguistic Hypotheses

In empirical linguistics, we seek to convert theoretical questions into 
testable hypotheses. In the natural sciences, the normal objective is to 
**disprove** a hypothesis rather than prove it. Have a look at the image below. 
We will pretend this is our dataset which tells us something about "swans".

<img src=https://i.dailymail.co.uk/i/pix/2014/03/13/article-0-1C40765B00000578-240_964x614.jpg height=600 width=600>

Based on our dataset, we formulate the following existential hypothesis:

> "All swans are white."

Is this hypothesis warranted? According to our limited dataset, we might say based on 
[inductive logic](https://plato.stanford.edu/entries/logic-inductive/) that our hypothesis
is not only warranted but true! But this kind of thinking, as Karl Popper showed (1959) is 
prone to error. It would only take a single observation of a black swan to disprove the 
hypothesis. In fact, **the primary objective of the scientific method is to disprove a
hypothesis**. In the natural sciences, this is most often done through experimentation. 

As Stefanowitsch points out (2020: 77-93), the normal goal of finding counter examples to
a hypothesis is particularly problematic for corpus linguistics. This is due to the near 
infinite linguistic variation and diversity found in human language. A corpus is always 
a very limited snapshot of a particular speech situation. 

**In linguistics, hypotheses tend to be formulated in terms of statistical tendencies 
rather than in existential terms.** For instance (examples from Stefanowitsch 2020: 75):

> "X's tend to be Y"

or 

> "Z's prefer Y" 

We can say formally:

    A linguistic hypothesis is a "statement postulating relationships between
    [two or more] constructs". (Stefanowitsch 2020: 64)
    
Here are two examples, taken from Stefanowitsch (2020):

> Most occurrences of the suffix *-wards* are British English\
> Most occurrences of the suffix *-ward* are American English

In each hypothesis, we have two constructs: 1) "a suffix" consisting
of a sequence that is either *-wards* or *-ward*,  and 2) a dialect 
of English, either American or British. Note that we assume the existence
of these constructs when formulating these hypotheses.

There are ways of inductively testing for natural patterns in data, such as
unsupervised clustering methods, that provide interesting avenues for testing the validity 
of our constructs. But the majority of research questions will revolve around
constructs already assumed to exist.

## Operationalizing a Hypothesis

After we formulate a hypothesis that can be tested with statistical
means, we must then decide how to go about gathering and arranging 
our data. This requires us to "operationalize" our terms. 

For instance, Stefanowitsch provides a useful example of the concept 
of 'hardness', defined in a dictionary as (2020: 77):

> FIRM TO TOUCH firm, stiff and difficult to press down, break, or cut

This definition fits our "experiential" knowledge of 'hardness'. But more 
objective means of hardness are needed if we're going to, for instance,
compare how rugged two different kinds of phone screens are. This requires
us to condense the experiential / cultural concept of hardness down into a 
procedure that can be accurately reproduced in various situations. One such example
is the Vickers Hardness Test (Stefanowitsch 2020: 78):

$HV = \frac{0.102 \times F}{A}$

Where $F$ is the load in newtons, $A$ is the surface of the indentation of an 
object pressing on the material, and $0.102$ is a constant that converts
newtons into a kilopond.

In linguistic research we operationalize complex entities into a series of operations.
For instance, say we are trying to measure the behavior of verbs within a text corpus. 
We cannot simply define something like a 'verb' intuitively, since our Python code has 
no concept of 'verb'—it cannot think about actions and movements...it cannot think at all!
So, then, we are forced to operationalize what is a very intricate concept into
something we can produce. For instance: 

In [None]:
def is_verb(string):
    """Test whether a string is a verb."""
    if string.endswith('ing'):
        return True
    else:
        return False

This, of course, is by no means a comprehensive definition of a verb. But only an example. Even if 
we manually annotate all cases of verbs, we then must define verb as some entity. For instance,
see the following mini-dataset:

In [None]:
verbs = [
    ('went', 'v'),
    ('going', 'v')
]

In the small dataset, we've marked a 'v' for each word what we think is a verb. That 'v' string is not, 
of course itself a 'verb', rather it is a simplified formalism meant to point to a 'verb'. In fact,
the 'v' is simply *our interpretation* of a verb. 

Another example might be operationalizing the concept of word order. This is often done already in
linguistic research, where word order is simplified to:

$word\_order = SV$

or 

$word\_order = VS$

## Gathering and Structuring Data

### Dependent versus Independent Variables

Remember our definition of a linguistic hypothesis as:

    a "statement postulating relationships between
    [two or more] constructs". (Stefanowitsch 2020: 64)
    
 We make a further specification, again following Stefanowitsch 2020: 64,
 
     the construct "we want to explain" is called the dependent
     variable; the construct "we believe provides an explanation"
     is called the independent variable.
     
For example, take the following hypothesis:

> Present tense verbs are used more frequently in speech than in prose.

We have two independent variables: "speech" and "prose". These are the 
variables we assume "explain" (in a limited sense) how often a present 
tense verb is used in the corpus. The "present tense verb" is therefore
the dependent variable.

### Structuring A Dataset

**Always structure your data in the raw data table format consisting of observations
in rows, and features of those observations in columns (Stefanowitsch 2020: 137)**.
This is the standard in data science, and is constructed so as to preserve your
data in a reproducible format.

|  | feature 1 | feature 2 |  feature 3 | ... |
|--|--|--|--|--|
| observation 1 | 
| observation 2 | 
| observation 3 | 
| ... | 

The observation is the primary entity you're analyzing. For example, if you're analyzing
verbs, each observation is a verb in the corpus, and each feature is a particular feature
of that verb or of its context.

**NEVER store your data in summarized format, i.e. where each row and column is a sum or 
cross tabulation of co-occurring features (Stefanowitsch 2020: 137).** An example of a 
summarized format is given below:

|            | subject | object |
|------------|---------|--------|
| verb_stem1 | 123     | 225    |
| verb_stem2 | 521     | 12     |

This format "irreversibly destroys the relationship between the corpus hits and 
the different annotations applied to them" (Stefanowitsch 2020: 138). This format
is fine (and necessary) for analysis and for displaying results! But it should
never be the primary way you store your data. Always use the raw data table format. 

Furthermore, as we shall discuss below, always publish your raw data alongside 
your research. Preferably in an open medium intended for data such as Github 
or Zenodo.

### Dataset Illustration

Let's create an example of a raw data table.

In [None]:
import pandas as pd
import re

Let's say we're researching sentence tendencies in a corpus. Here's a snapshot of [that corpus](https://github.com/CambridgeSemiticsLab/nena_corpus/blob/master/nena/0.01/Barwar/The%20Bear%20and%20the%20Fox.nena), 
which is Neo-Aramaic:

In [None]:
the_bear_and_fox = """
xà-yomaˈ də́bba xðára-wawa gu-ṭuràne.ˈ tfíqla b-xa-tèla.ˈ ʾo-tèlaˈ mə́re há
lɛ̀kət zála,ˈ ya-gáni də̀bba?ˈ b-álaha xðárən báθər rə̀sqiˈ bálki xa-mə́ndi táfəq
bìyiˈ ṱ-àxlən mə́ndi,ˈ ʾáxxa m-tàmma.ˈ ʾámər ma-lat-ðáya b-gánəx qàrθɛla?ˈ dáx-it
jwàja?ˈ ʾáxxa l-tàmma.ˈ mə́ra làˈ ču-qárθa lɛ̀la ʾáxxa.ˈ mə́re ʾən-bằyətˈ
zaqrə̀nnəxˈ xa-qərṭàla,ˈ sàla y-amrə́xle,ˈ xa-sàla,ˈ ṭla-sə̀twaˈ ta-t-lá qɛràti.ˈ
t-lá-hoya qàrθa-ʾəlləx,ˈ t-lá-hawe tàlga-ʾəlləx.ˈ hóla rəš-ṭùra,ˈ tìwtɛla.ˈ mə́re
də-hàyyo!ˈ qímɛle múθya tùreˈ ʾu-č̭ənnək̭ɛ́ra díya di-di-dì,ˈ mə́re túgən ʾáti
gàwe.ˈ tìwtɛla.ˈ zqìrəllaˈ hal-làxxa.ˈ mə́ra də-klìgən!ˈ pàlṭən m-gáwe.ˈ mə́re làˈ
ta-ṱ-óðən qắpəx ʾàp ṭla-réšəx,ˈ baʿdḕn pàlṭət.ˈ qìmɛle,ˈ xθìməlle.ˈ kúlle
zqìrəlle-wˈ píštɛla hádəx də́bba gàwe.ˈ ʾu-maxéla ʾáqle díye ʾə̀lla.ˈ ʾu-ṣàlya.ˈ
nàpla,ˈ ṣálya-wˈ ṣàlya,ˈ ṣàlyaˈ b-o-ṭùraˈ hal-šə́ttət hə̀nna,ˈ gu-rawùlta.ˈ ṣléla
tàma.ˈ ʾína kúlla gɛ́rme díya šmìṭe.ˈ píšta be-sarùber,ˈ hat-šmíṭle ʾɛ-sála
t-wéwa zqìrəlla.ˈ kúlla šmìṭla.ˈ módi wìdle?ˈ ʾaw-téla rìqle.ˈ téla rìqle,ˈ
ʾáy hédi-hedi qìmla.ˈ xà-yoma,ˈ trè,ˈ ṭḷàθa,ˈ bar-píšla spày,ˈ zìlla.ˈ zílla
báre báre dìye.ˈ wírre gu-xa-bòya.ˈ wírre gu-xà ʾisára.ˈ 
"""

First we split the text into sentences along punctuation lines. We use the period + intonation
boundary (ˈ) to do this. For this example, we'll leave the line markers e.g. (1) and simply consider that part of the
sentence.

In [None]:
sentences = the_bear_and_fox.split('.ˈ')

print(sentences[:5])

**Each sentence is a single observation**. Now we'll construct the dataset by
gathering features on those observations. We'll gather the following features:

* number of words in sentence (segmented by space or -)
* number of stress groups in sentence (segmented by space only)
* whether a sentence contains a segement marked with a comma (,)
* the first word of the sentence

Pay attention to how we store each feature.

In [None]:
raw_data_table = [] # <- put data here

# for each sentence, create a new row
# in the table by appending a list of
# feature values to the raw_data_table list

for sentence in sentences:
    
    # clean the sentence of newlines \n and skip any empty sentences
    sentence = sentence.replace('\n', '').strip()
    if not sentence:
        continue
    
    # gather the data to store
    words = re.split(' |-', sentence) # use re to split by either space or (|) dash -
    stress_groups = sentence.split(' ')
    has_comma = ',' in sentence
    first_word = words[0]
    
    # package the data into a list which will become a "row" in the raw_data_table
    sentence_data = [len(words), len(stress_groups), has_comma, first_word]
    
    # add the "row" into the table by appending the data list to the raw_data_table list
    raw_data_table.append(sentence_data)

Let's have a look at the dataset.

In [None]:
raw_data_table[:5]

Now, we convert the dataset into a DataFrame so we can do statistical analysis
with it later.

In [None]:
raw_data_df = pd.DataFrame(raw_data_table, columns=['n_words', 'n_stress', 'has_comma', 'first_word'])

raw_data_df.head(5)

We can store the data for later use and also for archiving by exporting it as a csv file:

In [None]:
raw_data_df.to_csv('example_raw_data.csv')

## Open Science & Reproducibility

Since scientific research should be falsifiable, it must also be reproducible.
In other words, when you report the results of your research, anyone should be
able to pick up your data and reproduce the same results you found. Often in linguistics
this is not the case. We frequently see summary data tables with counts of observed
features, but no provision of the data that produced those counts.

**Your results are only as good as the accessibility of your data**. If you publish really 
nice results, but not the data itself, those results are of limited or no value to the 
advancement of the field. As such, proprietary or protected data has lesser value to good
research than data that is freely available.

Here are just a couple of avenues you can choose for publishing your data online. They are 
relatively easy to use. Upload your data to one of these archives in the form of a
raw data table, either as a CSV file or with a format that can preserve that structure of
a table. Then you can provide a hyperlink to accompany your paper/article to direct
your readers to the full form of the data.

* [Get started with Github](https://help.github.com/en/github/getting-started-with-github)
* [About Zenodo](https://about.zenodo.org)