# Analysis of Variation Data Processing Notebook

The purpose of this notebook is to annotate VOC data for further variationist analysis.  

# 0.0 Creating a virtual environment with Conda

A virtual environment in Python provides a self-contained and isolated environment for each project, allowing you to manage dependencies separately and avoid conflicts between different projects or system installations. It ensures reproducibility, simplifies dependency management, and promotes consistent environments across development stages without impacting the global Python environment.

First create an environment called ``linguist258`` where you will download all necessary packages. You can either do it with conda or simply python. 

## 0.1A Conda


In [None]:
!conda create --name linguist258
!conda activate linguist258


## 0.1B Python 

In [None]:
!python -m venv linguist258
# Mac OS
source env/bin/activate
#Windows
!.\linguist258\Scripts\activate


## 0.2 Packages
Once we have created and activated ``linguist258``, we will install the required packages on our computer. We only need to run this once, once they are installed they don't need to be installed again. 

In [2]:
!pip install pympi-ling
!pip install pandas
!pip install stanza
!pip install numpy

Collecting pympi-ling
  Downloading pympi_ling-1.70.2-py2.py3-none-any.whl (24 kB)
Installing collected packages: pympi-ling
Successfully installed pympi-ling-1.70.2


# 1.0 Loading Data

These are the imports necessary to run the rest of the cells. Cells from here down should be run in sequence.

In [5]:
from pympi import Eaf, Praat
import pandas as pd
from pathlib import Path

In the following cell, replace the text in quotations the path in your computer where your .eaf or .TextGrid files are located.

In [4]:
path_to_transcripts = Path("<path to folder where your recordings are located>")

Here, we define a couple of functions.

In [5]:
def format_annotations(tier_name, annotation_list, output_list):
    """This function formats a tuple of the form (begin, end, value)
    into a list of dictionaries of the form {speaker, start, end, text}"""
    
    for start_ms, end_ms, text in annotation_list:

        output_list.append({
            'speaker': tier_name,
            'start_ms': start_ms,
            'end_ms': end_ms,
            'text': text
        })
    
    return output_list


def get_annotations(root_path, filename, annotations=list(), tier_name=None):
    """Extracts all annotations for a given .eaf
    file for a given tier (if tier_name specified) or for all tiers"""
    # Load elan file
    elan_object = Eaf(file_path=root_path / filename)
    
    # If a tier name is provided, only annotations for that
    # tier will be used.
    if tier_name:
        tiernames = [tier_name]
    else:
        # Get all tier names
        tiernames = elan_object.get_tier_names()

    # Iterate over all tiers and extract annotations
    for tier in tiernames:
        
        annotations = format_annotations(
            tier,
            elan_object.get_annotation_data_for_tier(tier),
            annotations
            )
    return annotations
    

def iterate_over_folder(
    root_path, annotations=list(),
    file_extension='.eaf', # This parameter can't really be changes with the code as is.
    output_format='dict_list',
    tier_name=None
    ):
    """Iterates over files of a specified extensions and 
    extracts annotations into either a list of dictionaries
    or a pandas dataframe."""
    for file in root_path.glob(f'*{file_extension}'):
        
        annotations = get_annotations(
            root_path, file,
            annotations=annotations,
            tier_name=tier_name
            )
    
    
    if output_format == 'pandas':
        return pd.DataFrame(annotations)
    elif output_format == 'dict_list':
        return annotations


The following functions are utilities to write annotations into elan or praat once they have been processed.

In [6]:
# The tier name can probably be derived from annotations. 
# I can change this once we have a better idea of the 
# output structure.
def output_tier(eaf_object, tier_name, annotation_name, annotation_list):
    
    eaf_object.add_tier(tier_name)
    
    for annotation_dict in annotation_list:
        
        
        eaf_object.add_annotation(
            tier_name, 
            annotation_dict['start_ms'],
            annotation_dict['end_ms'],
            value=annotation_dict[annotation_name] # Probably somethign like 'pos_tags'
            )
    
    return eaf_object


def write_eaf(eaf_object, output_file):
    
    eaf_object.to_file(output_file)
        
###################################################################################
###################################################################################
####### Praat functionality can be added if necessary to both #####################
####### input and output pipelines.                           #####################
###################################################################################
###################################################################################


# 2.0 Linguistic Annotation

We begin by importing the stanza package

In [21]:
import stanza
import numpy as np

We now download the necessary packages from stanza. This should only be done once. 

The first argument ``en`` is the language code for English, and the processors include a tokenizer, part of speech tagger, and a lemmatizer.


In [2]:
stanza.download('en', processors='tokenize,pos,lemma')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2024-04-30 10:06:28 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package  |
------------------------------
| tokenize        | combined |
| pos             | combined |
| lemma           | combined |
| pretrain        | combined |
| backward_charlm | 1billion |
| forward_charlm  | 1billion |

2024-04-30 10:06:28 INFO: File exists: /Users/jesushermosillo/stanza_resources/en/tokenize/combined.pt
2024-04-30 10:06:28 INFO: File exists: /Users/jesushermosillo/stanza_resources/en/pos/combined.pt
2024-04-30 10:06:28 INFO: File exists: /Users/jesushermosillo/stanza_resources/en/lemma/combined.pt
2024-04-30 10:06:28 INFO: File exists: /Users/jesushermosillo/stanza_resources/en/pretrain/combined.pt
2024-04-30 10:06:28 INFO: File exists: /Users/jesushermosillo/stanza_resources/en/backward_charlm/1billion.pt
2024-04-30 10:06:28 INFO: File exists: /Users/jesushermosillo/stanza_resources/en/forward_charlm/1billion.pt
2024-04-30 10:06:28 INFO: Finished

Once we download the necessary stanza packages, we initialize the pipeline. 

In [16]:
nlp = stanza.Pipeline('en',
                      use_gpu=False, 
                      processors=['tokenize','pos','lemma','mwt'])





2024-04-30 10:10:30 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2024-04-30 10:10:31 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |

2024-04-30 10:10:31 INFO: Use device: cpu
2024-04-30 10:10:31 INFO: Loading: tokenize
2024-04-30 10:10:31 INFO: Loading: pos
2024-04-30 10:10:31 INFO: Loading: lemma
2024-04-30 10:10:31 INFO: Done loading processors!


In [17]:
# substitue with actual path to the data
df = pd.read_csv('Transcriptions/toy_data.csv')

We now annotate the utterances using ``nlp()``. The ``apply()`` function allows us to apply a function to a column in a pandas dataframe. 

In [18]:
df['annotated'] = df.text.apply(lambda x: nlp(x))

In [25]:
annotated_data = []
for speaker,start_ms,end_ms, sentences in np.array(df[ ['speaker', 'start_ms', 'end_ms', 'annotated']]):
    for sent in sentences.sentences:
        for word in sent.words:
            annotation = [speaker,start_ms, end_ms, word.text, word.xpos, word.feats]
            annotated_data.append(annotation)
            print(annotation)
    print() 
                    
                    

['AMA_Stepney_SarahA', 0, 2294, 'so', 'RB', None]
['AMA_Stepney_SarahA', 0, 2294, 'being', 'VBG', 'VerbForm=Ger']
['AMA_Stepney_SarahA', 0, 2294, '..', ',', None]
['AMA_Stepney_SarahA', 0, 2294, 'introduced', 'VBN', 'Tense=Past|VerbForm=Part']
['AMA_Stepney_SarahA', 0, 2294, 'like', 'IN', None]
['AMA_Stepney_SarahA', 0, 2294, 'an', 'DT', 'Definite=Ind|PronType=Art']
['AMA_Stepney_SarahA', 0, 2294, 'actual', 'JJ', 'Degree=Pos']
['AMA_Stepney_SarahA', 0, 2294, ',', ',', None]

['AMA_Stepney_SarahA', 270165, 270655, 'yeah', 'UH', None]
['AMA_Stepney_SarahA', 270165, 270655, ',', ',', None]

['AMA_Stepney_SarahA', 272560, 273470, 'yeah', 'UH', None]
['AMA_Stepney_SarahA', 272560, 273470, ',', ',', None]

['AMA_Stepney_SarahA', 282355, 285675, 'uhm', 'UH', None]
['AMA_Stepney_SarahA', 282355, 285675, 'yeah', 'UH', None]
['AMA_Stepney_SarahA', 282355, 285675, 'they', 'PRP', 'Case=Nom|Number=Plur|Person=3|PronType=Prs']
['AMA_Stepney_SarahA', 282355, 285675, 'all', 'DT', None]
['AMA_Stepney_S

In [27]:
annotated_df = pd.DataFrame(annotated_data,
                            columns=['speaker', 'start_ms', 'end_ms', 'Word','POS','Features'] )

Here's what the data looks like

In [28]:
annotated_df

Unnamed: 0,speaker,start_ms,end_ms,Word,POS,Features
0,AMA_Stepney_SarahA,0,2294,so,RB,
1,AMA_Stepney_SarahA,0,2294,being,VBG,VerbForm=Ger
2,AMA_Stepney_SarahA,0,2294,..,",",
3,AMA_Stepney_SarahA,0,2294,introduced,VBN,Tense=Past|VerbForm=Part
4,AMA_Stepney_SarahA,0,2294,like,IN,
...,...,...,...,...,...,...
106,Interviewer,279710,282260,rich,JJ,Degree=Pos
107,Interviewer,279710,282260,kids,NNS,Number=Plur
108,Interviewer,279710,282260,and,CC,
109,Interviewer,279710,282260,-,",",


We can also see all adjectives in the data using the following line of code

In [29]:
annotated_df[annotated_df.POS=='JJ']

Unnamed: 0,speaker,start_ms,end_ms,Word,POS,Features
6,AMA_Stepney_SarahA,0,2294,actual,JJ,Degree=Pos
72,Interviewer,270790,273530,sorry,JJ,Degree=Pos
96,Interviewer,278385,279495,big,JJ,Degree=Pos
101,Interviewer,279710,282260,High,JJ,Degree=Pos
106,Interviewer,279710,282260,rich,JJ,Degree=Pos
