# Analysis of Variation Data Processing Notebook

The purpose of this notebook is to annotate VOC data for further variationist analysis.  

# 0.0 Creating a virtual environment with Conda

A virtual environment in Python provides a self-contained and isolated environment for each project, allowing you to manage dependencies separately and avoid conflicts between different projects or system installations. It ensures reproducibility, simplifies dependency management, and promotes consistent environments across development stages without impacting the global Python environment.

First create an environment called ``linguist258`` where you will download all necessary packages. You can either do it with conda or simply python. 

## 0.1A Conda


In [None]:
!conda create --name linguist258
!conda activate linguist258


## 0.1B Python 

In [None]:
!python -m venv linguist258
# Mac OS
!source env/bin/activate
#Windows
#!.\linguist258\Scripts\activate


## 0.2 Packages
Once we have created and activated ``linguist258``, we will install the required packages on our computer. We only need to run this once, once they are installed they don't need to be installed again. 

In [6]:
!pip install pympi-ling
!pip install pandas
!pip install stanza
!pip install numpy

Collecting stanza
  Downloading stanza-1.8.2-py3-none-any.whl (990 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.1/990.1 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting networkx
  Downloading networkx-3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
Collecting emoji
  Downloading emoji-2.11.1-py2.py3-none-any.whl (433 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: networkx, emoji, stanza
Successfully installed emoji-2.11.1 networkx-3.1 stanza-1.8.2


# 1.0 Loading Data

These are the imports necessary to run the rest of the cells. Cells from here down should be run in sequence.

In [8]:
from pympi import Eaf, Praat
import pandas as pd
from pathlib import Path

In the following cell, replace the text in quotations the path in your computer where your .eaf or .TextGrid files are located.

In [9]:
# This should be the path to the folder containing your .eaf files.
path_to_transcripts = Path("<Path to your data>")

Here, we define a couple of functions.

In [46]:
def format_annotations(tier_name, filename, annotation_list, output_list):
    """This function formats a tuple of the form (begin, end, value)
    into a list of dictionaries of the form {speaker, start, end, text}"""
    
    for start_ms, end_ms, text in annotation_list:

        output_list.append({
            'filename': filename,
            'speaker': tier_name,
            'start_ms': start_ms,
            'end_ms': end_ms,
            'text': text
        })
    
    return output_list


def get_annotations(root_path, filename, annotations=list(), tier_name=None):
    """Extracts all annotations for a given .eaf
    file for a given tier (if tier_name specified) or for all tiers"""
    # Load elan file
    elan_object = Eaf(file_path=filename)
    
    # If a tier name is provided, only annotations for that
    # tier will be used.
    if tier_name:
        tiernames = [tier_name]
    else:
        # Get all tier names
        tiernames = elan_object.get_tier_names()

    # Iterate over all tiers and extract annotations
    for tier in tiernames:
        
        annotations = format_annotations(
            tier,
            filename.stem,
            elan_object.get_annotation_data_for_tier(tier),
            annotations
            )
    return annotations
    

def iterate_over_folder(
    root_path, annotations=list(),
    file_extension='.eaf', # This parameter can't really be changes with the code as is.
    output_format='dict_list',
    tier_name=None
    ):
    """Iterates over files of a specified extensions and 
    extracts annotations into either a list of dictionaries
    or a pandas dataframe."""
    for file in root_path.glob(f'*{file_extension}'):
        
        annotations = get_annotations(
            root_path, file,
            annotations=annotations,
            tier_name=tier_name
            )
    
    
    if output_format == 'pandas':
        return pd.DataFrame(annotations)
    elif output_format == 'dict_list':
        return annotations


### Process annotations
The cell below will output annotations into a csv format. Please make sure you added your own folder name to the variable assignment of path_to_transcripts.

In [47]:
data = iterate_over_folder(path_to_transcripts, output_format='pandas')
data.to_csv(path_to_transcripts / 'annotation_data.csv')

The following functions are utilities to write annotations into elan or praat once they have been processed.

In [11]:
# The tier name can probably be derived from annotations. 
# I can change this once we have a better idea of the 
# output structure.
def output_tier(eaf_object, tier_name, annotation_name, annotation_list):
    
    eaf_object.add_tier(tier_name)
    
    for annotation_dict in annotation_list:
        
        
        eaf_object.add_annotation(
            tier_name, 
            annotation_dict['start_ms'],
            annotation_dict['end_ms'],
            value=annotation_dict[annotation_name] # Probably somethign like 'pos_tags'
            )
    
    return eaf_object


def write_eaf(eaf_object, output_file):
    
    eaf_object.to_file(output_file)
        
###################################################################################
###################################################################################
####### Praat functionality can be added if necessary to both #####################
####### input and output pipelines.                           #####################
###################################################################################
###################################################################################


# 2.0 Linguistic Annotation

We begin by importing the stanza package

In [12]:
import stanza
import numpy as np

2024-05-06 18:55:37.457975: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


We now download the necessary packages from stanza. This should only be done once. 

The first argument ``en`` is the language code for English, and the processors include a tokenizer, part of speech tagger, and a lemmatizer.


In [13]:
stanza.download('en', processors='tokenize,pos,lemma')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-06 18:55:42 INFO: Downloaded file to /Users/anton/stanza_resources/resources.json
2024-05-06 18:55:42 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package           |
---------------------------------------
| tokenize        | combined          |
| mwt             | combined          |
| pos             | combined_charlm   |
| lemma           | combined_nocharlm |
| forward_charlm  | 1billion          |
| pretrain        | conll17           |
| backward_charlm | 1billion          |



Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/tokenize/combined.pt:   0%|    …

2024-05-06 18:55:43 INFO: Downloaded file to /Users/anton/stanza_resources/en/tokenize/combined.pt


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/mwt/combined.pt:   0%|         …

2024-05-06 18:55:45 INFO: Downloaded file to /Users/anton/stanza_resources/en/mwt/combined.pt


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/pos/combined_charlm.pt:   0%|  …

2024-05-06 18:56:02 INFO: Downloaded file to /Users/anton/stanza_resources/en/pos/combined_charlm.pt


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/lemma/combined_nocharlm.pt:   0…

2024-05-06 18:56:09 INFO: Downloaded file to /Users/anton/stanza_resources/en/lemma/combined_nocharlm.pt
2024-05-06 18:56:09 INFO: File exists: /Users/anton/stanza_resources/en/forward_charlm/1billion.pt


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/pretrain/conll17.pt:   0%|     …

2024-05-06 18:56:27 INFO: Downloaded file to /Users/anton/stanza_resources/en/pretrain/conll17.pt
2024-05-06 18:56:27 INFO: File exists: /Users/anton/stanza_resources/en/backward_charlm/1billion.pt
2024-05-06 18:56:27 INFO: Finished downloading models and saved to /Users/anton/stanza_resources


Once we download the necessary stanza packages, we initialize the pipeline. 

In [14]:
nlp = stanza.Pipeline('en',
                      use_gpu=False, 
                      processors=['tokenize','pos','lemma','mwt'])





2024-05-06 18:56:27 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-06 18:56:27 INFO: Downloaded file to /Users/anton/stanza_resources/resources.json
2024-05-06 18:56:28 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2024-05-06 18:56:28 INFO: Using device: cpu
2024-05-06 18:56:28 INFO: Loading: tokenize
2024-05-06 18:56:28 INFO: Loading: mwt
2024-05-06 18:56:28 INFO: Loading: pos
2024-05-06 18:56:28 INFO: Loading: lemma
2024-05-06 18:56:28 INFO: Done loading processors!


In [48]:
# substitue with actual path to the data
df = pd.read_csv(path_to_transcripts / 'annotation_data.csv')

We now annotate the utterances using ``nlp()``. The ``apply()`` function allows us to apply a function to a column in a pandas dataframe. 

In [49]:
df['annotated'] = df.text.apply(lambda x: nlp(x))

In [50]:
annotated_data = []
for filename, speaker,start_ms,end_ms, sentences in np.array(df[ ['filename', 'speaker', 'start_ms', 'end_ms', 'annotated']]):
    for sent in sentences.sentences:
        for word in sent.words:
            annotation = [filename, speaker,start_ms, end_ms, sent.text, word.text, word.xpos, word.feats, [(wrd.text, wrd.xpos) for wrd in sent.words]]
            annotated_data.append(annotation)
            print(annotation)
    print() 
                    
                    

['RED_Hooks_Evelyn', 'RED_Hooks_Evelyn', 598390, 600770, 'doing whatever they wanted to do.', 'doing', 'VBG', 'VerbForm=Ger', [('doing', 'VBG'), ('whatever', 'WP'), ('they', 'PRP'), ('wanted', 'VBD'), ('to', 'TO'), ('do', 'VB'), ('.', '.')]]
['RED_Hooks_Evelyn', 'RED_Hooks_Evelyn', 598390, 600770, 'doing whatever they wanted to do.', 'whatever', 'WP', 'PronType=Rel', [('doing', 'VBG'), ('whatever', 'WP'), ('they', 'PRP'), ('wanted', 'VBD'), ('to', 'TO'), ('do', 'VB'), ('.', '.')]]
['RED_Hooks_Evelyn', 'RED_Hooks_Evelyn', 598390, 600770, 'doing whatever they wanted to do.', 'they', 'PRP', 'Case=Nom|Number=Plur|Person=3|PronType=Prs', [('doing', 'VBG'), ('whatever', 'WP'), ('they', 'PRP'), ('wanted', 'VBD'), ('to', 'TO'), ('do', 'VB'), ('.', '.')]]
['RED_Hooks_Evelyn', 'RED_Hooks_Evelyn', 598390, 600770, 'doing whatever they wanted to do.', 'wanted', 'VBD', 'Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin', [('doing', 'VBG'), ('whatever', 'WP'), ('they', 'PRP'), ('wanted', 'VBD'), 

In [51]:
annotated_df = pd.DataFrame(annotated_data,
                            columns=['filename', 'speaker', 'start_ms', 'end_ms', 'text', 'Word','POS','Features', 'tagged_sentence'] )

Here's what the data looks like

In [52]:
annotated_df

Unnamed: 0,filename,speaker,start_ms,end_ms,text,Word,POS,Features,tagged_sentence
0,RED_Hooks_Evelyn,RED_Hooks_Evelyn,598390,600770,doing whatever they wanted to do.,doing,VBG,VerbForm=Ger,"[(doing, VBG), (whatever, WP), (they, PRP), (w..."
1,RED_Hooks_Evelyn,RED_Hooks_Evelyn,598390,600770,doing whatever they wanted to do.,whatever,WP,PronType=Rel,"[(doing, VBG), (whatever, WP), (they, PRP), (w..."
2,RED_Hooks_Evelyn,RED_Hooks_Evelyn,598390,600770,doing whatever they wanted to do.,they,PRP,Case=Nom|Number=Plur|Person=3|PronType=Prs,"[(doing, VBG), (whatever, WP), (they, PRP), (w..."
3,RED_Hooks_Evelyn,RED_Hooks_Evelyn,598390,600770,doing whatever they wanted to do.,wanted,VBD,Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbF...,"[(doing, VBG), (whatever, WP), (they, PRP), (w..."
4,RED_Hooks_Evelyn,RED_Hooks_Evelyn,598390,600770,doing whatever they wanted to do.,to,TO,,"[(doing, VBG), (whatever, WP), (they, PRP), (w..."
...,...,...,...,...,...,...,...,...,...
2654,SAL_Valdez_David,SAL_Valdez_David,895270,901453,"so you know it was good it was, uh, it was gre...",up,RP,,"[(so, RB), (you, PRP), (know, VBP), (it, PRP),..."
2655,SAL_Valdez_David,SAL_Valdez_David,895270,901453,"so you know it was good it was, uh, it was gre...",on,IN,,"[(so, RB), (you, PRP), (know, VBP), (it, PRP),..."
2656,SAL_Valdez_David,SAL_Valdez_David,895270,901453,"so you know it was good it was, uh, it was gre...",roller,NN,Number=Sing,"[(so, RB), (you, PRP), (know, VBP), (it, PRP),..."
2657,SAL_Valdez_David,SAL_Valdez_David,895270,901453,"so you know it was good it was, uh, it was gre...",blading,NN,Number=Sing,"[(so, RB), (you, PRP), (know, VBP), (it, PRP),..."


We can also see all adjectives in the data using the following line of code

In [53]:
annotated_df[annotated_df.POS=='JJ']

Unnamed: 0,filename,speaker,start_ms,end_ms,text,Word,POS,Features,tagged_sentence
41,RED_Hooks_Evelyn,RED_Hooks_Evelyn,611160,616180,I kinda just uh was taken in by the- the white...,white,JJ,Degree=Pos,"[(I, PRP), (kinda, RB), (just, RB), (uh, UH), ..."
45,RED_Hooks_Evelyn,RED_Hooks_Evelyn,611160,616180,I kinda just uh was taken in by the- the white...,else,JJ,Degree=Pos,"[(I, PRP), (kinda, RB), (just, RB), (uh, UH), ..."
93,RED_Hooks_Evelyn,RED_Hooks_Evelyn,624410,627350,uh I mean I love my family don't get me wrong.,wrong,JJ,Degree=Pos,"[(uh, UH), (I, PRP), (mean, VBP), (I, PRP), (l..."
196,RED_Hooks_Evelyn,RED_Hooks_Evelyn,649370,657130,"oh yeah, you know playing sports and stuff you...",black,JJ,Degree=Pos,"[(oh, UH), (yeah, UH), (,, ,), (you, PRP), (kn..."
222,RED_Hooks_Evelyn,RED_Hooks_Evelyn,657540,662660,the people that I grew up with in high school ...,high,JJ,Degree=Pos,"[(the, DT), (people, NNS), (that, WDT), (I, PR..."
...,...,...,...,...,...,...,...,...,...
2609,SAL_Valdez_David,SAL_Valdez_David,885901,888676,she told me like oh so you can be at the same ...,same,JJ,Degree=Pos,"[(she, PRP), (told, VBD), (me, PRP), (like, UH..."
2618,SAL_Valdez_David,SAL_Valdez_David,888965,895233,and it was good because I later looked up thos...,good,JJ,Degree=Pos,"[(and, CC), (it, PRP), (was, VBD), (good, JJ),..."
2630,SAL_Valdez_David,SAL_Valdez_David,888965,895233,and it was good because I later looked up thos...,worth,JJ,Degree=Pos,"[(and, CC), (it, PRP), (was, VBD), (good, JJ),..."
2640,SAL_Valdez_David,SAL_Valdez_David,895270,901453,"so you know it was good it was, uh, it was gre...",good,JJ,Degree=Pos,"[(so, RB), (you, PRP), (know, VBP), (it, PRP),..."


# 3. Beautify our data

Here we're gonna make it so that each row corresponds to one adjective and contains a column with the text of that annotation. We're also transforming our ms timestamps to seconds. Inspect the output to check that everything processed normally.

In [56]:
annotated_df['start_s'] = annotated_df.start_ms.map(lambda x: x/1000)
annotated_df['end_s'] = annotated_df.end_ms.map(lambda x: x/1000)
clean_df = annotated_df[['filename','speaker', 'start_s', 'end_s', 'text', 'tagged_sentence', 'Word', 'POS', 'Features']]
adj_df = clean_df[clean_df.POS == 'JJ']
adj_df.to_csv(path_to_transcripts / 'hooks_valdez_adjectives.csv')
adj_df

Unnamed: 0,filename,speaker,start_s,end_s,text,tagged_sentence,Word,POS,Features
41,RED_Hooks_Evelyn,RED_Hooks_Evelyn,611.160,616.180,I kinda just uh was taken in by the- the white...,"[(I, PRP), (kinda, RB), (just, RB), (uh, UH), ...",white,JJ,Degree=Pos
45,RED_Hooks_Evelyn,RED_Hooks_Evelyn,611.160,616.180,I kinda just uh was taken in by the- the white...,"[(I, PRP), (kinda, RB), (just, RB), (uh, UH), ...",else,JJ,Degree=Pos
93,RED_Hooks_Evelyn,RED_Hooks_Evelyn,624.410,627.350,uh I mean I love my family don't get me wrong.,"[(uh, UH), (I, PRP), (mean, VBP), (I, PRP), (l...",wrong,JJ,Degree=Pos
196,RED_Hooks_Evelyn,RED_Hooks_Evelyn,649.370,657.130,"oh yeah, you know playing sports and stuff you...","[(oh, UH), (yeah, UH), (,, ,), (you, PRP), (kn...",black,JJ,Degree=Pos
222,RED_Hooks_Evelyn,RED_Hooks_Evelyn,657.540,662.660,the people that I grew up with in high school ...,"[(the, DT), (people, NNS), (that, WDT), (I, PR...",high,JJ,Degree=Pos
...,...,...,...,...,...,...,...,...,...
2609,SAL_Valdez_David,SAL_Valdez_David,885.901,888.676,she told me like oh so you can be at the same ...,"[(she, PRP), (told, VBD), (me, PRP), (like, UH...",same,JJ,Degree=Pos
2618,SAL_Valdez_David,SAL_Valdez_David,888.965,895.233,and it was good because I later looked up thos...,"[(and, CC), (it, PRP), (was, VBD), (good, JJ),...",good,JJ,Degree=Pos
2630,SAL_Valdez_David,SAL_Valdez_David,888.965,895.233,and it was good because I later looked up thos...,"[(and, CC), (it, PRP), (was, VBD), (good, JJ),...",worth,JJ,Degree=Pos
2640,SAL_Valdez_David,SAL_Valdez_David,895.270,901.453,"so you know it was good it was, uh, it was gre...","[(so, RB), (you, PRP), (know, VBP), (it, PRP),...",good,JJ,Degree=Pos
