# Come up with your own project!

Define your programming project at a high level (individual or in small groups)!

Here is a recommended outline to get you started:

__Task:__ extract most frequent words / annotations from (sub-)corpora by genre

__Data:__ UD_English-GUM

Now it's your turn:
- pick two genres out of {academic, bio, conversation, court, essay, fiction, interview, letter, news, podcast, speech, textbook, vlog, voyage, whow}
- pick a language feature out of {POS, MORPH, DEPREL, Stopwords, SentenceLength, WordLength}
- formulate a research question and/or hypothesis about
    a) what frequency difference you expect between the two genres and
    b) what this difference could mean (e.g. in terms of how language works, who the authors/audiences are, how we can apply AI to the texts)

- coding: read UD GUM file(s), parse CoNLL format, count feature statistics

__Important:__ Document your code and your coding process, try to answer your research question as much as possible and come up with explanations for your findings, explain difficulties.

In [2]:
%pip install pandas

Collecting pandas
  Downloading pandas-2.2.2-cp310-cp310-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.2-cp310-cp310-win_amd64.whl (11.6 MB)
   ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
   --- ------------------------------------ 1.0/11.6 MB 8.4 MB/s eta 0:00:02
   -------------------------- ------------- 7.6/11.6 MB 24.7 MB/s eta 0:00:01
   ---------------------------------------- 11.6/11.6 MB 25.1 MB/s eta 0:00:00
Using cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Using cached tzdata-2024.1-py2.py3-none-any.whl (345 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.2 pytz-2024.2 tzdata-2024.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Load the dataset
import pandas as pd
import re
import pathlib  # Needed to handle file paths

# Specify the path to the file
infile = pathlib.Path('UD_English-GUM', 'en_gum-ud-dev.conllu')

# Open the file with UTF-8 encoding to handle special characters
with open(infile, 'r', encoding='utf-8') as f:  # Specify encoding to avoid Unicode errors
    for line in f:  # Loops over individual lines in the file
        print(line)  # Print or process each line


# newdoc id = GUM_academic_exposure

# global.Entity = GRP-etype-infstat-centering-minspan-link-identity

# meta::author = Kara Morgan-Short, Ingrid Finger, Sarah Grey, Michael T. Ullman

# meta::dateCollected = 2018-09-11

# meta::dateCreated = 2012-03-28

# meta::dateModified = 2012-03-28

# meta::genre = academic

# meta::salientEntities = 3, 56, 98

# meta::sourceURL = https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0032974

# meta::speakerCount = 0

# meta::summary = This study shows that limited exposure to a second language (L2) after it is no longer being actively used generally causes attrition of L2 competence.

# meta::title = Second Language Processing Shows Increased Native-Like Neural Responses after Months of No Exposure

# sent_id = GUM_academic_exposure-1

# s_prominence = 2

# s_type = frag

# transition = establishment

# text = Introduction

# newpar

# newpar_block = head (1 s)

1	Introduction	introduction	NOUN	NN	Number=Sing	0	root	0:root	Discour

In [3]:
import pathlib
import re

# Specify input and output file paths
input_file = pathlib.Path('UD_English-GUM', 'en_gum-ud-dev.conllu')
combined_output_file = pathlib.Path('UD_English-GUM', 'en_gum-ud-dev-academic-conversation.conllu')
academic_output_file = pathlib.Path('UD_English-GUM', 'en_gum-ud-dev-academic.conllu')
conversation_output_file = pathlib.Path('UD_English-GUM', 'en_gum-ud-dev-conversation.conllu')

# Initialize variables to track current genre and store relevant lines
current_genre = None
relevant_genres = ['academic', 'conversation']  # The genres we are interested in
buffer = []  # Buffer to hold lines for each document

# Open the input file and create separate files for the genres
with open(input_file, 'r', encoding='utf-8') as infile, \
     open(combined_output_file, 'w', encoding='utf-8') as combined_outfile, \
     open(academic_output_file, 'w', encoding='utf-8') as academic_outfile, \
     open(conversation_output_file, 'w', encoding='utf-8') as conversation_outfile:

    for line in infile:
        # Check if the line defines a new genre
        if line.startswith('# meta::genre ='):
            current_genre = line.split('=')[-1].strip()  # Extract the genre from the line

        # If we're in a relevant genre (academic or conversation), store the lines
        if current_genre in relevant_genres:
            buffer.append(line)  # Add the line to the buffer

        # If we reach the end of a document (marked by an empty line), write the buffer to appropriate files
        if line.strip() == '':  # Empty lines mark the end of a document in CoNLL-U format
            if current_genre in relevant_genres and buffer:
                # Write to combined file (both academic and conversation)
                combined_outfile.writelines(buffer)

                # Write to separate files based on genre
                if current_genre == 'academic':
                    academic_outfile.writelines(buffer)
                elif current_genre == 'conversation':
                    conversation_outfile.writelines(buffer)
            
            # Clear the buffer for the next document
            buffer = []


Space for your notes