String processing with Python

Using a text corpus found on the cds-language GitHub repo or a corpus of your own found on a site such as Kaggle, write a Python script which calculates collocates for a specific keyword.



The script should take a directory of text files, a keyword, and a window size (number of words) as input parameters, and an output file called out/{filename}.csv
These parameters can be defined in the script itself
Find out how often each word collocates with the target across the corpus
Use this to calculate mutual information between the target word and all collocates across the corpus
Save result as a single file consisting of four columns: collocate, raw_frequency, MI


BONUS CHALLENGE: Use argparse to take inputs from the command line as parameters


General instructions

For this assignment, you should upload a standalone .py script which can be executed from the command line.
Save your script as collocation.py
Make sure to include a requirements.txt file and your data
You can either upload the scripts here or push to GitHub and include a link - or both!
Your code should be clearly documented in a way that allows others to easily follow the structure of your script and to use them from the command line


Purpose

This assignment is designed to test that you have a understanding of:

how to structure, document, and share a Python scripts;
how to effectively make use of native Python packages for string processing;
how to extract basic linguistic information from large quantities of text, specifically in relation to a specific target keyword

In [8]:
#Importing all the necessary modules
import os
import re
from collections import Counter
import pandas as pd
from pathlib import Path

filepath = os.path.join("..", "data", "100_english_novels", "corpus")



In [9]:
# Define function which includes the arguments text directory, keyword and window size (the latter n-words before and n-words after keyword)
def collocation(text_dir, keyword, window_size = 1):
    # Make a list that the loop appends to
    collocations = list()
    collocations_unique = list()
    concordance_lines = list()
    collocate_lines = list()

    # For each file in the filepath that ends with .txt, read the file into "text"
    for file in Path(text_dir).glob("*9.txt"):
        with open(file, "r", encoding="utf-8") as file:
            text = file.read()

            # Tokenize each text into individual words
            text_tokens = re.compile(r"\W+").split(text)
            
            # Return index for each element text_tokens if the element in text_tokens is equal to keyword
            indices = [index for index, match in enumerate(text_tokens) if match == keyword]
            
            # For each keyword in the text, create an object (= concordance_line) that has keyword and the words just before and after (keyword +- window_size)
            for index in indices:
                concordance_line = text_tokens[max(0,index - window_size):index+window_size+1]
                
                # Append the concordance line to "concordance_lines"
                concordance_lines.append(concordance_line)

                # For each word in the concordance_line, add it to "new_collocations" if it is not the keyword.
                new_collocations = [collocate for collocate in concordance_line if collocate != keyword]

                # For each word in the concordance_line, add it to "new_collocations_unique" if it is not the keyword and if it does not already exist in the list.
                new_collocations_unique = [collocate for collocate in concordance_line if collocate not in collocations_unique and collocate != keyword]
                
                # Extend my list collocations, with all the collocations (words around keyword)
                collocations.extend(new_collocations)
                
                # Extend my list collocations_unique, with all the collocations (words around keyword) that do not already appear in the list.
                if new_collocations_unique not in collocations_unique:
                    collocations_unique.extend(new_collocations_unique)

    
    # Go through the collocations (all words that have appeared with the keyword in any of the texts) and count how often they have occured with the keyword.
    o11 = Counter(collocations)

    # Create an empty dictionary for counting concordance lines where keyword occurs without collocation.
    o12 = dict()
    
    # For each unique collocation in collocations_unique:
    for collocation_unique in collocations_unique:
        
        # Set loop counter
        loop_count = 0
        
        # For each line in concordance_lines, if the unique collocation does NOT appear, add +1 to counter
        for concordance_line in concordance_lines:
            if collocation_unique not in concordance_line:
                loop_count += 1

        # Updating the o12 to include a count for n-times that each unique collocation did not appear with a keyword
        o12.update({collocation_unique : loop_count})
    
    # Getting number of concordance lines (and we have 1 per keyword)
    n_times_keyword = len(concordance_lines)
    
    # Writing a dictionary with the keys for all collocates, and the value for n_keywords.
    R1 = {x: n_times_keyword for x in o12}


    print(f"All concordance lines with keyword: {concordance_lines}")
    print(f"All collocations (not unique): {collocations}")
    print(f"All unique collocations: {collocations_unique}")
    print(f"Number of collocations: {len(collocations)}")
    print(f"Number of unique collocations: {len(collocations_unique)}")
    print(f"N-times keyword occurs with collocate: {o11}") # n-times keyword occurs with collocate
    print(f"n-times keyword occurs w/o collocate: {o12}") # n-times keyword occurs w/o collocate
    print(f"N-times keyword occurs (regardless of collocate): {R1}") # n-times keyword occurs (regardless of collocate)
    
######################## STILL MISSING (HAVEN'T DONE THIS YET) ########################
            # o21 # n-times collocate occurs w/o keyword
            # o22 # n-times neither keyword nor collocate appears within a 1 + window_size
            # R2 # n-times a window contains collocate w/o keyword, or neither contains keyword nor collocate
            # C1 # n-times collocate occurs either with or without keyword
            # C2 # n-times collocate does not occur
            # N # sum of C1 + C2 + R1 + R2

In [10]:
collocation(filepath, "denounce")

All concordance lines with keyword: [['to', 'denounce', 'the'], ['would', 'denounce', 'me'], ['would', 'denounce', 'them'], ['to', 'denounce', 'my']]
All collocations (not unique): ['to', 'the', 'would', 'me', 'would', 'them', 'to', 'my']
All unique collocations: ['to', 'the', 'would', 'me', 'them', 'my']
Number of collocations: 8
Number of unique collocations: 6
N-times keyword occurs with collocate: Counter({'to': 2, 'would': 2, 'the': 1, 'me': 1, 'them': 1, 'my': 1})
n-times keyword occurs w/o collocate: {'to': 2, 'the': 3, 'would': 2, 'me': 3, 'them': 3, 'my': 3}
N-times keyword occurs (regardless of collocate): {'to': 4, 'the': 4, 'would': 4, 'me': 4, 'them': 4, 'my': 4}
