# Mini Lab 6: Collocations

We're going to use the same libraries and introductory steps in our processing pipeline.

For details about the package and it's functions see: <https://docuscospacy.readthedocs.io/en/latest/docuscope.html>

If you'd like to explore what this library does in an interactive online interface, you can go to: <https://docuscope-ca.eberly.cmu.edu/>

## Prepare your data

In this lab, we'll also be reviewing the steps for loading text data directly from your Google Drive. To prepare, follow these steps:


1.   Download the `ya_corpus` from Canvas.
2.   This is a zipped file, so you will need to unzip it. On a Mac, just double-click the file. On Windows, right-click on the zip file you want to expand. Then, from the context menu, choose "Extract All".
3.   Make a `data` folder on your Google Drive in a location that is sensible to you.
4.  In that new directory, from Drive click on `+ New` then `Folder upload`. Find the unzipped `ya_corpus` and upload it.

---
If you're using the lab and don't have access to Canvas, you can read in some preprocessed data, then filter it in place of code chunk 11.

```{python}
twilight_tokens = pl.read_parquet('https://github.com/browndw/humanities_analytics/raw/refs/heads/main/data/data_tables/ya_tokens.parquet')

twilight_tokens = twilight_tokens.filter(pl.col("doc_id").str.contains("Twilight"))
```


## Install the libraries

Note that the capture decorator simply supresses the installation output.

## Mount Drive

Connect to your Google Drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load the libraries

We'll need these for our proceessing pipeline (docuscospacy, spacy) wrangle data frames (polars), generate and maipulate tables (great_tables) and create plots (matplotlib).

In [26]:
from gensim.models import Word2Vec
import re

## Import data

Once you've added the ya_corpus folder to your Drive, you can read in the **Twilight** texts.

Change the path (to JUST the Twilight directory), remove the comment (#) and run...

In [65]:
# 1. Prepare the text data
def preprocess_text(text_file):
    with open(text_file, 'r', encoding='utf-8') as f:
        text = f.read()
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    tokenized_sentences = [re.findall(r'\b\w+\b', sentence.lower()) for sentence in sentences]
    return tokenized_sentences

# 2. Train the Word2Vec model
def train_word2vec_model(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4, sg=0):
    model = Word2Vec(sentences=tokenized_sentences, vector_size=vector_size, window=window, min_count=min_count, workers=workers, sg=sg)
    return model

# 3. Save and use the model
def save_model(model, file_path):
     model.save(file_path)

def load_model(file_path):
    return Word2Vec.load(file_path)

In [66]:
twilight_file = "/content/drive/MyDrive/76-380-780 MiHA/Mini Labs/data/ya_corpus/Twilight/Meyer_Twilight.txt"

In [67]:
model_file = '/content/drive/MyDrive/76-380-780 MiHA/Mini Labs/data/ya_corpus/twilight.model'

In [68]:
tokenized_sentences = preprocess_text(twilight_file)

In [69]:
model = train_word2vec_model(tokenized_sentences)

In [70]:
save_model(model, model_file)

In [71]:
loaded_model = load_model(model_file)

In [73]:
  word = "her"
  if word in loaded_model.wv:
      vector = loaded_model.wv[word]
      similar_words = loaded_model.wv.most_similar(word, topn=10)
      print(f"Vector for '{word}': {vector}")
      print(f"Similar words to '{word}': {similar_words}")
  else:
      print(f"Word '{word}' not found in vocabulary.")

Vector for 'her': [-0.78750575  0.7612749   0.14320377  0.2425537   0.4410034  -1.376455
  0.06663188  1.5922779  -1.1987109  -0.35662937 -0.5558759  -1.3470191
  0.05842101  0.54743433  0.41945302 -0.25711858  0.04326954 -0.70031637
  0.03159953 -1.6344703   0.08516036  0.08223625  0.8512079  -0.5773743
  0.01574954 -0.3833253  -1.3480359  -0.28446513 -0.20267667  0.35531062
  1.3980274  -0.08136886  0.50472754 -0.6934735  -0.84024024  0.9532006
  0.1011598  -0.02656677 -0.44757658 -1.24818     0.2757881  -0.38599417
 -0.13939567  0.21535529  0.30118388  0.45548576 -0.53019744 -0.2606152
 -0.13330145  0.48818192  0.7010559  -0.64420027 -0.4334239   0.2564224
  0.27379394  0.43746457  0.5885829  -0.04886107 -0.70934486  0.40370113
 -0.397404    0.15977304  0.2905917  -0.06621511 -0.83842707  0.96380174
  0.4524631   0.24281016 -0.7333111   1.0476041  -0.74061394  0.18997973
  0.61162716  0.02186894  0.58837914  0.08166508 -0.39852753 -0.1778204
 -0.55372804 -0.05321027 -0.63092124  0.0

The resulting table should have a `doc_id` column and a `text` column. This is conventional formatting for processing textual data.