# Contemporary Corpus of American English (COCA) Full Text Analysis

This notebook is helpful for you if you purchased the Contemporary Corpus of American English (COCA) Full Text and want to handle it offline versus using the web GUI at https://www.english-corpora.org/coca/. We demonstrate various functionalities in `getout_of_text3` that can be used to effectively read and analyze the COCA full text data.

- Goals of this notebook include 
  - understanding how to use the COCA full text data
  - staging a dictionary of dictionaries of dictionaries `{'genre': {'year': pd.df}}`

In [1]:
import getout_of_text_3 as got3
import pandas as pd


In [2]:
got3.__version__

'0.2.39'

In [3]:
coca_corpus = got3.read_corpus('../coca-text/')

Genres:   0%|          | 0/8 [00:00<?, ?genre/s]

Processing genre: mag


Genres:  12%|█▎        | 1/8 [00:00<00:06,  1.04genre/s]

Finished genre: mag (total files: 30)
Processing genre: web


Genres:  25%|██▌       | 2/8 [00:01<00:05,  1.02genre/s]

Finished genre: web (total files: 34)
Processing genre: acad


Genres:  38%|███▊      | 3/8 [00:02<00:04,  1.02genre/s]

Finished genre: acad (total files: 30)
Processing genre: news


Genres:  50%|█████     | 4/8 [00:03<00:03,  1.04genre/s]

Finished genre: news (total files: 30)
Processing genre: spok


Genres:  62%|██████▎   | 5/8 [00:04<00:02,  1.05genre/s]

Finished genre: spok (total files: 30)
Processing genre: blog


Genres:  75%|███████▌  | 6/8 [00:05<00:01,  1.04genre/s]

Finished genre: blog (total files: 34)
Processing genre: fic


Genres:  88%|████████▊ | 7/8 [00:06<00:00,  1.09genre/s]

Finished genre: fic (total files: 30)
Processing genre: tvm


Genres: 100%|██████████| 8/8 [00:07<00:00,  1.06genre/s]

Finished genre: tvm (total files: 30)





In [4]:
coca_corpus.keys()

dict_keys(['mag', 'web', 'acad', 'news', 'spok', 'blog', 'fic', 'tvm'])

In [5]:
coca_corpus['acad'].keys()

dict_keys(['2013', '2007', '2006', '2012', '2004', '2010', '2011', '2005', '2001', '2015', '2014', '2000', '2016', '2002', '2003', '2017', '1999', '1998', '1996', '1997', '1995', '1994', '1990', '1991', '1993', '1992', '2019', '2018', '2008', '2009'])

In [6]:
coca_corpus['web'].keys()

dict_keys(['13', '07', '06', '12', '04', '10', '11', '05', '29', '01', '15', '14', '28', '16', '02', '03', '17', '32', '26', '27', '33', '25', '31', '19', '18', '30', '24', '08', '20', '34', '21', '09', '23', '22'])

____________________________
## Search Keyword 

- using `bovine` as a test keyword across the full COCA corpus
- COMPARE YOUR RESULTS TO THE OUTPUT HERE, IF POSSIBLE: https://www.english-corpora.org/coca/
  - I get sometimes less and sometimes more hits! TBD and needs review...


### Comparing parallel vs non-parallel kwic search

- the `n_jobs` parameter will automatically use n-1 cores to use all but one of your CPU cores. This leads to much better performance on large corpora.
- i.e. for `bovine` on the full COCA text corpus, I get (10-1=9 CPU cores):
  - non-parallel: time elapsed: 0 days 00:01:01.157718
  - parallel: time elapsed: 0 days 00:00:22.578978
  - almost 3x faster!

In [None]:
before = pd.Timestamp.now()
bovine_kwic = got3.search_keyword_corpus('bovine', coca_corpus, 
                                            case_sensitive=False,
                                            show_context=True, 
                                            context_words=15,
                                            output='print',
                                            parallel=True)
after = pd.Timestamp.now()
print('time elapsed:', after - before)

_____________________________
### Run keyword_frequency_analysis

- get a distribution of keyword frequencies across the full COCA corpus genres
- this shows `1178812039` (~1.1 billion tokens) in the COCA dataset -- on the site it's published as `1.0 billion` # of words.
  - notably the English-Corpora site returns `bovine` with 1248 hits in the full COCA corpus, whereas I get `1252` hits here -- so there is a discrepancy of 4 hits. TBD why this is the case.

> 🚨 discrepacy also in the kwic versus the frequency analysis here

In [8]:
before = pd.Timestamp.now()
bovine_freq = got3.keyword_frequency_analysis('bovine', 
                                              coca_corpus, 
                                              case_sensitive=False,
                                              relative=True, # optionally to show column, per 10k words
                                              parallel=True # use parallel processing
                                              )
after = pd.Timestamp.now()
print('time elapsed:', after - before)

📊 Frequency Analysis for 'bovine' (case_sensitive=False, loose substring match)
  acad    :    501 hits | 140449282 tokens | 0.04 /10k
  web     :    208 hits | 149036464 tokens | 0.01 /10k
  mag     :    162 hits | 146417442 tokens | 0.01 /10k
  fic     :    109 hits | 142585624 tokens | 0.01 /10k
  blog    :     92 hits | 143156927 tokens | 0.01 /10k
  tvm     :     71 hits | 162287598 tokens | 0.00 /10k
  news    :     68 hits | 143377305 tokens | 0.00 /10k
  spok    :     41 hits | 151501397 tokens | 0.00 /10k
------------------------------------------------------------
TOTAL: 1252 hits across 8 genres (~1178812039 tokens)
time elapsed: 0 days 00:00:29.319176


______________________________________
## Finding collocates of `bovine` in the COCA corpus

The `find_collocates` function in `getout_of_text_3` allows you to identify words that frequently appear near a target keyword within a large corpus, such as COCA. 

- Simply provide your `keyword` and `coca_corpus`, i.e. the corpus dictionary, and optional parameters like window size, minimum frequency, and parallel processing. 

- The function returns a dictionary of collocates and their counts, which you can easily convert to a DataFrame for further analysis or visualization.



In [None]:
before = pd.Timestamp.now()
bovine_collocates = got3.find_collocates('bovine', 
                                         coca_corpus,
                                         window_size=15,
                                         min_freq=2,
                                         case_sensitive=False,
                                         parallel=True)
after = pd.Timestamp.now()
print('time elapsed:', after - before)

In [None]:
before = pd.Timestamp.now()
bovine_collocates = got3.find_collocates('bovine', 
                                         coca_corpus,
                                         window_size=15,
                                         min_freq=2,
                                         case_sensitive=False,
                                         parallel=True)
after = pd.Timestamp.now()
print('time elapsed:', after - before)

🔗 Collocate Analysis for 'bovine' (window: ±15 words, loose substring match)

📚 MAG_1993 Genre Collocates:
  Found 9 instances of 'bovine' in mag_1993
  the            :  13 times
  and            :   6 times
  for            :   4 times
  that           :   3 times
  company        :   2 times
  with           :   2 times
  cattle         :   2 times
  human          :   2 times
  field          :   2 times
  industry       :   2 times

📚 MAG_1992 Genre Collocates:
  Found 7 instances of 'bovine' in mag_1992
  the            :   7 times
  and            :   6 times
  The            :   4 times
  from           :   2 times
  cattle         :   2 times
  diseases       :   2 times
  know           :   2 times
  leukemia       :   2 times
  virus          :   2 times
  rampant        :   2 times

📚 MAG_1990 Genre Collocates:
  Found 6 instances of 'bovine' in mag_1990
  and            :   5 times
  milk           :   3 times
  into           :   2 times
  produced       :   2 times
  fro

### Part of Speech tagging for collocates

- using the `spacy` library to tag the collocates with their parts of speech (POS)
- tbd if this will get included with `got3`!

In [None]:
import spacy
from spacy.cli import download
download('en_core_web_sm')
nlp = spacy.load('en_core_web_sm')

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 39.1 MB/s eta 0:00:00



[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: pip install --upgrade pip


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [22]:
# Convert collocates to DataFrame and show value counts
collocates_df = pd.DataFrame(list(bovine_collocates['all_collocates'].items()), columns=['word', 'count'])
collocates_df = collocates_df.sort_values('count', ascending=False).reset_index(drop=True)

# Get top 50 collocates
top_collocates = collocates_df.head(50)['word'].tolist()

# POS tagging
collocate_pos = [(word, nlp(word)[0].pos_) for word in top_collocates]

# Convert to DataFrame for display
collocate_pos_df = pd.DataFrame(collocate_pos, columns=['word', 'pos'])
collocate_pos_df.head(20)

# now merge the two so we get pos along with the counts
collocates_with_pos = pd.merge(collocates_df, collocate_pos_df, on='word', how='left')
collocates_with_pos.head(20)


Unnamed: 0,word,count,pos
0,the,1245,PRON
1,and,802,CCONJ
2,with,350,ADP
3,that,255,SCONJ
4,for,206,ADP
5,The,200,PRON
6,from,187,ADP
7,were,157,AUX
8,was,151,AUX
9,are,146,AUX


In [23]:
def hub_layout(G, keyword, group_attr="pos_group", radius=10.0, seed=42):
    rng = np.random.default_rng(seed)
    groups = [g for g in sorted(set(nx.get_node_attributes(G, group_attr).values())) if g != "KEYWORD"]
    n_groups = len(groups)

    # keyword at center
    pos = {keyword: (0,0)}

    # arrange other groups in orbit
    angles = np.linspace(0, 2*np.pi, n_groups, endpoint=False)
    group_positions = {
        grp: (radius*np.cos(a), radius*np.sin(a))
        for grp, a in zip(groups, angles)
    }

    for node, data in G.nodes(data=True):
        if node == keyword: 
            continue
        grp = data.get(group_attr, "OTHER")
        cx, cy = group_positions.get(grp, (0,0))
        jitter = rng.normal(scale=1.0, size=2)
        pos[node] = (cx + jitter[0], cy + jitter[1])
    return pos

# usage:
pos = hub_layout(G, keyword, group_attr="pos_group", radius=3.5, seed=4)


In [37]:
# ------------------------------------------------------------------
# Node size + color computation
# ------------------------------------------------------------------

# Color palette for groups
pos_colors = {
    'KEYWORD': '#d62728',  # red
    'NOUN': '#1f77b4',     # blue
    'VERB': '#2ca02c',     # green
    'ADJ': '#ff7f0e',      # orange
    'ADV': '#9467bd',      # purple
    'PROPN': '#bcbd22',    # olive
    'OTHER': '#7f7f7f'     # gray
}

# Group-relative sizing parameters
BASE_MIN = 1500    # minimum size for non-keyword nodes
BASE_MAX = 4500    # maximum size (largest within a group)
KEYWORD_SIZE = 6000  # keyword node size (increased from 3200)

from collections import defaultdict
max_per_group = defaultdict(lambda: 1)
for n in G.nodes:
    if n == keyword:
        continue
    grp = G.nodes[n]['pos_group']
    max_per_group[grp] = max(max_per_group[grp], G.nodes[n]['count'])

# Compute node sizes
node_sizes = []
for n in G.nodes:
    data = G.nodes[n]
    if n == keyword:
        node_sizes.append(KEYWORD_SIZE)
        continue
    grp = data['pos_group']
    grp_max = max_per_group.get(grp, 1)
    rel = data['count'] / grp_max if grp_max else 0
    size = BASE_MIN + (BASE_MAX - BASE_MIN) * rel
    node_sizes.append(size)

# Compute node colors
node_colors = [
    pos_colors.get(G.nodes[n].get('pos_group', 'OTHER'), '#7f7f7f')
    for n in G.nodes
]


### Thinking about crawling or querying wiki data

In [83]:
headers = {
    "User-Agent": "Mozilla/5.0 (compatible; MyBot/0.1; +https://example.com/bot)"
}
url = "https://en.wikipedia.org/wiki/Bovinae"
response = requests.get(url, headers=headers, timeout=3)
print(response)
print(response.status_code)
print('response content:', response.content[:500])  # print first 500 characters

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    # Find any table whose class starts with 'infobox'
    infobox = soup.find('table', class_=re.compile(r'^infobox'))
    if infobox:
        # Extract key info from infobox
        info_text = infobox.get_text(separator=' | ', strip=True)
        # Limit to first 200 chars for hover
        print(info_text[:200] + "..." if len(info_text) > 200 else info_text)
    else:
        # Get first paragraph if no infobox
        first_para = soup.find('p')
        if first_para:
            text = first_para.get_text(strip=True)
            print(text[:150] + "..." if len(text) > 150 else text)

<Response [200]>
200
response content: b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vect'
Bovines | Temporal range: | Miocene | to | present | African buffalo | ( | Syncerus caffer | ) | Scientific classification | Kingdom: | Animalia | Phylum: | Chordata | Class: | Mammalia | Order: | Art...


In [84]:
# ------------------------------------------------------------------
# 1. Build Graph (your earlier code)
# ------------------------------------------------------------------
import networkx as nx
import numpy as np


keyword = 'bovine'  # or whatever your keyword is
G = nx.Graph()
G.add_node(keyword, pos_group='KEYWORD', count=collocates_with_pos['count'].max())

for _, row in top_pos_table.iterrows():
    w = row['word']
    posg = row['pos_group']
    cnt = int(row['count'])
    if w == keyword:
        continue
    G.add_node(w, pos_group=posg, count=cnt)
    G.add_edge(keyword, w, weight=cnt)

# ------------------------------------------------------------------
# 2. Hub layout (so groups orbit the keyword)
# ------------------------------------------------------------------
import numpy as np

def hub_layout(G, keyword, group_attr="pos_group", radius=10.0, seed=42):
    rng = np.random.default_rng(seed)
    groups = [g for g in sorted(set(nx.get_node_attributes(G, group_attr).values())) if g != "KEYWORD"]
    n_groups = len(groups)

    pos = {keyword: (0,0)}  # keyword at center
    angles = np.linspace(0, 2*np.pi, n_groups, endpoint=False)
    group_positions = {
        grp: (radius*np.cos(a), radius*np.sin(a))
        for grp, a in zip(groups, angles)
    }

    for node, data in G.nodes(data=True):
        if node == keyword:
            continue
        grp = data.get(group_attr, "OTHER")
        cx, cy = group_positions.get(grp, (0,0))
        jitter = rng.normal(scale=1.0, size=2)
        pos[node] = (cx + jitter[0], cy + jitter[1])
    return pos

pos = hub_layout(G, keyword, group_attr="pos_group", radius=3.5, seed=4)

# ------------------------------------------------------------------
# 3. Interactive Plotly draw
# ------------------------------------------------------------------
import plotly.graph_objects as go

x_nodes = [pos[n][0] for n in G.nodes]
y_nodes = [pos[n][1] for n in G.nodes]
node_texts = [f"{n}<br>POS: {G.nodes[n]['pos_group']}<br>Count: {G.nodes[n]['count']}" 
              for n in G.nodes]

x_edges, y_edges = [], []
for u, v in G.edges:
    x_edges += [pos[u][0], pos[v][0], None]
    y_edges += [pos[u][1], pos[v][1], None]

edge_trace = go.Scatter(
    x=x_edges, y=y_edges,
    line=dict(width=0.8, color="#888"),
    hoverinfo="none",
    mode="lines"
)

node_trace = go.Scatter(
    x=x_nodes, y=y_nodes,
    mode="markers+text",
    text=[n for n in G.nodes],
    textposition="middle center",
    marker=dict(
        size=[s/60 for s in node_sizes],  # rescale sizes for Plotly
        color=node_colors,
        line=dict(width=1.5, color="black")
    ),
    hovertext=node_texts,
    hoverinfo="text"
)

fig = go.Figure(data=[edge_trace, node_trace])
fig.update_layout(
    title=f"Collocates Network for “{keyword}” (Interactive)",
    title_x=0.5,
    height=800,
    plot_bgcolor="white",
    showlegend=False,
    margin=dict(l=20, r=20, t=40, b=20),
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
)

fig.show()


### FUN! There are only a few hits for `gabagool`!

In [17]:
gabagool_kwic = got3.search_keyword_corpus('gabagool', coca_corpus,
                                            case_sensitive=False,
                                            show_context=False, 
                                            context_words=15,
                                            output='print',
                                            parallel=True)

🔍 COCA Corpus Search: 'gabagool'
🚀 Using parallel processing with 9 processes...

📚 MAG_1993 :
------------------------------
  ❌ No matches found in mag_1993

📚 MAG_1992 :
------------------------------
  ❌ No matches found in mag_1992

📚 MAG_1990 :
------------------------------
  ❌ No matches found in mag_1990

📚 MAG_1991 :
------------------------------
  ❌ No matches found in mag_1991

📚 MAG_1995 :
------------------------------
  ❌ No matches found in mag_1995

📚 MAG_1994 :
------------------------------
  ❌ No matches found in mag_1994

📚 MAG_1996 :
------------------------------
  ❌ No matches found in mag_1996

📚 MAG_1997 :
------------------------------
  ❌ No matches found in mag_1997

📚 MAG_2008 :
------------------------------
  ❌ No matches found in mag_2008

📚 MAG_2009 :
------------------------------
  ❌ No matches found in mag_2009

📚 MAG_2019 :
------------------------------
  ❌ No matches found in mag_2019

📚 MAG_2018 :
------------------------------
  ❌ No matches f

![https://static0.cbrimages.com/wordpress/wp-content/uploads/2025/01/tony-and-gabagool-featured-image.PNG?w=1200&h=628&fit=crop](https://static0.cbrimages.com/wordpress/wp-content/uploads/2025/01/tony-and-gabagool-featured-image.PNG?w=1200&h=628&fit=crop)