In [None]:
# A. Update the installers
!pip install --upgrade pip setuptools wheel

# B. Install spaCy and BeautifulSoup (The pre-built versions)
!pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org beautifulsoup4 requests spacy --only-binary :all:

# C. Download the English NLP "Brain"
!python -m spacy download en_core_web_sm

# Automated Sociolinguistic Metadata Extraction (Trinidadian Creole)

Research Context

This file serves as a contextual mining tool for the broader study of Trinidadian rhoticity. While acoustic analysis provides quantitative data on phonological shift, this NLP-driven pipeline automatically extracts sociolinguistic metadata from research literature (Wikipedia, archives, etc.) to provide a historical and cultural framework for the speaker linguistic environment.

Key Features:
- Web Scraping: Automated retrieval of research text using Requests and BeautifulSoup.

- NLP Metadata Extraction: Utilizing spaCy's Named Entity Recognition (NER) to identify ethnic groups (NORP), parent languages (LANGUAGE), and geographical hubs (GPE).

In [None]:
import requests
from bs4 import BeautifulSoup
import spacy
import pandas as pd

# 1. Load the NLP engine
# This is the "brain" that will read the text for us
nlp = spacy.load("en_core_web_sm")

def linguistic_literature_miner(url):
    """
    Automated tool to scrape research text and extract 
    Geographic (GPE) and Institutional (ORG) metadata.
    """
    # Set headers to look like a real browser
    headers = {'User-Agent': 'Mozilla/5.0'}
    
    # A. Scrape the content
    print(f"Scraping data from: {url}...")
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # B. Extract only the paragraph text (the "meat" of the content)
    text = " ".join([p.get_text() for p in soup.find_all('p')])
    
    # C. NLP Processing: Named Entity Recognition (NER)
    # We process the first 15,000 characters to keep it fast
    doc = nlp(text[:15000]) 
    
    entities = []
    for ent in doc.ents:
        # We only want Locations (GPE) and Organizations (ORG)
        if ent.label_ in ['GPE', 'NORP', 'LANGUAGE']:
            entities.append({'Entity': ent.text, 'Category': ent.label_})
    
    # D. Return a clean, unique table of findings
    return pd.DataFrame(entities).drop_duplicates().reset_index(drop=True)

# 2. RUN THE MINER
# Pointing it at an academic summary of Polynesian languages
df_metadata = linguistic_literature_miner("https://en.wikipedia.org/wiki/Trinidadian_Creole")

print("\n--- Automated Research Metadata Extraction ---")
print(df_metadata.head(20))

# Optional: Save results for your portfolio
# df_metadata.to_csv('linguistic_metadata_results.csv', index=False)

### Data Synthesis & Observations
By programmatically "reading" the academic summary of Trinidadian Creole, the model has identified several key vectors of influence:

Language Contact: The extraction of LANGUAGE entities identifies the superstrate and substrate influences (English, French, Spanish) that compete with native Creole features.

Cultural Stratigraphy: The NORP (Nationalities/Religious/Political) labels highlight the diverse demographic history (e.g., African, Indian, European) which directly correlates with the "mixed-methods" nature of the phonetic variation observed in our acoustic pipeline.

Geographic Anchoring: Identifying GPE (Geopolitical Entities) allows for a quick mapping of the hub areas where language contact is most intense.