In [1]:
### A. Updating installers ###
!pip install --upgrade pip setuptools wheel

###B. Installing spaCy and BeautifulSoup ###
!pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org beautifulsoup4 requests spacy --only-binary :all:

### C. Downloading the English NLP data ###
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m24.1 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Automated Sociolinguistic Metadata Extraction (Trinidadian Creole)

Research Context

This file serves as a contextual mining tool for the broader study of Trinidadian rhoticity. While acoustic analysis provides quantitative data on phonological shift, this NLP-driven pipeline automatically extracts sociolinguistic metadata from research literature (Wikipedia, archives, etc.) to provide a historical and cultural framework for the speaker linguistic environment.

Key Features:
- Web Scraping: Automated retrieval of research text using Requests and BeautifulSoup.

- NLP Metadata Extraction: Utilizing spaCy's Named Entity Recognition (NER) to identify ethnic groups (NORP), parent languages (LANGUAGE), and geographical hubs (GPE).

In [2]:
import requests
from bs4 import BeautifulSoup
import spacy
import pandas as pd

### 1. Loading spaCy ###
nlp = spacy.load("en_core_web_sm")

def linguistic_literature_miner(url):
    """
    Automated tool to scrape research text and extract 
    Geographic (GPE) and Institutional (ORG) metadata.
    """
    # 1a. Setting headers to bypass block
    headers = {'User-Agent': 'Mozilla/5.0'}
    
    # 1b. Scraping the content
    print(f"Scraping data from: {url}...")
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 1c. Extracting only paragraph text
    text = " ".join([p.get_text() for p in soup.find_all('p')])
    
    # 1d. NLP Processing (NER), processing first 15,000 characters
    doc = nlp(text[:15000]) 
    
    entities = []
    for ent in doc.ents:
        if ent.label_ in ['GPE', 'NORP', 'LANGUAGE']:
            entities.append({'Entity': ent.text, 'Category': ent.label_})
    
    # 1e. Returning clean table
    return pd.DataFrame(entities).drop_duplicates().reset_index(drop=True)

### 2. Running the miner ###
df_metadata = linguistic_literature_miner("https://en.wikipedia.org/wiki/Trinidadian_Creole")

print("\n--- Automated Research Metadata Extraction ---")
print(df_metadata.head(20))

### 3. Saving results to portfolio ###
df_metadata.to_csv('linguistic_metadata_results.csv', index=False)

Scraping data from: https://en.wikipedia.org/wiki/Trinidadian_Creole...

--- Automated Research Metadata Extraction ---
               Entity  Category
0         Trinidadian      NORP
1            Trinidad       GPE
2              Tobago       GPE
3    Lesser Antillean      NORP
4             English  LANGUAGE
5          Tobagonian      NORP
6              French      NORP
7             African      NORP
8         East Indian      NORP
9          Amerindian      NORP
10            Spanish      NORP
11  Caribbean English      NORP
12              China       GPE
13         Portuguese      NORP
14          Venezuela       GPE
15            Madeira       GPE
16              India       GPE
17        west Africa       GPE
18              Syria       GPE
19            Lebanon       GPE


### Data Synthesis & Observations
By programmatically "reading" the academic summary of Trinidadian Creole, the model has identified several key vectors of influence:

Language Contact: The extraction of LANGUAGE entities identifies the superstrate and substrate influences (English, French, Spanish) that compete with native Creole features.

Cultural Stratigraphy: The NORP (Nationalities/Religious/Political) labels highlight the diverse demographic history (e.g., African, Indian, European) which directly correlates with the "mixed-methods" nature of the phonetic variation observed in our acoustic pipeline.

Geographic Anchoring: Identifying GPE (Geopolitical Entities) allows for a quick mapping of the hub areas where language contact is most intense.