# Languages and Proficiency

This example demonstrates many-to-many relationships using an association table with international standards. You'll learn:

- **Many-to-many relationships** — People speak multiple languages; languages have multiple speakers
- **Lookup tables** — Standardized reference data (ISO language codes, CEFR levels)
- **Association tables** — Linking entities with additional attributes
- **Complex queries** — Aggregations, filtering, and joins

## International Standards

This example uses two widely-adopted standards:

- **ISO 639-1** — Two-letter language codes (`en`, `es`, `ja`)
- **CEFR** — Common European Framework of Reference for language proficiency (A1–C2)

Using international standards ensures data consistency and enables integration with external systems.

In [None]:
import datajoint as dj
import numpy as np
from faker import Faker

dj.config['display.limit'] = 8
schema = dj.Schema('tutorial_languages')

## Lookup Tables

Lookup tables store standardized reference data that rarely changes. The `contents` attribute pre-populates them when the schema is created.

In [None]:
@schema
class Language(dj.Lookup):
    definition = """
    # ISO 639-1 language codes
    lang_code : char(2)             # two-letter code (en, es, ja)
    ---
    language : varchar(30)          # full name
    native_name : varchar(50)       # name in native script
    """
    contents = [
        ('ar', 'Arabic', 'العربية'),
        ('de', 'German', 'Deutsch'),
        ('en', 'English', 'English'),
        ('es', 'Spanish', 'Español'),
        ('fr', 'French', 'Français'),
        ('hi', 'Hindi', 'हिन्दी'),
        ('ja', 'Japanese', '日本語'),
        ('ko', 'Korean', '한국어'),
        ('pt', 'Portuguese', 'Português'),
        ('ru', 'Russian', 'Русский'),
        ('zh', 'Chinese', '中文'),
    ]

In [None]:
@schema
class CEFRLevel(dj.Lookup):
    definition = """
    # CEFR proficiency levels
    cefr_level : char(2)            # A1, A2, B1, B2, C1, C2
    ---
    level_name : varchar(20)        # descriptive name
    category : enum('Basic', 'Independent', 'Proficient')
    description : varchar(100)      # can-do summary
    """
    contents = [
        ('A1', 'Beginner', 'Basic',
         'Can use familiar everyday expressions'),
        ('A2', 'Elementary', 'Basic',
         'Can communicate in simple routine tasks'),
        ('B1', 'Intermediate', 'Independent',
         'Can deal with most travel situations'),
        ('B2', 'Upper Intermediate', 'Independent',
         'Can interact with fluency and spontaneity'),
        ('C1', 'Advanced', 'Proficient',
         'Can express ideas fluently for professional use'),
        ('C2', 'Mastery', 'Proficient',
         'Can understand virtually everything'),
    ]

In [None]:
print("Languages:")
print(Language())
print("\nCEFR Levels:")
print(CEFRLevel())

## Entity and Association Tables

- **Person** — The main entity
- **Proficiency** — Association table linking Person, Language, and CEFRLevel

The association table's primary key includes both Person and Language, creating the many-to-many relationship.

In [None]:
@schema
class Person(dj.Manual):
    definition = """
    # People with language skills
    person_id : int32               # unique identifier
    ---
    name : varchar(60)
    date_of_birth : date
    """

In [None]:
@schema
class Proficiency(dj.Manual):
    definition = """
    # Language proficiency (many-to-many: person <-> language)
    -> Person
    -> Language
    ---
    -> CEFRLevel
    """

In [None]:
dj.Diagram(schema)

**Reading the diagram:**
- **Gray tables** (Language, CEFRLevel) are Lookup tables
- **Green table** (Person) is Manual
- **Solid lines** indicate foreign keys in the primary key (many-to-many)
- **Dashed line** indicates foreign key in secondary attributes (reference)

## Populate Sample Data

In [None]:
np.random.seed(42)
fake = Faker()
fake.seed_instance(42)

# Generate 200 people
n_people = 200
Person.insert(
    {
        'person_id': i,
        'name': fake.name(),
        'date_of_birth': fake.date_of_birth(
            minimum_age=18, maximum_age=70)
    }
    for i in range(n_people)
)

print(f"Created {len(Person())} people")
Person()

In [None]:
# Assign random language proficiencies
lang_keys = Language.fetch('KEY')
cefr_keys = CEFRLevel.fetch('KEY')

# More people at intermediate levels than extremes
cefr_weights = [0.08, 0.12, 0.20, 0.25, 0.20, 0.15]
avg_languages = 2.5

for person_key in Person.fetch('KEY'):
    n_langs = np.random.poisson(avg_languages)
    if n_langs > 0:
        selected_langs = np.random.choice(
            len(lang_keys), min(n_langs, len(lang_keys)), replace=False)
        Proficiency.insert(
            {
                **person_key,
                **lang_keys[i],
                **np.random.choice(cefr_keys, p=cefr_weights)
            }
            for i in selected_langs
        )

print(f"Created {len(Proficiency())} proficiency records")
Proficiency()

## Query Examples

### Finding Speakers

In [None]:
# Proficient English speakers (C1 or C2)
proficient_english = (
    Person.proj('name') & 
    (Proficiency & {'lang_code': 'en'} & 'cefr_level >= "C1"')
)
print(f"Proficient English speakers: {len(proficient_english)}")
proficient_english

In [None]:
# People who speak BOTH English AND Spanish
bilingual = (
    Person.proj('name') & 
    (Proficiency & {'lang_code': 'en'}) & 
    (Proficiency & {'lang_code': 'es'})
)
print(f"English + Spanish speakers: {len(bilingual)}")
bilingual

In [None]:
# People who speak English OR Spanish
either = (
    Person.proj('name') & 
    (Proficiency & 'lang_code in ("en", "es")')
)
print(f"English or Spanish speakers: {len(either)}")
either

### Aggregations

In [None]:
# People who speak 4+ languages
polyglots = Person.aggr(
    Proficiency,
    'name',
    n_languages='COUNT(lang_code)',
    languages='GROUP_CONCAT(lang_code)'
) & 'n_languages >= 4'

print(f"Polyglots (4+ languages): {len(polyglots)}")
polyglots

In [None]:
# Top 5 polyglots
top_polyglots = Person.aggr(
    Proficiency,
    'name',
    n_languages='COUNT(lang_code)'
) & dj.Top(5, order_by='n_languages DESC')

top_polyglots

In [None]:
# Number of speakers per language
speakers_per_lang = Language.aggr(
    Proficiency,
    'language',
    n_speakers='COUNT(person_id)'
)
speakers_per_lang

In [None]:
# CEFR level distribution for English
english_levels = CEFRLevel.aggr(
    Proficiency & {'lang_code': 'en'},
    'level_name',
    n_speakers='COUNT(person_id)'
)
english_levels

### Joining Tables

In [None]:
# Full profile: person + language + proficiency details
full_profile = (
    Person * Proficiency * Language * CEFRLevel
).proj('name', 'language', 'level_name', 'category')

# Show profile for person_id=0
full_profile & {'person_id': 0}

In [None]:
# Find people with C1+ proficiency in multiple languages
advanced_polyglots = Person.aggr(
    Proficiency & 'cefr_level >= "C1"',
    'name',
    n_advanced='COUNT(*)'
) & 'n_advanced >= 2'

print(f"Advanced in 2+ languages: {len(advanced_polyglots)}")
advanced_polyglots

## Key Concepts

| Pattern | Implementation |
|---------|----------------|
| **Many-to-many** | `Proficiency` links `Person` and `Language` |
| **Lookup tables** | `Language` and `CEFRLevel` with `contents` |
| **Association data** | `cefr_level` stored in the association table |
| **Standards** | ISO 639-1 codes, CEFR levels |

### Benefits of Lookup Tables

1. **Data consistency** — Only valid codes can be used
2. **Rich metadata** — Full names, descriptions stored once
3. **Easy updates** — Change "Español" to "Spanish" in one place
4. **Self-documenting** — `Language()` shows all valid options

## Next Steps

- [University Database](university.ipynb) — Academic records
- [Hotel Reservations](hotel-reservations.ipynb) — Workflow dependencies
- [Queries Tutorial](../basics/04-queries.ipynb) — Query operators in depth

In [None]:
# Cleanup
schema.drop(prompt=False)