# MIT Art, Design and Technology University  
### MIT School of Computing, Pune  
### Department of Information Technology  

---

## Experiential Learning Activity  
### Subject – Natural Language Processing  
### Topic – Build Your Own Mini NLP Tool: Language Detection  
### Academic Year 2025 – 2026 (SEM I)  
### Course Coordinator – Prof. Kalyani Lokhande  


In [2]:
# Install necessary libraries
!pip install langdetect pandas matplotlib

# Import required modules
import pandas as pd
import matplotlib.pyplot as plt
from langdetect import detect, DetectorFactory

# Fix randomness for consistent results
DetectorFactory.seed = 0

print("All libraries installed and imported successfully.")



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


All libraries installed and imported successfully.


### Phase 1 – Creating or Loading the Dataset

To test the language detection tool, we need a collection of short sentences written in different languages.  
Each sentence will represent a unique language such as English, Hindi, French, Spanish, and others.  

We will first create a small in-memory dataset with sample multilingual text.  
Later, the same approach can be applied to a larger dataset (like the Kaggle Language Detection dataset).

**Objective:**
- Prepare text data containing multiple languages.
- Store it in a DataFrame for further analysis.


In [3]:
# Create a small multilingual dataset
data = {
    "Text": [
        "Hello, how are you?",                  # English
        "Bonjour, comment allez-vous?",         # French
        "Namaste, aap kaise ho?",               # Hindi
        "Hola, buenos días.",                   # Spanish
        "Ciao, come stai?",                     # Italian
        "Hallo, wie geht es dir?",              # German
        "Olá, como vai você?",                  # Portuguese
        "Привет, как дела?",                    # Russian
        "こんにちは、お元気ですか？",          # Japanese
        "안녕하세요, 어떻게 지내세요?"             # Korean
    ]
}

# Convert to DataFrame
df = pd.DataFrame(data)
df


Unnamed: 0,Text
0,"Hello, how are you?"
1,"Bonjour, comment allez-vous?"
2,"Namaste, aap kaise ho?"
3,"Hola, buenos días."
4,"Ciao, come stai?"
5,"Hallo, wie geht es dir?"
6,"Olá, como vai você?"
7,"Привет, как дела?"
8,こんにちは、お元気ですか？
9,"안녕하세요, 어떻게 지내세요?"


### Phase 2 – Language Detection using `langdetect`

Now that our dataset is ready, we will use the `langdetect` library to identify the language of each sentence.  
The `detect()` function analyzes the text and returns a two-letter language code such as:

- `en` → English  
- `fr` → French  
- `es` → Spanish  
- `hi` → Hindi  
- `de` → German  
- and so on.

We will:
1. Apply the detection function to each sentence in the DataFrame.
2. Store the results in a new column called **Predicted_Language**.
3. Display the DataFrame to verify the detected languages.


In [4]:
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0  # for consistent results

# Function to detect language safely
def detect_language(text):
    try:
        return detect(text)
    except:
        return "error"

# Apply the function to each sentence
df["Predicted_Language"] = df["Text"].apply(detect_language)

# Display results
df


Unnamed: 0,Text,Predicted_Language
0,"Hello, how are you?",en
1,"Bonjour, comment allez-vous?",fr
2,"Namaste, aap kaise ho?",et
3,"Hola, buenos días.",es
4,"Ciao, come stai?",it
5,"Hallo, wie geht es dir?",de
6,"Olá, como vai você?",pt
7,"Привет, как дела?",mk
8,こんにちは、お元気ですか？,ja
9,"안녕하세요, 어떻게 지내세요?",ko


### Phase 3 – Converting Language Codes to Full Names

While the `langdetect` library returns short two-letter language codes (ISO-639 format),  
for presentation and analysis, it’s better to display the **full language names** such as  
"English", "French", "Hindi", etc.

To achieve this:
1. We create a mapping dictionary of language codes to names.  
2. We replace each code with its corresponding name using this dictionary.  
3. Finally, we display the updated DataFrame.


In [5]:
# Define a mapping of language codes to language names
lang_map = {
    'en': 'English', 'fr': 'French', 'es': 'Spanish', 'hi': 'Hindi',
    'de': 'German', 'it': 'Italian', 'pt': 'Portuguese', 'ru': 'Russian',
    'ja': 'Japanese', 'ko': 'Korean', 'ar': 'Arabic', 'zh-cn': 'Chinese'
}

# Apply the mapping
df["Detected_Language"] = df["Predicted_Language"].map(lang_map).fillna("Unknown")

# Display final DataFrame
df[["Text", "Detected_Language"]]


Unnamed: 0,Text,Detected_Language
0,"Hello, how are you?",English
1,"Bonjour, comment allez-vous?",French
2,"Namaste, aap kaise ho?",Unknown
3,"Hola, buenos días.",Spanish
4,"Ciao, come stai?",Italian
5,"Hallo, wie geht es dir?",German
6,"Olá, como vai você?",Portuguese
7,"Привет, как дела?",Unknown
8,こんにちは、お元気ですか？,Japanese
9,"안녕하세요, 어떻게 지내세요?",Korean


### Phase 4 – Building an Interactive Language Detection Tool

Now that our model can automatically detect languages from text,  
we will make it interactive by allowing the user to enter sentences manually.  

The system will:
1. Accept user input.  
2. Detect the language using `langdetect`.  
3. Display both the language code and its full name.  
4. Continue detecting until the user types **exit**.

This interactive version demonstrates the practical application of the NLP tool in real-world text processing.


In [6]:
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0

# Mapping language codes to readable names
lang_map = {
    'en': 'English', 'fr': 'French', 'es': 'Spanish', 'hi': 'Hindi', 
    'de': 'German', 'it': 'Italian', 'pt': 'Portuguese', 'nl': 'Dutch',
    'ru': 'Russian', 'ja': 'Japanese', 'ko': 'Korean', 'ar': 'Arabic', 
    'zh-cn': 'Chinese'
}

def detect_language_live(text):
    try:
        code = detect(text)
        name = lang_map.get(code, "Unknown Language")
        return code, name
    except:
        return "error", "Could not detect"

# Interactive loop
while True:
    text = input("Enter a sentence (or type 'exit' to stop): ")
    if text.lower() == 'exit':
        print("Exiting Language Detector. Goodbye!")
        break

    code, name = detect_language_live(text)
    print(f"Detected Language Code: {code}")
    print(f"Detected Language Name: {name}")
    print("-" * 50)


Detected Language Code: et
Detected Language Name: Unknown Language
--------------------------------------------------
Detected Language Code: en
Detected Language Name: English
--------------------------------------------------
Detected Language Code: error
Detected Language Name: Could not detect
--------------------------------------------------
Exiting Language Detector. Goodbye!
