## Coding a Language Detection model in Python 

In [3]:
# Use for Installing necessary package if not installed already
#pip install transformers torch datasets pandas scikit-learn nltk

Importing Necessary Libraries 

In [4]:
import pandas as pd
import re
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification, Trainer, TrainingArguments, pipeline
from datasets import Dataset, DatasetDict, ClassLabel
import torch
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

  from .autonotebook import tqdm as notebook_tqdm


Add your file path here

In [6]:
# Load the dataset
file_path = 'dataset/language-detection-full-dataset.csv'
df = pd.read_csv(file_path)

### Building a model using Roberta Base Pretrained model

In [7]:
# Data Preprocessing
df['Text'] = df['Text'].str.lower().apply(lambda x: re.sub(r'[^\w\s]', '', x))

In [8]:
# Mapping Different languages with their respective Labels
label_to_language = df[['Label', 'Language']].drop_duplicates().set_index('Label').to_dict()['Language']
language_to_label = {v: k for k, v in label_to_language.items()}

In [9]:
# Converting ASCII Language Code to Full Name 
language_code_to_name = {
    "af": "Afrikaans",
    "ar": "Arabic",
    "bg": "Bulgarian",
    "bn": "Bengali",
    "de": "German",
    "el": "Greek",
    "en": "English",
    "es": "Spanish",
    "et": "Estonian",
    "fa": "Persian",
    "fi": "Finnish",
    "fr": "French",
    "gu": "Gujarati",
    "he": "Hebrew",
    "hi": "Hindi",
    "hr": "Croatian",
    "hu": "Hungarian",
    "id": "Indonesian",
    "it": "Italian",
    "ja": "Japanese",
    "kn": "Kannada",
    "ko": "Korean",
    "lt": "Lithuanian",
    "lv": "Latvian",
    "ml": "Malayalam",
    "mr": "Marathi",
    "ne": "Nepali",
    "nl": "Dutch",
    "no": "Norwegian",
    "pa": "Punjabi",
    "pl": "Polish",
    "pt": "Portuguese",
    "ro": "Romanian",
    "ru": "Russian",
    "si": "Sinhala",
    "sk": "Slovak",
    "sl": "Slovenian",
    "sq": "Albanian",
    "sv": "Swedish",
    "sw": "Swahili",
    "ta": "Tamil",
    "te": "Telugu",
    "th": "Thai",
    "tl": "Tagalog",
    "tr": "Turkish",
    "uk": "Ukrainian",
    "ur": "Urdu",
    "vi": "Vietnamese",
    "zh": "Chinese"
}

In [10]:
# Load a pretrained language detection model
model_name = "papluca/xlm-roberta-base-language-detection"
language_detection_pipeline = pipeline("text-classification", model=model_name)

In [12]:
# Creating a Prediction Function using the pretrained model
def predict_language(text):
    # Preprocess the input text
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    # Predict the language label using the pretrained model
    prediction = language_detection_pipeline(text)
    # Extract the predicted language code from the model output
    predicted_code = prediction[0]['label']
    # Convert language code to full language name
    language_name = language_code_to_name.get(predicted_code, "Unknown")
    return language_name

In [None]:
# Testing the prediction function
sample_text = "Insert Text Here" 

Few example sentences in Different languages: 

- **Afrikaans (af)**: "Ek hou van musiek." (I like music.)
- **Arabic (ar)**: "أنا أحب الموسيقى." (I love music.)
- **Bulgarian (bg)**: "Обичам музиката." (I love music.)
- **Bengali (bn)**: "আমি সঙ্গীত পছন্দ করি।" (I like music.)
- **German (de)**: "Ich mag Musik." (I like music.)
- **Greek (el)**: "Μου αρέσει η μουσική." (I like music.)
- **English (en)**: "I like music."
- **Spanish (es)**: "Me gusta la música." (I like music.)
- **Estonian (et)**: "Mulle meeldib muusika." (I like music.)
- **Persian (fa)**: "من موسیقی را دوست دارم." (I like music.)
- **Finnish (fi)**: "Pidän musiikista." (I like music.)
- **French (fr)**: "J'aime la musique." (I like music.)
- **Gujarati (gu)**: "મને સંગીત ગમે છે." (I like music.)
- **Hebrew (he)**: "אני אוהב מוזיקה." (I love music.)
- **Hindi (hi)**: "मुझे संगीत पसंद है।" (I like music.)
- **Croatian (hr)**: "Volim glazbu." (I love music.)
- **Hungarian (hu)**: "Szeretem a zenét." (I love music.)
- **Indonesian (id)**: "Saya suka musik." (I like music.)
- **Italian (it)**: "Mi piace la musica." (I like music.)
- **Japanese (ja)**: "音楽が好きです。" (I like music.)
- **Kannada (kn)**: "ನಾನು ಸಂಗೀತವನ್ನು ಇಷ್ಟಪಡುತ್ತೇನೆ." (I like music.)
- **Korean (ko)**: "저는 음악을 좋아합니다." (I like music.)
- **Lithuanian (lt)**: "Man patinka muzika." (I like music.)
- **Latvian (lv)**: "Man patīk mūzika." (I like music.)
- **Malayalam (ml)**: "എനിക്ക് സംഗീതം ഇഷ്ടമാണ്." (I like music.)
- **Marathi (mr)**: "मला संगीत आवडते." (I like music.)
- **Nepali (ne)**: "मलाई संगीत मन पर्छ।" (I like music.)
- **Dutch (nl)**: "Ik hou van muziek." (I like music.)
- **Norwegian (no)**: "Jeg liker musikk." (I like music.)
- **Punjabi (pa)**: "ਮੈਨੂੰ ਸੰਗੀਤ ਪਸੰਦ ਹੈ।" (I like music.)
- **Polish (pl)**: "Lubię muzykę." (I like music.)
- **Portuguese (pt)**: "Eu gosto de música." (I like music.)
- **Romanian (ro)**: "Îmi place muzica." (I like music.)
- **Russian (ru)**: "Я люблю музыку." (I love music.)
- **Sinhala (si)**: "මට සංගීතය කැමතියි." (I like music.)
- **Slovak (sk)**: "Mám rád hudbu." (I like music.)
- **Slovenian (sl)**: "Rad imam glasbo." (I love music.)
- **Albanian (sq)**: "Më pëlqen muzika." (I like music.)
- **Swedish (sv)**: "Jag gillar musik." (I like music.)
- **Swahili (sw)**: "Ninapenda muziki." (I like music.)
- **Tamil (ta)**: "எனக்கு இசை பிடிக்கும்." (I like music.)
- **Telugu (te)**: "నాకు సంగీతం ఇష్టం." (I like music.)

These examples can be used to test and validate the language detection model.

In [14]:
predicted_language = predict_language(sample_text)
print(f"The predicted language for the input text is: {predicted_language}")

The predicted language for the input text is: English


### Model using Existing data from NLTK Library, Random Forest and Cross Validation

In [15]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/apple/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/apple/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
# Data cleaning function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    return ' '.join(tokens)

# Apply data cleaning to the 'Text' column
df['cleaned_text'] = df['Text'].apply(clean_text)

In [17]:

# Split the data into features (X) and labels (y)
X = df['cleaned_text']
y = df['Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
# Create a pipeline with TF-IDF vectorizer and RandomForestClassifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 3), analyzer='char_wb', max_features=50000)),
    ('clf', RandomForestClassifier(n_jobs=-1, random_state=42))
])

In [19]:
# Define hyperparameters for tuning
param_grid = {
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [None, 10, 20],
    'clf__min_samples_split': [2, 5, 10]
}

In [20]:
# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable huggingface/tokenizers: The current process just got forked, after paral

[CV] END clf__max_depth=None, clf__min_samples_split=2, clf__n_estimators=100; total time=  46.8s
[CV] END clf__max_depth=None, clf__min_samples_split=2, clf__n_estimators=100; total time=  47.4s
[CV] END clf__max_depth=None, clf__min_samples_split=2, clf__n_estimators=100; total time=  48.6s
[CV] END clf__max_depth=None, clf__min_samples_split=2, clf__n_estimators=100; total time=  49.0s
[CV] END clf__max_depth=None, clf__min_samples_split=2, clf__n_estimators=100; total time=  49.0s
[CV] END clf__max_depth=None, clf__min_samples_split=2, clf__n_estimators=200; total time=  54.3s
[CV] END clf__max_depth=None, clf__min_samples_split=2, clf__n_estimators=200; total time=  54.3s
[CV] END clf__max_depth=None, clf__min_samples_split=2, clf__n_estimators=200; total time=  54.6s
[CV] END clf__max_depth=None, clf__min_samples_split=5, clf__n_estimators=100; total time=  34.7s
[CV] END clf__max_depth=None, clf__min_samples_split=5, clf__n_estimators=100; total time=  35.3s
[CV] END clf__max_de

In [21]:
# Get the best model
best_model = grid_search.best_estimator_

In [22]:
# Make predictions on the test set
y_pred = best_model.predict(X_test)

In [23]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.98


In [24]:
# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       302
           1       0.99      0.98      0.98       208
           2       0.95      0.91      0.93        85
           3       0.99      0.97      0.98       294
           4       0.83      1.00      0.91       461
           5       1.00      0.96      0.98       183
           6       0.99      0.98      0.99       422
           7       0.95      0.96      0.96        85
           8       1.00      1.00      1.00        75
           9       1.00      0.99      1.00       224
          10       1.00      0.97      0.99       219
          11       0.96      0.96      0.96       137
          12       1.00      0.97      0.99       195
          13       1.00      1.00      1.00        74
          14       1.00      1.00      1.00       212
          15       1.00      0.91      0.95       193
          16       1.00      0.99      1.00       112
   

In [25]:
# Function to predict language for new text
def predict_language(text):
    cleaned = clean_text(text)
    prediction = best_model.predict([cleaned])[0]
    language = df[df['Label'] == prediction]['Language'].iloc[0]
    return language


In [28]:
# Test the model with some example sentences
examples = [
    "Hi, how are you?",  # English
    "Bonjour, comment allez-vous?",  # French
    "Hola, ¿cómo estás?",  # Spanish
    "Ciao, come stai?",  # Italian
    "Hallo, wie geht es dir?",  # German
    "नमस्ते, आप कैसे हैं?",  # Hindi
    "こんにちは、お元気ですか?",  # Japanese
    "안녕하세요, 어떻게 지내세요?",  # Korean
    "你好，你好吗?",  # Chinese
    "Здравствуйте, как вы?",  # Russian
    "مرحبا كيف حالك؟",  # Arabic
    "வணக்கம், நீங்கள் எப்படி இருக்கிறீர்கள்?",  # Tamil
    "Oi, como você está?",  # Portuguese
    "გამარჯობა, როგორ ხარ?",  # Georgian
    "ہیلو، آپ کیسے ہیں؟",  # Urdu
    "Olá, como estás?",  # Portuguese (Portugal)
    "Tere, kuidas sul läheb?",  # Estonian
    "Sveiki, kā jums klājas?",  # Latvian
    "Sveiki, kaip sekasi?",  # Lithuanian
    "Hej, hur mår du?",  # Swedish
    "Saluton, kiel vi fartas?",  # Esperanto
    "Hallo, hoe gaan dit?",  # Afrikaans
    "Hej, hvordan har du det?",  # Danish
    "Hej, hvordan går det?",  # Norwegian
    "Hei, miten voit?",  # Finnish
    "Привіт, як ти?",  # Ukrainian
    "Bună, ce mai faci?",  # Romanian
    "Γεια σας, πώς είστε;",  # Greek
    "สวัสดีคุณเป็นอย่างไร?",  # Thai
    "مرحبا، كيف حالك؟",  # Arabic (alternative)
    "مرحبا، كيفك؟",  # Arabic (colloquial)
]

In [29]:
print("\nPredictions for example sentences:")
for example in examples:
    predicted_language = predict_language(example)
    print(f"Text: '{example}' - Predicted Language: {predicted_language}")


Predictions for example sentences:
Text: 'Hi, how are you?' - Predicted Language: English
Text: 'Bonjour, comment allez-vous?' - Predicted Language: French
Text: 'Hola, ¿cómo estás?' - Predicted Language: Spanish
Text: 'Ciao, come stai?' - Predicted Language: Italian
Text: 'Hallo, wie geht es dir?' - Predicted Language: German
Text: 'नमस्ते, आप कैसे हैं?' - Predicted Language: Hindi
Text: 'こんにちは、お元気ですか?' - Predicted Language: Japanese
Text: '안녕하세요, 어떻게 지내세요?' - Predicted Language: Korean
Text: '你好，你好吗?' - Predicted Language: English
Text: 'Здравствуйте, как вы?' - Predicted Language: Russian
Text: 'مرحبا كيف حالك؟' - Predicted Language: Arabic
Text: 'வணக்கம், நீங்கள் எப்படி இருக்கிறீர்கள்?' - Predicted Language: Tamil
Text: 'Oi, como você está?' - Predicted Language: Portugeese
Text: 'გამარჯობა, როგორ ხარ?' - Predicted Language: English
Text: 'ہیلو، آپ کیسے ہیں؟' - Predicted Language: Urdu
Text: 'Olá, como estás?' - Predicted Language: Spanish
Text: 'Tere, kuidas sul läheb?' - Predict