# STOP - This is a read-only jupyter notebook. Save a copy.

Please save a copy in your drive so you will be able to run and modify the code on your copy.

`"File" > "Save a copy in Drive"`

# Getting started with AI

This notebook goes through some examples of some simplier ways of using AI with little to no experience with AI models. It's going to walk through an example problem using different methods. The hope is that some of these simplier methods may be available and effective for other use cases.

Will walk through using:
1. Packaged Library
2. LLM API*
3. Hugging Face Model

*For the LLM, an OpenAI API key is required.

# Language Detection Example

Let's say we want to detect language used in the title field to confirm if it was cataloged in the language of the material. For this example, we will use this Japanese title "南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画" taken from a MARC 245 field in a [HathiTrust record](https://catalog.hathitrust.org/Record/100069174.marc), stripped of punctuation. The item is available for [full view](https://babel.hathitrust.org/cgi/pt?id=keio.10812569024). We will explore using models to detect the language of the title.



## Method 1: Use a Packaged Library
The easiest thing to do is hunt for a well-supported library package with model. Fortunately, language detection has been worked on extensively, and there is already a package ready to use. For this example will use [fastText](https://fasttext.cc/), a library which can use a few different models for text related tasks. We need to download the library, and also the [language detection model](https://fasttext.cc/docs/en/language-identification.html) to load in the library. The model can detect 176 different languages.

In [1]:
import os

# Install FastText python libary
# More info: https://fasttext.cc/
!pip install fasttext

# Download the uncompressed model for the library for language detection.
# More info: https://fasttext.cc/docs/en/language-identification.html

# Define the file path and URL
fasttext_model_filepath = '/content/lid.176.bin'
url = 'https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin'

# Check if the file already exists
if not os.path.exists(fasttext_model_filepath):
    # If the file does not exist, download it
    !wget {url} -O {fasttext_model_filepath}
else:
    print("Model already downloaded: {}".format(fasttext_model_filepath))


Model already downloaded: /content/lid.176.bin


### Simple Detection Script
In just a few steps we get the model to correctly guess the title is japanese.

In [2]:
import fasttext

# Set title text for language detection
title = "南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画"

# Load the pre-trained language detection model
model = fasttext.load_model(fasttext_model_filepath)

# Simple function to detect language with the model
model_output = model.predict(title)

# Clean up output to get language code
language = model_output[0][0].replace("__label__", "")

print(f"Detected language: {language}")




Detected language: ja


### The Model Output
 Let's take a closer look the model output which is more complex than returning a simple language code. The model returns a tuple, the label and the confidence score.



In [3]:
print(model.predict(title))

(('__label__ja',), array([0.69136173]))


 ```
 (('__label__ja',), array([0.69136173]))
 ```

*   The label is `"__label__ja"` for the ISO 639-1 code "ja" (Japanese). The (__label__) is a naming convention often used in specific machine learning tools for labeling.

*   The cofidence score is 0.69136173 (69%). This number represents the [confidence score ](https://www.mindee.com/blog/how-use-confidence-scores-ml-models#most-common-machine-learning-confidence-scores)or probability that the model assigns to its prediction of `__label__ja`. It's based on the data it's seen in training, so if the data you are using is somewhat or very different than the original training data, these confidence scores may not be well-calibrated. The model may be over or under confident. This is something to keep in mind when relying on confidence scores.

## What other languages make up the 31%?

Another thing we can do with this model is get all of the top predictions for the text. We can see the possiblities by passing in `k` parameter with all the top predictions.


In [4]:
print(model.predict(title, k=5))


(('__label__ja', '__label__zh', '__label__en', '__label__es', '__label__ko'), array([0.69136173, 0.29256272, 0.0096868 , 0.00158292, 0.00132226]))


We now have multiple language codes and probabilities. Let's clean it up so it's easier to read with some helper functions to use in the rest of the notebook.

In [5]:
# install pycountry to map to the codes.
!pip install pycountry



In [6]:
# Define helper functions for pretty print
import pycountry

def pprint_language_codes(labels, scores):
    for label, score in zip(labels, scores):
        # Check the length of the label to determine if it's alpha_2 or alpha_3
        if len(label) == 2:
            language = pycountry.languages.get(alpha_2=label)
        elif len(label) == 3:
            language = pycountry.languages.get(alpha_3=label)
        else:
            language = None

        # Get language name or use "Unknown" if not found
        language_name = language.name if language else "Unknown"

        # Convert to percentage and round
        percentage = round(score * 100, 2)

        # Print the language details with confidence percentage
        print(f"Language: {language_name} ({label}), Confidence: {percentage}%")

def clean_labels(labels):
  # Remove the prefix '__label__' from each label in the tuple
  cleaned = tuple(label.replace('__label__', '') for label in labels)
  return cleaned

import time
from contextlib import contextmanager

# a timer method will be useful in comparing time
@contextmanager
def timer(description: str):
    start_time = time.time()
    try:
        yield
    finally:
        elapsed_time = time.time() - start_time
        print(f"{description}: {elapsed_time:.2f} seconds")

In [7]:
labels, codes = model.predict(title, k=5)
labels = clean_labels(labels)
pprint_language_codes(labels,codes)

Language: Japanese (ja), Confidence: 69.14%
Language: Chinese (zh), Confidence: 29.26%
Language: English (en), Confidence: 0.97%
Language: Spanish (es), Confidence: 0.16%
Language: Korean (ko), Confidence: 0.13%


From this output we see Chinese is a respectable second at 29%. Time to consult ChatGPT 4 for some theories about it. Below is a summarization.

### Why is Chinese scoring so high for a Japanese text?

1. Japanese uses three types of scripts: Kanji, Hiragana, and Katakana. Kanji are derived from Chinese characters and are identical in form to many traditional Chinese characters. When Japanese text predominantly features Kanji, it can look very similar to Chinese text. The can share the same Unicode code points because they are visually identical or very similar.
2. There is not a lot of context in a title. The less characters, the less context the model has to make a good decision.

__An additional note__: Language models might not always accurately identify languages that share common scripts. If a model has been trained predominantly on data where Chinese characters are used more frequently in Chinese contexts, it might be biased towards predicting Chinese when it encounters similar characters in an ambiguous setting.

OpenAI. (2024). ChatGPT (GPT 4 May 7 version) [Large language model].

### How about titles in other languages?

Let's add an Arabic (ar) and Slavik (sk) title to play around with as well. You can see below when we run it with fasttext, it's incredibly confident the ar title is Arabic, and fairly confident in the Slovak title. If we had more information than just a title, our model could likely have more confidence. Only you can tell if this will be acceptable in the context your model will be used.

In [8]:
title_ja = "南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画"
title_ar = "دراسات سكانية"
title_sk = "Trávnice slovenských národných piesní Na klavír pristrojíl Miroslav Francisci"

In [9]:
print("Title: " + title_ar)
labels, codes = model.predict(title_ar, k=5)
labels = clean_labels(labels)
pprint_language_codes(labels,codes)

Title: دراسات سكانية
Language: Arabic (ar), Confidence: 99.78%
Language: Egyptian Arabic (arz), Confidence: 0.17%
Language: Mazanderani (mzn), Confidence: 0.03%
Language: Persian (fa), Confidence: 0.01%
Language: Dhivehi (dv), Confidence: 0.01%


The model is incredibly confident this title is in Arabic, nearly 100%.

In [10]:
print("Title: " + title_sk)
with timer("SK"):
  labels, codes = model.predict(title_sk, k=5)
labels = clean_labels(labels)
pprint_language_codes(labels,codes)


Title: Trávnice slovenských národných piesní Na klavír pristrojíl Miroslav Francisci
SK: 0.00 seconds
Language: Slovak (sk), Confidence: 77.42%
Language: Czech (cs), Confidence: 22.15%
Language: Russian (ru), Confidence: 0.2%
Language: Catalan (ca), Confidence: 0.07%
Language: Swedish (sv), Confidence: 0.07%


The model is moderately confident in Slovak with 77% confidence.


## So does uncertainty mean for using the model?

One of the most significant differences between typical programming scenerios and using a model is that models will have varying degrees of certainty. Some models include a confidence score to help evaluate certainty and risk, but other models do not have any confidence. All models can be confidently wrong, so these risks need to be considered when incorparating AI into any pipeline or application.

## Using LLM API (GPT example)

What about using an LLM API? Large language models are powerful models with great language abilities. It's rare to find a packaged library with a model that does exactly what you want, so for the exercise, let's pretend fasttext doesn't exist. The next easiest thing is to see how a LLM performs.

In [11]:
!pip install openai



### Setting up a API Key

To do this part of the exercise you will need an API key from Open AI. If you don't have one, you will need an Open AI account and setup billing. Open AI charges by input token and ouput token.

In [12]:
import openai

# Initialize the OpenAI client
client = openai.OpenAI(api_key='Your API Key Here')

### Helper Methods for OpenAI

In [13]:
import openai
import json



def get_response_from_openai(prompt, model_name):
  # Call the API to generate text
  completion = client.chat.completions.create(
      messages=[
          {
              "role": "user",
              "content": prompt,
          }
      ],
      model=model_name,
      temperature=0,
      response_format={ "type": "json_object" }
  )
  return completion.choices[0].message.content.strip()



In [14]:
# helper method for generating prompts
def generate_prompt(prompt, data):
  return prompt + data

In [15]:
# defining a simple prompt
simple_top_5_prompt = "Predict up to five possible ISO 639-1 language codes,\
 ranked by likelihood in json: "

print(generate_prompt(simple_top_5_prompt, title_ja))

Predict up to five possible ISO 639-1 language codes, ranked by likelihood in json: 南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画


In [16]:
import time
from contextlib import contextmanager

# helper timer methhod
@contextmanager
def timer(description: str):
    start_time = time.time()
    try:
        yield
    finally:
        elapsed_time = time.time() - start_time
        print(f"{description}: {elapsed_time:.2f} seconds")


In [17]:
with timer("GPT 3.5 Turbo Model w/ Simple Prompt"):
    simple_top_5_result_35_turbo = get_response_from_openai(generate_prompt(simple_top_5_prompt, title_ja), "gpt-3.5-turbo-0125")

print("Prompt:" + simple_top_5_prompt)
print("Data input:" + title_ja)
print(simple_top_5_result_35_turbo)

GPT 3.5 Turbo Model w/ Simple Prompt: 0.92 seconds
Prompt:Predict up to five possible ISO 639-1 language codes, ranked by likelihood in json: 
Data input:南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画
{
  "1": "zh",
  "2": "ja",
  "3": "ko",
  "4": "vi",
  "5": "th"
}


In [18]:
with timer("GPT 4o Model w/ Simple Prompt"):
  simple_top_5_result_4o = get_response_from_openai(generate_prompt(simple_top_5_prompt, title_ja), "gpt-4o")

print("Prompt:" + simple_top_5_prompt)
print("Data input:" + title_ja)
print(simple_top_5_result_4o)

GPT 4o Model w/ Simple Prompt: 0.87 seconds
Prompt:Predict up to five possible ISO 639-1 language codes, ranked by likelihood in json: 
Data input:南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画
{
  "predictions": [
    "zh",
    "ja",
    "ko",
    "vi",
    "th"
  ]
}


### Expanding and improving a prompt

In [19]:
# defining a more complex prompt for format and accuracy:
complex_top_5_prompt = "You are a specialized AI model trained for language \
detection in bibliographic titles found within MARC records. Your task is to \
analyze the given text, focusing particularly on the context of shared \
characters to distinguish between languages that use overlapping scripts. \
Use contextual cues and other indicators to accurately predict up to five \
possible ISO 639-1 language codes, ranked by likelihood. \
The output should be formatted as a JSON array in the following structure: \
[{'code': '<ISO_code>', 'language': '<language_name>'}]. For instance, for the \
text 'Pride and Prejudice', you should return: {'result': \
[{'code': 'en', 'language': 'English'}]}. Please analyze the text below and \
enter your text after the colon: "


print(generate_prompt(complex_top_5_prompt, title_ja))


You are a specialized AI model trained for language detection in bibliographic titles found within MARC records. Your task is to analyze the given text, focusing particularly on the context of shared characters to distinguish between languages that use overlapping scripts. Use contextual cues and other indicators to accurately predict up to five possible ISO 639-1 language codes, ranked by likelihood. The output should be formatted as a JSON array in the following structure: [{'code': '<ISO_code>', 'language': '<language_name>'}]. For instance, for the text 'Pride and Prejudice', you should return: {'result': [{'code': 'en', 'language': 'English'}]}. Please analyze the text below and enter your text after the colon: 南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画


In [20]:
with timer("GPT 3.5 Turbo Model w/ Complex Prompt"):
    complex_top_5_result_35_turbo = get_response_from_openai(generate_prompt(complex_top_5_prompt, title_ja), "gpt-3.5-turbo-0125")

print("Prompt:" + complex_top_5_prompt)
print("Data input:" + title_ja)
print(complex_top_5_result_35_turbo)

GPT 3.5 Turbo Model w/ Complex Prompt: 1.04 seconds
Prompt:You are a specialized AI model trained for language detection in bibliographic titles found within MARC records. Your task is to analyze the given text, focusing particularly on the context of shared characters to distinguish between languages that use overlapping scripts. Use contextual cues and other indicators to accurately predict up to five possible ISO 639-1 language codes, ranked by likelihood. The output should be formatted as a JSON array in the following structure: [{'code': '<ISO_code>', 'language': '<language_name>'}]. For instance, for the text 'Pride and Prejudice', you should return: {'result': [{'code': 'en', 'language': 'English'}]}. Please analyze the text below and enter your text after the colon: 
Data input:南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画
{"result": [{"code": "ja", "language": "Japanese"}]}


In [21]:
with timer("GPT 4o Model w/ Complex Prompt"):
    complex_top_5_result_4o = get_response_from_openai(generate_prompt(complex_top_5_prompt, title_ja), "gpt-4o")

print("Prompt:" + complex_top_5_prompt)
print("Data input:" + title_ja)
print(complex_top_5_result_4o)

GPT 4o Model w/ Complex Prompt: 0.99 seconds
Prompt:You are a specialized AI model trained for language detection in bibliographic titles found within MARC records. Your task is to analyze the given text, focusing particularly on the context of shared characters to distinguish between languages that use overlapping scripts. Use contextual cues and other indicators to accurately predict up to five possible ISO 639-1 language codes, ranked by likelihood. The output should be formatted as a JSON array in the following structure: [{'code': '<ISO_code>', 'language': '<language_name>'}]. For instance, for the text 'Pride and Prejudice', you should return: {'result': [{'code': 'en', 'language': 'English'}]}. Please analyze the text below and enter your text after the colon: 
Data input:南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画
{
  "result": [
    {"code": "ja", "language": "Japanese"},
    {"code": "zh", "language": "Chinese"}
  ]
}


In [22]:
motivating_prompt =  "Predict up to five possible ISO 639-1 language codes,\
 ranked by likelihood in json, the good bots will get high fives: "

In [23]:
with timer("GPT 4o Model w/ Complex Prompt"):
    motivating_result_4o = get_response_from_openai(generate_prompt(motivating_prompt, title_ja), "gpt-4o")

print("Prompt:" + motivating_prompt)
print("Data input:" + title_ja)
print(motivating_result_4o)

GPT 4o Model w/ Complex Prompt: 0.69 seconds
Prompt:Predict up to five possible ISO 639-1 language codes, ranked by likelihood in json, the good bots will get high fives: 
Data input:南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画
{
  "predictions": [
    "ja",
    "zh",
    "ko",
    "vi",
    "th"
  ]
}


In [24]:
format_motivating_prompt =  "Predict up to five possible ISO 639-1 language codes,\
 ranked by likelihood in json, code and language in results, the good bots will get high fives: "

In [25]:
with timer("GPT 4o Model w/ Complex Prompt"):
    format_motivating_result_4o = get_response_from_openai(generate_prompt(format_motivating_prompt, title_ja), "gpt-4o")

print("Prompt:" + format_motivating_prompt)
print("Data input:" + title_ja)
print(format_motivating_result_4o)

GPT 4o Model w/ Complex Prompt: 1.83 seconds
Prompt:Predict up to five possible ISO 639-1 language codes, ranked by likelihood in json, code and language in results, the good bots will get high fives: 
Data input:南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画
{
  "predictions": [
    {
      "code": "ja",
      "language": "Japanese"
    },
    {
      "code": "zh",
      "language": "Chinese"
    },
    {
      "code": "ko",
      "language": "Korean"
    },
    {
      "code": "vi",
      "language": "Vietnamese"
    },
    {
      "code": "th",
      "language": "Thai"
    }
  ]
}


### Let's look how this does with the cheaper model, 3.5 turbo, and the other title language examples.



In [26]:
with timer("3.5 Turbo Model w/ Complex Prompt"):
    format_motivating_top_5_result_35_turbo = get_response_from_openai(generate_prompt(format_motivating_prompt, title_ja), "gpt-3.5-turbo")

print("Prompt:" + format_motivating_prompt)
print("Data input:" + title_ja)
print(format_motivating_top_5_result_35_turbo)

3.5 Turbo Model w/ Complex Prompt: 1.95 seconds
Prompt:Predict up to five possible ISO 639-1 language codes, ranked by likelihood in json, code and language in results, the good bots will get high fives: 
Data input:南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画
{
  "results": [
    {
      "code": "zh",
      "language": "Chinese"
    },
    {
      "code": "ja",
      "language": "Japanese"
    },
    {
      "code": "ko",
      "language": "Korean"
    },
    {
      "code": "vi",
      "language": "Vietnamese"
    },
    {
      "code": "th",
      "language": "Thai"
    }
  ]
}


In [27]:
with timer("3.5 Turbo Model w/ Complex Prompt"):
    format_motivating_top_5_result_35_turbo = get_response_from_openai(generate_prompt(format_motivating_prompt, title_ar), "gpt-3.5-turbo")

print("Prompt:" + format_motivating_prompt)
print("Data input:" + title_ar)
print(format_motivating_top_5_result_35_turbo)

3.5 Turbo Model w/ Complex Prompt: 2.58 seconds
Prompt:Predict up to five possible ISO 639-1 language codes, ranked by likelihood in json, code and language in results, the good bots will get high fives: 
Data input:دراسات سكانية
{
  "results": [
    {
      "code": "ar",
      "language": "Arabic"
    },
    {
      "code": "fa",
      "language": "Persian"
    },
    {
      "code": "ur",
      "language": "Urdu"
    },
    {
      "code": "ps",
      "language": "Pashto"
    },
    {
      "code": "sd",
      "language": "Sindhi"
    }
  ]
}


In [28]:
with timer("3.5 Turbo Model w/ Complex Prompt"):
    format_motivating_top_5_result_35_turbo = get_response_from_openai(generate_prompt(format_motivating_prompt, title_sk), "gpt-3.5-turbo")

print("Prompt:" + format_motivating_prompt)
print("Data input:" + title_sk)
print(format_motivating_top_5_result_35_turbo)

3.5 Turbo Model w/ Complex Prompt: 2.34 seconds
Prompt:Predict up to five possible ISO 639-1 language codes, ranked by likelihood in json, code and language in results, the good bots will get high fives: 
Data input:Trávnice slovenských národných piesní Na klavír pristrojíl Miroslav Francisci
{
  "results": [
    {
      "code": "sk",
      "language": "Slovak"
    },
    {
      "code": "cs",
      "language": "Czech"
    },
    {
      "code": "hu",
      "language": "Hungarian"
    },
    {
      "code": "pl",
      "language": "Polish"
    },
    {
      "code": "de",
      "language": "German"
    }
  ]
}


## Working with Hugging Face Models

Hugging Face has a ton of models. Much like github, some are far along, some are experiments, some are great but unsupported, etc. Hugging Face has some packages we can use to make it quite easy to use some of the models hosted.

In [29]:
# Install the Hugging Face transformer library
!pip install transformers



### Create a Pipeline and Detect Language

For this exercise we will use the Papluca XLM Roberta Base Language Detection.

In [30]:
from transformers import pipeline
# create a pipeline for language detection
language_detector = pipeline('text-classification', model='papluca/xlm-roberta-base-language-detection', return_all_scores=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



## Using the Pipeline

In [31]:
# Pass the title to the pipeline and print the result
result = language_detector(title_ja)
print(result)

[[{'label': 'ja', 'score': 0.8546233773231506}, {'label': 'nl', 'score': 0.002020410029217601}, {'label': 'ar', 'score': 0.005020438693463802}, {'label': 'pl', 'score': 0.0019789296202361584}, {'label': 'de', 'score': 0.001905736280605197}, {'label': 'it', 'score': 0.0024393198546022177}, {'label': 'pt', 'score': 0.004449539352208376}, {'label': 'tr', 'score': 0.0050081294029951096}, {'label': 'es', 'score': 0.0024506777990609407}, {'label': 'hi', 'score': 0.006665884982794523}, {'label': 'el', 'score': 0.002764463657513261}, {'label': 'ur', 'score': 0.004795867949724197}, {'label': 'bg', 'score': 0.002327968366444111}, {'label': 'en', 'score': 0.004049498122185469}, {'label': 'fr', 'score': 0.004364688880741596}, {'label': 'zh', 'score': 0.07286593317985535}, {'label': 'ru', 'score': 0.00292335101403296}, {'label': 'th', 'score': 0.012064753100275993}, {'label': 'sw', 'score': 0.0035523998085409403}, {'label': 'vi', 'score': 0.0037285308353602886}]]


Let's pretty up the result.

In [32]:
for title in [title_ja, title_ar, title_sk]:
  print("\n")
  with timer("HF Model for Title: "):
    results = language_detector(title)

    # Flatten the list of lists if needed (in case your results are nested in a list of lists)
    flat_results = results[0] if isinstance(results[0], list) else results

    # Sort the list of dictionaries by the 'score' key in descending order
    sorted_results = sorted(flat_results, key=lambda x: x['score'], reverse=True)

    # Retain only the top 5 entries
    top_5_languages = sorted_results[:5]

    # Extract labels and scores
    labels = [result['label'] for result in top_5_languages]
    scores = [result['score'] for result in top_5_languages]
  print("Title: " + title)
  pprint_language_codes(labels, scores)



HF Model for Title: : 0.30 seconds
Title: 南品傀儡 青海舎主人述 亀毛[兎角亭亀毛]画
Language: Japanese (ja), Confidence: 85.46%
Language: Chinese (zh), Confidence: 7.29%
Language: Thai (th), Confidence: 1.21%
Language: Hindi (hi), Confidence: 0.67%
Language: Arabic (ar), Confidence: 0.5%


HF Model for Title: : 0.15 seconds
Title: دراسات سكانية
Language: Arabic (ar), Confidence: 99.12%
Language: Modern Greek (1453-) (el), Confidence: 0.1%
Language: Urdu (ur), Confidence: 0.1%
Language: Portuguese (pt), Confidence: 0.1%
Language: Japanese (ja), Confidence: 0.07%


HF Model for Title: : 0.23 seconds
Title: Trávnice slovenských národných piesní Na klavír pristrojíl Miroslav Francisci
Language: Polish (pl), Confidence: 18.95%
Language: Bulgarian (bg), Confidence: 12.46%
Language: Russian (ru), Confidence: 11.47%
Language: Thai (th), Confidence: 7.15%
Language: Urdu (ur), Confidence: 6.81%


### Uh oh, that Hugging Face model doesn't support Slovak.

While this model seems as good as fastText for languages, it doesn't have the breadth of coverage. To make this useful for our problem would need to:

1. Find another model with more supported languages.
2. Make alternative processing for unsupported languages, maybe using an LLM or a patchwork of models.
3. Fine-tune this model on other language.

While #3 is possible, it's out of scope for this tutorial. You would need to find or construct a dataset, and learn more about using Hugging Face for fine-tuning. If you need to fine tune, and can find the data, copiloting an LLM can be very helpful in getting the Hugging Face or Pytorch syntax for fine-tuning.

## Thanks!

This is about all for a getting started. Fine-tuning and building a model from scratch is way more involved and specific to the task for just this worksheet. Hopefully you have found some of this useful. Cheers.