<div class="alert alert-block alert-danger">

For this practice notebook, please download the notebook and run it on Google Colab, as it requires GPU access. As of September 19, 2025, our GPU infrastructure is still under development. We will notify you once GPU-enabled containers become available.

</div>

# Introduction to Language Models

This section provides practical examples of using open-source large language models (LLMs) in Python, focusing on models compatible with the `AutoModelForCausalLM` class from the Hugging Face `transformers` library. Loading models this way ensures smooth integration for causal language modeling tasks, such as text generation, where the model predicts the next tokens based on preceding input.

## Key Highlights

- **Compatibility:** Models demonstrated here work seamlessly with Hugging Face’s `transformers` ecosystem, enabling straightforward use of pretrained causal language models.
- **Efficiency:** The selected models are optimized to run efficiently on Google Colab environments with GPU constraints, such as the NVIDIA T4 GPU with approximately 15GB of VRAM.
- **Quantization:** Many of these models leverage quantization techniques, which reduce memory usage and speed up inference without significant loss in accuracy, making them accessible on modest hardware.

## Model Selection

A curated list of open-source LLMs has been chosen from the Hugging Face Model Hub ([https://huggingface.co/models](https://huggingface.co/models)) as of July 14, 2025. These models balance performance, size, and resource requirements, making them ideal for hands-on learning and experimentation.

## Getting Started: Loading Pretrained Models

To begin using these models, you can load both the model and tokenizer using the following Python code snippet:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Example: Load a pretrained causal language model and its tokenizer
pretrained_model_name_or_path = "model-identifier-here"  # Replace with actual model ID from Hugging Face

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)
```


## Recommended Models

A curated list of models that meet the criteria. I've included their parameter size, estimated VRAM requirement (for quantized versions), and key strengths. All can be downloaded directly from Hugging Face and run offline.

| Model Name | Parameter Size | VRAM (Quantized) | Key Strengths |
| :-- | :-- | :-- | :-- |
| [Llama 3.1 8B (Quantized)](https://huggingface.co/meta-llama/Llama-3.1-8B) | 8B | ~7–10 GB | Chat, summarization, reasoning |
| [Gemma 7B (Quantized)](https://huggingface.co/google/gemma-7b) | 7B | ~6–8 GB | Text generation, multilingual |
| [Phi-3 Mini (Quantized)](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) | 3.8B | ~7–8 GB | Reasoning, coding, efficient |
| [Qwen2.5 7B](https://huggingface.co/Qwen/Qwen2.5-7B) | 7B | ~6 GB | General, multilingual |
| [Mistral Small 22B (Quant.)](https://huggingface.co/nbeerbower/Mistral-Small-Drummer-22B) | 22B | ~15 GB | Creative writing, fast queries |
| [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | 7B | ~8–10 GB | Reasoning, research |
| [StableLM 7B](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) | 7B | ~8 GB | Prototyping, general purpose |
| [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) | 1.1B | ~4–5 GB | Low-resource, fast testing |

## Meta Llama 3.1-8B-Instruct

The Meta Llama 3.1-8B-Instruct model, hosted at https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, is a part of Meta's Llama 3.1 collection of multilingual large language models. Released on July 23, 2024, this instruction-tuned variant features 8 billion parameters and is optimized for multilingual dialogue use cases, outperforming many open-source and closed chat models on industry benchmarks. It operates as an auto-regressive language model using an optimized transformer architecture, with supervised fine-tuning and reinforcement learning with human feedback to enhance helpfulness and safety. The model supports a context length of 128k tokens and is trained on over 15 trillion tokens from publicly available sources, with a knowledge cutoff of December 2023. It excels in tasks like text generation, summarization, and instruction-following, making it suitable for applications in education, research, and creative content generation, such as developing chatbots or analyzing regional data in places like Columbia, MO.

This model is designed for commercial and research use in eight supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. While it can output text in other languages, developers are advised to fine-tune it responsibly for unsupported ones to ensure safety and accuracy. The Llama 3.1 family emphasizes openness and inclusivity, with a custom commercial license (Llama 3.1 Community License) that allows redistribution and modifications, provided users comply with terms like displaying "Built with Llama" attribution and adhering to the Acceptable Use Policy. Ethical considerations are highlighted, including risks of biased or inaccurate outputs, and developers are encouraged to perform tailored safety testing. Training involved significant computational resources, equivalent to 1.46 million GPU hours, but Meta maintains net-zero greenhouse gas emissions through renewable energy matching.

Accessing the Meta Llama 3.1-8B-Instruct model requires requesting permission due to its gated repository status on Hugging Face, which helps enforce licensing and responsible use. This restriction ensures users agree to Meta's terms, including prohibitions on illegal activities, harmful content generation, and violations of intellectual property. Once approved, the model can be integrated into Python workflows using libraries like Hugging Face Transformers for local or cloud-based inference, supporting efficient deployment on hardware with at least 15 GB VRAM, such as Google Colab's free tier.

### Steps to Request Access and Authenticate

To gain access to the model and resolve authentication errors, follow these steps:

- **Create or Log In to a Hugging Face Account:** Visit huggingface.co and sign up or log in, ensuring your account uses a valid email for notifications.
- **Request Access to the Model:** Navigate to https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, click "Request Access," fill out the form agreeing to the Llama 3.1 Community License, and provide details on your intended use (e.g., educational projects at the University of Missouri). Approval may take 1-3 days.


```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Replace "YOUR_HF_TOKEN_HERE" with your actual Hugging Face token
access_token = "YOUR_HF_TOKEN_HERE"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", torch_dtype="auto", token=access_token)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)
```

```python
# Example prompt
prompt = "Describe data science opportunities at the University of Missouri."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

## Google Gemma-7B-IT

The Google Gemma-7B-IT model, hosted at https://huggingface.co/google/gemma-7b-it, is an instruction-tuned variant of the Gemma family of lightweight, state-of-the-art open models developed by Google. Released as part of the Gemma series, this 7 billion parameter model is designed for text-to-text tasks, including question answering, summarization, and reasoning, with a focus on English-language content. Built on technology similar to Google's Gemini models, it supports efficient deployment on resource-limited environments like laptops or cloud instances, making it suitable for educational and research applications in settings such as data science programs at the University of Missouri in Columbia, MO. The model uses a decoder-only architecture and was trained on a diverse dataset of 6 trillion tokens, including web documents, code, and mathematical text, with a knowledge cutoff around early 2024.

Gemma-7B-IT excels in generating creative text formats, powering chatbots, and supporting NLP research, with strong performance on benchmarks like MMLU (64.3% accuracy) and HellaSwag (81.2%). It incorporates safety measures such as CSAM filtering and bias mitigation, aligning with Google's Responsible AI practices. The model is licensed under open terms that require users to review and accept Google's usage license for access, promoting responsible deployment. Ethical considerations include potential biases from training data and risks of misinformation, so users are encouraged to implement content safety safeguards and perform evaluations for specific use cases.

Access to Gemma-7B-IT requires accepting conditions on Hugging Face due to its gated repository status, ensuring users agree to Google's terms for responsible use. This may involve a quick review process, processed immediately upon login. Once granted, the model can be loaded via the Hugging Face Transformers library for local inference, supporting optimizations like quantization for environments with limited VRAM, such as Google Colab's free tier (approximately 15 GB VRAM and 12.7 GB RAM).

To access and use the model, follow these steps:

- **Create or Log In to a Hugging Face Account:** Visit huggingface.co and sign up or log in with a valid email for notifications.
- **Request Access to the Model:** Navigate to https://huggingface.co/google/gemma-7b-it, review Google's usage license, and click to accept the conditions. Access is typically granted immediately for logged-in users.

After gaining access, load the model for text generation tasks. The following example uses bfloat16 precision for efficiency:

# StabilityAI StableLM-Base-Alpha-7B

The StabilityAI StableLM-Base-Alpha-7B model, hosted at https://huggingface.co/stabilityai/stablelm-base-alpha-7b, is a 7 billion parameter decoder-only language model pre-trained on a diverse collection of English datasets with a sequence length of 4096 tokens. Released as part of the StableLM-Base-Alpha suite, it was designed to address context window limitations in open-source language models, serving as a foundational model for tasks like text generation and further fine-tuning[1]. Although this model has been superseded by newer versions in the Stable LM collection, it remains a valuable resource for research and experimentation, particularly in educational settings such as data science programs at the University of Missouri in Columbia, MO, where users can explore English-language NLP tasks like summarization or creative writing[1]. The model uses the NeoX transformer architecture and was trained on datasets roughly three times larger than The Pile, emphasizing broad English coverage.

StableLM-Base-Alpha-7B is intended for foundational use in application-specific fine-tuning without strict commercial limitations, licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA-4.0), which requires attribution to Stability AI and sharing modifications under the same terms[1]. Developers should be aware of potential limitations, including the risk of generating offensive or biased content from its pre-training data, and exercise caution in production systems to avoid harm[1]. As an older model, it may not match the performance of successors, but it supports efficient inference on hardware with at least 15 GB VRAM, such as Google Colab's free tier, making it accessible for local projects like analyzing university-related texts or prototyping AI tools.

This model does not require special access or approval on Hugging Face, allowing direct downloads for users with a standard account. Its open availability encourages quick prototyping, with community-driven optimizations available for various platforms.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "stabilityai/stablelm-2-1_6b"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.29G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/895 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/784 [00:00<?, ?B/s]

In [2]:
prompt = "Describe data science opportunities at the University of Missouri."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation.


Describe data science opportunities at the University of Missouri. The Data Science Institute is a new initiative at the University of Missouri that will provide a new home for data science research and education. The Data Science Institute will be a center for data science research, education, and outreach. The Data Science Institute will be a center for data science research, education, and outreach. The Data Science Institute will be a center for data science research, education, and outreach. The Data Science Institute will be a center for data science research, education, and outreach. The Data Science Institute will be a center for data science research, education, and outreach. The Data Science Institute will be a center for data science research, education, and outreach. The Data Science Institute will be a center for data science research, education, and outreach. The Data Science Institute will be a center for data science research, education, and outreach. The Data Science I

## Microsoft Phi-3-Mini-4K-Instruct

The Microsoft Phi-3-Mini-4K-Instruct model, hosted at https://huggingface.co/microsoft/Phi-3-mini-4k-instruct, is a lightweight, state-of-the-art open-source large language model with 3.8 billion parameters. Released as part of the Phi-3 family, it is optimized for high-quality reasoning and instruction-following tasks, trained on a mix of synthetic and filtered public data to enhance performance in areas like math, coding, and general knowledge. With a context length of 4K tokens, it supports efficient text generation and is particularly suited for memory-constrained environments, making it an excellent choice for educational and research applications in settings like data science programs at the University of Missouri in Columbia, MO. The model has undergone supervised fine-tuning and direct preference optimization to improve safety and alignment with human preferences, achieving strong results on benchmarks such as MMLU (70.9%) and GSM8K (85.7%).

Phi-3-Mini-4K-Instruct is designed for broad commercial and research use in English, with capabilities extending to multilingual tasks through its diverse training data of 4.9 trillion tokens. It uses a dense decoder-only Transformer architecture and is licensed under the MIT license, allowing unrestricted redistribution and modifications. Developers should note potential limitations, including biases from training data and lower performance on non-English languages or highly specialized domains. Responsible AI practices are emphasized, such as evaluating for fairness and implementing safeguards against harmful content, in line with Microsoft's guidelines.

Unlike some gated models, Phi-3-Mini-4K-Instruct does not require special access or approval on Hugging Face, enabling direct downloads for anyone with a standard account. This open accessibility facilitates quick experimentation, with optimized versions available for ONNX runtime across platforms like CPU, GPU, and mobile devices. For users in Columbia, MO, this model is ideal for local projects, such as generating summaries of university research or analyzing regional datasets, and can run efficiently on hardware with at least 15 GB VRAM, like Google Colab's free tier.


To begin using the model, load it via the Hugging Face Transformers library. The following example demonstrates text generation with bfloat16 precision for efficiency:

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

pretrained_model_name_or_path = "microsoft/Phi-3-mini-4k-instruct"
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path = pretrained_model_name_or_path,
                                             device_map="cuda",
                                             dtype="auto",
                                             trust_remote_code=False,
                                             )
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Although we can now use the model and tokenizer directly, it's much easier to wrap it in a `pipeline` object:

In [4]:
from transformers import pipeline

# Create a pipeline
generator = pipeline("text-generation",
                     model=model,
                     tokenizer=tokenizer,
                     return_full_text=False,
                     max_new_tokens=500,
                     do_sample=False
                     )

Device set to use cuda
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Finally, we create our prompt as a user and give it to the model:

In [5]:
# The prompt (user input/query)
messages = [{"role": "user",
             "content": "Write down a paragraph about University of Missouri."}]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

 The University of Missouri, also known as Mizzou, is a public research university located in Columbia, Missouri. Founded in 1839, it is the flagship institution of the University of Missouri System and is one of the oldest universities in the state. Mizzou offers a wide range of undergraduate and graduate programs across various disciplines, including business, engineering, education, health sciences, and the arts. The university is known for its strong research programs, particularly in the fields of agriculture, engineering, and health sciences. Mizzou is also home to the Mizzou Athletics program, which competes in NCAA Division I and is a member of the Southeastern Conference. The university's beautiful campus, rich history, and commitment to academic excellence make it a top choice for students seeking a quality education in the Midwest.


In [6]:
# The prompt (user input/query)
messages = [{"role": "user",
             "content": "Tell me something interesting about Columbia in Missouri"}]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

 Columbia, Missouri, is not only the state capital but also a vibrant city with a rich history and a strong sense of community. Founded in 1821, Columbia was named after Christopher Columbus, reflecting the era's common practice of naming places after explorers. The city is home to the University of Missouri, which was established in 1839 and is a significant contributor to the local economy and culture. The university's influence is evident in the city's architecture, with many historic buildings and the beautiful Mizzou Botanical Garden. Columbia also hosts the annual Missouri State Fair, which is one of the largest and most popular state fairs in the United States, showcasing agriculture, arts, and crafts.


In [7]:
# The prompt (user input/query)
messages = [{"role": "user",
             "content": "Are you familiar with the Institute for Data Science and Informatics at the University of Missouri"}]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

 Yes, the Institute for Data Science and Informatics (IDSI) at the University of Missouri is a research institute that focuses on the development of data science and informatics. It aims to advance the field by fostering interdisciplinary research and education, and by providing resources and support for data-driven innovation. The institute brings together faculty and students from various disciplines to collaborate on projects that leverage data science and informatics to address complex problems in areas such as healthcare, agriculture, and environmental science.


In [8]:
# The prompt (user input/query)
messages = [{"role": "user",
             "content": "What programs are available at the Institute for Data Science and Informatics at the University of Missouri"}]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

 The Institute for Data Science and Informatics (IDSI) at the University of Missouri offers a variety of programs and resources for students and researchers interested in data science and informatics. As of my knowledge cutoff in 2023, the following programs and resources are available:


1. **Master of Science in Data Science (MSDS)**: This program is designed to provide students with a strong foundation in data science, including data analytics, machine learning, and data visualization.


2. **Master of Science in Informatics (MSI)**: The MSI program focuses on the study of information systems, computer science, and information technology.


3. **Graduate Certificate in Data Science**: A shorter program for those who want to gain specific skills in data science without committing to a full master's degree.


4. **Graduate Certificate in Informatics**: Similar to the Data Science certificate, this program is tailored for students interested in the field of informatics.


5. **Research