<center><h1>Anatomy of a Large Language Model</h1></center>
<center><i>Part of the Knowledge Journal Project</i></center>

---

**Author:** Augusto Damasceno  
**Project Link:** [github.com/augustodamasceno/knowledge-journal](https://github.com/augustodamasceno/knowledge-journal)  
**Last Updated:** August 19, 2025

---
<div style="text-align: right;">
<small>Copyright ¬© 2025, Augusto Damasceno</small><br>
<small>SPDX-License-Identifier: <code>BSD-2-Clause</code></small>
</div>

# Summary

### **1. Definition**
### **2. Chats and Engines**
### **3. Data and Preprocessing**

# Session 1 - Definition
> All text within this session has been sourced from IBM [1].

Large language models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding 
and generating natural language and other types of content to perform a wide range of tasks. 


## How large language models work

LLMs operate by leveraging deep learning techniques and vast amounts of textual data. These models are typically based on a transformer architecture, like the generative pre-trained transformer, which excels at handling sequential data like text input. LLMs consist of multiple layers of neural networks, each with parameters that can be fine-tuned during training, which are enhanced further by a numerous layer known as the attention mechanism, which dials in on specific parts of data sets.

During the training process, these models learn to predict the next word in a sentence based on the context provided by the preceding words. The model does this through attributing a probability score to the recurrence of words that have been tokenized, broken down into smaller sequences of characters. These tokens are then transformed into embeddings, which are numeric representations of this context.

To ensure accuracy, this process involves training the LLM on a massive corpora of text (in the billions of pages), allowing it to learn grammar, semantics and conceptual relationships through zero-shot and self-supervised learning. Once trained on this training data, LLMs can generate text by autonomously predicting the next word based on the input they receive, and drawing on the patterns and knowledge they've acquired. The result is coherent and contextually relevant language generation that can be harnessed for a wide range of NLU and content generation tasks.

Model performance can also be increased through prompt engineering, prompt-tuning, fine-tuning and other tactics like reinforcement learning with human feedback (RLHF) to remove the biases, hateful speech and factually incorrect answers known as ‚Äúhallucinations‚Äù that are often unwanted byproducts of training on so much unstructured data. This is one of the most important aspects of ensuring enterprise-grade LLMs are ready for use and do not expose organizations to unwanted liability, or cause damage to their reputation. 


# Session 2 - Chats and Engines

## Chats [2-24]
| Chat Service üí¨ | Developer | Key Focus / Differentiator | Primary Model(s) | Notable Feature üöÄ |
| :--- | :--- | :--- | :--- | :--- |
| **Amazon Q** ‚òÅÔ∏è | Amazon AWS | Business-focused AI for DevOps and AWS cloud services | Proprietary (Titan) | Directly takes actions in your AWS environment |
| **Character.AI** üé≠ | Character.AI | Character-driven, immersive roleplay conversations | Proprietary Models | User-generated characters and group chats |
| **ChatGPT** ü§ñ | OpenAI | General-purpose leader with strong multimodal capabilities | GPT-4o | Native desktop app & advanced voice mode |
| **Claude** üïäÔ∏è | Anthropic | Safety, nuanced instruction following, & industry-leading context | Claude 3.5 Sonnet | Massive 200k token context window |
| **CLOVA X** üá∞üá∑ | Naver (Korea) | Korean-focused assistant, integrated with LINE | HyperCLOVA X | Dominant presence in the Korean and Japanese markets |
| **Cohere Chat** üåç | Cohere | Multilingual enterprise AI with on-premises deployment options | Command R+, Command A | 256k-token window with 23-language support |
| **Copilot (Github)** üíª | Microsoft | AI pair programmer for developers in the IDE | GPT-4-Turbo, GPT-4o | Real-time code completion and commands in VS Code |
| **DeepSeek Chat** üêã | DeepSeek | Powerful open-weight models with massive context support | DeepSeek-V3 | 128k token context for free, strong coding focus |
| **Ernie Bot** üêâ | Baidu (China) | Chinese market leader with search & cloud integration | ERNIE 4.0 Turbo | Deep integration with Baidu's ecosystem |
| **Gemini** üåê | Google | Deep integration with Google ecosystem & services | Gemini 1.5 Pro/Flash | "Google it" button for fact-checking responses |
| **Grok** ü§ñ | xAI | Real-time knowledge via integration with X (Twitter) platform | Grok-2 | "Grok mode" for unfiltered, sarcastic responses |
| **HuggingChat** ü§ó | Hugging Face | Open-source AI hub for customizable, community-driven chat | Mixtral, Llama-based models | Access to thousands of community-hosted models |
| **Kimi Chat** üìÑ | Moonshot AI (China) | Ultra-long context processing for document analysis | Moonshot AI Models | Supports up to 2 million characters of context |
| **Le Chat (Mistral)** üêì | Mistral AI | High-performance models with strong multilingual support | Mistral Large 2, Codestral | Strong coding and European language capabilities |
| **Meta AI** ü¶ô | Meta | Social-first AI integrated into WhatsApp, Instagram, Facebook | Llama 3 | On-the-go access via Ray-Ban Meta smart glasses |
| **Microsoft Copilot** ü™Ñ | Microsoft | Deeply embedded in Windows, Edge, and Office 365 | GPT-4-Turbo, GPT-4o | Accessible via a dedicated keyboard key on Windows PCs |
| **Perplexity** üîé | Perplexity AI | Answer engine with real-time web search and citations | Mixture of Experts (Proprietary + OpenAI) | Copilot mode for guided, interactive research |
| **Pi** ü§ù | Inflection AI | Empathetic and supportive personal AI companion | Inflection-2 | Proactive, kind, and conversationally fluid |
| **Qwen Chat** ü¶ö | Alibaba Cloud | High-performance multilingual AI for enterprise and research | Qwen3 | Excels in over 100 languages, strong reasoning benchmarks |
| **Replika** ‚ù§Ô∏è | Luka, Inc. | AI companion with emotional intelligence and memory | Proprietary Models | AR avatar for immersive interaction |
| **SenseNova** üåå | SenseTime (China)| Enterprise and industry-specific AI solutions | SenseNova 5.0 | Strong computer vision and image gen capabilities |
| **Spark (Xinghuo)** ‚ú® | iFlytek (China) | Conversational AI with education & business focus | SparkDesk / iFlytek Models | Specialized hardware and software for education |
| **YouChat** üîó | You.com | Search-integrated chatbot with citations and apps | Proprietary + OpenAI | "YouApps" for adding tools like image gen to the chat |
> This table was generated with contributions from multiple large language models, including Gemini, ChatGPT, DeepSeek, and Grok.

## Engines [25-42]
| Engine ‚öôÔ∏è             | Developer / Org          | Key Focus / Differentiator               | Main Model(s)                            |
| ---------------------- | ------------------------ | ------------------------------------------ | ---------------------------------------- |
| **GPT-4 / GPT-4o** ü§ñ  | OpenAI                   | General-purpose, industry-leading, multimodal | GPT-4, GPT-4o                            |
| **Gemini** üåê          | Google (DeepMind)        | Multimodal, deep Google ecosystem integration | Gemini 1.5 Pro, Ultra                    |
| **Claude** üïäÔ∏è           | Anthropic                | Safety, constitutional AI, long context      | Claude 3 & 3.5 family (Opus, Sonnet)     |
| **LLaMA 3.1** ü¶ô       | Meta                     | Open-weight, research + industry adoption    | LLaMA 3.1 (8B, 70B, 405B)                |
| **Command** üìè         | Cohere                   | Enterprise, RAG, multilingual              | Command R+, Command R                    |
| **Mistral** ü¶∏         | Mistral AI               | High-performance, open-source & MoE          | Mistral Large 2, Mixtral 8x22B           |
| **Grok-2** üê¶‚Äç‚¨õ          | xAI                      | Social/chat integration, reasoning         | Grok-2                                   |
| **DeepSeek-V3** üïµÔ∏è      | DeepSeek AI              | MoE efficiency, cost-effective scaling     | DeepSeek-V3 (671B total, 37B active)     |
| **Qwen-3** üêâ          | Alibaba Cloud            | Strong MoE scaling, Chinese/English        | Qwen 3 family                            |
| **Kimi K2** üõ∞Ô∏è          | Moonshot AI              | Agentic MoE, top coding/math performance     | Kimi K2 (1T total, 32B active)           |
| **Jurassic-2** ü¶ñ      | AI21 Labs                | Enterprise NLP, writing assistance         | Jurassic-2 family                        |
| **Amazon Titan** üè≠    | Amazon Web Services      | AWS ecosystem, enterprise applications     | Titan Text, Titan Multimodal             |
| **ERNIE Bot** üßß       | Baidu                    | Chinese ecosystem, multimodal              | ERNIE 4.0                                |
| **Phi-3** üß†           | Microsoft Research       | Small Language Models (SLMs), on-device AI | Phi-3 family (Mini, Small, Medium)       |
| **Falcon** ü¶Ö          | TII (UAE)                | Multilingual, open-weight Apache 2.0 license | Falcon 40B, Falcon 180B                  |
| **Yi-1.5** üèØ          | 01.AI                    | Open-weight, strong reasoning & multilingual | Yi-1.5 (6B, 9B, 34B, 200B)               |
| **GLM-4** üìö          | Zhipu AI & Tsinghua University | Academic & enterprise focus, bilingual     | GLM-4 family                             |
| **EleutherAI** ‚è≥      | EleutherAI (community)   | Open-source research, data transparency    | GPT-J, GPT-NeoX, Pythia                  |
> This table was generated with contributions from multiple large language models, including Gemini, ChatGPT, DeepSeek, and Grok.

# Session 3 - Data and Preprocessing  

> The data used in this notebook is the book Alice's Adventures in Wonderland by Lewis Carroll [44],  
> available for download from Project Gutenberg [43].

## Libraries

In [27]:
import re
import unicodedata

import pandas as pd
import contractions

## Cleaning Functions

In [22]:
def remove_square_brackets(text: str) -> str:
    return re.sub(r'\[.*?\]', '', text)
    
def clean_gutenberg_text(raw_text: str) -> str:
    start_markers = [
        r"\*\*\* START OF (THIS|THE) PROJECT GUTENBERG EBOOK .* \*\*\*",
        r"\*\*\*START OF THE PROJECT GUTENBERG EBOOK.*",
    ]
    end_markers = [
        r"\*\*\* END OF (THIS|THE) PROJECT GUTENBERG EBOOK .* \*\*\*",
        r"End of (the|this) Project Gutenberg EBook",
        r"End of Project Gutenberg's",
    ]

    start_index = -1
    for marker in start_markers:
        match = re.search(marker, raw_text, re.IGNORECASE)
        if match:
            start_index = match.end()
            break

    if start_index == -1:
        start_index = 0

    end_index = -1
    for marker in end_markers:
        match = re.search(marker, raw_text, re.IGNORECASE)
        if match:
            end_index = match.start()
            break
            
    if end_index == -1:
        end_index = len(raw_text)

    core_content = raw_text[start_index:end_index].strip()
    cleaned_text = re.sub(r'\n\s*\n', '\n\n', core_content)
    cleaned_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', cleaned_text)
    cleaned_text = cleaned_text.strip()

    return cleaned_text


def clean_text(text: str) -> str:
    text_clean_gutenberg = clean_gutenberg_text(text)
    output = remove_square_brackets(text_clean_gutenberg)
    return output

    

# Normalization Functions

In [32]:
def normalize_text(text):
    text_lower = text.lower()
    text_normalized = unicodedata.normalize('NFKC', text_lower)
    text_expanded_contractions = contractions.fix(text_normalized)
    return text_expanded_contractions

## Load Data

In [47]:
book_filename = "dataset/book1.txt"
with open(book_filename, 'r', encoding='utf-8') as file:
    raw_data= file.read()

data_clean = clean_text(raw_data)
data = normalize_text(data_clean)

# Comparison Raw and Preprocessed Data

In [48]:
data[:3000]

'\n\nalice‚Äôs adventures in wonderland\n\nby lewis carroll\n\nthe millennium fulcrum edition 3.0\n\ncontents\n\n chapter i.     down the rabbit-hole  chapter ii.    the pool of tears  chapter iii.   a caucus-race and a long tale  chapter iv.    the rabbit sends in a little bill  chapter v.     advice from a caterpillar  chapter vi.    pig and pepper  chapter vii.   a mad tea-party  chapter viii.  the queen‚Äôs croquet-ground  chapter ix.    the mock turtle‚Äôs story  chapter x.     the lobster quadrille  chapter xi.    who stole the tarts?  chapter xii.   alice‚Äôs evidence\n\nchapter i. down the rabbit-hole\n\nalice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‚Äúand what is the use of a book,‚Äù thought alice ‚Äúwithout pictures or conversations?‚Äù\n\nso she was considering in her own mind (as well as she could, for 

In [49]:
raw_data[:3000]

"\ufeffThe Project Gutenberg eBook of Alice's Adventures in Wonderland\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: Alice's Adventures in Wonderland\n\nAuthor: Lewis Carroll\n\nRelease date: June 27, 2008 [eBook #11]\n                Most recently updated: June 26, 2025\n\nLanguage: English\n\nCredits: Arthur DiBianca and David Widger\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***\n\n[Illustration]\n\n\n\n\nAlice‚Äôs Adventures in Wonderland\n\nby Lewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3.0\n\nContents\n\n CHAPTER I.     Down the

# References

[1] ‚ÄúWhat are large language models (LLMs)?,‚Äù IBM Think, [Online]. Available: https://www.ibm.com/think/topics/large-language-models. [Accessed: Aug. 19, 2025].

[2] Amazon Web Services, Amazon Q. [Online]. Available: https://aws.amazon.com/q/. [Accessed: Aug. 19, 2025].

[3] Character.AI, Character.AI. [Online]. Available: [suspicious link removed]. [Accessed: Aug. 19, 2025].

[4] OpenAI, ChatGPT. [Online]. Available: https://chat.openai.com/. [Accessed: Aug. 19, 2025].

[5] Anthropic, Claude. [Online]. Available: https://claude.ai/. [Accessed: Aug. 19, 2025].

[6] Naver, CLOVA X. [Online]. Available: https://clova-x.naver.com/. [Accessed: Aug. 19, 2025].

[7] Cohere, Cohere Chat. [Online]. Available: https://cohere.com/chat. [Accessed: Aug. 19, 2025].

[8] Microsoft, GitHub Copilot. [Online]. Available: https://github.com/features/copilot. [Accessed: Aug. 19, 2025].

[9] DeepSeek, DeepSeek Chat. [Online]. Available: https://chat.deepseek.com/. [Accessed: Aug. 19, 2025].

[10] Baidu, ERNIE Bot (ÊñáÂøÉ‰∏ÄË®Ä). [Online]. Available: https://yiyan.baidu.com/. [Accessed: Aug. 19, 2025].

[11] Google, Gemini. [Online]. Available: https://gemini.google.com/. [Accessed: Aug. 19, 2025].

[12] xAI, Grok. [Online]. Available: https://grok.x.ai/. [Accessed: Aug. 19, 2025].

[13] Hugging Face, HuggingChat. [Online]. Available: https://huggingface.co/chat/. [Accessed: Aug. 19, 2025].

[14] Moonshot AI, Kimi Chat. [Online]. Available: https://kimi.moonshot.cn/. [Accessed: Aug. 19, 2025].

[15] Mistral AI, Le Chat. [Online]. Available: https://chat.mistral.ai/. [Accessed: Aug. 19, 2025].

[16] Meta, Meta AI. [Online]. Available: https://meta.ai/. [Accessed: Aug. 19, 2025].

[17] Microsoft, Microsoft Copilot. [Online]. Available: https://copilot.microsoft.com/. [Accessed: Aug. 19, 2025].

[18] Perplexity AI, Perplexity. [Online]. Available: https://www.perplexity.ai/. [Accessed: Aug. 19, 2025].

[19] Inflection AI, Pi. [Online]. Available: https://pi.ai/. [Accessed: Aug. 19, 2025].

[20] Alibaba Cloud, Qwen Chat (ÈÄö‰πâÂçÉÈóÆ). [Online]. Available: https://qianwen.aliyun.com/. [Accessed: Aug. 19, 2025].

[21] Luka, Inc., Replika. [Online]. Available: https://replika.ai/. [Accessed: Aug. 19, 2025].

[22] SenseTime, SenseNova (ÂïÜÊ±§Êó•Êó•Êñ∞). [Online]. Available: https://www.sensetime.com/en/business-group-SenseNova. [Accessed: Aug. 19, 2025].

[23] iFlytek, SparkDesk (ËÆØÈ£ûÊòüÁÅ´). [Online]. Available: https://xinghuo.xfyun.cn/. [Accessed: Aug. 19, 2025].

[24] You.com, YouChat. [Online]. Available: https://you.com/chat. [Accessed: Aug. 19,

[25] OpenAI, ‚ÄúHello GPT-4o,‚Äù OpenAI Blog, May 13, 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o/.

[26] Gemini Team, ‚ÄúGemini 1.5: Unlocking multimodal understanding across hundreds of thousands of tokens,‚Äù Google for Developers, Feb. 15, 2024. [Online]. Available: https://developers.googleblog.com/2024/02/gemini-15-available-for-private-preview-in-google-ai-studio.html.

[27] Anthropic, ‚ÄúIntroducing Claude 3.5 Sonnet,‚Äù Anthropic Research, Jun. 20, 2024. [Online]. Available: https://www.anthropic.com/news/claude-3-5-sonnet.

[28] Meta AI, ‚ÄúIntroducing Meta Llama 3.1,‚Äù Meta AI, Jul. 23, 2024. [Online]. Available: https://ai.meta.com/blog/meta-llama-3-1/.

[29] Cohere, ‚ÄúIntroducing Command R+: A Scalable LLM for Enterprises,‚Äù Cohere Blog, Apr. 4, 2024. [Online]. Available: https://txt.cohere.com/command-r-plus-scalable-llm-for-enterprises/.

[30] Mistral AI Team, ‚ÄúAu Large: Our new flagship model, a new generative endpoint, and more,‚Äù Mistral AI, Jul. 11, 2024. [Online]. Available: https://mistral.ai/news/au-large/.

[31] xAI, ‚ÄúAnnouncing Grok-2,‚Äù xAI Blog, Aug. 2024.

[32] DeepSeek-AI, ‚ÄúDeepSeek-V3: A Strong, Economical, and Efficient Mixture-of-Experts Language Model,‚Äù arXiv preprint arXiv:2405.04434, 2024.

[33] J. Qwen Team et al., ‚ÄúQwen2: A Family of Strong and General Open-Source Language Models,‚Äù arXiv preprint arXiv:2406.04832, 2024.

[34] Moonshot AI, "Kimi K2," Moonshot AI Research, Aug. 2024.

[35] O. Sharir, Y. Levine, and A. Shashua, ‚ÄúThe Jurassic-2 series of language models,‚Äù AI21 Labs, Mar. 2023. [Online]. Available: https://www.ai21.com/blog/jurassic-2-language-models.

[36] S. Vasisht, ‚ÄúHarness the power of Amazon Titan multimodal embeddings model in Amazon Bedrock,‚Äù AWS Machine Learning Blog, Apr. 2, 2024. [Online]. Available: https://aws.amazon.com/blogs/machine-learning/harness-the-power-of-amazon-titan-multimodal-embeddings-model-in-amazon-bedrock/.

[37] Baidu, ‚ÄúERNIE 4.0: The latest foundation model from Baidu,‚Äù Baidu Research, Oct. 17, 2023.

[38] M. A. Abdelfattah et al., ‚ÄúPhi-3: Redefining What‚Äôs Possible with Open Small Language Models,‚Äù Microsoft Research, 2024. [Online]. Available: https://www.microsoft.com/en-us/research/publication/phi-3-redefining-whats-possible-with-open-small-language-models/.

[39] Technology Innovation Institute, ‚ÄúFalcon 180B: an open-source powerhouse,‚Äù Falcon LLM, Sep. 6, 2023. [Online]. Available: https://falconllm.tii.ae/falcon-180b.html.

[40] 01.AI, ‚ÄúYi-1.5: A New Generation of Open-Source Models, Excelling in Code and Math,‚Äù 01.AI Blog, Jun. 5, 2024. [Online]. Available: https://www.01.ai/blog/yi-1.5.

[41] Z. Zeng et al., ‚ÄúGLM-4: A new generation of multimodal language models,‚Äù Zhipu AI, 2024.

[42] S. Black et al., ‚ÄúGPT-NeoX-20B: An Open-Source Autoregressive Language Model,‚Äù in Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models, 2022.

[43] Project Gutenberg, "Alice's Adventures in Wonderland, by Lewis Carroll," Project Gutenberg, 2008. [Online]. Available: https://www.gutenberg.org/ebooks/11. [Accessed: Aug. 19, 2025].

[44] L. Carroll, Alice's Adventures in Wonderland. London, U.K.: Macmillan, 1865.