27 Apr 02:38
  • fixed a problem caused by llama.cpp API change.

What's Changed

  • Pass false for special tokens to be compatible with llama.cpp commit 40f74e4 by @shawiz in #21

New Contributors

27 Mar 02:25
  • added IQ1_M quantization
  • renamed preProcess and postProcess to preprocess and postprocess respectively(#17).

07 Mar 17:19
  • removed redundant BOS token in mistral template, as it is added by llama.cpp anyway.
  • added more quantization option that llama.cpp supports (it's just String typed enum so you can extend it anyway, but still)
  • func decode(_ token: Token) -> String is now private and you now have func decode(_ token: [Token]) -> String. the prior was handling under the hood multibyte character handling so it was not supposed to be public from the beginning.
  • changed the params.n_ctx = UInt32(maxTokenCount) + (maxTokenCount % 2 == 1 ? 1 : 2) to params.n_ctx = UInt32(self.maxTokenCount). the prior code was like that because of some error i was experiencing but just changed it to the code as it supposed to be from the beginning.

01 Feb 09:38
  • fixed initializer with template
public convenience init(
    from url: URL,
    template: Template,
    history: [Chat] = [],
    seed: UInt32 = .random(in: .min ... .max),
    topK: Int32 = 40,
    topP: Float = 0.95,
    temp: Float = 0.8,
    historyLimit: Int = 8,
    maxTokenCount: Int32 = 2048
) {
        from: url.path,
        stopSequence: template.stopSequence,
        history: history,
        seed: seed,
        topK: topK,
        topP: topP,
        temp: temp,
        historyLimit: historyLimit,
        maxTokenCount: maxTokenCount
    self.preProcess = template.preProcess
    self.template = template

last line was missing. damn.
30 Jan 20:14
  1. renamed some things to improve readability, this and change from endIndex to stopSequenceEndIndex.
extension Model {
     public var endToken: Token { llama_token_eos(self) }
     public var newLineToken: Token { llama_token_nl(self) }
  1. added download progress observing function that you can pass onto initializer. check updated
fileprivate func downloadData(to destination: URL, _ updateProgress: @escaping (Double) -> Void) async throws {
    var observation: NSKeyValueObservation!
    let url: URL = try await withCheckedThrowingContinuation { continuation in
        let task = URLSession.shared.downloadTask(with: self) { url, response, error in
            if let error { return continuation.resume(throwing: error) }
            guard let url else { return continuation.resume(throwing: HuggingFaceError.urlIsNilForSomeReason) }
            let statusCode = (response as! HTTPURLResponse).statusCode
            guard statusCode / 100 == 2 else { return continuation.resume(throwing: statusCode)) }
            continuation.resume(returning: url)
        observation = task.progress.observe(\.fractionCompleted) { progress, _ in
    let _ = observation
    try FileManager.default.moveItem(at: url, to: destination)

30 Jan 10:08
  1. you can now override a new recovery function that is called when too long input that shouldn't be handled, func recoverFromLengthy(_ input: borrowing String, to output: borrowing AsyncStream<String>.Continuation).
open func recoverFromLengthy(_ input: borrowing String, to output:  borrowing AsyncStream<String>.Continuation) {
  1. fixed potential bug of inferencing when it shouldn't. usually it won't cause any damage, because we are most likely going to set maxTokenCount lower than the actual limit of the model, but still. it used to be if statement now it is a while statement.
private func prepare(from input: borrowing String, to output: borrowing AsyncStream<String>.Continuation) -> Bool {
    if maxTokenCount <= currentCount {
        while !history.isEmpty && maxTokenCount <= currentCount {
            history.removeFirst(min(2, history.count))
            tokens = encode(preProcess(self.input, history))
            initialCount = tokens.count
            currentCount = Int32(initialCount)
        if maxTokenCount <= currentCount {
            isFull = true
            recoverFromLengthy(input, to: output)
            return false
    return true
  1. i changed the order of HuggingFaceModel initializer parameter and its label in 94bcc54
//so now instead of:
HuggingFaceModel("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", template: .chatML(systemPrompt), with: .Q2_K)

//you should do:
HuggingFaceModel("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", .Q2_K, template: .chatML(systemPrompt))

this just makes more sense, so i had to change it.

29 Jan 21:34
  1. fixed potential crash in func prepare(from input: borrowing String, to output: borrowing AsyncStream<String>.Continuation) -> Bool
private func prepare(from input: borrowing String, to output: borrowing AsyncStream<String>.Continuation) -> Bool {
    guard !input.isEmpty else { return false }
    return true
  1. fixed the initializer(#9) with template for real this time in 7a3ab0c. it was hard to test it without a initializer, which brings me to my next point:
  2. added initializer forLLM that takes HuggingFaceModel struct instead of local URL. like this:
lazy var model = HuggingFaceModel("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", template: .chatML(systemPrompt), with: .Q2_K)


func testInferenceFromHuggingFaceModel() async throws {
    let bot = try await LLM(from: model)
    let input = "have you heard of this so-called LLM.swift library?"
    await bot.respond(to: input)
public enum Quantization: String {
    case IQ2_XXS
    case IQ2_XS
    case Q2_K_S
    case Q2_K
    case Q3_K_S
    case Q3_K_M
    case Q3_K_L
    case Q4_K_S
    case Q4_K_M
    case Q5_K_S
    case Q5_K_M
    case Q6_K
    case Q8_0

public struct HuggingFaceModel {
    public let name: String
    public let template: Template
    public let filterRegexPattern: String
    public init(_ name: String, template: Template, filterRegexPattern: String) { = name
        self.template = template
        self.filterRegexPattern = filterRegexPattern
    public init(_ name: String, template: Template, with quantization: Quantization = .Q4_K_M) { = name
        self.template = template
        self.filterRegexPattern = "(?i)\(quantization.rawValue)"
    package func getDownloadURLStrings() async throws -> [String] {
        let url = URL(string: "\(name)/tree/main")!
        let data = try await url.getData()
        let content = String(data: data, encoding: .utf8)!
        let downloadURLPattern = #"(?<=href=").*\.gguf\?download=true"#
        let matches = try! downloadURLPattern.matches(in: content)
        let root = ""
        return { match in root + match }

    package func getDownloadURL() async throws -> URL? {
        let urlStrings = try await getDownloadURLStrings()
        for urlString in urlStrings {
            let found = try filterRegexPattern.hasMatch(in: urlString)
            if found { return URL(string: urlString)! }
        return nil
    public func download(to directory: URL = .documentsDirectory, as name: String? = nil) async throws -> URL {
        var destination: URL
        if let name {
            destination = directory.appending(path: name)
            guard !destination.exists else { return destination }
        guard let downloadURL = try await getDownloadURL() else { throw HuggingFaceError.noFilteredURL }
        destination = directory.appending(path: downloadURL.lastPathComponent)
        guard !destination.exists else { return destination }
        let data = try await downloadURL.getData()
        try data.write(to: destination)
        return destination
    public static func tinyLLaMA(_ systemPrompt: String, with quantization: Quantization = .Q4_K_M) -> HuggingFaceModel {
        HuggingFaceModel("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", template: .chatML(systemPrompt), with: quantization)
open class LLM: ObservableObject {


    public convenience init(
        from huggingFaceModel: HuggingFaceModel,
        to url: URL = .documentsDirectory,
        as name: String? = nil,
        history: [Chat] = [],
        seed: UInt32 = .random(in: .min ... .max),
        topK: Int32 = 40,
        topP: Float = 0.95,
        temp: Float = 0.8,
        historyLimit: Int = 8,
        maxTokenCount: Int32 = 2048
    ) async throws {
        let url = try await url, as: name)
            from: url,
            template: huggingFaceModel.template,
            history: history,
            seed: seed,
            topK: topK,
            topP: topP,
            temp: temp,
            historyLimit: historyLimit,
            maxTokenCount: maxTokenCount



27 Jan 22:30
  • added the stop feature, so that you can stop in the middle of inferencing:
private var shouldContinuePredicting = false
public func stop() {
    shouldContinuePredicting = false

private func predictNextToken() async -> Token {
    guard shouldContinuePredicting else { return llama_token_eos(model) }
    return token

private func prepare(from input: consuming String, to output: borrowing AsyncStream<String>.Continuation) -> Bool {
    shouldContinuePredicting = true
    return true


27 Jan 15:41
  1. fixed potential history removal error that happens when manually:
// before:

history.removeFirst(min(2, history.count))

this used to be a problem for adding users manually adding odd number of chats to history or if one experiences the race condition issue(#10).

  1. fixed potential race condition issue(#10) by adding actor attribute to concurrent functions which change LLM's properties:
@globalActor public actor InferenceActor {
    static public let shared = InferenceActor()


private func predictNextToken() async -> Token
private func finishResponse(from response: inout [String], to output: borrowing AsyncStream<String>.Continuation) async

public func getCompletion(from input: borrowing String) async -> String

public func respond(to input: String, with makeOutputFrom: @escaping (AsyncStream<String>) async -> String) async


27 Jan 10:30
  • fixed initializer that takes template:
    public convenience init(
        from url: URL,
        template: Template,
        history: [Chat] = [],
        seed: UInt32 = .random(in: .min ... .max),
        topK: Int32 = 40,
        topP: Float = 0.95,
        temp: Float = 0.8,
        historyLimit: Int = 8,
        maxTokenCount: Int32 = 2048
    ) {
            from: url.path,
            history: history,
            seed: seed,
            topK: topK,
            topP: topP,
            temp: temp,
            historyLimit: historyLimit,
            maxTokenCount: maxTokenCount
        self.template = template

it was only setting stopSequence property, leaving preProcess property.