Releases: eastriverlee/LLM.swift
Releases · eastriverlee/LLM.swift
v1.5.1
v1.5.0
Highlights
- added IQ1_M quantization
- renamed
preProcess
andpostProcess
topreprocess
andpostprocess
respectively(#17).
Full Changelog: 1.4.3...v1.5.0
1.4.3
Highlights
- removed redundant BOS token in mistral template, as it is added by
llama.cpp
anyway. - added more quantization option that
llama.cpp
supports (it's justString
typedenum
so you can extend it anyway, but still) func decode(_ token: Token) -> String
is nowprivate
and you now havefunc decode(_ token: [Token]) -> String
. the prior was handling under the hood multibyte character handling so it was not supposed to bepublic
from the beginning.- changed the
params.n_ctx = UInt32(maxTokenCount) + (maxTokenCount % 2 == 1 ? 1 : 2)
toparams.n_ctx = UInt32(self.maxTokenCount)
. the prior code was like that because of some error i was experiencing but just changed it to the code as it supposed to be from the beginning.
Full Changelog: v1.4.2...1.4.3
v1.4.2
Highlights
- fixed initializer with
template
public convenience init(
from url: URL,
template: Template,
history: [Chat] = [],
seed: UInt32 = .random(in: .min ... .max),
topK: Int32 = 40,
topP: Float = 0.95,
temp: Float = 0.8,
historyLimit: Int = 8,
maxTokenCount: Int32 = 2048
) {
self.init(
from: url.path,
stopSequence: template.stopSequence,
history: history,
seed: seed,
topK: topK,
topP: topP,
temp: temp,
historyLimit: historyLimit,
maxTokenCount: maxTokenCount
)
self.preProcess = template.preProcess
self.template = template
}
last line was missing. damn.
Full Changelog: v1.4.1...v1.4.2
v1.4.1
Highlights
- renamed some things to improve readability, this and change from
endIndex
tostopSequenceEndIndex
.
extension Model {
public var endToken: Token { llama_token_eos(self) }
public var newLineToken: Token { llama_token_nl(self) }
...
}
- added download progress observing function that you can pass onto initializer. check updated
README.md
.
fileprivate func downloadData(to destination: URL, _ updateProgress: @escaping (Double) -> Void) async throws {
var observation: NSKeyValueObservation!
let url: URL = try await withCheckedThrowingContinuation { continuation in
let task = URLSession.shared.downloadTask(with: self) { url, response, error in
if let error { return continuation.resume(throwing: error) }
guard let url else { return continuation.resume(throwing: HuggingFaceError.urlIsNilForSomeReason) }
let statusCode = (response as! HTTPURLResponse).statusCode
guard statusCode / 100 == 2 else { return continuation.resume(throwing: HuggingFaceError.network(statusCode: statusCode)) }
continuation.resume(returning: url)
}
observation = task.progress.observe(\.fractionCompleted) { progress, _ in
updateProgress(progress.fractionCompleted)
}
task.resume()
}
let _ = observation
try FileManager.default.moveItem(at: url, to: destination)
}
Full Changelog: v1.4.0...v1.4.1
v1.4.0
Highlights
- you can now override a new recovery function that is called when too long input that shouldn't be handled,
func recoverFromLengthy(_ input: borrowing String, to output: borrowing AsyncStream<String>.Continuation)
.
open func recoverFromLengthy(_ input: borrowing String, to output: borrowing AsyncStream<String>.Continuation) {
output.yield("tl;dr")
}
- fixed potential bug of inferencing when it shouldn't. usually it won't cause any damage, because we are most likely going to set
maxTokenCount
lower than the actual limit of the model, but still. it used to be if statement now it is awhile
statement.
private func prepare(from input: borrowing String, to output: borrowing AsyncStream<String>.Continuation) -> Bool {
...
if maxTokenCount <= currentCount {
while !history.isEmpty && maxTokenCount <= currentCount {
history.removeFirst(min(2, history.count))
tokens = encode(preProcess(self.input, history))
initialCount = tokens.count
currentCount = Int32(initialCount)
}
if maxTokenCount <= currentCount {
isFull = true
recoverFromLengthy(input, to: output)
return false
}
}
...
return true
}
- i changed the order of
HuggingFaceModel
initializer parameter and its label in 94bcc54
//so now instead of:
HuggingFaceModel("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", template: .chatML(systemPrompt), with: .Q2_K)
//you should do:
HuggingFaceModel("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", .Q2_K, template: .chatML(systemPrompt))
this just makes more sense, so i had to change it.
Full Changelog: v1.3.0...v1.4.0
v1.3.0
Highlights
- fixed potential crash in
func prepare(from input: borrowing String, to output: borrowing AsyncStream<String>.Continuation) -> Bool
private func prepare(from input: borrowing String, to output: borrowing AsyncStream<String>.Continuation) -> Bool {
guard !input.isEmpty else { return false }
...
return true
}
- fixed the initializer(#9) with template for real this time in 7a3ab0c. it was hard to test it without a initializer, which brings me to my next point:
- added initializer for
LLM
that takesHuggingFaceModel
struct instead of localURL
. like this:
lazy var model = HuggingFaceModel("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", template: .chatML(systemPrompt), with: .Q2_K)
...
func testInferenceFromHuggingFaceModel() async throws {
let bot = try await LLM(from: model)
let input = "have you heard of this so-called LLM.swift library?"
await bot.respond(to: input)
#assert(!bot.output.isEmpty)
}
public enum Quantization: String {
case IQ2_XXS
case IQ2_XS
case Q2_K_S
case Q2_K
case Q3_K_S
case Q3_K_M
case Q3_K_L
case Q4_K_S
case Q4_K_M
case Q5_K_S
case Q5_K_M
case Q6_K
case Q8_0
}
public struct HuggingFaceModel {
public let name: String
public let template: Template
public let filterRegexPattern: String
public init(_ name: String, template: Template, filterRegexPattern: String) {
self.name = name
self.template = template
self.filterRegexPattern = filterRegexPattern
}
public init(_ name: String, template: Template, with quantization: Quantization = .Q4_K_M) {
self.name = name
self.template = template
self.filterRegexPattern = "(?i)\(quantization.rawValue)"
}
package func getDownloadURLStrings() async throws -> [String] {
let url = URL(string: "https://huggingface.co/\(name)/tree/main")!
let data = try await url.getData()
let content = String(data: data, encoding: .utf8)!
let downloadURLPattern = #"(?<=href=").*\.gguf\?download=true"#
let matches = try! downloadURLPattern.matches(in: content)
let root = "https://huggingface.co"
return matches.map { match in root + match }
}
package func getDownloadURL() async throws -> URL? {
let urlStrings = try await getDownloadURLStrings()
for urlString in urlStrings {
let found = try filterRegexPattern.hasMatch(in: urlString)
if found { return URL(string: urlString)! }
}
return nil
}
public func download(to directory: URL = .documentsDirectory, as name: String? = nil) async throws -> URL {
var destination: URL
if let name {
destination = directory.appending(path: name)
guard !destination.exists else { return destination }
}
guard let downloadURL = try await getDownloadURL() else { throw HuggingFaceError.noFilteredURL }
destination = directory.appending(path: downloadURL.lastPathComponent)
guard !destination.exists else { return destination }
let data = try await downloadURL.getData()
try data.write(to: destination)
return destination
}
public static func tinyLLaMA(_ systemPrompt: String, with quantization: Quantization = .Q4_K_M) -> HuggingFaceModel {
HuggingFaceModel("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", template: .chatML(systemPrompt), with: quantization)
}
}
open class LLM: ObservableObject {
...
public convenience init(
from huggingFaceModel: HuggingFaceModel,
to url: URL = .documentsDirectory,
as name: String? = nil,
history: [Chat] = [],
seed: UInt32 = .random(in: .min ... .max),
topK: Int32 = 40,
topP: Float = 0.95,
temp: Float = 0.8,
historyLimit: Int = 8,
maxTokenCount: Int32 = 2048
) async throws {
let url = try await huggingFaceModel.download(to: url, as: name)
self.init(
from: url,
template: huggingFaceModel.template,
history: history,
seed: seed,
topK: topK,
topP: topP,
temp: temp,
historyLimit: historyLimit,
maxTokenCount: maxTokenCount
)
}
...
}
Full Changelog: v1.2.4...v1.3.0
v1.2.4
Highlights
- added the stop feature, so that you can stop in the middle of inferencing:
private var shouldContinuePredicting = false
public func stop() {
shouldContinuePredicting = false
}
@InferenceActor
private func predictNextToken() async -> Token {
guard shouldContinuePredicting else { return llama_token_eos(model) }
...
return token
}
private func prepare(from input: consuming String, to output: borrowing AsyncStream<String>.Continuation) -> Bool {
...
shouldContinuePredicting = true
return true
}
v1.2.3
Highlights
- fixed potential history removal error that happens when manually:
// before:
history.removeFirst(2)
//after:
history.removeFirst(min(2, history.count))
this used to be a problem for adding users manually adding odd number of chats to history
or if one experiences the race condition issue(#10).
- fixed potential race condition issue(#10) by adding actor attribute to concurrent functions which change
LLM
's properties:
@globalActor public actor InferenceActor {
static public let shared = InferenceActor()
}
...
@InferenceActor
private func predictNextToken() async -> Token
@InferenceActor
private func finishResponse(from response: inout [String], to output: borrowing AsyncStream<String>.Continuation) async
@InferenceActor
public func getCompletion(from input: borrowing String) async -> String
@InferenceActor
public func respond(to input: String, with makeOutputFrom: @escaping (AsyncStream<String>) async -> String) async
v1.2.2
Highlights
fixed initializer that takes:template
public convenience init(
from url: URL,
template: Template,
history: [Chat] = [],
seed: UInt32 = .random(in: .min ... .max),
topK: Int32 = 40,
topP: Float = 0.95,
temp: Float = 0.8,
historyLimit: Int = 8,
maxTokenCount: Int32 = 2048
) {
self.init(
from: url.path,
history: history,
seed: seed,
topK: topK,
topP: topP,
temp: temp,
historyLimit: historyLimit,
maxTokenCount: maxTokenCount
)
self.template = template
}
it was only setting stopSequence
property, leaving preProcess
property.