New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Error & No Auto Complete when binary flict is large ~ 6mb #403
Comments
Thanks for reporting this issue! I am currently analyzing a few memory crashes (both related to suggestions as well to other features) and somewhere in the code some references don't clear up, which very quickly builds up garbage and crashes the app on some devices. Will comment here if I have found out more what could be the cause.
Hmm I am surprised that the load method doesn't crash for this size. I think I will start there (because the entire binary file is currently read and then analyzed, instead of reading only small chunks and processing it bit by bit), then move forward to the actual suggestion algorithm. |
@patrickgold There's actually a bigger and more serious issue I believe: the flict compression algorithm actually makes the file bigger than it has to be, significantly bigger. I used the |
@patrickgold It may help to use the default https://github.com/remi0s/aosp-dictionary-tools on a wordlist and also use the same wordlist on florisboard dict-tools to verify my results. |
@sabzo One thing that the Python library currently does not do is utilizing the end byte grouping the Flict spec defines,because it always glitched out in the claculation and at the time I just commented out the code: https://github.com/florisboard/dictionary-tools/blob/cc712ffc70b8485ea8fbd25d06381a4fc2b9b906/flict.py#L129-L140 For your file size of >6MB, this can definitely save a lot of bytes in the file size once fixed. |
Since Luminiso doesn't support higher n-grams, is fixing the calculation error in |
No the end bug is something I prefer to have included still in phase 1, because it reduces the file size and also reduces the amount of bytes read into memory in runtime when the binary file is cached. |
@sabzo I've pushed a fix for the end count bug on both the dictionary-ttols a) the file size of the binary Flictionary decreases and Thanks in advance! |
@patrickgold File size decreased by 2 mb to 4.9 mb, however the b)
|
Memory Management + Large File sizes seem to be contributing to the above -- makes me wonder if this is why C++ is used in the AOSP keyboards outthere for word complete/prediction... |
Java is known to not be the best language in terms of memory management, but it is definitely possible to write Kotlin/JVM code which runs smoothly, just requires a lot of attention and care. I just don't want to use C++ besides Kotlin for this, because then the code is as readable as the AOSP prediction/dictionary code... I will rewrite the load function either later this evening or tomorrow midday to not preload the binary file at once but read the InputStream bit by bit, then I will investigate the behavior with the Android Studio Memory Profiler. |
Above PR fixes a lot in the prediction algorithm (both the Flictionary load function and memory management), could you have a look into it if this fixes the memory crash for you? |
@patrickgold the same memory error and keyboard freezing and restarting still persists unfortunately. The new flict algorithm produced a 4.9 mb binary, while the same combined wordlist using aosp tools still built a 1.9mb binary. The compression worked, but it doesn't do better than the aosp compression algorithm. |
@patrickgold Revisiting this -- are there plans to reduce size of flicts to be <= their equivalent AOSP binary size? |
@sabzo I will definitely try to decrease the file size of a Flictionary, but the 4.9MB of your Flictionary are not the problem, the problem lies in the representation of the data after being parsed in runtime, which in its current state is far from memory-efficient. Thus the OOM errors and crashes occur. In the next few days I will begin with phase 2 of the suggestion feature and then I will see where I can improve. |
@patrickgold understood. If you can point me to that part of the code where the representation of data happens I can help troubleshoot as well. I've a good amount of time on my hands. |
This is the definition of a florisboard/app/src/main/java/dev/patrickgold/florisboard/ime/nlp/FlorisLanguageModel.kt Lines 32 to 38 in e4f5fcf
The main issue in this is the MutableList (which is an ArrayList in JVM runtime), which takes a lot of overhead for each node, even if the list is empty. In the current representation this is 2 times per node and in my precompiled version the Ngram node occurs about 670k times. This is even worse for your 4.9MB Flictionary, thus crashing the app with OOM errors. The problem is I can't really find a more efficient mutable array than ArrayList (except if I had primitives, which I don't have here). Another thing I see is that I can use a |
I didn't want to believe this when you originally said this, but after investigating and learning more about the JVM, it is more and more clear for me why the AOSP keyboard uses native C++ for NLP. The JVM just represents data in a very inefficient way, which works well for normal applications and small data sets but becomes quickly a big problem for large data sets like a Flictionary once parsed. I will look into maybe rewriting the dictionary base implementation (everything which requires a lot of memory) in native C++ or Rust (depends if the F-Droid build servers support compiling Rust, else I will resort to C++) and this should hopefully be the solution to the OOM errors. |
@patrickgold f-droid servers can probably compile rust, since this app https://github.com/jensstein/oandbackup |
@tsiflimagas Good to know, thanks for linking! |
@patrickgold heads up that the latest release https://github.com/florisboard/florisboard/releases/tag/v0.3.10-beta04 breaks the suggestions on the default en.flict. While more suggestions show, clicking on any of them does nothing. Perhaps maybe a memory error too. The app doesn't crash, but the selection of suggestions doesn't work either. On an emulator it works (as the computer has more resources than a regular phone I assume) but on the devices I've had multiple issues. |
The fact that the suggestions show but you can't select them makes me believe there's a bug in the CadidateView touch logic rather that in the suggestions core. If you go to Settings > Typing > Suggestions display mode and set it to classic or dynamic width, does it work then? |
@patrickgold You're right about that. Only classic, dynamic width work. dyanmic width & scrollable doesn't work. |
I have an idea what happens: For all modes, a suggestions is marked as soon as your pointer touches down on that. When your pointer moves while being down, it automatically cancels the selection for the scrollable display mode to scroll the candidate view. I suspect that an ACTION_MOVE is triggered so fast because your touch pointer (finger) can't be pixel-perfect still. On an emulator though holding still without moving is far easier because you use a mouse. Anyways, I guess I will have to introduce a minimum distance threshold before the scrolling starts to prevent this kind of bug. |
@patrickgold understood. I've been experimenting with loading multiple flicks at the same time so users can have multiple language support without having to manually switch between languages. I know |
Into one runtime flict? Untested but this hacked version could work, which just loops over all passed assetRefs. For n-grams wich appear multiple times the last loaded Flictionary will take precedence: fun load(context: Context, assetRefs: List<AssetRef>): Result<Flictionary> {
val buffer = ByteArray(5000) { 0 }
var headerStr: String? = null
var date: Long = 0
var version = 0
val ngramTree = NgramTree()
for (assetRef in assetRefs) {
val inputStream: InputStream
if (assetRef.source == AssetSource.Assets) {
inputStream = context.assets.open(assetRef.path)
} else {
return Result.failure(Exception("Only AssetSource.Assets is currently supported!"))
}
var pos = 0
val ngramOrderStack = mutableListOf<Int>()
val ngramTreeStack = mutableListOf<NgramNode>()
while (true) {
if (inputStream.readNext(buffer, 0, 1) <= 0) {
break
}
val cmd = buffer[0].toInt() and 0xFF
when {
(cmd and MASK_BEGIN_PTREE_NODE) == CMDB_BEGIN_PTREE_NODE -> {
if (pos == 0) {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_CMD_BEGIN_PTREE_NODE,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
val order = ((cmd and ATTR_PTREE_NODE_ORDER) shr 4) + 1
val type = ((cmd and ATTR_PTREE_NODE_TYPE) shr 2)
val size = (cmd and ATTR_PTREE_NODE_SIZE) + 1
val freq: Int
val freqSize: Int
when (type) {
ATTR_PTREE_NODE_TYPE_CHAR -> {
freq = NgramNode.FREQ_CHARACTER
freqSize = 0
}
ATTR_PTREE_NODE_TYPE_WORD_FILLER -> {
freq = NgramNode.FREQ_WORD_FILLER
freqSize = 0
}
ATTR_PTREE_NODE_TYPE_WORD -> {
if (inputStream.readNext(buffer, 1, 1) > 0) {
freq = buffer[1].toInt() and 0xFF
} else {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_EOF,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
freqSize = 1
}
else -> return Result.failure(Exception("TODO: shortcut not supported"))
}
if (inputStream.readNext(buffer, freqSize + 1, size) > 0) {
val char = String(buffer, freqSize + 1, size, Charsets.UTF_8)[0]
val node = NgramNode(order, char, freq)
val lastOrder = ngramOrderStack.lastOrNull()
if (lastOrder == null) {
ngramTree.higherOrderChildren.add(node)
} else {
if (lastOrder == order) {
ngramTreeStack.last().sameOrderChildren.add(node)
} else {
ngramTreeStack.last().higherOrderChildren.add(node)
}
}
ngramOrderStack.add(order)
ngramTreeStack.add(node)
pos += (freqSize + 1 + size)
} else {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_EOF,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
}
(cmd and MASK_BEGIN_HEADER) == CMDB_BEGIN_HEADER -> {
if (pos != 0) {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_CMD_BEGIN_HEADER,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
version = cmd and ATTR_HEADER_VERSION
if (version != VERSION_0) {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNSUPPORTED_FLICTIONARY_VERSION,
address = pos,
cmdByte = cmd.toByte(),
absoluteDepth = ngramTreeStack.size
)
)
}
if (inputStream.readNext(buffer, 1, 9) > 0) {
val size = (buffer[1].toInt() and 0xFF)
date =
((buffer[2].toInt() and 0xFF).toLong() shl 56) +
((buffer[3].toInt() and 0xFF).toLong() shl 48) +
((buffer[4].toInt() and 0xFF).toLong() shl 40) +
((buffer[5].toInt() and 0xFF).toLong() shl 32) +
((buffer[6].toInt() and 0xFF).toLong() shl 24) +
((buffer[7].toInt() and 0xFF).toLong() shl 16) +
((buffer[8].toInt() and 0xFF).toLong() shl 8) +
((buffer[9].toInt() and 0xFF).toLong() shl 0)
if (inputStream.readNext(buffer, 10, size) > 0) {
headerStr = String(buffer, 10, size, Charsets.UTF_8)
ngramOrderStack.add(-1)
ngramTreeStack.add(NgramTree())
pos += (10 + size)
} else {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_EOF,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
} else {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_EOF,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
}
(cmd and MASK_END) == CMDB_END -> {
if (pos == 0) {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_CMD_END,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
val n = (cmd and ATTR_END_COUNT)
if (n > 0) {
if (n <= ngramTreeStack.size) {
for (c in 0 until n) {
ngramOrderStack.removeLast()
ngramTreeStack.removeLast()
}
} else {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_ABSOLUTE_DEPTH_DECREASE_BELOW_ZERO,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size - n
)
)
}
} else {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_CMD_END_ZERO_VALUE,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
pos += 1
}
else -> {
inputStream.close()
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.INVALID_CMD_BYTE_PROVIDED,
address = pos, cmdByte = cmd.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
}
}
inputStream.close()
if (ngramTreeStack.size != 0) {
return Result.failure(
ParseException(
errorType = ParseException.ErrorType.UNEXPECTED_ABSOLUTE_DEPTH_NOT_ZERO_AT_EOF,
address = pos, cmdByte = 0x00.toByte(), absoluteDepth = ngramTreeStack.size
)
)
}
}
return Result.success(
Flictionary(
name = "flict",
label = "flict",
authors = listOf(),
headerStr = headerStr ?: "",
date = date,
version = version,
languageModel = FlorisLanguageModel(ngramTree)
)
)
} |
hmmm I can see that working -- but the more I try to debug, even switching between languages doesn't work well. There's an incredible delay. I tried to clear the |
Referring to this feature
I generated a binary flict using the
dict-tools
for en where size was 6.8 mb -- about 3 times the default en binary flict. Size is larger due to more words and the highest dimension for n-grams used was 2. However when file is reduced to a small size, significantly fewer tokens, word-complete works fine.Short description
Memory runs out when flict is too large
Steps to reproduce
Environment information
38baac1af92fea80a86f5fdd8850bc677a27a3d2
.@patrickgold
The text was updated successfully, but these errors were encountered: