-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
srt_tools: add srt remove #68
Conversation
Added the srt-remove functionality. Unit tests included but commented.
Optimized the srt-remove functionality with unit tests included. Functions support and return generators. 'yield from' uses python 3.3.
Optimize srt_remove
Issues Future Development Perhaps, the split function could use it's own file and certain methods (contains, etc) could be placed in a util for use in basic-tools (add, copy, etc). |
Hey! A few general thoughts from a cursory glance, since this will need some overall massaging.
I have more questions but I suspect it's not super helpful to list them all, since this will need some reworking anyway. My general reaction is that there are a lot of premature micro-optimisations here, and I'm not entirely convinced that all the functionality is justified. Could you please go more into your motivations for each piece of functionality and the profiling you did to justify the tradeoff of some of these micro-optimisations? :-) |
Re-optimized the srt-remove functionality with split unit tests included. util functions (binary search, contains) removed.
This pull request introduces 5 alerts when merging 07a29c1 into 2d9f734 - view on LGTM.com new alerts:
|
Thanks for reviewing the pull request so fast! =D Here are some answers to the questions you had. Be sure to advise me on the next changes we need to make. ContextI use captioning with music. Automated tools aren't good enough to caption music (even without the instrumental). What I'll do is place the vocals and lyrics in an automated tool such as YouTube's and then adjust the times it provides. This means I must remove captions, copy choruses, etc. With Adobe Premiere's new update you can remove mass captions but you can't copy them. You can't remove by "insert". Soon, they will have an accurate speech to text tool (better than YouTube's). However, no program has the other features I need (remove by insert, copy, etc). Re-optimize srt_removeImplementation
It's actually an unintended result of the simple implementation provided below. I've kept it to help lazy people. For some captions, we want to remove FROM a specific timestamp without having to check the file. A lot of time is wasted opening the file and checking the last index over time. Also, the last index in the file could be wrong: You can't tell where every index ends easily using eyes/code because captions aren't sorted by end times. Implementation (Algorithm)
This results in 5 edge cases (4 sequential, 1 non-sequential). With the new changes, only 2 are directly covered to simplify things (but some may be indirectly covered). When the user doesn't specify a second timestamp, we remove from timestamp one to end (because there's no timestamp two). See RCI and RCT Optimization below for more information. Parser
Thanks. I changed "--t" back to "-t". Do you have any way of simplifying the argument (via nargs or an alternative) so the user doesn't enter "-x 1 2 -t 00:00:05,00 00:00:10,00"? What should we do in this case? How would a user remove from {time,index} N to M without an implementation like this? split() Optimization
I've committed a new method that performs a single walk through. Single Walkthrough Explained The new method walks through the generator once. In order to ensure captions are sorted, we created a list of the new captions and keep track of the index. Once we arrive at the split index, we are able to sort them and add the remaining captions. remove_caption_index() and remove_caption_timestamp() Optimization You could create a remove by index method that uses a single walkthrough (provided in the profile), but then it will lose the checking and negative index functionality. It guarantees no index will be in between specified numbers but you may be mislead in thinking index 26 exist(ed) and it never did. The whole idea behind RCT was getting the timestamps then removing by index. However I've created a new method which removes by timestamps. Looking at the profile statistics below, they are pretty much the same. Sometimes the bsearch method wins and sometimes the walk does. I went with the walk because it eliminates the need for util which means simpler and easier to read code. ProfilingAll code used to profile can be found here: https://pastebin.com/u/switchupcb/1/eL4dBf1u Keep in mind that a new generator needs to be created for every run which inflates the times. isinstance()
There is only two cases of isinstance() now. With that being said, here were the results: n = 100,000, list(), run = 10000
Range: 26.6974238
List: 2.7328538999999985
Generator: 0.009768199999999894
n = 100,000, isistance() else list(), run = 10000
Range: 2.740842200000003
List: 0.0012198999999952775
Generator: 0.010537000000006458 split() n = 5, s = 1, run = 1000000
list, bsearch, timsort: 1.0983209
walk, timsort: 0.3698865
walk: 0.28525639999999997
n = 100,000, s = 1, run = 1000000
list, bsearch, timsort: 1.9422774
walk, timsort: 0.3613230000000005
walk: 0.32073970000000074
n = 100,000, s = 50,000, run = 1000000
list, bsearch, timsort: 2.5298762999999997
walk, timsort: 0.34685589999999955
walk: 0.28242230000000035 remove_caption_index() n = 5, s = 1, run = 1000000
deque, sort: 0.5850082999999999
list, walk: 0.5416839000000001
walk (no index check or negative index): 0.5663668
n = 100,000, s = 1, run = 1000000
deque, sort: 0.6403641
list, walk: 0.5941318999999998
walk (no index check or negative index): 0.6067177999999998
n = 100,000, s = 50,000, run = 1000000
deque, sort: 0.6180734000000001
list, walk: 0.5740768999999997
walk (no index check or negative index): 0.6000827000000006 remove_caption_timestamp ## run 1
n = 5, s = 1, run = 1000000
list bsearch(x2) rci : 0.6162214
walk: 0.6071775999999999
n = 100,000, s = 1, run = 1000000
list bsearch(x2) rci : 0.6469964000000001
walk: 0.6807417999999998
n = 100,000, s = 50,000, run = 1000000
list bsearch(x2) rci : 0.6708639999999999
walk: 0.6476633
## run 2
n = 5, s = 1, run = 1000000
list bsearch(x2) rci : 0.6819531999999999
walk: 0.6077996
n = 100,000, s = 1, run = 1000000
list bsearch(x2) rci : 0.6091927000000001
walk: 0.6199763
n = 100,000, s = 50,000, run = 1000000
list bsearch(x2) rci : 0.6019388999999999
walk: 0.6211809000000001 OtherTests LGTM The other fix is simple. Make the 'timestamp_two' parameter default to |
I'm likely to be quite busy for a while, so it's unlikely this will be reviewed fully in a timely manner, but the code quality looks better, thanks. More general suggestions:
|
LGTM Patch with parser changes. remove_captions_index() removed.
Patch srt_remove
Parser
Using --start, --t1 and --end,--t2 like srt linear-timeshift.
srt-lines-matching doesn't do this and there's no example of this. Plus, it seems more complicated. You should commit that change as you have a better grasp of Unix Conventions. Unanswered Questions
Future |
Well, that's the point: having a generic utility to do this kind of filtering at the subtitle level rather than the line level would achieve what you want without having to encode all possible combinations of logic ahead of time. All of the plumbing is there in srt-lines-matching, it just now has to be done at the Subtitle object level.
I appreciate you contributing to srt, but speaking as an open source maintainer who does this in his spare time, it's not up to you to dictate my schedule. :-) The commit as it is still needs a lot of work, and as well as you creating it, that realistically requires a lot of time commitment from my side to provide all the feedback. I wouldn't count on this being merged in that time period unless there are changes to a less complex approach similar to
You need to use |
Fixes the srt-remove unit tests.
SimplicityIn the current commit, the user simply enters an index and/or timestamp they desire to remove from/to. This takes one action and does not require any checking of files. There is no "encoding of all possible combinations of logic". This implementation is no different from The design is pretty simple:
Complication
I must be misunderstanding this suggestion because it seems more complicated in any of 3 ways I'm interpreting it. 1. A timestamp helper 2. An equivalent functionality to lines-matching Therefore, a sort-of-equivalent 'subtitle' matcher should take a function, and return the matching subtitles (ideally removing the non-matching ones). So in your example, you'd end up with all the subtitles that start before "00:08,100" and before index 4. This is great because you wanted to remove all subtitles after "00:08,100"; except you now have subtitles which still overlap "00:08,100" when you wanted none to be there. You weren't supposed to find matching subtitles, you were supposed to remove the ones at a certain point(s). So all of this for a functionality that doesn't do what we actually want it to do. 3. A search/query ConclusionWhile your previous suggestions assisted in simplifying the code, this one does not. The current suggestion adds complexity; otherwise, warrants a better explanation. I'm not adding more complexity or functionalities just to end up at the same position we're currently in. I'm asking you to review because we're starting to assume complexity where there isn't. I do understand your concerns for readability vs. usability. I'm not convinced the code isn't readable in its current state. Please guide me into the decision you want me to make regarding this dilemma. |
I'm on holiday for a few weeks from the end of this one, so I won't have much time to comment, but just to come back:
I asked before but I don't feel like there was much of an answer: why should that matter? Who are the tangible people that need that?
Huh? |
Specific use case from comment 6.
Also, this is called srt-remove, not srt-select. There are many reasons you would want to remove a caption or to remove ALL captions within a time period (without keeping the others).
https://man7.org/linux/man-pages/man1/grep.1.html |
Hey mate, I know your intentions are good, but I'm afraid I've basically run out of the limited supply of patience and energy I have to review this. The questions and statements made here and in other issue tickets are just so fundamentally at odds with how srt is designed, and at this point we're just going around in circles. I mean, just look at this example in the readme that shows what "modularity" really means:
Ultimately, like the vast majority of open-source projects, this is a passion project that follows its author's own preferences, and we're going around in circles here. I wish you all the best in your future endeavours, though. :-) |
Simplifies srt_remove by maintainer standards.
Am I supposed to follow UNIX or your standards? I've already copied your code style for this method. I'm not recreating a method that doesn't do what this is supposed to do. There's obvious reasons why you'd remove a subtitle... What I don't understand though, is why this is only modular to UNIX (23% of people) and not python (100%). For example, numpy doesn't require you to run shell commands to manipulate numbers in your program. Perhaps, that's where we disagree: I have much to say about Linux... But where this disagreement stops, is the CODE.As I've said, your standards are followed pretty clearly in it. If you are misinterpreting a line, you should probably mark it in the review (so I can fix it). If the project isn't actually being maintained (due to a lack of "patience and energy"), you should say that in the project description or README. If you need me to DoorDash you "patience and energy", please let me know. With that being said... |
simplify srt_removeParsing simplified. No Index. No timestamp method. Does literally one thing. UNIX. Issue #72 updated. The file has 2 methods not including the main function. Follows all patterns of every other functionality. There's only 120 lines to actually review; it still follows your template plus many lines are comments. |
This isn't a discussion about why you'd want to remove a subtitle. Read the above again:
To which you replied look at comment 6, but I've read that three times now and there's still nothing to be found.
I have no interest in producing a confused library which doesn't know what its intentions or design goals. It's really that simple. srt is tested on other platforms to support library users, but that's as far as it's gonna go. I have no time or interest in spending the ongoing time to support anything more than that. There's no requirement to contribute to this or any codebase.
This isn't numpy, whether in terms of design, intent, or in terms of number of contributors. The comparison just doesn't make any sense.
You really won't get far contributing to open source with an attitude where you confuse "my pull request met with scrutiny" with "this project is unmaintained". Here's how you can tell the difference: on an unmaintained project, you don't get scrutiny: you get absolutely nothing. :-)
Mate, let me offer some final, sincere, advice: being cheeky to people who are being straight with you is a really poor strategy. See ya. |
Added the srt-remove functionality. Unit tests included but commented.
srt-remove
Caption tools lack a mass-removal function. This commit will allow us to remove multiple captions by index/timestamp.
Future Development
I'd want to make sure the function is merged by your standards prior to further development (such that I can get a better grasp of how to organize each function). I know that methods such as binary_search, contains_timestamp, captions_containing_timestamp may serve better in a util file. Perhaps, the split function could use it's own file.
After this, we could add an argument in combination with 'fixed-timeshift' to specify whether removals are in-place. Currently, all removals are in place. This would allow us to remove and adjust times at the same time.
Input
Originally, I wanted the input for t1 and t2 to be interchange-able by index - timestamp. For example, an input could be '10' and '00:00:50,00' which would remove from index 10 to :50. This is possible by using get_timestamp to convert 10 to a time. However, I'm still figuring out argparse.
Unit Tests
The functions are tested and confirmed to work. I could not get the module 'srt-remove' to import in the project so perhaps you can fix that?
Remove Method Explanation
Edge Cases
An explanation of each edge case for remove_timestamps(). Keep in mind that reverse removals (where t2 > t1) are allowed, but during the calculation (code that determines the indexes) t1 and t2 are solely sequential.
Sequential Edge Cases
Captions are only removed from t1 - t2. If t2 occurs before or at the FIRST CAPTION, t1 must also occur before any captions. Therefore, no captions will be removed. As captions are sorted by start time, this edge case can be checked instantly.
Captions are only removed from t1 - t2. If t1 occurs after or at the LAST CAPTION, t2 must also occur after any captions. Therefore, no captions will be removed. Captions are sorted by start times (and NOT end times). Therefore, we can't instantly evaluate this edge case without knowing the LAST CAPTION.
Since t1 < t2 and captions are only removed from t1 - t2, if t1 occurs after or at the end of the LAST CAPTION TO BE REMOVED, no captions will be removed. This case can be checked if the LAST CAPTION INDEX is found.
All caption are removed from t1 - t2, when t1 < all captions and all captions are < t2. Again, captions aren't sorted by end times. Therefore, we have no way of confirming that the LAST CAPTION TO BE REMOVED is actually the very LAST CAPTION by index. However, we can by knowing the nearest captions.
Non-Sequential Edge Cases
Captions are only removed from 0 - t2 and t1 - end. This means 2 conditions must be met for no captions to be removed:
1. No captions occur before t2.
2. No captions occur after t1.
All caption are removed from 0 to t2, and t1 to the end. Since t2 < t1, all captions will be removed.
What about the other inverses of the non-sequential cases?
In a non-sequential case, any captions that occur after t1 ARE removed. Even if t2 occurs directly before t1, there must be no captions after t1. As you can see, this means t1 and t2 are not dependent on each other. Even if t2 occurs at 0, t1 may still have captions to remove. Even if t1 occurs at the end, t2 may have captions to move. It's only if each have no captions in their direction, that none will be removed.
FIRST CAPTION TO BE REMOVED
Since captions are sorted by start time, we can ALWAYS determine 'if t2 <= subtitles[0].start'. This is important because we must determine the FIRST CAPTION INDEX for the first bound of our range.
We find the FIRST CAPTION TO BE REMOVED by finding the [nearest contained captions] after t1. Specifically, the right-most nearest caption. Therefore, the FIRST CAPTION TO BE REMOVED is always at/after t1. Likewise, t1 always occurs at/before the FIRST CAPTION TO BE REMOVED.
We can determine 'if subtitles[-1] <= t1' if there is NO nearest-caption to the right of t1. We can determine (non-sequentially) 'if t2 <= subtitles[0].start and subtitles[-1].end <= t1' by the same reasoning.
If t1 is after the last caption, the following edge case is TRUE: 'if subtitles[t2].end <= t1'
LAST CAPTION TO BE REMOVED
Since captions aren't sorted by end times, we can't simply check if the LAST CAPTION <= t2 by index (i.e subtitles[-1]). For the same reason, we also can't use the next caption (subtitles[t2 + 1]) as frame of reference. Instead, the [nearest captions] provide us a frame of reference.
We find the LAST CAPTION TO BE REMOVED by finding the [nearest captions] before t2. Specifically, the left-most nearest caption. Therefore, the LAST CAPTION TO BE REMOVED is always at/before t2. Likewise, t2 always occurs at/after the LAST CAPTION TO BE REMOVED.
We can determine 'if t1 <= subtitles[0].start and subtitles[-1].end <= t2' (remove all) if there is NO nearest-caption to the right of t2.
Nearest Caption (Not Included)
I wrote a nearest caption function but realized I could simply use the binary_search to find the nearest values. The nearest caption function simply takes the subtitles and a timestamp to find the nearest caption before/after (to the left/right of the) provided timestamp. These are returned in a 2D array such that you can determine the nearest caption on the left and right-hand side, and whether or not the timestamp even has captions before/after them (by empty array). I also finished testing the method. Where would I commit this?