Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of series feature #61

Merged
merged 2 commits into from
May 24, 2023
Merged

Conversation

ChristianKuehnel
Copy link
Contributor

@ChristianKuehnel ChristianKuehnel commented Apr 5, 2023

Start of the implementation of a "series" feature as discussed in #60 .

This feature provides:

  • new flag "--series" to toggle the behavior
  • extracting season/episode information from the title of the show for two cases: "(Sxx/Eyy)" and "(xx)". Other patterns need to be added once we see them being used.
  • adding two new variables to the --target flag to name files based on {{season}} and {{episode}}

What is not implemented:

  • Support filterining shows by season/episode, not sure if we want to have that.
  • Showing the season/episode information in the list/dump output.
  • Assigning a different default --target name in --series mode

Please let me know what you think and if/in what direction to keep moving.

⚠️ This PR also contains a simple test.sh script to test the new behavior. We need to strip this file before merging.

@ChristianKuehnel ChristianKuehnel marked this pull request as ready for review April 5, 2023 17:38
requirements.txt Outdated Show resolved Hide resolved
mtv_dl.py Outdated Show resolved Hide resolved
@fnep
Copy link
Owner

fnep commented Apr 13, 2023

Sorry, it took me a while over the Easter weekend.

Good direction already :)

@ChristianKuehnel
Copy link
Contributor Author

For the next steps:
If we want to have season/episode information in the list command and support filtering for them, I suppose we need to add two columns to the database and parse this information when a new Mediathek View json. This means we need to trigger the heuristics during the import, regardless of the user setting the --series flag.

Downsides:

  • Makes the import slower (cant say by how much).
  • Will cause lots of false-positives when trying matching the (123) pattern, so users will see nonsensical information in these columns.

Is that what we want to have?
Do you have a better idea?

@fnep
Copy link
Owner

fnep commented Apr 14, 2023

I would guess running this match during import with a precompiled regex is not very slow. For example, parsing the date is really slow, but the script does it anyway. Given we make the regex configurable, it would not work anyway.

We could try to use a python regex call as a custom sqlite function like this, and apply it on the SELECT. Not sure how slow it is, it is worth a try. Not?

@ChristianKuehnel
Copy link
Contributor Author

Looking at the sqlite documentation, it seems that you can only define one REGEXP function, which can only be used in the WHERE clause. And you've already defined such a function.

What we probably need is to extract the season/episode information using "function callbacks", something like:

SELECT ... , GET_SEASON(title) as season, GET_EPISODE(title) as episode ... WHERE

And then we can filter this information using the normal SQL features.

Or did I miss something?

Thinking about the filtering for season/episode more:
For my use case I don't actually need that. I'm downloading all unseen episodes anyway (via cron job) and then filter out the false-positives manually. I'm not sure if the season/episode information is reliable enough to filter on that as well...

@fnep
Copy link
Owner

fnep commented Apr 15, 2023

I would also probably not filter for it. The only reason i see is if you wait for a new season while the old one is still aired over and over again, and you want a filter to only get shows with seasons greater than that.

For the regex, i think you are right. The function callbacks would be a good workaround, but we probably do not need them.

@ChristianKuehnel
Copy link
Contributor Author

I looked for more season/episode patterns in the title file from the raw Filmliste.aktuell.xz

Examples

  • Folge 1: Andi und die Eselalm -- having a prefix of "Folge", with a : at the end
  • Folge 5: Wilde Tiergeschichten aus der Eifel (S14/E05) -- having a prefix of "Folge" in combination with season/episode
  • Folge 4: Mehr als eine Entscheidung (S01/E04) - Audiodeskription -- having a suffix of "Audiodeskription" after the season/episode marker
  • Leichter leben (10) | Partnervermittlung für ältere Menschen -- having the title of the episode after the episode number (10)
  • Trudel (84) und Erich (89) heiraten nach 33 Jahren Verlobung -- false positive: some folks age use similar pattern as episode number
  • Rückblick: 12. Folge - Zwillingszauber (1) -- Using "12. Folge" to mark the episode, with a . after the number
  • #delikatdelikat Folge 06 - Late Night Alter -- similar pattern, but without .
  • Michail Gorbatschow in Berlin (2014) - heute journal 08.11.2014 -- false positive: year marker (2014) looks like episode pattern
  • Folge 11/36: Getrennt- und Zusammenschreibung -- probably episode 11 of 36 total
  • Folge 7 (OmU) - Originalfassung mit deutschen Untertiteln -- without the : after the episode
  • Folge 9 -- without any more title information
  • Hörfassung: Folge 3: Besuch aus dem Jenseits - Staffel 5 -- with separate texts for Season and Episode
  • Können wir das Pferd retten? (Staffel 33 Folge 10) -- textual version of Episode and Season
  • #45 Trostloser VfB-Auftritt in Dortmund -- with prefix # for the episode
  • StarStarSpace #23/Japanoschlampen #34 - Die verschollene Episode -- two episode numbers in the same title 🤯
  • Gipfeltour (4/5) | Serie "Nix wie raus …. Madeira" -- again episode 4 of 5 total
  • Bares für Rares vom 21. März 2023, aktueller bericht (22.02.2021) -- using the date to identify the episode
  • Krieg in Europa: Tag 14, Der 13. Tag, Songwriterin des Jahres 2017 -- just numbering something (e.g. days, years) hard to gather examples for all of these...
  • Serie: "Pflügers Weihnachtswerkstatt - Staffel III" | Weihnachtliche Walnuss-Dekoschale (5/5) -- using Roman III numbers to count the season 🤦

Other observations

  • I see these patterns across all channels, so using the channel information doesn't help in deciding on which patterns to use
  • I did not find any example for "(S01?E01)" with other characters than /, so I couldn't find any cases of _ in the titles.

Conclusion

I guess we can cover most of these cases with a couple of regular expressions quite well, but not all of them. Maybe we should offer a combination of pre-configured patterns and allowing the user to add more for corner cases? Maybe also allow to manually override the season number? This would also mean that we should show season and episode in the list mode so users will see what their pattern would look like...

Relatively reliable patterns:

  • S01/E01 is the most reliable pattern
  • Using Folge 1 and 2. Folge sometimes in combination withStaffel 2 also looks quite straight forward.
  • #42 should also work
  • (123) works sometimes, doesn't make sense to apply this in general (false positives on age, year)
  • (3/5) might also work, all examples I've seen were of the pattern (<episode>/<total number of episodes>)

Sorry for the long post, however the reality of heuristics is messy 😞

@fnep
Copy link
Owner

fnep commented Apr 16, 2023

Oh wow, a lot of examples. Thank you for assembling this.

Allowing users to define the season number as a parameter would be nice, but because we have to do this during the parsing time, it’s not really an option. We should provide a fixed set of regular expressions as you said, and wait for users to come and ask to extend it (no parameter to extend it at the beginning at least).

tests.py Outdated Show resolved Hide resolved
mtv_dl.py Outdated Show resolved Hide resolved
@fnep
Copy link
Owner

fnep commented Apr 16, 2023

While reading the changes I thought about the manual season definition again and I think it will not help. If the (123) annotated show is not season 1, the episode number would also be false. Maybe I miss something?

And even if it is more work, having the season/episode info available as a filter for there case where it is defined properly would be just consequent. Applying it only for list/download is easier, but it will prevent some usecases.

mtv_dl.py Outdated Show resolved Hide resolved
@fnep
Copy link
Owner

fnep commented Apr 22, 2023

Good progress 🚀

@ChristianKuehnel
Copy link
Contributor Author

Alright here is another update to the feature:

  • We now extract season/episode information when filling the database. If we don't have season/episode information these are set to None/NULL.
  • This information is used when --series is set for the download and list commands.
  • This change modified the database schema, so you right now you need to re-create the database for this. A migration feature is a # TODO.
  • I did not find a way to add a unit-testing command that is not visible to the users, it seems docopt does not yet support this: Support for hidden commands - Feature Request docopt/docopt#232

Next steps:

  • implement filtering
  • resolve the TODOs in the code

@fnep
Copy link
Owner

fnep commented Apr 22, 2023

For the testing this should not be an command line parameter in my opinion.

This tool has no tests as of yet, so there is no structure for it. But normally I use pytest to run doctests with a pytest.ini having addopts = --doctest-modules. https://docs.pytest.org/en/7.1.x/how-to/doctest.html

For running pytest I normally use tox, but maybe this is too much for here. https://tox.wiki/en/latest/example/pytest.html

@ChristianKuehnel
Copy link
Contributor Author

Filtering and history are working now.

I'm not sure about:

  • How to do the database schema version upgrade as you're using the PRAGMA schema.user_version already for something else. Right now I'm checking column names, also not ideal.
  • Since your INSERT statements depend on the ordering of database columns, I hope that the ALTER TABLE statements always preserve the order of columns.
  • Do we actually need to update the schema of .Filmliste? Would it be easier to just delete the file and re-import everything?
  • Do we want to enable --series automatically, if the user filters for season/episode? If so: I guess we would need to refactor/extract the parsing of the filter expressions.

Next steps:

  • re-visit the doctest invocation (as per your last comment)
  • probably some cleanup and polishing

@fnep
Copy link
Owner

fnep commented Apr 22, 2023

  • For the migration of the history the user_version would be right. For the filmliste the script version number is in the file name and so it will be recreated anyway. The user_version is used to see if it needs a refresh.
  • I think enabling it automatically - only if some filter flag is given - is confusing. Better leave it an option for now.

@ChristianKuehnel ChristianKuehnel changed the title first draft of series feature Implementation of series feature Apr 23, 2023
@ChristianKuehnel
Copy link
Contributor Author

@fnep I'm done with the series mode feature. Please do a thorough review and let me know if anything needs fixing or improvement.

Reminder for myself: we need to remove the test.sh before merging...

mtv_dl.py Outdated Show resolved Hide resolved
mtv_dl.py Outdated Show resolved Hide resolved
mtv_dl.py Outdated Show resolved Hide resolved
mtv_dl.py Outdated
@@ -39,6 +39,8 @@

List options:
-c <results>, --count=<results> Limit the number of results. [default: 50]
--series Try to extract season and episode information from the title
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For listing, i would prefer to always show the columns. For Downloads it makes sense because of the movie / episodedetails difference, but here it is only cosmetics.

By the way, for me docopt is unable to have this parameter description twice. I need to remove one, otherwise is get --series is not a unique prefix: --series, --series?. Did this work for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to strip the series_mode flag and all the if-statements in the show_table function as well? Or should we leave them in, just in case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, this worked for me...

Copy link
Owner

@fnep fnep Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it looks like this:

> poetry run mtv_dl list --series
--series is not a unique prefix: --series, --series?
Usage:
  mtv_dl.py list [options] [--sets=<file>] [--count=<results>] [<filter>...]
  mtv_dl.py dump [options] [--sets=<file>] [<filter>...]
  mtv_dl.py download [options] [--sets=<file>] [--low|--high] [<filter>...]
  mtv_dl.py history [options] [--reset|--remove=<hash>]
  mtv_dl.py --help

… yes i would also remove the if-statements and series_mode parameter there to be in line with the docs.

mtv_dl.py Outdated Show resolved Hide resolved
mtv_dl.py Outdated Show resolved Hide resolved
squashed changed for rebasing on master
After enabling mypy and ruff, I had to fix a few things so that pytest
is passing successfully again.
@fnep
Copy link
Owner

fnep commented May 21, 2023

Hi Christian, Sorry it took me so long to get back to this issue.

I'm ready to merge this … is there anything left you want to do before?

@ChristianKuehnel
Copy link
Contributor Author

No worries, I'm good with the changes, feel free to merge when you are ready.

Do you want me to change anything else?

@fnep
Copy link
Owner

fnep commented May 22, 2023

No, it's all good. I was just a bit overloaded recently and want to catch up now.

@fnep fnep merged commit 513201b into fnep:master May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants