New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix og_title matching - empty dict for kinds other than movie #116
Conversation
Previous regexp didn't match some kinds.
I'm having problems with this changeset. First, I'm getting a syntax error due to the "u" in front of the regex. After changing it to "r" -and also removing the coding cookie at the beginning, other files in the project don't have it- the test suite gave me 21 failures when it was 17 failures in the master. |
I've ported this from the legacy branch, while merging upstream changes back to legacy. Didn't really tested this on master because I thought that it's separated from the rest of the code (you know, input/output signature/interfaces). Just noticed that the current regular expression doesn't match tv series while testing upstream changes. Maybe I should also mention that I've tested it on Python 2.6, rather than 3.x (hence, the unicode markers). Feel free to adapt it or reuse in any way. I'm happy to implement fixes or changes to make it working on master but I'm currently focusing on the legacy branch (because we're still on Python2 🙃 ) I've updated my first comment with some additional details. |
I thought it does match TV series. The tests in the test suite regarding the movie kinds and titles pass and they include a series example (Doctor Who 2005 - 0436992). Can you give an example of a series where a part of the og_title (title, year, kind) is not parsed correctly? The only problem I'm aware of is the series title in quotes which shows up in episode titles. |
Are you sure that your tests contain valid (exactly as in IMDB) og:title value? |
You're right, I had to add a preprocessing operation for that :) |
I'm not sure if this is a good idea to preprocess title before matching. In my opinion it's better to have a regular expression that matches the format of the original string. If you still want to replace the BTW - the regexp I'm proposing yields a cleaner output, you don't have additional, unnecessary groups. Also having a lookahead/lookbehind safeguards the format. We can:
Up to you. |
I don't like string preprocessing, I'll be happy to remove it and use a better regex. Would you like to create a PR for the master branch using py3 or would you rather have me look into it? |
Ok. I'll check with python3 and update this PR. |
@uyar maybe you can help me here. I have 17 failing tests on a clean master branch when running py.test on Python 3.6.4. Yes, I've cleaned py.test cache. It seems some additional title parsing is required, because
While On the other hand,
|
@jsynowiec I'm getting 19 failures. Two of them are related to the dash issue. The tests expect regular dashes, so maybe some postprocessing? The other 17 errors are normal, their functionalities haven't been restored yet. |
Yes, I see those. I'm working on a fix. |
Fixed. It now handles start/end years for series years key. |
Works for me too, thanks. |
if this PR is considered complete, I'll merge it. Thanks @jsynowiec ! PS @uyar : obviously feel free to merge any pull request by yourself; you should have the permission (if not, I'll change them). |
OK, if I feel confident about a PR, I'll merge it. |
And I started with this one :) |
Previous regexp didn't match TV series (and probably other kinds).
Sample
og_tilte
valuesRelated to #103
Before
After
Parsed title