Fix og_title matching - empty dict for kinds other than movie #116

jsynowiec · 2018-01-08T21:36:35Z

Previous regexp didn't match TV series (and probably other kinds).

Sample og_tilte values

Agents of S.H.I.E.L.D. (TV Series 2013– )
Lost (TV 2004–2010)
The Matrix (1999)

Related to #103

Before

In [7]: ia.get_movie('2364582')
Out[7]: <Movie id:2364582[http] title:_None_>

After

In [3]: ia.get_movie('2364582')
Out[3]: <Movie id:2364582[http] title:_"Agents of S.H.I.E.L.D." (2013)_>

Parsed title

In [6]: analyze_og_title(u'Agents of S.H.I.E.L.D. (TV Series 2013– )')
Out[6]: 
{'kind': u'tv series',
 'series years': u'',
 'title': u'Agents of S.H.I.E.L.D.',
 'year': 2013}

In [7]: analyze_og_title(u'Lost (TV Series 2004–2010)')
Out[7]: {'kind': u'tv series', 'series years': u'2010', 'title': u'Lost', 'year': 2004}

In [8]: analyze_og_title(u'The Matrix (1999)')
Out[8]: {'kind': 'movie', 'title': u'The Matrix', 'year': 1999}

Previous regexp didn't match some kinds.

uyar · 2018-01-09T08:17:16Z

I'm having problems with this changeset. First, I'm getting a syntax error due to the "u" in front of the regex. After changing it to "r" -and also removing the coding cookie at the beginning, other files in the project don't have it- the test suite gave me 21 failures when it was 17 failures in the master.

jsynowiec · 2018-01-09T08:21:31Z

I've ported this from the legacy branch, while merging upstream changes back to legacy. Didn't really tested this on master because I thought that it's separated from the rest of the code (you know, input/output signature/interfaces). Just noticed that the current regular expression doesn't match tv series while testing upstream changes.

Maybe I should also mention that I've tested it on Python 2.6, rather than 3.x (hence, the unicode markers). Feel free to adapt it or reuse in any way. I'm happy to implement fixes or changes to make it working on master but I'm currently focusing on the legacy branch (because we're still on Python2 🙃 )

I've updated my first comment with some additional details.

uyar · 2018-01-09T08:42:28Z

I thought it does match TV series. The tests in the test suite regarding the movie kinds and titles pass and they include a series example (Doctor Who 2005 - 0436992). Can you give an example of a series where a part of the og_title (title, year, kind) is not parsed correctly? The only problem I'm aware of is the series title in quotes which shows up in episode titles.

jsynowiec · 2018-01-09T09:06:29Z

Are you sure that your tests contain valid (exactly as in IMDB) og:title value?
Or maybe you're doing some preprocessing on og:title before running it through the regex matcher that I didn't notice? Like replacing unicode \2013 that divides the years with a regular dash?

uyar · 2018-01-09T09:07:43Z

You're right, I had to add a preprocessing operation for that :)

jsynowiec · 2018-01-09T09:16:32Z

I'm not sure if this is a good idea to preprocess title before matching. In my opinion it's better to have a regular expression that matches the format of the original string. If you still want to replace the \2013 then it might be better to explicitly have it as a value, not as a "dash-like thing" in the source because it turns out that I did notice it but that unicode character was somehow distorted when copying changes.

BTW - the regexp I'm proposing yields a cleaner output, you don't have additional, unnecessary groups. Also having a lookahead/lookbehind safeguards the format.

We can:

Remove the preprocessing, match against the original title with the unicode dash-thing
Change the \2013 character in the regexp to a dash (it should work)
Do nothing, use the old regexp.

Up to you.

uyar · 2018-01-09T09:23:32Z

I don't like string preprocessing, I'll be happy to remove it and use a better regex. Would you like to create a PR for the master branch using py3 or would you rather have me look into it?

jsynowiec · 2018-01-09T09:25:26Z

Ok. I'll check with python3 and update this PR.

jsynowiec · 2018-01-09T11:18:47Z

@uyar maybe you can help me here. I have 17 failing tests on a clean master branch when running py.test on Python 3.6.4. Yes, I've cleaned py.test cache.

It seems some additional title parsing is required, because

<meta property="og:title" content="&quot;Doctor Who&quot; Blink (TV Episode 2007)">

While Blink is expected.

On the other hand, h3[itemprop='name']/text() returns "Blink"

<h3 itemprop="name">
Blink             <span class="titlereference-title-year">
                    (<a href="/search/title?year=2007&amp;ref_=tt_rv" itemprop="url">2007</a>)
             </span>
    </h3>

uyar · 2018-01-09T11:45:09Z

@jsynowiec I'm getting 19 failures. Two of them are related to the dash issue. The tests expect regular dashes, so maybe some postprocessing? The other 17 errors are normal, their functionalities haven't been restored yet.
Getting the title from the h3 element has its problems, see 1126eb4#commitcomment-26590148
Maybe we could use a combination of the two (og:title and h3).

jsynowiec · 2018-01-09T11:50:29Z

Two of them are related to the dash issue. The tests expect regular dashes, so maybe some postprocessing?

Yes, I see those. I'm working on a fix.

jsynowiec · 2018-01-09T12:41:22Z

Fixed. It now handles start/end years for series years key.

uyar · 2018-01-09T12:43:06Z

Works for me too, thanks.

alberanid · 2018-01-09T18:35:31Z

if this PR is considered complete, I'll merge it. Thanks @jsynowiec !

PS @uyar : obviously feel free to merge any pull request by yourself; you should have the permission (if not, I'll change them).

uyar · 2018-01-09T18:42:14Z

OK, if I feel confident about a PR, I'll merge it.

uyar · 2018-01-09T18:42:49Z

And I started with this one :)

Fix og_title parsing

dda5e78

Previous regexp didn't match some kinds.

Jakub Synowiec added 3 commits January 9, 2018 12:33

Ignore vscode settings

a01dd88

Remove unicode markers

6dc7f32

Remove \2013 (dash-like) character replacement

7cfd3ad

Jakub Synowiec added 2 commits January 9, 2018 13:01

Ignore ctags

bbca15c

Fix start/end date handling for tv series

78464b2

uyar merged commit 4d6d3f9 into cinemagoer:master Jan 9, 2018

jsynowiec deleted the fix/analyze_og_title branch January 9, 2018 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix og_title matching - empty dict for kinds other than movie #116

Fix og_title matching - empty dict for kinds other than movie #116

jsynowiec commented Jan 8, 2018 •

edited

uyar commented Jan 9, 2018

jsynowiec commented Jan 9, 2018 •

edited

uyar commented Jan 9, 2018 •

edited

jsynowiec commented Jan 9, 2018

uyar commented Jan 9, 2018

jsynowiec commented Jan 9, 2018

uyar commented Jan 9, 2018

jsynowiec commented Jan 9, 2018

jsynowiec commented Jan 9, 2018 •

edited

uyar commented Jan 9, 2018

jsynowiec commented Jan 9, 2018 •

edited

jsynowiec commented Jan 9, 2018

uyar commented Jan 9, 2018

alberanid commented Jan 9, 2018

uyar commented Jan 9, 2018

uyar commented Jan 9, 2018

Fix og_title matching - empty dict for kinds other than movie #116

Fix og_title matching - empty dict for kinds other than movie #116

Conversation

jsynowiec commented Jan 8, 2018 • edited

uyar commented Jan 9, 2018

jsynowiec commented Jan 9, 2018 • edited

uyar commented Jan 9, 2018 • edited

jsynowiec commented Jan 9, 2018

uyar commented Jan 9, 2018

jsynowiec commented Jan 9, 2018

uyar commented Jan 9, 2018

jsynowiec commented Jan 9, 2018

jsynowiec commented Jan 9, 2018 • edited

uyar commented Jan 9, 2018

jsynowiec commented Jan 9, 2018 • edited

jsynowiec commented Jan 9, 2018

uyar commented Jan 9, 2018

alberanid commented Jan 9, 2018

uyar commented Jan 9, 2018

uyar commented Jan 9, 2018

jsynowiec commented Jan 8, 2018 •

edited

jsynowiec commented Jan 9, 2018 •

edited

uyar commented Jan 9, 2018 •

edited

jsynowiec commented Jan 9, 2018 •

edited

jsynowiec commented Jan 9, 2018 •

edited