Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix og_title matching - empty dict for kinds other than movie #116

Merged
merged 6 commits into from Jan 9, 2018
Merged

Fix og_title matching - empty dict for kinds other than movie #116

merged 6 commits into from Jan 9, 2018

Conversation

jsynowiec
Copy link
Contributor

@jsynowiec jsynowiec commented Jan 8, 2018

Previous regexp didn't match TV series (and probably other kinds).

Sample og_tilte values

Agents of S.H.I.E.L.D. (TV Series 2013– )
Lost (TV 2004–2010)
The Matrix (1999)

Related to #103

Before

In [7]: ia.get_movie('2364582')
Out[7]: <Movie id:2364582[http] title:_None_>

After

In [3]: ia.get_movie('2364582')
Out[3]: <Movie id:2364582[http] title:_"Agents of S.H.I.E.L.D." (2013)_>

Parsed title

In [6]: analyze_og_title(u'Agents of S.H.I.E.L.D. (TV Series 2013– )')
Out[6]: 
{'kind': u'tv series',
 'series years': u'',
 'title': u'Agents of S.H.I.E.L.D.',
 'year': 2013}

In [7]: analyze_og_title(u'Lost (TV Series 2004–2010)')
Out[7]: {'kind': u'tv series', 'series years': u'2010', 'title': u'Lost', 'year': 2004}

In [8]: analyze_og_title(u'The Matrix (1999)')
Out[8]: {'kind': 'movie', 'title': u'The Matrix', 'year': 1999}

Previous regexp didn't match some kinds.
@uyar
Copy link
Collaborator

uyar commented Jan 9, 2018

I'm having problems with this changeset. First, I'm getting a syntax error due to the "u" in front of the regex. After changing it to "r" -and also removing the coding cookie at the beginning, other files in the project don't have it- the test suite gave me 21 failures when it was 17 failures in the master.

@jsynowiec
Copy link
Contributor Author

jsynowiec commented Jan 9, 2018

I've ported this from the legacy branch, while merging upstream changes back to legacy. Didn't really tested this on master because I thought that it's separated from the rest of the code (you know, input/output signature/interfaces). Just noticed that the current regular expression doesn't match tv series while testing upstream changes.

Maybe I should also mention that I've tested it on Python 2.6, rather than 3.x (hence, the unicode markers). Feel free to adapt it or reuse in any way. I'm happy to implement fixes or changes to make it working on master but I'm currently focusing on the legacy branch (because we're still on Python2 🙃 )

I've updated my first comment with some additional details.

@uyar
Copy link
Collaborator

uyar commented Jan 9, 2018

I thought it does match TV series. The tests in the test suite regarding the movie kinds and titles pass and they include a series example (Doctor Who 2005 - 0436992). Can you give an example of a series where a part of the og_title (title, year, kind) is not parsed correctly? The only problem I'm aware of is the series title in quotes which shows up in episode titles.

@jsynowiec
Copy link
Contributor Author

Are you sure that your tests contain valid (exactly as in IMDB) og:title value?
Or maybe you're doing some preprocessing on og:title before running it through the regex matcher that I didn't notice? Like replacing unicode \2013 that divides the years with a regular dash?

@uyar
Copy link
Collaborator

uyar commented Jan 9, 2018

You're right, I had to add a preprocessing operation for that :)

@jsynowiec
Copy link
Contributor Author

I'm not sure if this is a good idea to preprocess title before matching. In my opinion it's better to have a regular expression that matches the format of the original string. If you still want to replace the \2013 then it might be better to explicitly have it as a value, not as a "dash-like thing" in the source because it turns out that I did notice it but that unicode character was somehow distorted when copying changes.

BTW - the regexp I'm proposing yields a cleaner output, you don't have additional, unnecessary groups. Also having a lookahead/lookbehind safeguards the format.

We can:

  1. Remove the preprocessing, match against the original title with the unicode dash-thing
  2. Change the \2013 character in the regexp to a dash (it should work)
  3. Do nothing, use the old regexp.

Up to you.

@uyar
Copy link
Collaborator

uyar commented Jan 9, 2018

I don't like string preprocessing, I'll be happy to remove it and use a better regex. Would you like to create a PR for the master branch using py3 or would you rather have me look into it?

@jsynowiec
Copy link
Contributor Author

Ok. I'll check with python3 and update this PR.

@jsynowiec
Copy link
Contributor Author

jsynowiec commented Jan 9, 2018

@uyar maybe you can help me here. I have 17 failing tests on a clean master branch when running py.test on Python 3.6.4. Yes, I've cleaned py.test cache.

It seems some additional title parsing is required, because

<meta property="og:title" content="&quot;Doctor Who&quot; Blink (TV Episode 2007)">

While Blink is expected.

On the other hand, h3[itemprop='name']/text() returns "Blink"

<h3 itemprop="name">
Blink             <span class="titlereference-title-year">
                    (<a href="/search/title?year=2007&amp;ref_=tt_rv" itemprop="url">2007</a>)
             </span>
    </h3>

@uyar
Copy link
Collaborator

uyar commented Jan 9, 2018

@jsynowiec I'm getting 19 failures. Two of them are related to the dash issue. The tests expect regular dashes, so maybe some postprocessing? The other 17 errors are normal, their functionalities haven't been restored yet.
Getting the title from the h3 element has its problems, see 1126eb4#commitcomment-26590148
Maybe we could use a combination of the two (og:title and h3).

@jsynowiec
Copy link
Contributor Author

jsynowiec commented Jan 9, 2018

Two of them are related to the dash issue. The tests expect regular dashes, so maybe some postprocessing?

Yes, I see those. I'm working on a fix.

@jsynowiec
Copy link
Contributor Author

Fixed. It now handles start/end years for series years key.

@uyar
Copy link
Collaborator

uyar commented Jan 9, 2018

Works for me too, thanks.

@alberanid
Copy link
Collaborator

if this PR is considered complete, I'll merge it. Thanks @jsynowiec !

PS @uyar : obviously feel free to merge any pull request by yourself; you should have the permission (if not, I'll change them).

@uyar
Copy link
Collaborator

uyar commented Jan 9, 2018

OK, if I feel confident about a PR, I'll merge it.

@uyar uyar merged commit 4d6d3f9 into cinemagoer:master Jan 9, 2018
@uyar
Copy link
Collaborator

uyar commented Jan 9, 2018

And I started with this one :)

@jsynowiec jsynowiec deleted the fix/analyze_og_title branch January 9, 2018 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants