Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix syllabus parsing errors by preferring html5lib over lxml. #91

Merged
merged 1 commit into from
May 6, 2013
Merged

Fix syllabus parsing errors by preferring html5lib over lxml. #91

merged 1 commit into from
May 6, 2013

Conversation

rhayward
Copy link
Contributor

@rhayward rhayward commented May 2, 2013

Thanks for all your work on this script!

I was trying to download fe-001 when I ran into trouble:

~/coursera-dl$./coursera-dl -n --skip-download fe-001
Downloading class: fe-001
Downloaded https://class.coursera.org/fe-001/lecture/index (167494 bytes)
Week_10
  10-1_Liquidity_Trading_Costs_and_Portfolio_Execution
  10-2_Optimal_Execution
  10-3_Portfolio_Execution
  10-4_Optimal_Execution_in_Excel_1
  10-5_Optimal_Execution_in_Excel_2
  10-6_Real_Options
  10-7_Valuation_of_Natural_Gas_and_Electricity_Related_Options
  10-8_Real_Options_in_Excel
Week_9
  9-1_Structured_Credit-_CDOs_and_Beyond
  9-2_The_Gaussian_Copula_Model
  9-3_A_Simple_Example-_Part_I
  9-4_A_Simple_Example-_Part_II
  9-5_The_Mechanics_of_a_Synthetic_CDO_Tranche
  9-6_Computing_the_Fair_Value_of_a_CDO_Tranche
  9-7_Cash_and_Synthetic_CDOs
  9-8_Pricing_and_Risk_Management_of_CDO_Portfolios
  9-9_CDO-Squareds_and_Beyond
Week_8
  8-1_Review_of_the_Binomial_Model_for_Option_Pricing
  8-2_The_Black-Scholes_Model
  8-3_The_Greeks-_Delta_and_Gamma
  8-4_The_Greeks-_Vega_and_Theta
  8-5_Risk-Management_of_Derivatives_Portfolios
  8-6_Delta-Hedging
  8-7_The_Volatility_Surface
  8-8_The_Volatility_Surface_in_Action
  8-9_Why_is_there_a_Skew
  8-10_What_the_Volatility_Surface_Tells_Us
  8-11_Pricing_Derivatives_Using_the_Volatility_Surface
  8-12_Beyond_the_Volatility_Surface_and_Black-Scholes
Week_7
  7-1_Introduction_to_Mortgage_Mathematics_and_Mortgage_Backed_Securities
  7-2_Prepayment_Risk_and_Mortgage_Pass-Throughs
  7-3_Mortgage_Pass-Throughs_in_Excel
  7-4_Principal-Only_and_Interest-Only_MBS
  7-5_Risks_of_Principal-Only_and_Interest-Only_MBS
  7-6_Collateralized_Mortgage_Obligations_(CMOs)
  7-7_Pricing_Mortgage-Backed-Securities
Week_6
  6-1_Model_Calibration
  6-2_An_Application-_Pricing_a_Payer_Swaption_in_a_BDT_Model
  6-3_Fixed_Income_Derivatives_Pricing_in_Practice
  6-4_Modeling_Defaultable_Bonds
  6-5_Pricing_Defaultable_Bonds
  6-6_Credit_Default_Swaps
  6-7_Pricing_Credit_Default_Swaps
  Interview_with_Emmanuel_Derman
Week_5
  5-1_Introduction_to_Term_Structure_Lattice_Models
  5-2_The_Cash_Account_and_Pricing_Zero-Coupon_Bonds
  5-3_Fixed_Income_Derivatives-_Options_on_Bonds
  5-4_Fixed_Income_Derivatives-_Bond_Forwards
  5-5_Fixed_Income_Derivatives-_Bond_Futures
  5-6_Fixed_Income_Derivatives-_Caplets_and_Floorlets
  5-7_Fixed_Income_Derivatives-_Swaps_and_Swaptions
  5-8_The_Forward_Equations
Week_4
  4-1_Implementation_Difficulties_with_Mean_Variance
  4-2_Negative_Exposures_and_Leveraged_ETFs
  4-3_Beyond_Variance
  4-4_Statistical_Biases_in_Performance_Evaluation
Traceback (most recent call last):
  File "../coursera-dl", line 976, in <module>
    main()
  File "../coursera-dl", line 966, in main
    if download_class(args, class_name):
  File "../coursera-dl", line 928, in download_class
    or tmp_cookie_file, args.reverse)
  File "../coursera-dl", line 488, in parse_syllabus
    lecture['mp4'] = get_video(lecturePage)
  File "../coursera-dl", line 446, in get_video
    return soup.find(attrs={'type':re.compile('^video/mp4')})['src']
TypeError: 'NoneType' object has no attribute '__getitem__'

Additionally, the TestSyllabusParsing unit test failed:

~/coursera-dl$./runtests.sh 
E
======================================================================
ERROR: test_parse (test.test_coursera.TestSyllabusParsing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_coursera.py", line 24, in test_parse
    sections = coursera_dl.parse_syllabus(self.syllabus_page, None)
  File "coursera/coursera_dl.py", line 475, in parse_syllabus
    href = a['href']
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py", line 879, in __getitem__
    return self.attrs[key]
KeyError: 'href'

----------------------------------------------------------------------
Ran 1 test in 0.204s

FAILED (errors=1)

After playing with BeautifulSoup for a while, I noticed that the parse tree
for the syllabus was incorrect. I have both the lxml and html5lib packages
installed, and BeautifulSoup was picking lxml automatically. When I changed
this to explicitly use html5lib, the parse tree was correct and the problems
disappeared.

@rbrito
Copy link
Member

rbrito commented May 2, 2013

Yes, the encoding problem can happen as I pointed out in another project:

coursera-dl/edx-dl#18 (comment)

If I understand things correctly, it is only BeautifulSoup 4 that needs (or even makes available) the possibility of using different parsers, right?

I don't want to break things for users of BeautifulSoup 3, because... Well, I am one. :)

@rhayward
Copy link
Contributor Author

rhayward commented May 2, 2013

I don't know that much about BeautifulSoup, but it appears that you are correct. I didn't mean to break things for people with BeautifulSoup 3. I revised my code, and now the change will only impact users of BeautifulSoup 4.

@rbrito
Copy link
Member

rbrito commented May 3, 2013

@rhayward, can you squash both commits together (as in git rebase -i HEAD~2, for instance)? It would make things easier to read and review.

Thanks,
Rogério.

* Explicitly request html5lib in the BeautifulSoup constructor.
* Add instructions for installation of html5lib to the README

Beautiful Soup 4 (bs4) can use any of three parsers: html.parser (builtin),
lxml, or html5lib. It seems that for at least some Coursera classes (fe-001
for example), only html5lib produces an acceptable parse tree. However, if
both lxml and html5lib are installed, bs4 will pick lxml by default. This
patch ensures that html5lib is used if it is available. These changes should
not affect users of Beautiful Soup 3.
@rhayward
Copy link
Contributor Author

rhayward commented May 6, 2013

I squashed the changes into a single commit, and I added some info on installation of html5lib to README.md.

Also, I think that this may fix issue 66.

@rbrito
Copy link
Member

rbrito commented May 6, 2013

Great, and better yet that it came with more detailed documentation, thanks!

rbrito pushed a commit that referenced this pull request May 6, 2013
Fix syllabus parsing errors by preferring html5lib over lxml.
@rbrito rbrito merged commit 2b21f25 into coursera-dl:master May 6, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants