Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supply a template or txt file with course names for easy lookup #59

Open
ivoflipse opened this issue Feb 11, 2013 · 4 comments
Open

Supply a template or txt file with course names for easy lookup #59

ivoflipse opened this issue Feb 11, 2013 · 4 comments

Comments

@ivoflipse
Copy link

When I tried to use the otherwise awesome script I had to go and lookup all the names I wanted from the course list. So I just made a little txt file with the url handle and the name of the course, which I could then easily copy into the command line.
Perhaps it would be an idea to maintain a list of all the courses?

Past courses

  • neuralnets-2012-001 Neural Networks for Machine Learning
  • sciwrite-2012-001 Writing in the Sciences
  • progfun-2012-001 Functional Programming Principles in Scala
  • maththink-2012-001 Introduction to Mathematical Thinking
  • bigdata-2012-001 Web Intelligence and Big Data
  • healthpolicy-2012-001 Health Policy and the Affordable Care Act
  • intrologic Introduction to Logic
  • compilers Compilers
  • automata Automata
  • gametheory Game Theory
  • crypto Cryptography I

Current courses (possibly incomplete)

  • algo2-2012-001 Algorithms: Design and Analysis, Part 2
  • thinkagain-2012-001 Think Again: How to Reason and Argue
  • hetero-2012-001 Heterogeneous Parallel Programming
  • compmethods-2012-001 Computational Methods for Data Analysis
  • precalculus-001 Pre-Calculus
  • algebra-001 Algebra
  • proglang-2012-001 Programming Languages
  • calcsing-2012-001 Calculus in a Single Variable
@rbrito
Copy link
Member

rbrito commented Feb 11, 2013

Hi, Ivo.

On Mon, Feb 11, 2013 at 1:22 PM, Ivo Flipse notifications@github.com wrote:

When I tried to use the otherwise awesome script I had to go and lookup all the names I wanted from the course list.

Well, supposedly, the idea would be to download material from courses
that you already know about (because you are subscribed to them). :)

So I just made a little txt file with the url handle and the name of the course, which I could then easily copy into the command line.
Perhaps it would be an idea to maintain a list of all the courses?

I guess that one of the easiest routes would be to grab this
information from some site that aggregates this (e.g.,
classcentral.com), but this is on the borderline of the scope of
coursera-dl, which is meant for downloads, not discovery...

Furthermore, keeping such lists may need some manual intervention and
it is not really clear how they could be used by the script. The
person has to sign up for the courses anyway (and if you try to signup
for some courses after they are already running or after they have
been concluded, you will be denied access).

The reason for that may be because the course won't be offered on
coursera anymore (see, for instance, Jeniffer Widom's db course
migrating to Class2Go, Umesh Vazirani's qcomp migrating to EdX.org,
the saas courses moving to EdX too etc.).

And, of course, to have access to the courses, you have to click the
"I accept the honor code" or something like that. I don't intend to
make this particular step automated, for human/awareness reasons.

Please, clarify how you intend to keep the list of courses up-to-date
without the maintainers of the program (John and I) having extra work.
If you are persuasive enough, we may implement your idea. :)

Thanks,

Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://rb.doesntexist.org/blog : Projects : https://github.com/rbrito/
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br

@ivoflipse
Copy link
Author

I personally only download courses when all the material is available, because else I would have to come back later and download the rest anyway. But I can understand if others use it to download video's to watch them offline or on-the-go. The issue with course material no longer being available could (hopefully) be caught with an exception when you get an access denied error.

I guess the only work around I could imagine would be to parse the Course page for logged in users.
https://www.coursera.org/user/i/<user_uuid>
Then check if left/width of the "coursera-course-listing-progress" element have reached 100%.
If so, extract the course url from the "coursera-course-listing-meta" element and try to run the script.

But I can understand if all this level of automation is out of scope of the script.

@jplehmann
Copy link
Contributor

I've personally been facing a similar issue with the explosion of classes. I have used the following regex:

# extract all the currently open classes I'm enrolled in on a single line, space separated
grepo "class.coursera.org/(.*?)/" courses.html | uniq | paste -s -d" "

where courses.html is the page displayed when you click on "courses" underneath your name in the menu, and "grepo" is a script I wrote which does something like "grep -o" except it outputs only the text matched by the group.

@ivoflipse
Copy link
Author

Inspired by your comment I messed around a little to see if I could get out this information. I couldn't get to my /courses page, so I just manually downloaded it. Automating this would be nice, but it works.

Then I load the page using BeautifulSoup:

page = open("Courses.htm")
soup = BeautifulSoup(page)
# Find the box that contains the course information
course_elements = soup.findAll("div", 
{"class":"coursera-course-listing-box coursera-course-listing-box-wide coursera-account-course-listing-box"})

This gives us a list that contains each of the boxes on the /course page. From here we can try and extract the relevant information:

# Iterate through each course box
for course in course_elements:
    # The date information is in a span element
    listing_start = course.findAll("span")
    # Some booleans for controlling behavior of the script
    is_course = True
    ended = False

    # Not every box seems to be a course, so we just try to parse it and else fail
    try:
        # There seem to be three different date formats:
        # Courses yet to start
        if "Starts" == listing_start[2].text.split()[0]:
            ending_time = listing_start[2].text
        # Courses that have already ended
        elif "Ended" == listing_start[2].text.split()[0]:
            ending_time = listing_start[2].text
            ended = True
        # Courses that have already started, but not yet ended
        else:
            ending_time = "End date: {}".format(listing_start[2].text)
    except:
        # If we can't get the date, flip this boolean, so we don't bother with further parsing
        is_course = False

    # If the current element is a course, print the info
    # If you set this check to ended, it'll only give you info for completed courses
    if is_course: #and ended:
        course_listing = course.findAll("h3")
        course_name = course_listing[0].text
        course_url = str(course_listing[0]).split("\"")[3]
        split_course_url = course_url.split("/")
        if split_course_url[3] == "course":
            course_handler = course_url.split("/")[4]
        else:
            course_handler = course_url.split("/")[3]
        print "Course name: {}".format(course_name) 
        print "Course handler: {}".format(course_handler)
        print "Course url: {}".format(course_url)
        print ending_time
        print 

I added some prints, which aren't really needed, but just show you that you can retrieve the information you'd want. You could either use the url that's passed when you press the green button or use the course name, like your script currently uses. It seems that courses that are no longer accessible have a different url (with the auth part), so that's useful info too.

So depending on the status of the course, you'd get something like this:

Course in progress
Course name: Think Again: How to Reason and Argue
Course handler: thinkagain-2012-001
Course url: https://class.coursera.org/thinkagain-2012-001/auth/auth_redirector?type=login&amp;subtype=normal
End date: Nov 26th

Course not yet started
Course name: Know Thyself
Course handler: knowthyself
Course url: https://www.coursera.org/course/knowthyself
Starts in 20 days

Ended course
Course name: Automata
Course handler: automata
Course url: https://class.coursera.org/automata/auth/auth_redirector?type=login&amp;subtype=normal
Ended 8 months ago

Ended and closed course
Course name: Statistics One
Course handler: stats1
Course url: https://www.coursera.org/course/stats1
Ended 4 months ago

It would require some fiddling, because you no longer have to pass the names through the command line, so you'd have to insert them somewhere. Or make the script get the names from the parsed file and go through them one by one.

Anyway, this was a fun experiment :-) If only I could get it to retrieve this information from the live page and possibly list the courses available for me, so I could pass the number of the course I wanted the script to download, that would be awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants