## Aim
1. To extract all the stanford course homepages and automate making a readme.md file in a tabular form where it is .
2. After finding the valid course homepages, the next task you should do is find all the archival web pages as the follow a pattern.


In [0]:
import requests
import time
import lxml.html

In [0]:
def check_website_exist(webpage):
    '''
    given a url for webpage returns true if website exist otherwise false.
    '''
    try:
      request = requests.get(webpage)
    except:
      print("error occured, max retry")
      time.sleep(5)
      return False 
    if request.status_code == 200:
        return True
    else:
        return False

In [0]:
def extract_homepage():
    '''
    Returns a list of url for the cs courses which are valid
    '''
    
    course_no = []

    for i in range(1, 801):
        course_no.append("cs" + str(i))

    semester = []
    for i in range(1100, 1208, 2):
        semester.append(str(i))


    cs_class = "http://web.stanford.edu/class/"

    result = []
    for ele1 in course_no:
        current_website = cs_class + ele1
        if check_website_exist(current_website):
            result.append(current_website)
            print(current_website)
            
    return result

In [0]:
def find_titles(webpages):
    '''
    Given a list of webpages it returns a dictionary with key as title and value as url of that page.
    '''
    result = {}
    for url in webpages:
        try:
          t = lxml.html.parse(url)
        except OSError:
          print("os error occured for url: ", url)
          continue

        if t.find(".//title") != None:
          result[t.find(".//title").text] = url
    
    return result

In [0]:
def print_markdown_table(web_table):
    '''
    Given a dictionary, it prints a markdown table
    '''
    print("| Course Number | Link |")
    print("| --- | --- |")
    
    for name, url in web_table.items():
        course_no  = url.split("/")[-1]
        
        print(f'|{course_no}|[{name}]({url})|')

In [30]:
webpages = extract_homepage()

http://web.stanford.edu/class/cs9
http://web.stanford.edu/class/cs20
http://web.stanford.edu/class/cs22
http://web.stanford.edu/class/cs41
http://web.stanford.edu/class/cs42
http://web.stanford.edu/class/cs43
http://web.stanford.edu/class/cs47
http://web.stanford.edu/class/cs50
http://web.stanford.edu/class/cs51
http://web.stanford.edu/class/cs52
http://web.stanford.edu/class/cs84
http://web.stanford.edu/class/cs93
http://web.stanford.edu/class/cs101
http://web.stanford.edu/class/cs102
http://web.stanford.edu/class/cs103
http://web.stanford.edu/class/cs105
http://web.stanford.edu/class/cs107
http://web.stanford.edu/class/cs108
http://web.stanford.edu/class/cs109
http://web.stanford.edu/class/cs110
http://web.stanford.edu/class/cs116
http://web.stanford.edu/class/cs121
http://web.stanford.edu/class/cs122
http://web.stanford.edu/class/cs123
http://web.stanford.edu/class/cs124
http://web.stanford.edu/class/cs129
http://web.stanford.edu/class/cs131
http://web.stanford.edu/class/cs137
http:

In [31]:
web_tables = find_titles(webpages)

os error occured for url:  http://web.stanford.edu/class/cs20
os error occured for url:  http://web.stanford.edu/class/cs116
os error occured for url:  http://web.stanford.edu/class/cs181
os error occured for url:  http://web.stanford.edu/class/cs190
os error occured for url:  http://web.stanford.edu/class/cs253
os error occured for url:  http://web.stanford.edu/class/cs255
os error occured for url:  http://web.stanford.edu/class/cs344
os error occured for url:  http://web.stanford.edu/class/cs346
os error occured for url:  http://web.stanford.edu/class/cs355
os error occured for url:  http://web.stanford.edu/class/cs477


In [32]:
print_markdown_table(web_tables)

| Course Number | Link |
| --- | --- |
|cs9|[CS9: Problem-Solving for the CS Technical Interview](http://web.stanford.edu/class/cs9)|
|cs22|[CS22: The History and Philosophy of Artificial Intelligence](http://web.stanford.edu/class/cs22)|
|cs41|[Index of /class/cs41](http://web.stanford.edu/class/cs41)|
|cs42|[Callback Me Maybe: Contemporary JavaScript (CS 42)](http://web.stanford.edu/class/cs42)|
|cs47|[Cross-Platform Mobile Development (CS 47)](http://web.stanford.edu/class/cs47)|
|cs50|[CS 50 : Using Tech For Good](http://web.stanford.edu/class/cs50)|
|cs51|[ CS+SG Studio ](http://web.stanford.edu/class/cs51)|
|cs52|[title](http://web.stanford.edu/class/cs52)|
|cs84|[Index of /class/cs84](http://web.stanford.edu/class/cs84)|
|cs93|[
  CS93: Teaching AI](http://web.stanford.edu/class/cs93)|
|cs101|[CS101 Introduction to Computing Principles](http://web.stanford.edu/class/cs101)|
|cs102|[
    CS 102: Working with Data - Tools and Techniques
  ](http://web.stanford.edu/class/cs102)|
|c

In [0]:
course_no = ["cs47"]

for i in range(1, 301):
  course_no.append("cs" + str(i))

semester = []
for i in range(1100, 1208, 2):
  semester.append(str(i))

initial_link = "http://web.stanford.edu/class/archive/cs/"
end_link = "/"



result = []
for ele1 in course_no:

  for ele2 in semester:
    string = initial_link + ele1 + "/" + ele1 + "." + ele2 + end_link
    # print(string)
    # print("checking for", ele2)
    if check_website_exist(string):
      print(string)
      result.append(string)
    # time.sleep(1)

for ele in result:
  print(ele)