* Name: Arindam Choudhury
* Uplevel Email Address: arindam.choudhury.email@gmail.com
* Problem Statement: Problem-01: AI-Assisted Learning for NVIDIA SDKs and Toolkits

> # 1. AC_NvidiaLLM_helper_webscraping.ipynb
### Scraping NVIDIA and other web-pages and get the content text from each link sites:

For this project following websites are scraped and their contents were read and save in "NVIDIA_DATA_FOLDER"

1. NVIDIA Main page                 `https://www.nvidia.com/en-us/` 
2. NVIDIA Main document page        `https://docs.nvidia.com`
3. NVIDIA All document pages        `https://docs.nvidia.com/#all-documents`
4. NVIDIA Forum pages               `https://forums.developer.nvidia.com/`
5. NVIDIA Customer Help pages       `https://nvidia.custhelp.com/app/answers/list/st/5/kw/grid`
6. Wikipedia (NVIDIA) page          `https://en.wikipedia.org/wiki/Nvidia`
7. Stack-Overflow pages for NVIDIA  `https://stackoverflow.com/ ` 

> #### Import Necessary Libraries

In [70]:
from langchain_community.document_loaders import UnstructuredURLLoader, UnstructuredFileLoader
from langchain.docstore.document import Document
from unstructured.cleaners.core import remove_punctuation,clean,clean_extra_whitespace, group_broken_paragraphs
from bs4 import BeautifulSoup
import requests
import pickle
import re
from tqdm import tqdm
import time

> #### Connect google drive
* ### Make sure to unzip Arindam_Choudhury_NvidiaLLM.zip and save "MyDrive" location

In [None]:
from google.colab import userdata, drive
drive.mount('/content/drive', force_remount=True)

> #### Define global variables for this notebook

In [71]:
save_data_folder = '/content/drive/MyDrive/Arindam_Choudhury_NvidiaLLM/NVIDIA_DATA_FOLDER/'
ALL_PAGE_LINKS   = set()                   # Saves all proceesed Links (Used for finding Duplicate)

# Webpage comments. Will be removed from the page
comment1 = "Your browser either does not have JavaScript enabled or does not appear to support enough features of JavaScript to be used well on this site. "
comment2 = "This site requires Javascript in order to view all its content. Please enable Javascript in order to access all the functionality of this web site. "
comment3 = "Here are the  instructions how to enable JavaScript in your web browser."

> #### Function: Save and Load file

In [72]:
def save_data(file_name, data):
    with open(save_data_folder + file_name, 'wb') as file:
      pickle.dump(data, file)

def load_data(file_name):
    with open(save_data_folder + file_name, 'rb') as file:
      return pickle.load(file)

> #### Function: Check for duplicate links (Already processed)

In [73]:
def check_duplicate_links(working_page_links):
    dup_count = 0
    for link in working_page_links:
        if link in ALL_PAGE_LINKS:
            working_page_links.remove(link)
            dup_count += 1
        else:
            ALL_PAGE_LINKS.add(link)
    if dup_count > 0:
        print(f"Number of Duplicate links removed: {dup_count}")
    else:
        print(f"No Duplicate links found")
    return working_page_links

> #### Function: Read page content and create documents

In [74]:
def read_content_from_pages(working_page_links, save_file_name, print_count):
    PAGE_CONTENT   = []
    load_count     = 1

    url_tbp        = working_page_links.pop()
    while url_tbp:
        url_loader = UnstructuredURLLoader(urls = [url_tbp], 
                                           mode="elements",
                                           continue_on_failure = True, 
                                           show_progress_bar=False,
                                           post_processors=[clean,remove_punctuation,clean_extra_whitespace])
        
        elements = url_loader.load()
                                                        # cleaning the contents
        selected_elements = [e for e in elements if e.metadata['category']=="NarrativeText"]
        full_clean = " ".join([e.page_content for e in selected_elements])
        full_clean = full_clean.replace(comment1, '')
        full_clean = full_clean.replace(comment2, '')
        full_clean = full_clean.replace(comment2, '')
              
        PAGE_CONTENT.extend([Document(page_content=full_clean, metadata={"source":url_tbp})])
        try:
            url_tbp = working_page_links.pop()
            if  load_count % print_count == 0:
                print (f"Processing completed for {load_count} pages....")
                save_data(save_file_name, PAGE_CONTENT) # SAVE_PAGE_CONTENT
                time.sleep(5)
            load_count += 1
        except:
            print (f"Processing completed for {load_count} pages....")
            save_data(save_file_name, PAGE_CONTENT)
            break

    return PAGE_CONTENT

> ### 1. NVIDIA Main page                 `https://www.nvidia.com/en-us/`

> #### Function: Get all href links from the page

In [75]:
def get_links_main_page(url, save_file):    
    PAGE_LINKS = set()
    html_text = requests.get(url).content
    soup = BeautifulSoup(html_text, 'lxml')
    links = soup.find_all("a", href=re.compile("https://"))
    for link in links:
        PAGE_LINKS.add(link["href"])
    save_data(save_file, list(PAGE_LINKS))                            # Saving the links.....
    return list(PAGE_LINKS)

> #### Process Main page

In [76]:
main_page_url     = 'https://www.nvidia.com/en-us/'
link_file_name    = '1_NVIDIA_MAIN_PAGE_LINKS'
docs_file_name    = '1_NVIDIA_MAIN_PAGE_CONTENT'

MAIN_PAGE_LINKS = get_links_main_page(main_page_url, link_file_name)   # get links

working_page_links = MAIN_PAGE_LINKS.copy()                            # links to be processed
working_page_links = check_duplicate_links(working_page_links)
print(f"Total {len(working_page_links)} pages to be processed")

MAIN_PAGE_DOCUMENT = read_content_from_pages(working_page_links, docs_file_name, 50)

No Duplicate links found
Total 345 pages to be processed
Processing completed for 50 pages....
Processing completed for 100 pages....
Processing completed for 150 pages....
Processing completed for 200 pages....
Processing completed for 250 pages....
Processing completed for 300 pages....
Processing completed for 345 pages....


> ### 2. NVIDIA Main document page        `https://docs.nvidia.com`

> #### Process Main Document page

In [77]:
working_page_links  = 'https://docs.nvidia.com'
docs_file_name      = '2_NVIDIA_MAIN_DOC_PAGE_CONTENT'

MAIN_DOCUMENT_PAGE_DOCUMENT = read_content_from_pages([working_page_links], docs_file_name, 1)

Processing completed for 1 pages....


> ### 3. NVIDIA All document pages        `https://docs.nvidia.com/#all-documents`

> #### Function: Get all href links from the page

In [78]:
def get_links(url):
    page_links = set()
    html_text = requests.get(url).content
    soup = BeautifulSoup(html_text, 'lxml')
    links = soup.find_all("a", class_="Link", href=re.compile("https://"))
    for link in links:
        page_links.add(link["href"])
    return page_links

> #### Function: Get recursive links [upto 3rd level; Main URL -> link1s -> link2s -> link3s]

In [79]:
def get_recursive_links(url, save_file):
    PAGE_LINKS = set()

    doc_pages = get_links(url)
    PAGE_LINKS = PAGE_LINKS.union(doc_pages)

    for page in doc_pages:
        link_pages = get_links(page)
        PAGE_LINKS = PAGE_LINKS.union(link_pages)

        for link_page in link_pages:
            next_link_pages = get_links(link_page)
            PAGE_LINKS = PAGE_LINKS.union(next_link_pages)

    save_data(save_file, list(PAGE_LINKS))       # Saving the links.....
    return list(PAGE_LINKS)

> #### Process All documents page

In [80]:
document_url   = 'https://docs.nvidia.com/#all-documents'
link_file_name = '3_NVIDIA_DOCUMENT_PAGE_LINKS'
docs_file_name = '3_NVIDIA_DOCUMENT_PAGE_CONTENT'

DOC_PAGE_LINKS = get_recursive_links(document_url, link_file_name)    # get links

working_page_links = []                                               # links to be processed
for link in DOC_PAGE_LINKS:
  if 'nvidia' in link.split("."):
    working_page_links.append(link)

non_working_links = [
'https://api.nvcf.nvidia.com',
'https://ngc.nvidia.com',
'https://developer.nvidia.com',
'https://catalog.ngc.nvidia.com',
'https://helm.ngc.nvidia.com/egxdefault',
'https://ets.egxdefault.egx.nvidia.com',
'https://fc.ngc.nvidia.com',
'https://nvid.nvidia.com',
'https://helm.ngc.nvidia.com/nvidia',
'https://eps.egx.nvidia.com',
'https://docs.nvidia.com/ai-enterprise/1.5/user-guide/index.html#configure-vmware-vsphere-vm-with-vgpuuser-guide/index.html',
'https://forums.developer.nvidia.com/c/healthcare/Parabricks/290',
'https://*.ngc.nvidia.com',
'https://egxdefault.egx.nvidia.com',
'https://docscontent.nvidia.com/sphinx/0000018c-87dc-d937-a99e-87deb3850000/0000018c-839d-de01-a38c-b3ff5b3d0000/cloud-functions/user-guide/latest/_downloads/2b19177fe9ae38d30a1c3da24edf09da/inference-test.zip',
'https://docscontent.nvidia.com/sphinx/0000018c-87dc-d937-a99e-87deb3850000/0000018c-839d-de01-a38c-b3ff5b3d0000/cloud-functions/user-guide/latest/_downloads/4f273de9a4ca3c4b8d1894c7b2f80907/postman.json',
'https://api.nvcf.nvidia.com/v3/openapi']

for link in non_working_links:                                          # remove non working links
    if link in working_page_links:
        working_page_links.remove(link)

working_page_links = check_duplicate_links(working_page_links)
print(f"Total {len(working_page_links)} pages to be processed")
DOCUMENT_PAGE_DOCUMENT = read_content_from_pages(working_page_links, docs_file_name, 2000)

Number of Duplicate links removed: 25
Total 32835 pages to be processed
Processing completed for 2000 pages....
Processing completed for 4000 pages....
Processing completed for 6000 pages....
Processing completed for 8000 pages....
Processing completed for 10000 pages....
Processing completed for 12000 pages....
Processing completed for 14000 pages....
Processing completed for 16000 pages....
Processing completed for 18000 pages....
Processing completed for 20000 pages....
Processing completed for 22000 pages....
Processing completed for 24000 pages....
Processing completed for 26000 pages....
Processing completed for 28000 pages....
Processing completed for 30000 pages....
Processing completed for 32000 pages....
Processing completed for 32835 pages....


> ### 4. NVIDIA Forum pages               `https://forums.developer.nvidia.com/`

> #### Function: Get all href links from the page

In [81]:
def get_links_forum_pages(url, save_file):
    PAGE_LINKS = set()

    html_text = requests.get(url).content
    soup = BeautifulSoup(html_text, 'lxml')
    divs = soup.find_all("div", itemprop="itemListElement")                     # Get all forums
    forums = set()
    for div in divs:
        path = div.find("meta", itemprop="url")
        forums.add('https://forums.developer.nvidia.com' + str(path.attrs['content']))

    f_name_catagoies = set()
    for forum in forums:                                                        # Get the catagories from each forums
        html_text = requests.get(forum).content
        soup = BeautifulSoup(html_text, 'lxml')
        anchors = soup.find_all("a", href=re.compile("/c/"))
        for anchor in anchors:
            if forum[-5:] not in str(anchor):
               f_name_catagoies.add('https://forums.developer.nvidia.com' + str(anchor['href']))

    for catagory in f_name_catagoies:                                           # Get the topics from each catagories
        html_text = requests.get(catagory).content
        soup = BeautifulSoup(html_text, 'lxml')
        sites = soup.find_all("a", class_="title raw-link raw-topic-link")
        for site in sites:
            PAGE_LINKS.add(str(site['href']))

    save_data(save_file, list(PAGE_LINKS))                                      # Saving the links.....
    return list(PAGE_LINKS)

> #### Process Forum pages

In [82]:
forum_url      = 'https://forums.developer.nvidia.com/'
link_file_name = '4_NVIDIA_FORUM_PAGE_LINKS'
docs_file_name = '4_NVIDIA_FORUM_PAGE_CONTENT'

FORUM_PAGE_LINKS = get_links_forum_pages(forum_url, link_file_name)     # get links

working_page_links = FORUM_PAGE_LINKS.copy()                            # links to be processed
working_page_links = check_duplicate_links(working_page_links)
print(f"Total {len(working_page_links)} pages to be processed")

FORUM_PAGE_DOCUMENT = read_content_from_pages(working_page_links, docs_file_name, 500)

No Duplicate links found
Total 3564 pages to be processed
Processing completed for 500 pages....
Processing completed for 1000 pages....
Processing completed for 1500 pages....
Processing completed for 2000 pages....
Processing completed for 2500 pages....
Processing completed for 3000 pages....
Processing completed for 3500 pages....
Processing completed for 3564 pages....


> ### 5. NVIDIA CustomerHelp pages `https://nvidia.custhelp.com/app/answers/list/st/5/kw/grid`

> #### Function: Get all href links from the page

In [83]:
def get_links_support_pages(url, save_file):
    PAGE_LINKS = set()
    for i in range (1, 6):
        URL = url + str(i)
        html_text = requests.get(URL).content
        soup = BeautifulSoup(html_text, 'lxml')
        links = soup.find_all("a", href=re.compile("/app/answers/detail/a_id"))
        for link in links:
            PAGE_LINKS.add("https://nvidia.custhelp.com" + link["href"])
    save_data(save_file, list(PAGE_LINKS))                            # Saving the links.....
    return list(PAGE_LINKS)

> #### Process CustomerHelp pages

In [84]:
# Process NVIDIA Customer help pages:
custhelp_url   = 'https://nvidia.custhelp.com/app/answers/list/st/5/kw/grid/page/'
link_file_name = '5_NVIDIA_CUSTHELP_PAGE_LINKS'
docs_file_name = '5_NVIDIA_CUSTHELP_PAGE_CONTENT'

CUSTHELP_PAGE_LINKS = get_links_support_pages(custhelp_url, link_file_name) # get links

working_page_links = CUSTHELP_PAGE_LINKS.copy()                             # links to be processed
working_page_links = check_duplicate_links(working_page_links)
print(f"Total {len(working_page_links)} pages to be processed")

CUSTHELP_PAGE_DOCUMENT = read_content_from_pages(working_page_links, docs_file_name, 10)

No Duplicate links found
Total 46 pages to be processed
Processing completed for 10 pages....
Processing completed for 20 pages....
Processing completed for 30 pages....
Processing completed for 40 pages....
Processing completed for 46 pages....


> ### 6. Wikipedia (NVIDIA) page          `https://en.wikipedia.org/wiki/Nvidia`

> #### Process Wikipedia page

In [85]:
working_page_links  = 'https://en.wikipedia.org/wiki/Nvidia'
docs_file_name      = '6_NVIDIA_WIKIPEDIA_PAGE_CONTENT'

WIKI_PAGE_DOCUMENT = read_content_from_pages([working_page_links], docs_file_name, 1)

Processing completed for 1 pages....


> ### 7. Stack-Overflow pages for NVIDIA  `https://stackoverflow.com/ `

> #### Funtion: Get all href links from the page

In [86]:
def stack_page_links(tag, page_end):
    page_links = set()
    for page in range (1, page_end + 1):
        url = "https://stackoverflow.com/questions/tagged/" + tag + "?tab=votes&page=" + str(page) +"&pagesize=50"
        html_text = requests.get(url).content
        soup = BeautifulSoup(html_text, 'lxml')
        links = soup.find_all("a", class_="s-link", href=re.compile("/questions/"))
        for link in links:
            page_links.add("https://stackoverflow.com" + link["href"])
    print (f"Links retrived for stckoverflow question tagged with: {tag}....")
    return list(page_links)

> #### Function: Get content from each pages and create a text file for contents [use only top or 1st answer]

In [87]:
def stack_get_qa(page_links):
    PAGE_QA_TEST = ''
    page_count = 0
    for page in tqdm(page_links):
        html_text = requests.get(page).content
        soup = BeautifulSoup(html_text, 'lxml')
        question_answer = soup.find_all("div", class_="s-prose js-post-body")
        if page_count % 250 == 0:
            time.sleep(5)
        if len(question_answer) < 2: # that means no answer to the question, remove that question/link
            pass
        else:
            PAGE_QA_TEST += question_answer[0].text + question_answer[1].text # 0 has question and 1 has answer
        page_count += 1
    print (f"Processing completed for {page_count} pages....")
    return PAGE_QA_TEST

> #### Function: Read text file and create Langchain Document

In [88]:
def read_content_from_stack_pages(text_file_name, save_file_name):
    page_loader = UnstructuredFileLoader(text_file_name,
                                         continue_on_failure = True, 
                                         show_progress_bar=False,
                                         post_processors=[clean, clean_extra_whitespace, group_broken_paragraphs])

    elements = page_loader.load()
    full_clean = elements[0].page_content

    PAGE_CONTENT = [Document(page_content=full_clean, metadata={"source":'https://stackoverflow.com'})]
    print (f"Processing completed for 1 file....")
    save_data(save_file_name, PAGE_CONTENT)
    return PAGE_CONTENT

> #### Process Stackoverflow page

In [89]:
link_file_name = '7_NVIDIA_STACK_OVERFLOW_QA_PAGE_LINKS'
docs_file_name = '7_NVIDIA_STACK_OVERFLOW_QA_PAGE_CONTENT'
stack_qa_text  = '7_NVIDIA_STACK_OVERFLOW_QA_TEXT.txt'

tag_list = ['nvidia', 'nvidia-jetson', 'nvidia-jetson-nano', 'nvidia-docker', 'nvidia-digits', 'nvidia-deepstream', 
            'nvidia-smi', 'nvidia-titan', 'nvidia-flex', 'nvidia-isaac', 'nvidia-hpc-compilers', 'nvidia-jetpack-sdk', 'nvidia-sass']
page_count = [74, 9, 6, 6, 2, 2, 1, 1, 1, 1, 1, 1, 1]
question_count = [3690, 401, 291, 268, 91, 56, 35, 12, 10, 10, 9, 5, 3]

PAGE_LINKS = []
for tag, count in zip(tag_list, page_count):
    PAGE_LINKS.extend(stack_page_links(tag, count))
save_data(link_file_name, PAGE_LINKS)                            # Saving the links.....

working_page_links = PAGE_LINKS.copy()                           # links to be processed

working_page_links = check_duplicate_links(working_page_links)
print(f"Total {len(working_page_links)} links to be processed")
STACK_PAGE_QA_TEXT = stack_get_qa(working_page_links)
save_data(stack_qa_text, STACK_PAGE_QA_TEXT)

STACK_PAGE_DOCUMENT = read_content_from_stack_pages(save_data_folder + stack_qa_text, docs_file_name)

Links retrived for stckoverflow question tagged with: nvidia....
Links retrived for stckoverflow question tagged with: nvidia-jetson....
Links retrived for stckoverflow question tagged with: nvidia-jetson-nano....
Links retrived for stckoverflow question tagged with: nvidia-docker....
Links retrived for stckoverflow question tagged with: nvidia-digits....
Links retrived for stckoverflow question tagged with: nvidia-deepstream....
Links retrived for stckoverflow question tagged with: nvidia-smi....
Links retrived for stckoverflow question tagged with: nvidia-titan....
Links retrived for stckoverflow question tagged with: nvidia-flex....
Links retrived for stckoverflow question tagged with: nvidia-isaac....
Links retrived for stckoverflow question tagged with: nvidia-hpc-compilers....
Links retrived for stckoverflow question tagged with: nvidia-jetpack-sdk....
Links retrived for stckoverflow question tagged with: nvidia-sass....
Number of Duplicate links removed: 186
Total 4671 links to 

100%|██████████| 4671/4671 [11:08<00:00,  6.99it/s]  


Processing completed for 4671 pages....
Processing completed for 1 file....
