<a href="https://colab.research.google.com/github/Y-Srivaishnavi/nano_jsGPT/blob/main/data_extraction/js_docs_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [46]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from collections import deque

# Initialize a queue with the starting URL
url_queue = deque([' https://developer.mozilla.org/en-US/docs/Web/JavaScript'])

# Set to store visited URLs to avoid revisiting pages
visited_urls = set()

# Maximum number of pages to scrape
max_pages = 2000

# Function to scrape a single page
def scrape_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        # Targeting only the specified article for textual content
        article_content = soup.find('article', class_='main-page-content', lang='en-US')
        if article_content:
            paragraphs = [p.text.strip() for p in article_content.find_all('p')]
        else:
            paragraphs = []

        # Extract links from the entire page, not just the article, to continue the crawl
        links = [a.get('href') for a in soup.find_all('a', href=True)]
        for link in links:
            absolute_link = urljoin(url, link)
            if urlparse(absolute_link).netloc == urlparse(url).netloc and absolute_link not in visited_urls:
                url_queue.append(absolute_link)

        return {
            'url': url,
            'paragraphs': paragraphs
        }

    except requests.RequestException as e:
        print(f"Request failed: {e}")
        return None


# Main loop to process URLs in the queue
scraped_data = []
pages_scraped = 0  # Counter for pages scraped

while url_queue and pages_scraped < max_pages:
    current_url = url_queue.popleft()
    if current_url not in visited_urls:
        print(f"Scraping {current_url}...")
        data = scrape_page(current_url)
        if data:
            scraped_data.append(data)
            pages_scraped += 1  # Increment the counter after a successful scrape
        # Mark the current URL as visited
        visited_urls.add(current_url)


Scraping  https://developer.mozilla.org/en-US/docs/Web/JavaScript...
Scraping https://developer.mozilla.org/en-US/docs/Web/JavaScript#content...
Scraping https://developer.mozilla.org/en-US/docs/Web/JavaScript#top-nav-search-input...
Scraping https://developer.mozilla.org/en-US/docs/Web/JavaScript#languages-switcher-button...
Scraping https://developer.mozilla.org/en-US/...
Scraping https://developer.mozilla.org/en-US/docs/Web...
Scraping https://developer.mozilla.org/en-US/docs/Web/HTML...
Scraping https://developer.mozilla.org/en-US/docs/Web/CSS...
Scraping https://developer.mozilla.org/en-US/docs/Web/JavaScript...
Scraping https://developer.mozilla.org/en-US/docs/Web/HTTP...
Scraping https://developer.mozilla.org/en-US/docs/Web/API...
Scraping https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions...
Scraping https://developer.mozilla.org/en-US/docs/Learn...
Scraping https://developer.mozilla.org/en-US/docs/Learn/HTML...
Scraping https://developer.mozilla.org/en-US/d

  k = self.parse_starttag(i)



Scraping https://developer.mozilla.org/en-US/advertising...
Scraping https://developer.mozilla.org/en-US/docs/MDN/Community/Issues...
Scraping https://developer.mozilla.org/en-US/community...
Scraping https://developer.mozilla.org/discord...
Scraping https://developer.mozilla.org/en-US/docs/MDN/Writing_guidelines/Attrib_copyright_license...
Scraping https://developer.mozilla.org/en-US/#content...
Scraping https://developer.mozilla.org/en-US/#top-nav-search-input...
Scraping https://developer.mozilla.org/users/fxa/login/authenticate/?next=%2Fen-US%2F...
Scraping https://developer.mozilla.org/en-US/blog/regular-expressions-reference-updates/...
Scraping https://developer.mozilla.org/en-US/blog/aria-accessibility-html-landmark-roles/...
Scraping https://developer.mozilla.org/en-US/docs/Web/API/Performance_API...
Scraping https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_nesting...
Scraping https://developer.mozilla.org/en-US/blog/introducing-ai-help/...
Scraping https://developer.mozi

In [47]:
# Function to save scraped data to a file
def print_retrieved_data(data):
    i = 0
    for page_data in data:
        if i < 5:
          print(f"URL: {page_data['url']}")
          print("Paragraphs:")
          for paragraph in page_data['paragraphs']:
              print(paragraph)
          print("-" * 80)
          print()  # Add a newline after each page_data
          i += 1
        else:
          break

print_retrieved_data(scraped_data)

URL:  https://developer.mozilla.org/en-US/docs/Web/JavaScript
Paragraphs:
JavaScript (JS) is a lightweight interpreted (or just-in-time compiled) programming language with first-class functions. While it is most well-known as the scripting language for Web pages, many non-browser environments also use it, such as Node.js, Apache CouchDB and Adobe Acrobat. JavaScript is a prototype-based, multi-paradigm, single-threaded, dynamic language, supporting object-oriented, imperative, and declarative (e.g. functional programming) styles.
JavaScript's dynamic capabilities include runtime object construction, variable parameter lists, function variables, dynamic script creation (via eval), object introspection (via for...in and Object utilities), and source-code recovery (JavaScript functions store their source text and can be retrieved through toString()).
This section is dedicated to the JavaScript language itself, and not the parts that are specific to Web pages or other host environments. Fo

In [48]:
print(len(scraped_data))

2000


In [52]:
import re

def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)  # Example: Remove non-alphabetic characters
    cleaned_text = cleaned_text.lower()  # Example: Convert to lowercase
    return cleaned_text

# Clean the text
new_scraped_data = {}
i = 0
for data in scraped_data:
  for line in data['paragraphs']:
    new_scraped_data[i] = clean_text(line)
    i+=1

In [53]:
for j in range(10):
  print(new_scraped_data[j])

javascript js is a lightweight interpreted or justintime compiled programming language with firstclass functions while it is most wellknown as the scripting language for web pages many nonbrowser environments also use it such as nodejs apache couchdb and adobe acrobat javascript is a prototypebased multiparadigm singlethreaded dynamic language supporting objectoriented imperative and declarative eg functional programming styles
javascripts dynamic capabilities include runtime object construction variable parameter lists function variables dynamic script creation via eval object introspection via forin and object utilities and sourcecode recovery javascript functions store their source text and can be retrieved through tostring
this section is dedicated to the javascript language itself and not the parts that are specific to web pages or other host environments for information about apis that are specific to web pages please see web apis and dom
the standards for javascript are the ecma

In [55]:
with open('js_docs_text_data.txt', 'w') as dump:
  for j in range(i):
    print(new_scraped_data[j], file=dump)

In [54]:
print(i)

32641
