Skip to content

Welcome to the GitBook Scraper project! This tool is designed to scrape content from GitBook sites and save it in Markdown format. πŸ“„βœ¨

Notifications You must be signed in to change notification settings

codedByCan/GitBook_Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

🌐 GitBook Scraper

Welcome to the GitBook Scraper project! This tool is designed to scrape content from GitBook sites and save it in Markdown format. πŸ“„βœ¨

πŸš€ Technologies Used

  • Puppeteer: For headless browser automation.
  • fs-extra: For file system operations.
  • path: For handling and transforming file paths.

πŸ› οΈ Installation

First, clone the repository to your local machine. Then, navigate to the project directory and install the necessary dependencies:

npm install

πŸ“ˆ Usage

To run the scraper, execute the following command:

node scraper.js

This will start the scraper, which will navigate to the specified GitBook site, extract the content, and save it to a result.md file in the project directory.

πŸ“ Project Structure

  • scraper.js: The main file that runs the web scraper.

πŸ“œ scraper.js Overview

  • BASE_URL: The URL of the GitBook site to be scraped.
  • OUTPUT_FILE: The path to the file where the scraped content will be saved.

Workflow

  1. Browser Initialization: Launch a headless browser using Puppeteer.
  2. Navigate to URL: Go to the specified URL and wait until the page is fully loaded.
  3. Expand Menus: Click on any expandable menu items to ensure all content is visible.
  4. Collect Links: Gather all the links to the pages within the GitBook site.
  5. Scrape Content: Visit each page, extract the relevant content, and format it in Markdown.
  6. Save to File: Write the collected content to the specified output file.
  7. Close Browser: Shut down the browser once the scraping is complete.

Sample Code Snippets

Browser Initialization:

const browser = await puppeteer.launch({ 
  headless: "new", 
  args: ['--no-sandbox', '--disable-setuid-sandbox']
});

Navigating to URL:

await page.goto(BASE_URL, { waitUntil: 'networkidle0', timeout: 60000 });

Collecting Links:

const links = await page.evaluate(() => {
  const anchors = Array.from(document.querySelectorAll('a'));
  return anchors.map(anchor => anchor.href);
});

Scraping Content:

const content = await page.evaluate(() => {
  return document.querySelector('.page-inner').innerText;
});

Saving to File:

await fs.writeFile(OUTPUT_FILE, markdownContent);

πŸ›‘οΈ Error Handling

If an error occurs during scraping, it will be logged to the console, and the scraper will continue processing the remaining pages.

catch (error) {
  console.error('Unexpected error:', error);
}

πŸ“¬ Contact

If you have any questions or suggestions, feel free to reach out to fastuptime.

About

Welcome to the GitBook Scraper project! This tool is designed to scrape content from GitBook sites and save it in Markdown format. πŸ“„βœ¨

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published