Web_Crawler_CS467

This is the workspace for CS 467 Capstone Software Projects.

Project Instruction

Open the zip folder or clone the project from the GitHub Repository:

$git clone https://github.com/WeiChienHsu/Web_Crawler_CS467.git

Please make sure you have downloaded and installed node, python3, and npm in your local machine.

You can install npm by installing Node.js. Node.js is an environment for developing server-side applications.
When you install Node.js, npm will install automatically. (https://nodejs.org/en/)

Frontend UI

Under the directory: ./Web_Crawler_CS467/webapp/react-front-end

Run:

$npm install
$npm start

to install the required modules and the server will be listening on the port 3000

Test on local machine

Open your web browser and visit: http://localhost:3000/.

Enter a Starting URL
Select between BFS and DFS crawling method
Select the depth of the searching level
It is optional to enter a keyword
Press "Search" or "Demo" Button
Web Frontend sends the request to Web Server through a POST Request
Web Frontend receives the crawling result from the Web Server and renders content to a chart using D3 libraries
If there is Error Message return from the Web Server, it will display on the UI.

Some good testing examples:

Algo	URL	depth	keyword
BFS	www.google.com	1	mail
BFS	www.google.com	1	mail
DFS	https://www.reddit.com/r/OSUOnlineCS/	22	-
DFS	https://yahoo.com	16	-

Web Server

Under the directory: ./Web_Crawler_CS467/webapp

Run on local machine

Run:

$npm install
$npm start

to install the required modules and the server will be listening on the port 8080

Test on local machine

In the web app directory, you will also find a directory called ‘testing’ with two files to create the testing setup for the postman.

You can import both

visualizers_api.postman_collection.json
visualizers_api.postman_environment.json

Currently, the app_url variable is set to the cloud deploy API if you would like the code running in your local machine you will have to change this value to http://localhost:8080

Web Crawler

Under the directory: ./Web_Crawler_CS467

To see the algorithms in action with live URLs being printed, run locally with these instructions.

Install the codebase in our Github repository
Make sure you have the appropriate python packages installed from our requirements.txt file (Python 3) using:

pip3 install -r requirements.txt

Run the algorithm of your choice from the target file’s directory, in the following format:

BFS - python3 bfs_search.py [target url] [depth] [keyword (optional)]

python3 bfs_search.py http://www.google.com 2
python3 bfs_search.py http://www.google.com 2 stop

DFS - python3 dfs_search.py [target url] [depth] [keyword (optional)]

python3 dfs_search.py http://www.google.com 20
python3 dfs_search.py http://www.google.com 20 stop

Watch as each URL is printed to the console and it is added to the list to be returned (in this case printed) as one large array object

Project Plan

USER STORIES AND SPECIFICATION

Frontend

Web Server

As a Web Server, it can receive the request from Web Frontend: The server will listen for any API requests from the front end.
As a Web Server, it can establish communication with the endpoints from Cloud Function and access APIs for both BFS and DFS crawling: The server will determine whether the request is a BFS or DFS search based on the payload from the frontend: The server will make the appropriate function call to the backend cloud function.
As a Web Server, it can parse the request and attach the keyword and level of search into DFS/BFS API from Google Cloud Platform (Cloud Function Service): The server can parse the level and the keyword (if any) from the request to send to the cloud function for crawling
As a Web Server, it can receive the response from Cloud Function: The server will wait for a response from the cloud function after its API call
As a Web Server, it can return the content from Cloud Function to the Web Frontend: The server will relay the information in JSON back to the frontend as the response to the original API call.
As a Web Server, it should response Error Message with correct Status Code to the Frontend: The server will identify any empty response object and return an error to the front end instead of the normal response. The server will have a fail safe, where if it take the function too long to respond, it will return an error message to the front end (in case there is an issue with the function that can not be assessed).
As a Web Server, it should be hosted on the Google Cloud Platform: The server will be hosted on Google Cloud Platform which in turn will host the front end of our webpage. It should be accessed by the end point of Google App Engine. (e.g. https://XXXX.appspot.com/)

Web Crawler

As a Web Crawler, it should expose two APIs (DFS/BFS) from cloud function: The crawler will be hosted in the cloud
As a Web Crawler, it can receive the function parameters from Web Server: The crawler will have 2 separate functions for performing the search. One will implement the DFS algorithm and the other will implement the BFS algorithm.
As a Web Crawler, it can parse the parameters like URL, level of search, search method and keyword: The function parameters for starting URL, depth and keyword should be included from the server and The crawler will have a function to parse the parameters of URL, level of search, and keyword out of the payload
As a Web Crawler, it should scrape the URL and capture the first URL from the crawling results: The crawler will scrape the page of the given URL until it finds another URL to jump to
As a Web Crawler, it can implement both DFS and BFS algorithm to collect URLs: The crawler should have 2 endpoints available, one for a DFS and one for BFS. If the keyword is found the search will stop and return a boolean indicating the keyword has been found instead of reaching the depth
As a Web Crawler, to apply the DFS algorithm, the program will start at the start page, randomly choose one of the links on that page, then follow it to the next page. Then, on the next page, it randomly selects a link from the options available and follows it. This makes a chain from the starting page. This continues until the program hits the page limit indicated.
As a Web Crawler, to apply the BFS algorithm, the program will follow ALL links from the start page, and ALL links from each page it visits, until the crawler has reached the limit of pages deep (as measured from the start page), it should visit.
As a Web Crawler, it should limit the number of levels for the BFS method, since this is likely to return a huge, sprawling graph, consider limiting the user's input with this kind of search to a small number.
As a Web Crawler, it should return a JSON data representing the Search Result to Web Server: The crawler will store its traversals (edges) in a data structure, as well as the URL followed, and the webpage title for each stop. There will be a function that converts the data structure information into a properly formatted JSON object. The crawler will return an empty object if an issue is encountered

Software Architecture

Edit Link - SOFTWARE ARCHITECTURE

Frontend React Components

Web Server API

Edit Link - Web Server API

Use Cases

Use Case 1 - User request to Crawl URLs

Enter a Starting URL
Select between BFS and DFS crawling method
Select the depth of the searching level
Optional to enter a keyword
Press the Crawl button
Web Frontend sends the request to Web Server through POST Request
Web server receives JSON format request carrying user input in the request body
Web server calls the Web crawler hosting on GCP with the required information
Web Crawler crawls those URLs using BFS/DFS
Web Crawler wraps the list of URL into JSON format and returns to the Web server
Web Server parses the response and converts it to JSON format which could be read by Web Frontend
Web Frontend render contents using D3 libraries

Edit Link - Use Case 1 - Crawling

Use Case 2 - Searching History

User Enters starting URL, level of search and keyword with DFS method selected
After pressing crawl, Frontend pushes URL into the Cookies first and then start the process communicating with Web Server
User requests for the Searching History, then the Frontend request from the Cookies
If Cookies is empty, Frontend will display an empty list of searching History and inform the user there is no history
If Cookies is not empty, Frontend captures a list of data to display in UI
User request to clean Searching History
Frontend calls the cleanup method to delete Cookies and return the status code of successfully removed data
Frontend display the empty list likes step 4.

Edit Link - Use Case 2 - History

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
config		config
images		images
webapp		webapp
.gitignore		.gitignore
README.md		README.md
bfs_multithread.py		bfs_multithread.py
bfs_search.py		bfs_search.py
dfs_search.py		dfs_search.py
extract.py		extract.py
get-pip.py		get-pip.py
hold		hold
mockBFS.json		mockBFS.json
mockDFS.json		mockDFS.json
openapi-functions.yaml		openapi-functions.yaml
package-lock.json		package-lock.json
requirements.txt		requirements.txt
user_agents.txt		user_agents.txt

WeiChienHsu/Web_Crawler_CS467

Folders and files

Latest commit

History

Repository files navigation

Web_Crawler_CS467

Project Instruction

Frontend UI

Test on local machine

Some good testing examples:

Web Server

Run on local machine

Test on local machine

Web Crawler

Project Plan

USER STORIES AND SPECIFICATION

Frontend

Web Server

Web Crawler

Software Architecture

Frontend React Components

Web Server API

Use Cases

Use Case 1 - User request to Crawl URLs

Use Case 2 - Searching History

About

Resources

Stars

Watchers

Forks

Languages