Crawlit: Web Crawler with Streamlit

This project is a web crawler based on Scrapy (and Crowl.tech), enriched by a Streamlit user interface to visualize and analyze the results.

Features

Web Crawler: Uses Scrapy (Crowl) to browse and collect data from specified websites.
Streamlit interface: Allows interactive visualization and analysis of collected data, including a distribution of PageRanks.
CSV Export: Ability to export collected data in a CSV format for further processing.

PageRank

In this project, we use a method to calculate the PageRank of different pages, inspired by the original algorithm. By adding the concept of the reasonable surfer.

Visualization with ECharts

We use ECharts, an open source visualization library, to display the distribution of PageRanks of our crawled web pages. The distribution of response statuses, links by depth, and other information.

Chart Features:

Type: We opted for a bar graph to clearly visualize the distribution of PageRank scores.
Tooltips: By hovering over each bar, you can see a tooltip that shows the precise number of URLs with that PageRank score.
Axes: The X axis shows the PageRank score (from 1 to 10), while the Y axis shows the number of URLs corresponding to each score.

Preview in pictures

Dependencies

Scrapy
Crowl
Igraph
Streamlit
streamlit_echarts
pymysql
twisted
adbapi
streamlit_apexjs

Installation

Clone repository

git clone https://github.com/drogbadvc/crawlit.git

Navigate to the project directory

cd your_project_name

Install dependencies

    pip install -r requirements.txt

Execution

    streamlit run graph-streamlit.py

Use

Streamlit Interface: Go to http://localhost:8501 in your browser after launching Streamlit.
Web Crawler: Please see the Scrapy documentation for more details on running and configuring spiders.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.streamlit		.streamlit
components		components
crowl		crowl
css		css
demo		demo
util		util
LICENSE.md		LICENSE.md
README.md		README.md
graph-streamlit.py		graph-streamlit.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.streamlit

.streamlit

components

components

crowl

crowl

css

css

demo

demo

util

util

LICENSE.md

LICENSE.md

README.md

README.md

graph-streamlit.py

graph-streamlit.py

requirements.txt

requirements.txt

Repository files navigation

Crawlit: Web Crawler with Streamlit

Features

PageRank

Visualization with ECharts

Chart Features:

Preview in pictures

Dependencies

Installation

Use

About

Releases

Packages

Languages

License

drogbadvc/crawlit

Folders and files

Latest commit

History

Repository files navigation

Crawlit: Web Crawler with Streamlit

Features

PageRank

Visualization with ECharts

Chart Features:

Preview in pictures

Dependencies

Installation

Use

About

Topics

Resources

License

Stars

Watchers

Forks

Languages