This project is a comprehensive Python-based crawler and indexer designed to explore, crawl, and index hidden services on the Tor network (".onion" sites). The system can analyze and visualize relationships between different sites, generate reverse content indices, and provide TF-IDF based search capabilities over the crawled content.
- Tor Integration: Establishes a secure session over the Tor network to anonymously crawl
.onionsites. - Crawling Mechanism: A robust multi-threaded crawler that can handle retries, validate URLs, and avoid crawling already indexed sites.
- Content Indexing: Indexes the content of crawled sites, creating a reverse index for fast search and retrieval.
- Link Relationship Mapping: Captures and maps relationships between different
.onionlinks discovered during the crawl. - TF-IDF Based Search: Implements a simple search engine using TF-IDF to rank indexed pages based on query relevance.
- Visualization: Generates visual representations of term frequency (word clouds) and site relationships (network graphs).
-
TorSession: Manages the connection to the Tor network, establishing a secure session for crawling
.onionsites. -
LinkCrawler: Crawls the dark web, discovering new
.onionlinks, indexing content, and managing relationships between links. -
SearchEngineCrawler: Queries dark web search engines to discover additional
.onionlinks. -
LinkIndexManager: Handles storing and retrieving indexed links.
-
ReverseContentIndexManager: Manages the reverse index, mapping search terms to the documents in which they appear.
-
LinkRelationshipManager: Tracks relationships between different
.onionsites, such as mutual links and loops. -
KPIManager: Calculates and stores KPIs, such as:
- Mean response time of crawled sites
- Ratio of mutual links
- Count of link loops where a link points to itself
- Total number of links indexed
- Total number of terms indexed
-
VisualizationManager: Generates word clouds and link graphs to visualize the indexed content and relationships.
-
REPL and QueryEngine: Provides an interactive shell for querying the indexed data.
-
Clone the Repository:
git clone https://github.com/yourusername/tor-crawler-indexer.git cd tor-crawler-indexer -
Install Dependencies: Ensure you have Python 3.x installed. Install the required Python packages using pip:
-
Configure Tor Authentication Cookie: Before running the crawler, make sure that the Tor authentication cookie is properly configured. This is essential for establishing a secure session with the Tor network. You can typically find the
control_auth_cookiein the Tor data directory, often located at/var/lib/tor. Ensure that your TorSession class is configured to read this cookie. -
Run the Crawler: Ensure you have the Tor service running on your system. Start the crawler by running the main script:
python tor_search_csv_xx.py
-
Starting the Crawler: The crawler starts automatically and begins to explore the
.onionsites listed in the seed URLs. It indexes the content and discovers new links recursively. -
Querying the Index: Use the REPL to interact with the indexed data. Type
query <search terms>to search the index. -
Visualization: The system automatically generates visualizations of the data, including term clouds and link clouds, saved as
term_cloud.pngandlink_cloud.pngrespectively.
- tor_search_csv_14.py: The main script containing all the classes and logic for crawling, indexing, and visualization.
- link_index.json: Stores the indexed data of crawled links.
- reverse_content_index.json: Contains the reverse index of terms used for search queries.
- link_relationship.json: Tracks the discovered relationships between different
.onionsites. - discovered_onion_links.json: Logs all
.onionlinks discovered during the crawl. - crawled_sites.json: Stores metadata about the sites that have been crawled, including their status and response times.
- term_cloud.png: A visual representation of the most common terms in the indexed content.
- link_cloud.png: A network graph showing the interconnections between discovered
.onionsites.
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for more details.


