Tor Crawler and Indexer

This project is a comprehensive Python-based crawler and indexer designed to explore, crawl, and index hidden services on the Tor network (".onion" sites). The system can analyze and visualize relationships between different sites, generate reverse content indices, and provide TF-IDF based search capabilities over the crawled content.

Features

Tor Integration: Establishes a secure session over the Tor network to anonymously crawl .onion sites.
Crawling Mechanism: A robust multi-threaded crawler that can handle retries, validate URLs, and avoid crawling already indexed sites.
Content Indexing: Indexes the content of crawled sites, creating a reverse index for fast search and retrieval.
Link Relationship Mapping: Captures and maps relationships between different .onion links discovered during the crawl.
TF-IDF Based Search: Implements a simple search engine using TF-IDF to rank indexed pages based on query relevance.
Visualization: Generates visual representations of term frequency (word clouds) and site relationships (network graphs).

Key Components

TorSession: Manages the connection to the Tor network, establishing a secure session for crawling .onion sites.
LinkCrawler: Crawls the dark web, discovering new .onion links, indexing content, and managing relationships between links.
SearchEngineCrawler: Queries dark web search engines to discover additional .onion links.
LinkIndexManager: Handles storing and retrieving indexed links.
ReverseContentIndexManager: Manages the reverse index, mapping search terms to the documents in which they appear.
LinkRelationshipManager: Tracks relationships between different .onion sites, such as mutual links and loops.
KPIManager: Calculates and stores KPIs, such as:
- Mean response time of crawled sites
- Ratio of mutual links
- Count of link loops where a link points to itself
- Total number of links indexed
- Total number of terms indexed
VisualizationManager: Generates word clouds and link graphs to visualize the indexed content and relationships.
REPL and QueryEngine: Provides an interactive shell for querying the indexed data.

Installation

Clone the Repository:

git clone https://github.com/yourusername/tor-crawler-indexer.git
cd tor-crawler-indexer

Install Dependencies: Ensure you have Python 3.x installed. Install the required Python packages using pip:
Configure Tor Authentication Cookie: Before running the crawler, make sure that the Tor authentication cookie is properly configured. This is essential for establishing a secure session with the Tor network. You can typically find the control_auth_cookie in the Tor data directory, often located at /var/lib/tor. Ensure that your TorSession class is configured to read this cookie.
Run the Crawler: Ensure you have the Tor service running on your system. Start the crawler by running the main script:
```
python tor_search_csv_xx.py
```

Usage

Starting the Crawler: The crawler starts automatically and begins to explore the .onion sites listed in the seed URLs. It indexes the content and discovers new links recursively.
Querying the Index: Use the REPL to interact with the indexed data. Type query <search terms> to search the index.
Visualization: The system automatically generates visualizations of the data, including term clouds and link clouds, saved as term_cloud.png and link_cloud.png respectively.

Files

tor_search_csv_14.py: The main script containing all the classes and logic for crawling, indexing, and visualization.
link_index.json: Stores the indexed data of crawled links.
reverse_content_index.json: Contains the reverse index of terms used for search queries.
link_relationship.json: Tracks the discovered relationships between different .onion sites.
discovered_onion_links.json: Logs all .onion links discovered during the crawl.
crawled_sites.json: Stores metadata about the sites that have been crawled, including their status and response times.
term_cloud.png: A visual representation of the most common terms in the indexed content.
link_cloud.png: A network graph showing the interconnections between discovered .onion sites.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
img		img
information_retrieval		information_retrieval
yay		yay
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tor Crawler and Indexer

Features

Key Components

Installation

Usage

Files

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tor Crawler and Indexer

Features

Key Components

Installation

Usage

Files

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages