RSSCrawlerArabicCorpus is a crawler for multiple RSS feed sites written in Java. Both text and images could be scraped via HTML parsing from different arabic news websites.
CSS selector expression is used to specify the DOM locations for the text and image path.
SHA256 is used instead of MD5 to digest URLs.
1. Create a new mysql dataBase and write its name in the "sys_conf.txt" file.
2. Edit the configurations in the file "sys_conf.txt" to adapt your project
3. Run the project
- All the parameters for the crawler are initialized from a file named sys_conf.txt. The sys_conf.txt specifies
- The saving path for the crawled data
- File path of an XML file containing the URLs of the RSS sites and XPath for its text and image content
- Source of the news
- URL of the source
- Part of code of the line containing the title in related links
- Part of code of the line containing the domain in related links
- Language name defined in crawl-sites.xml
- DataBase name
- Username for mysql database 10. Password for mysql database
- An XML file should be provided to specify the feed channels and the CSS selector syntax for the text and image content in a DOM tree.
- JawharaFM
jsoup-*.*.*.jar
mysql-connector-java-*.*.**-bin.jar
We used the project "https://github.com/MingjieQian/RSSFeedCrawler" as a basis for our project and we modified it according to our need for data.