WebCrawler

This software was developed while I was an Intern at Intellinet Systems Pvt. Ltd. [ https://www.intellinetsystem.com ]. It is based on Scrapy (a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler. It is currently maintained by Scrapinghub Ltd., a web-scraping development and services company) and uses Customized Spiders/bots designed to fetch Contacts and other informations from their Exhibitors .

Customized Spiders for Data Extraction

Installation

Use the Installation Guide to Scrapy to install Scrapy and build your environment to start designing customized spiders.

import scrapy

Imports

import csv
import re
import scrapy
import scrapy.item
import xlrd
from scrapy.item import Item, Field
from urllib.parse import urlparse
from urllib.parse import urlsplit
from scrapy.selector import HtmlXPathSelector
import pandas as pd
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector

Custom Spiders

These Customized spiders were built after a complete analysis of the websites, their architectures and scraping the entire footer of the website to get emails and contacts.

Snippet to Scrap Multiple links stored in text file

This snippet takes care of scraping multiple websites from a file that are listed down in it.

class name_of_your_spider(scrapy.Spider):

    name = 'name_of_your_spider'
    allowed_domains = []
    start_urls = []

    # rules = [Rule(LinkExtractor(), callback='parse', follow=True)]

    def __init__(self, filename=None):
        allowed_domains = []
        start_urls = []

        if filename:
            with open(filename, newline='') as f:
                content = csv.reader(f, delimiter=' ')

                for row in content:

                    self.start_urls.extend(row)

Snippet to Scrap multiple webpages of a website

for link in all_links:
            checker = str(link)
            link = str(str(response.url)+ checker)   

            yield scrapy.http.Request(url=link,
                    callback=self.parse_item,dont_filter = True)

Define a special Function to handle this

def parse_item(self, response):

            # url=link

        url = response.url
        email = response.xpath('*//a/text()').re(r'[\w\.-]+@[\w\.-]+')
        #email = response.xpath('*//p/text()').re(r'[\w\.-]+@[\w\.-]+')
        email = list(email)
        if len(email) == 0:
            email = ['']  # check whether the email is fetched or not

        for it in zip(email):
            parsed_uri = urlparse(url)
            result = \
                '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
            scraped_info = {'Website URL': result,
                            'Email Scraped': it[0]}

            yield scraped_info

Reports

The Reports are now open-sourced :)

Demonstration

Video Explanation

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
AIMEDINDIA		AIMEDINDIA
AUTOEXPO		AUTOEXPO
CONSTRUCTION PLACEMENTS		CONSTRUCTION PLACEMENTS
DUBAI MOTORS		DUBAI MOTORS
Dict		Dict
IMPA		IMPA
Intern		Intern
PLASTEMART		PLASTEMART
SIAM		SIAM
VAE.AHK		VAE.AHK
Installations		Installations
README.md		README.md
bot_spider.py		bot_spider.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawler

Customized Spiders for Data Extraction

Installation

Imports

Custom Spiders

Snippet to Scrap Multiple links stored in text file

Snippet to Scrap multiple webpages of a website

Define a special Function to handle this

Reports

Demonstration

About

Releases

Packages

Languages

XDoodler/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

Customized Spiders for Data Extraction

Installation

Imports

Custom Spiders

Snippet to Scrap Multiple links stored in text file

Snippet to Scrap multiple webpages of a website

Define a special Function to handle this

Reports

Demonstration

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages