Multi-Webbing

A multi-threaded libary for web scraping in python, built upon the python threading modules. Supports using requests and selenium for making web requests.

Set Up

Install the module from pip
```
 pip install multi_webbing
```

Import the Module into your python file

 from multi_webbing import multi_webbing as mw

Set the Number of threads and create a multi-webbing object. By default this will use the requests module, but this can be changed to selenium by passing the web_module="selenium" option to MultWebbing.
```
 num_threads = 4
 my_threads = mw.MultiWebbing(num_threads) #intialize threading
```
Start the threads. The threads will now continuously check the work queue for work.
```
 my_threads.start
```
To put a job in the queue, call the job_queue.put() method of the multi-webbing object.
```
 my_threads.job_queue.put(mw.Job(job_id, job_function, url, [job_data, job_type]))
```
When you are ready, stop the threads
```
 my_threads.finish()
```

You might find it useful to check the size of the queue in a loop before calling finish:

    while my_threads.queue.qsize() > 0:
            pass
    my_threads.finish()

Job Function

When creating a job, you need to pass a job function that the thread will call to do some work.

The job function has 3 required arguments and 2 optional ones:

Required Arguments

url

The URL of the webpage to be worked on.

job_function

The function the thread should call when it picks the job out of the queue. See Job Function.

custom_data

An argument that can be used for anything to be accessed inside the job function.

Optional Arguments

session

A requests.session object. If this is not set, the job will use the session set when the MultiWebbing object was instanced.

lock

A threading.lock object. If this is not set, the job will use the lock set when the MultiWebbing object was instanced.

Returning Data From Threads

It is not possible to directly return data from a thread to the main process using the "return" statement.

Instead you should create a list or dictioary in the main process, then put this in the custom_data argument of the job. You can then use

    dictionary.update()

or

    list.append()

in the job function. The main process will be able to access the updated/appended data. A note: while the update and append functions are thread safe, some other functions are not (e.g. JSON.dumps()) and you may need to wrap them in a lock to prevent a race condition.

Multiple variables and data structures can be accessed in the job by placing them in a list.

Job Function

The job function will be called from a thread when it gets a job from the queue.

An example using using the requests module:

def job_function(job):

    job_data = job.custom_data[0] #in this example, a dictionary which contains the data processed from scraping
    job_type = job.custom_data[1] #in this example, a string
    
    get_url_success = job.get_url() #get the URL
    if get_url_success: #check the request connected
        if job.request.status_code == 200: #check that the URL was recieved OK
            job.lock.acquire() #update/append are thread safe but other operations elsewhere (e.g. JSON.dumps) might not be
            if job.type == "jobtype1": #do something
                job.custom_data.update({"key1":"val3", "key2":"val4"})
            if job.type == "jobtype2": #do something different
                job.custom_data.update({"key1":"val3", "key2":"val4"})
            job.lock.release()

Using requests, you can access the request object by calling job.request. For example, to obtain the text attribute from the visited page:

    text = job.request.text

Using selenium you can access the webdriver by calling job.driver, for example:

    element = driver.find_element_by_xpath('xpath_string')

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
multi_webbing		multi_webbing
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Webbing

Set Up

Job Function

Required Arguments

Optional Arguments

Returning Data From Threads

Job Function

About

Releases

Packages

Languages

adhardy/Multi-Webbing

Folders and files

Latest commit

History

Repository files navigation

Multi-Webbing

Set Up

Job Function

Required Arguments

Optional Arguments

Returning Data From Threads

Job Function

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages