# Getting Started
This notebook uses iPython Parallel to crawl multiple sites simultaneously. To get started, we'll have to first setup our environment. We need to connect to the iPython cluster we setup on SeaWulf and verify that all of the processors we requested are available:

In [2]:
import ipyparallel as ipp
c = ipp.Client()
print('Number of processors available: ' + str(len(c.ids)))

Number of processors available: 0


We can verify that MPI is working using the parallel cell magic command "%px" and some simple code.   

In [1]:
%%px
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print("Processor " + str(rank) + " reporting for duty.")

ERROR:root:Cell magic `%%px` not found.


# Limits

The Web crawlers we use are modified versions of the Python libraries iCrawler (https://pypi.python.org/pypi/icrawler/0.4.7) and InstagramCrawler (https://github.com/iammrhelo/InstagramCrawler).

### Baidu, Bing, Flickr, Google, and Photobucket

The iCrawler-based script is used to scrape the sites listed above. The limits for each of these sites are:

* Baidu, Bing, and Google limit you to the first 1000 search results per crawl.
* Photobucket's top-100 RSS feed only contains 100 images (surprise!)
* Flickr limits you to 3600 images per hour, requires that you obtain an API key, and will ban the key if you exceed this limit.

### Instagram and Smugmug

The InstagramCrawler-based scripts can be used to scrape both Instagram and Smugmug. Instead of using each of these sites APIs, these scripts mimic a human user that is very polite and doesn't exceed acesss limits. Since they don't use any official API, there aren't strict limits on the number of images you can download. However, since it mimics a human user with explicit sleep timers, it can take significantly longer than the iCrawler-based script.

# Crawling the Web

The following block of code is setup to crawl each site you select in parallel. You can modify the variable keyword to change the keyword you search each site for, maxnum to change the maximum number of images you will download from any Website, and crawlers to select which sites you will crawl.

In [2]:
%%px
# keyword to search for
keyword = 'seal'

# maximum number of images to download per Website
maxnum = 100

# list of Websites to crawl. format should be an array of lowercase strings
# available options are: baidu, bing, flickr, google, instagram, photobucket, smugmug
crawlers = ['baidu', 'bing', 'google', 'photobucket', 'smugmug']

# your Flickr API key
flickr_api = False

from mpi4py import MPI
# we sometimes get warnings for invalid images. these are annoying at best.
import warnings
warnings.filterwarnings("ignore")

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if (rank < len(crawlers)):
    site = crawlers[rank]
    print('Processor ' + str(rank) + ' crawling ' + site.title())
    if site in ['baidu','bing','google']:
        maxnum = min(maxnum, 1000) # these sites limit you to 1000 images per search
        %run ./crawlers/icrawl.py $keyword $maxnum -c $site
    elif site == 'flickr':
        if not flickr_api:
            print('You must provide your Flickr API key in the flickr_api variable above to crawl Flickr.')
        else:
            %run ./crawlers/icrawl.py $keyword $maxnum -c $site
    elif site == 'photobucket':
        maxnum = min(maxnum, 100) # only 100 images max on this site
        %run ./crawlers/icrawl.py $keyword $maxnum -c $site
    elif site == 'instagram':
        %run ./crawlers/instagramcrawler.py $keyword $maxnum
    elif site == 'smugmug':
        %run ./crawlers/smugmugcrawler.py $keyword $maxnum
    else:
        print('"' + site + '" is an invalid crawler.')

ERROR:root:Cell magic `%%px` not found.


# Viewing Images

If you ran the above code block with properly set variables, you should have downloaded images into the /seals_geo_survey/images directory. You can confirm the creation of these directories using the %ls magic command. Feel free to modify the directory name to navigate further (e.g. %ls ~/seals_geo_survey/images/google).

In [3]:
%ls ~/seals_geo_survey/images/google

ls: cannot access '/home/bento/seals_geo_survey/images/google': No such file or directory


You can use the code block below to view a specific image by filling in the image_path variable. (e.g. image_path = './images/google/fuzzy_seal.jpg')

In [None]:
image_path = './images/

from IPython.display import Image
Image(filename=image_path) 

# Notes

When you crawl Instagram for "seal" or "seals" you get around 750k posts but a lot of Navy seals and other invalid images. If you search for "furseal" or "furseals" you get better results. Similar things happen with the other crawlers. I would recommend performing a manual search to assess the content and adjust your keywords before crawling.

You can specify multiple keywords for the search engine sites, such as keyword = 'antarctica+seal'.

Images are automatically downloaded into the ~/seals_geo_survey/images directory. Future stages of the pipeline depend on this structure, so do not move anything around manually.