# First Things First

To start with, I'm simply going to build a basic web crawler then, using the domains from the URDB source URLs file, I'll randomly sample 1000 to see what their robots.txt file says (to keep it simple, I'll parse them first to see which ones are empty/missing, then check to see what the non-missing/non-empty ones say. For reference, the standards for constructing a robots.txt file [is explained here](http://www.robotstxt.org/robotstxt.html).

In [1]:
import scrapy
import pandas as pd

In [2]:
URLs = pd.read_table('URDB_URLs_9-24-18.txt', names = ['Tariff Sheet URL'])
URLs

Unnamed: 0,Tariff Sheet URL
0,http://www.iid.com/Modules/ShowDocument.aspx?d...
1,http://villageofarcade.org/departments/public-...
2,http://cecelec.coopwebbuilder.com/sites/cecele...
3,http://www.wakeforestnc.gov/client_resources/r...
4,http://www.rrvcoop.com/content/security-lights
5,http://www.prairielandelectric.com/Rates_PDF/M...
6,http://www.talquinelectric.com/rates_elec.aspx
7,http://www.midstateelectric.coop/customer-serv...
8,http://www.surprisevalleyelectric.org/_uploads...
9,http://www.mainepublicservice.com/media/40402/...


# The Procedure

Based upon some very brief looking-over of these data, it seems like I'll need to do a few things to clean these URLs up first:

1. Check how many unique values there are relative to the DataFrame size
2. Reduce all URLs to their primary domain (e.g. https://www.nyseg.com/)
3. Remove all duplicates
4. Slap "robots.txt" on the end
5. Drop duplicate URLs again, since we probably now have more!
5. Run these through scrapy

In [3]:
URLs.drop_duplicates(inplace = True)

#Find only the domain portion of the URL + http... and then add /robots.txt to the end
URLs['Utility Domain URL'] = URLs['Tariff Sheet URL'].str.extract(pat = r'(https*://[^/]+)')
URLs['robots.txt URL'] = URLs['Utility Domain URL'] + '/robots.txt'

URLs.drop_duplicates(inplace = True)
URLs

Unnamed: 0,Tariff Sheet URL,Utility Domain URL,robots.txt URL
0,http://www.iid.com/Modules/ShowDocument.aspx?d...,http://www.iid.com,http://www.iid.com/robots.txt
1,http://villageofarcade.org/departments/public-...,http://villageofarcade.org,http://villageofarcade.org/robots.txt
2,http://cecelec.coopwebbuilder.com/sites/cecele...,http://cecelec.coopwebbuilder.com,http://cecelec.coopwebbuilder.com/robots.txt
3,http://www.wakeforestnc.gov/client_resources/r...,http://www.wakeforestnc.gov,http://www.wakeforestnc.gov/robots.txt
4,http://www.rrvcoop.com/content/security-lights,http://www.rrvcoop.com,http://www.rrvcoop.com/robots.txt
5,http://www.prairielandelectric.com/Rates_PDF/M...,http://www.prairielandelectric.com,http://www.prairielandelectric.com/robots.txt
6,http://www.talquinelectric.com/rates_elec.aspx,http://www.talquinelectric.com,http://www.talquinelectric.com/robots.txt
7,http://www.midstateelectric.coop/customer-serv...,http://www.midstateelectric.coop,http://www.midstateelectric.coop/robots.txt
8,http://www.surprisevalleyelectric.org/_uploads...,http://www.surprisevalleyelectric.org,http://www.surprisevalleyelectric.org/robots.txt
9,http://www.mainepublicservice.com/media/40402/...,http://www.mainepublicservice.com,http://www.mainepublicservice.com/robots.txt


In [4]:
#URLs['Tariff Sheet URL'].str.split(pat = r'^https*://.+/', expand = True)
#This doesn't do what I want right now, but gets the filenames, so maybe useful down the road?

In [5]:
URLs['robots.txt URL'].nunique()

1968

In [6]:
URLs.isnull().sum()

Tariff Sheet URL        0
Utility Domain URL    228
robots.txt URL        228
dtype: int64

In [7]:
URLs['Crawling_Allowed'] = False

In [9]:
URLs['Num_Ratepayers'] = 0

# TO DO
1. Figure out why these 228 nulls exist and correct them
2. Figure out rules for what we care about in robots.txt file
    1. Check if file exists, if not set `Crawling_Allowed = True`
    2. Check if file is blank. If so, set `Crawling_Allowed = True`
    2. If file doesn't have a line `User-agent: *`, then set `Crawling_Allowed = True`. This is fine because our bot doesn't exist yet and thus couldn't be disallowed or allowed by name in the first place.
    3. If the code below is in robots.txt and nothing else, set `Crawling_Allowed = True`
        `User-agent: *`
        `Disallow: `
    4. Check for the code below in file. If it exists, move on.
        `User-agent: *`
        `Disallow: /`
    5. For those remaining with `User-agent: *`:
        1. Compare our tariff sheet URL one-level-up parent directories to what is disallowed and set `Crawling_Allowed = True` if our directory isn't disallowed explicitly
3. For those with `Crawling_Allowed = True`, use their `Num_Ratepayers` value their score and final metric of "% Allowed" will be `(URLs['Num_Ratepayers'] * URLs['Crawling_Allowed']).sum()/URLs['Num_Ratepayers'].sum()`