<span style="display:block; border-top: 2px solid #39FF14;"></span>


# <span style="color:#39FF14; font-family: monospace;">Harvest Notebook</span>
### <span style="color:#39FF14; font-family: monospace;">Almond House Analytics</span>

<span style="color:#39FF14; font-family: monospace;">
To use the Harvest Notebook provide a list of URLs in the `data/input/url-list.csv` file.<br>
The program is expecting a single 'url' column.<br>
There is also an option below `To read URLs directly from code`.<br>
To use this, make sure to comment out the cell `To read URL list from input file`.
</span>

<span style="color:#39FF14; font-family: monospace;">
Harvest Notebook generates several output files in the `data/output` directory.<br>
Output file `url-analysis.csv` provide general site information about each URL in the input list.<br>
Output file `website-words.csv` provides a word count for each unique word for each URL in the input list.
</span>

<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Configure Notebook</span>

In [1]:
def complete(task):
    print(f"Completed {task}!")

def setup_output(output_dir, filename):
    return os.path.join(output_dir, filename)

from datetime import datetime
import os
import pandas as pd

from src.robot import is_allowed
from src.harvest import url_analysis, website_words

INPUT_FILE = "./data/input/url-list.csv"
OUTPUT_DIR = "./data/output"

complete("notebook configuration")

Completed notebook configuration!


<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Input Options</span>

<span style="color:#39FF14; font-family: monospace;">To read URL(s) directly from code:</span>

In [2]:
# urls = [
#     "https://www.target.com",
#     "https://www.walmart.com"
# ]

# df = pd.DataFrame(urls, columns=["url"])

# complete("dataframe loading")

<span style="color:#39FF14; font-family: monospace;">To read URL list from input file:</span>

In [3]:
df = pd.read_csv(INPUT_FILE)

complete("dataframe loading")

Completed dataframe loading!


<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Configure Output</span>

In [4]:
timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
output = os.path.join(OUTPUT_DIR, timestamp)
os.makedirs(output, exist_ok=True)

output_analysis = setup_output(output, "url-analysis.csv")
output_words = setup_output(output, "website-words.csv")

complete("output configuration")

Completed output configuration!


<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Collect Data from URLs</span>

In [None]:
results_analysis = []
results_words = []

for url in df['url']:

    if is_allowed(url):
        
        analysis_data = url_analysis(url)
        results_analysis.append(analysis_data)

        words_data = website_words(url)
        results_words.extend(words_data)

    else:
        print(f"Skipping {url} due to robots.txt restrictions.")
    
results_analysis.sort(key=lambda x: x['url'])
rdf_analysis = pd.DataFrame(results_analysis)

results_words.sort(key=lambda x: (x['url'], x['word']))
rdf_words = pd.DataFrame(results_words)

complete("collect and sort")

Skipping https://www.tpwd.texas.gov/state-parks due to robots.txt restrictions.
Skipping https://www.dnr.state.mn.us/state_parks due to robots.txt restrictions.
Skipping https://www.princeton.edu due to robots.txt restrictions.
Skipping https://www.umich.edu due to robots.txt restrictions.
Completed harvest loop!
Completed url analysis sort!
Completed website words sort!


<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Process Output</span>

In [None]:
rdf_analysis.to_csv(output_analysis, index=False)
rdf_words.to_csv(output_words, index=False)

complete("output processing")

Completed output processing!


<span style="display:block; border-top: 2px solid #39FF14;"></span>
