<span style="display:block; border-top: 2px solid #39FF14;"></span>


# <span style="color:#39FF14; font-family: monospace;">Harvest Notebook</span>
### <span style="color:#39FF14; font-family: monospace;">Almond House Analytics</span>

<span style="color:#39FF14; font-family: monospace;">
To use the Harvest Notebook provide a list of URLs in the `data/input/url-list.csv` file.<br>
The program is expecting a single 'url' column.<br>
There is also an option below `To read URLs directly from code`.<br>
To use this, make sure to comment out the cell `To read URL list from input file`.
</span>

<span style="color:#39FF14; font-family: monospace;">
Harvest Notebook generates several output files in the `data/output` directory.<br>
Output file `url-analysis.csv` provide general site information about each URL in the input list.<br>
Output file `website-words.csv` provides a word count for each unique word for each URL in the input list.
</span>

<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Configure Notebook</span>

In [1]:
def complete(task):
    print(f"Completed {task}!")

def setup_output(output_dir, filename):
    return os.path.join(output_dir, filename)

import os
from datetime import datetime
import pandas as pd

from src.harvest import harvest_url, extract_word_counts

INPUT_FILE = "./data/input/url-list.csv"
OUTPUT_DIR = "./data/output"

complete("notebook configuration")

Completed notebook configuration!


<span style="display:block; border-top: 2px solid #39FF14;"></span>


<span style="color:#39FF14; font-family: monospace;">To read URL(s) directly from code:</span>

In [2]:
# urls = [
#     "https://www.target.com",
#     "https://www.walmart.com"
# ]

# df = pd.DataFrame(urls, columns=["url"])

# complete("dataframe loading")

<span style="color:#39FF14; font-family: monospace;">To read URL list from input file:</span>

In [3]:
df = pd.read_csv(INPUT_FILE)

complete("dataframe loading")

Completed dataframe loading!


<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Configure Output</span>

In [4]:
timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
output = os.path.join(OUTPUT_DIR, timestamp)
os.makedirs(output, exist_ok=True)

output1 = setup_output(output, "url-analysis.csv")
output2 = setup_output(output, "website-words.csv")

complete("output configuration")

Completed output configuration!


<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Collect Data from URLs</span>

In [5]:
results1 = []
results2 = []

for url in df['url']:
    site_data = harvest_url(url)
    results1.append(site_data)
    word_data = extract_word_counts(url)
    results2.extend(word_data)
    
complete("harvest loop")

results1.sort(key=lambda x: x['url'])
rdf1 = pd.DataFrame(results1)

complete("url analysis sort")

results2.sort(key=lambda x: (x['url'], x['word']))
rdf2 = pd.DataFrame(results2)

complete("website words sort")

Completed harvest loop!
Completed url analysis sort!
Completed website words sort!


<span style="display:block; border-top: 2px solid #39FF14;"></span>


### <span style="color:#39FF14; font-family: monospace;">Process Output</span>

In [6]:
rdf1.to_csv(output1, index=False)
rdf2.to_csv(output2, index=False)

complete("output processing")

Completed output processing!


<span style="display:block; border-top: 2px solid #39FF14;"></span>
