# Scrap Content and Build a Hugo Site

### Steps
* Setup and Helper Functions
* Import the Site Data from a CSV
* For Each Site for teaser content, scrape the data (Title, Price and Url to the more information page)
* Combine all the site data into a data frame
* Export this data to a json and csv file for use on the hugo site and other projects (Drupal Import )
* Build the hugo site

## Setup and Helper Functions

### Import the neccesary python modules

In [149]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re
import hashlib
import subprocess

### Variable Setup

In [150]:
timestamp_full = datetime.today().strftime('%Y-%m-%d-%H:%M:')
timestamp_day  = datetime.today().strftime('%Y-%m-%d')

# Site Import Data
site_data_file = 'importfiles/sitedata.csv'
site_data_df = pd.DataFrame(columns=["company","site", "collection"])

# Site Scrape Data


### Standard Helper Functions

In [151]:
# Log processing commands to see where a site may have failed
def log_processing(url):
    print('  ---> processing ' + url)

# Pull out everything but numbers and letters from the imported string
def returnNumbersAndLettersOnly(oldString):
    newString = ''.join(e for e in oldString if e.isalnum())
    return newString

# Pull out everything but numbers from the imported string
def returnNumbersOnly(oldString):
    newString = ''.join(e for e in oldString if e.isdigit())
    return newString

# Set default headers for call
# Testing out calls 
def getHeadersObject(url):    
    headers = {
        'User-Agent': "PostmanRuntime/7.18.0",
        'Accept': "*/*",
        'Cache-Control': "no-cache",
        'Accept-Encoding': "gzip, deflate",
        'Referer': url,
        'Connection': "keep-alive",
        'cache-control': "no-cache"
    }    
    return headers
    
# Get the index of an item and handle the exception where the index does not exist and set to empty
def pop(item,index):

    try:
        return item[index]
    except IndexError:
        return 'null'  
        
    return breakout

### Custom Functions

## Scrape Example Functions

In [183]:
def scrapeTeaserDataFromCollection(url,site):

    site_scraped_data_df_temp = pd.DataFrame(columns=["site","title", "url","price"])    
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    resultsRow = soup.find_all('article', {'class': 'box-info'})  
   
    for resultRow in resultsRow:
        resultRowBreakdown = resultRow.find('p', {'class': 'text-info'}).text.split('\n')
        try:
            site_scraped_data_df_temp = site_scraped_data_df_temp.append(
                {
                    'site':site,
                    'title':pop(resultRowBreakdown,0),
                    'price':returnNumbersOnly(pop(resultRowBreakdown,2)),
                    'url':resultRow.find('a').get('href')
                }
            , ignore_index=True)
        except IndexError as e:
            gotdata = '' 
        
    return site_scraped_data_df_temp
       
 

### Import the site data

In [184]:
#
site_data_df = pd.read_csv(site_data_file)
site_data_df

Unnamed: 0,company,site,collection
0,Peru For Less,https://www.peruforless.com,https://www.peruforless.com/packages/
1,Peru Vacation Tours,https://www.peruvacationtours.com,https://www.peruvacationtours.com/peru-tour-pa...


## Scrape Pages

In [185]:
site_scraped_data_df = pd.DataFrame(columns=["site","title", "url","price"])

#Go Through All Sites and Scrape the Appropriate Data

for index, row in site_data_df.iterrows():
    log_processing(row['company'])
    site_scraped_data_df = site_scraped_data_df.append(scrapeTeaserDataFromCollection(row['collection'],row['company']), ignore_index=True)
    #site_scraped_data_df.append(site_scraped_data_df_temp, ignore_index=False, verify_integrity=False, sort=None)
    #print(site_scraped_data_df)
    
    
print(site_scraped_data_df)

  ---> processing Peru For Less
  ---> processing Peru Vacation Tours
             site                                              title  \
0   Peru For Less                Machu Picchu, Cusco, Sacred Valley    
1   Peru For Less                Machu Picchu, Cusco, Sacred Valley    
2   Peru For Less                   Inca Trail, Machu Picchu, Cusco    
3   Peru For Less  Machu Picchu, Cusco, Titicaca, Puno, Arequipa,...   
4   Peru For Less      Machu Picchu, Cusco, Amazon, Arequipa, Colca    
5   Peru For Less  Machu Picchu, Cusco, Lima, Arequipa & Colca, P...   
6   Peru For Less  Machu Picchu, Cusco, Titicaca, Amazon, Lima, P...   
7   Peru For Less  Machu Picchu, Cusco, Sacred Valley, Galapagos ...   
8   Peru For Less  Machu Picchu, Iguazu, Rio de Janeiro, Buenos A...   
9   Peru For Less  Machu Picchu, Cusco, Sacred Valley, Puno & Lak...   
10  Peru For Less        Machu Picchu, Cusco, Sacred Valley, Amazon    
11  Peru For Less    Cusco, Machu Picchu, Titicaca, La Paz, & Uyun