# PHPBB Scraping Data API - Sample Use
## Configuration
Here's a possible use for the phpbb Scraper API, that will read latest topics from a forum. Setup to access the forum is done in a config.json, that has the following content:
```
{
    "user": "<user>",
    "password": "<password>",
    "base":"https://<forum url>",
    "target_dir":"<target dir where to save / read html files",
    "meta_file":"<targetFileWhereToAppendHtmlMetaData>"    
}
    mind the double backslash and encoding:
    eg meta_file = "C:\\Users\\JohnDoe\\Desktop\\TEST\\metafile.json"
```
The ScraperExecutor encapsulates generation of URL, log in, reading and processing of scraped forum pages
 

## Code is Documented :-)

In [2]:
import phpbb_scraper
from phpbb_scraper.scraper import ScraperExecutor
# code is documented, use help(<module>) to find out more about implemented code
# for inner structure, use dir(module)
help(ScraperExecutor)

Help on class ScraperExecutor in module phpbb_scraper.scraper:

class ScraperExecutor(builtins.object)
 |  Implementation of some frequently used scraping queries
 |  
 |  Class methods defined here:
 |  
 |  __init__(base=None, debug=False, wait_time=5, user=None, password=None, config_file=None, target_dir=None, meta_file=None) from builtins.type
 |      constructor
 |  
 |  close_session() from builtins.type
 |      close session
 |  
 |  get_session() from builtins.type
 |      gets/creates class session
 |  
 |  get_soup(url) from builtins.type
 |      retrieves soup for given url, configuration needed to be setup
 |      if url is a list, a list of soup will be returned
 |  
 |  get_soups(urls) from builtins.type
 |      reads multiple urls in case soup contains number of entries tags and a "start" property, 
 |      url will not be read. returns a list of dictionary with entry 
 |      {hash(soup_id):{'url':<url>,'url_hash':<url_hash>,'soup':<soup>,'soup_id':<soup_id>,'date':<da

## Sample Scrape
The following sample shows application of the ScraperExecutor class that reads latest topics and saves them as HTML files and updates the meta file (list of files downloaded)

In [None]:
from phpbb_scraper.scraper import ScraperExecutor

# config file  path
config_file = r"C:\<path>\config.json"

debug = False   # run in debug mode
steps_num = 2  # num of max web pages to be scraped

# set configuration and instanciate web scraper
executor = ScraperExecutor(config_file=config_file,debug=debug)
# scrapes data from forum , saves them to files and returns metadata of each scraped page
metadata = executor.retrieve_last_topics(steps_num=steps_num)


Display Of meta data for scraping of each Page: To make it unique, each scraped page (and forum posts later on, as well) gets hash ids alongside with file name so as to make it ready for analysis in subsequent steps 

In [None]:
# in case everything went fine, you can see the file metadata here (=what is appended to the metadata file)
metadata

## Reading Of Scraped Data
Read of scraped html data can be done with the read_topics_from_meta() function: It reads the metafile, accesses the referenced files there, and imports each post as dictionary.

In [None]:
from phpbb_scraper.scraper import ScraperExecutor

# read the urls from metafile and get post data as dictionary from stored html files

config_file = r"C:\<path>\config.json"

# read the urls from metafile and get metadata from stored html files
debug = False   # run in debug mode

# set configuration
executor = ScraperExecutor(config_file=config_file,debug=debug)

# read metafile and access locally stored html files
topics = executor.read_topics_from_meta() #dictionary containing topics metadata
print(f"Number of topics {len(topics)}, type of topics: {type(topics)}")
print(f"Metadata Keys per Post: {list(topics[list(topics.keys())[0]].keys())}")

Having transformed posts into dictionary, everything is set for further analysis :-)

## Scraped Data as HTML Table
ScraperExecutor method save_topics_as_html_table will read scraped data and is transforming them into tabular HTML data

In [None]:
from phpbb_scraper.scraper import ScraperExecutor

config_file = r"C:\<path>\config.json"

# read the urls from metafile and get metadata from stored html files
debug = False   # run in debug mode

# set configuration
executor = ScraperExecutor(config_file=config_file,debug=debug)

# html file name and path
html_file = r"posts_as_html_table"
path = r"C:\<path>\TEST"
add_timestamp = False

# create html table from dictionary and save file locally
executor.save_topics_as_html_table(html_file=html_file,path=path,append_timestamp=add_timestamp)


