# Data Exploration Notebook
Have a look at your data, understand when and what was crawled. See if you find documents that are not relevant to your task. Consider what metadata you want to collect.

## Setup
You need to execute this every time you open the notebook.

In [1]:
from bs4 import BeautifulSoup as BS
from functions import explore_metadata as explore_m
from functions import explore_irrelevant as explore_i

%load_ext autoreload
%autoreload 2

In the cell below, specify the directory where your data is stored.

In [2]:
DATA_DIRECTORY = '/home/brunobrocai/Code/med-crawlers/paradisi_forum/health_pages'

## Basic Metadata
Have a look at how many documents you have, how much data you have, and when the documents were crawled.

The function below gives you the same info as your file explorer.

In [3]:
count = explore_m.document_count(DATA_DIRECTORY)
print(f'Total number of documents: {count}')
size = explore_m.get_total_size(DATA_DIRECTORY)
print("Total size of files in the directory:", explore_m.format_size(size))

Total number of documents: 54033
Total size of files in the directory: 5.05 GB


The function below gives you the timeline of when the documents were crawled and what the cutoff date is.
>Be aware that this function can take a while to run and might not be very useful in some cases.

In [None]:
explore_m.vis_crawling_history(DATA_DIRECTORY)

## Find irrelevant documents
Look at structures in the urls you scraped. List all subdomains and frequent strings in the urls. Maybe you find some patterns that are not relevant to your use case.

In [4]:
urls = explore_i.list_key(DATA_DIRECTORY, 'url')

In [None]:
explore_i.get_subdomains(urls, print_=True)

In [6]:
print(urls)

{'https://forum.paradisi.de/thema/fragen-beim-rummachen-und-ueber-penis--69438/', 'https://forum.paradisi.de/thema/gefuehle-der-hoehepunkts-selbsterleben-12942/', 'https://forum.paradisi.de/thema/selbstbefriedigung-trotz-menstruation-41474/', 'https://forum.paradisi.de/thema/beschnittener-penis-wie-muss-man-sich-das-vorstellen-59046/', 'https://forum.paradisi.de/thema/meine-frau-macht-nur-noch-dicht-30448/', 'https://forum.paradisi.de/thema/verliebe-mich-fast-in-jeden-jungen-ist-das-normal-145261/', 'https://forum.paradisi.de/thema/wespenallergie-muss-ich-bei-einem-stich-trotz-notfallset-noch-ins-krankenhaus-173524/', 'https://forum.paradisi.de/thema/freundin-lecken-72117/', 'https://forum.paradisi.de/thema/sori-mozzarella-die-buffala-41266/', 'https://forum.paradisi.de/thema/kalte-haende-einschlafende-arme-von-durchblutungsstoerungen-8902/', 'https://forum.paradisi.de/thema/wie-lange-dauert-es-bis-zum-letzten-stadium-185989/', 'https://forum.paradisi.de/thema/gibt-es-hilfen-fuer-parki

In [None]:
explore_i.get_common_bytepairs(urls, iterations=1000, print_=True, min_print_len=5)

If you find a url pattern that interests you, you can use the function below to list all urls that match the pattern.

In [None]:
pattern_matches = explore_i.matching_set(urls, '/themen/')

explore_i.print_match_results(pattern_matches, count, True)

Sometimes, irrelevant data does not reveal itself in the urls. In this case, you might need to define a function that takes a look at the html to decide if a document is relevant or not.
In the cell below, you can define such a function and see what documents it matches.

>The function will be used by later notebooks. It should return TRUE if the document is relevant and FALSE if it is not.

In [None]:
def my_relevant_match_function(data):
    html = data['html_content']
    if html:
        return True
    return False


function_matches = explore_i.apply_function_dir(DATA_DIRECTORY, my_relevant_match_function)
explore_i.print_match_results(function_matches, count)

## Explore document metadata
Have a look at what html metadata is available in your documents.

In [None]:
pass