<a href="https://colab.research.google.com/github/WiktorProsowicz/machine-learning-projects/blob/main/NLP/NLP_TVP_Info.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Intorduction
The project's goal is to test some data harvesting and processing methods. This is my first project connected with ML, thus I want to focus mainly on utilising Tensorflow (especially tf's Keras) library. 

The main idea behind the project is to collect a fair amount of textual data from one of the divisions of polish governmental TV station (to be precise - it's online version) and create two kinds of models. 

One for studying the order of the words in a sequence for making article generator.

The other one for generating a good word embedding that would hopefully arrange words into clusters on the basis of their close occurence in sentences.

#Data harvesting & preprocessing


First part of this section focuses on an algorithm that collects links to TVP Info station articles. It begins with composing a list of links from articles grid available at www.tvp.info/polska. This site seems to be dynamically generated with js its is impossible to use standard web scraping method.

In [55]:
import urllib.request as request
import bs4
import re

In [67]:
N_PAGES = 50    # maximal number of sites like "www.tvp.info/polska?page=x" to visit
links_collection = set()

for page in range(1, N_PAGES + 1):

    print(f"- visiting page number {page} -")

    link_to_grid = f"https://www.tvp.info/polska?page={page}"

    try:
        with request.urlopen(link_to_grid) as fp:
            html = fp.read().decode("utf-8")

    except:
        print(f"- something went wrong when accessing grid number {page} -")
        break

    soup = bs4.BeautifulSoup(html)

    # manually picked hook that occures directly before proper script tag
    
    info_screening_content = soup.find("section", {"class": "info_screening_content"})
    proper_script_tag = info_screening_content.fetchNextSiblings("script")[0]

    js_script = proper_script_tag.get_text(strip = False)

    parsed = re.findall(r'"url" *: *"\\/[0-9]*\\/[a-zA-z\-]*"', js_script)
    viable_article_links = [element.replace("\"url\" : ", "").replace("\\", "").strip("\"") for element in parsed]

    for href in viable_article_links:
        if not href.startswith("https://www.tvp.info"):
            href = "https://www.tvp.info" + href
        
        # print(f"- got link {href} from the grid -")
        links_collection.add(href)


print(f"-- at first collected {len(links_collection)} links --")

-- at first collected 826 links --


This function visits a site with use of @param link
and collects all links from "see also" section

In [64]:
def traverse_site(link: str):

    try:
        with request.urlopen(link) as fp:
            html = fp.read().decode("utf-8")

    except:
        print(f"- something went wrong when traversing link {link} -")
        return []

    soup = bs4.BeautifulSoup(html)
    
    double_section = soup.find("section", {"class": "art-detial-two-box"})
    triple_section = soup.find("section", {"class": "art-detial-three-box"})

    return_set = set()

    for section in [double_section, triple_section]:

        # i.e. for art-detial-two-box -- art-detial-two-box__box
        article_classname = section.attrs["class"][0] + "__box"

        single_articles = section.find_all("div", {"class": article_classname})

        if not single_articles:
            break

        for article in single_articles:
            main_link = article.find("a", {"class": "news__image"})

            if main_link:
                href = main_link.attrs["href"]
                if not href.startswith("https://www.tvp.info"):
                    href = "https://www.tvp.info" + href
                return_set.add(href)
    
    return return_set

(Optional) Now the part responsible for collecting as many links as possible from every article in preliminary list.

In [None]:
MAX_LEN = 1000  # maximal number of links to harvest

visited_links = set()

while len(links_collection) <= MAX_LEN:

    available_links = links_collection.difference(visited_links)

    if not available_links:
        print("\n--- run out of links to visit ---\n")
        break
        
    target_link = available_links.pop()

    print(f"--- visiting {target_link} ---")
    harvested_links = traverse_site(target_link)

    # print(f"- harvested {harvested_links} -")

    visited_links.add(target_link)
    links_collection.update(harvested_links)



In [68]:
with open("/tmp/tvp_links.txt", "w") as target_file:

    for link in links_collection:
        target_file.write(f"{link}\n")

If you want to skip the whole process, just download the file from my github.

In [None]:
!wget ""