<p style="font-size: 30px; text-align: center;"><b>Webscraping and Social Media Scraping</b></p>
<p style="font-size: 20px; text-align: center;">Project</p>
<p style="font-size: 20px; text-align: center;">Weronika Mądro, Wojciech Hrycenko</p>
<p style="font-size: 20px; text-align: center;">Spring 2025</p>

<p style="font-size: 20px; text-align: center;">1. Choice of the website and legal issues</p>

This project focuses on scraping data from Mediakrytyk (https://mediakrytyk.pl/), a platform aggregating movies and TV series reviews of users and critics from various sources. The platform assigns combined scores, helping users make informed decisions about what to watch. On the website you can find rankings of movies and series in certain categories, such as the best, the worst and the most popular both by users and critics.

Web scraping should be conducted responsibly, adhering to legal and ethical guidelines. Before initiating the scraping process, we carefully reviewed Mediakrytyk’s terms of service to ensure compliance. This project respects the website's policies, making use of publicly available data while avoiding actions that could disrupt the platform's functionality. The website does not have robots.txt and in the terms of service there are no explicit restrictions on web scraping.

We decided to make the scrapping by two methods to match the project criteria: using the BeautifulSoup library and using Selenium. 

<p style="font-size: 20px; text-align: center;">2. Webscrapping process</p>

<p style="font-size: 16px; text-align: center;">2.1. Beautiful Soup TV Series Scrapping</p>

This script is designed for scraping TV series rankings from Mediakrytyk and displaying the extracted data interactively. It allows users to select different ranking categories and retrieve structured information. Key features include:

1. Importing necessary libraries for web scraping, data handling, and interactive widgets.

2. Defining a dictionary linking ranking categories to their respective URLs.

3. Implementing a function to scrape TV series details like title, year, ratings, genre, and cover image.

4. Using BeautifulSoup to parse HTML and extract structured data into a pandas DataFrame.

5. Creating interactive widgets for selecting a ranking category and triggering data retrieval.

6. Displaying results dynamically and saving the data to a CSV file.


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output

# Dictionary with ranking options
ranking_options = {
    "Best TV Series": "https://mediakrytyk.pl/seriale/ranking",
    "Users Best TV Series": "https://mediakrytyk.pl/seriale/ranking/uzytkownicy",
    "Users Worst TV Series": "https://mediakrytyk.pl/seriale/ranking/uzytkownicy/najgorsze",
    "Users Most Popular TV Series": "https://mediakrytyk.pl/seriale/ranking/uzytkownicy/popularne",
    "Critics Worst TV Series": "https://mediakrytyk.pl/seriale/ranking/najgorsze",
    "Critics Mostly Reviewed TV Series": "https://mediakrytyk.pl/seriale/ranking/popularne"
}

# Function to scrape movie list from selected ranking
def scrape_mediakrytyk_movies(url):
    print(f"Scraping from: {url}")
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        return pd.DataFrame()
    
    soup = BeautifulSoup(response.text, "html.parser")
    movies = []
    
    for index, item in enumerate(soup.select("ul > li > div > div:nth-of-type(3) > div > div:nth-of-type(1) > h3 > a"), start=1):
        title = item.text.strip()
        parent = item.find_parent("li")

        year_element = parent.select_one("a.label_small.link")
        year = year_element.text.strip() if year_element else "No year"

        critic_rating_element = parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.small.level_1.small_hide") or \
                              parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.rounded.small.level_0.small_hide") or \
                              parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.rounded.small.level_empty.small_hide") or \
                              parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.small.level_0.small_hide") or \
                              parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.small.level_empty.small_hide")
        critic_rating = critic_rating_element.text.strip() if critic_rating_element else "No critic rating"

        user_rating_element = parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.promoted.small_hide") or \
                              parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.small_hide") or \
                              parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_0.small_hide") or \
                              parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_empty.small_hide")
        user_rating = user_rating_element.text.strip() if user_rating_element else "No user rating"

        genre_element = parent.select_one("ul:nth-of-type(2) > li > a.link_gray:not(.italic)")
        genre = genre_element.text.strip() if genre_element else "No genre"

        country_elements = parent.select("ul:nth-of-type(2) > li:nth-of-type(2) > a.link_gray")
        countries = [c.text.strip() for c in country_elements]
        country = ", ".join(countries) if countries else "No production country"

        duration_element = parent.select_one("ul:nth-of-type(2) > li > a.link_gray[href*='dlugosc']")
        duration = duration_element.text.strip() if duration_element else "No duration"

        original_title_element = parent.select_one("a.link_gray.italic")
        original_title = original_title_element.text.strip() if original_title_element else "No original title"

        movies.append({
            "Title": title,
            "Original Title": original_title,
            "Year": year,
            "Critic Rating": critic_rating,
            "User Rating": user_rating,
            "Genre": genre,
            "Production Country": country,
            "Duration": duration
        })
    
    df = pd.DataFrame(movies)
    df.index = df.index + 1 
    return df

# Function to handle button click event
def on_button_click(b):
    # Disable the button to prevent multiple clicks
    b.disabled = True
    selected_url = ranking_options[dropdown.value]
    
    with output:
        clear_output(wait=True)
        print(f"Fetching data from: {selected_url}")

        # Scrape the data
        df = scrape_mediakrytyk_movies(selected_url)
        
        if df.empty:
            print("No data found!")
            return

        # Replace "?" in ratings
        if 'User Rating' in df.columns:
            df['User Rating'] = df['User Rating'].replace('?', 'No user rating')
        if 'Critic Rating' in df.columns:
            df['Critic Rating'] = df['Critic Rating'].replace('?', 'No critic rating')

        filename = f"{dropdown.value}.csv"
        df.to_csv(filename, encoding="utf-8")
        print(f"Data saved to: {filename}")
        display(df.head(40))

    # Re-enable the button after the process is complete
    b.disabled = False

# Dropdown widget to select the ranking
dropdown = widgets.Dropdown(
    options=list(ranking_options.keys()),
    description="Ranking:",
    style={'description_width': 'initial'}
)

button = widgets.Button(description="Show and Save Data")

output = widgets.Output()

button.on_click(on_button_click)

display(dropdown, button, output)


Dropdown(description='Ranking:', options=('Best TV Series', 'Users Best TV Series', 'Users Worst TV Series', '…

Button(description='Show and Save Data', style=ButtonStyle())

Output()

<p style="font-size: 16px; text-align: center;">2.2. Selenium Scrapping - Movies Ranking</p>

This script automates the process of scraping movie rankings from Mediakrytyk using Selenium. It dynamically loads pages, extracts key movie details, and saves the data for further analysis.

1. Importing necessary libraries for web scraping, automation, and data processing.

2. Initializing a Selenium WebDriver to control a browser session.

3. Handling cookie pop-ups by detecting and clicking the "Accept" button.

4. Scrolling down dynamically to load all movie entries on the page.

5. Scraping movie details such as title, original title, year, ratings, genre, country, duration, and cover image.

6. Extracting movie data from multiple pages based on the defined page limit.

7. Storing scraped data in a pandas DataFrame for structured handling.

8. Saving the extracted data into a CSV file for further use.

9. Rendering cover images within a table for visual representation.

10. Closing the Selenium WebDriver session after scraping is complete.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time
from IPython.display import display, HTML

driver = webdriver.Edge()

def handle_cookies(driver):
    """Accept cookies if the button is present."""
    try:
        WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
        time.sleep(2)
        accept_button = WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'fc-button') and contains(@aria-label, 'Zgadzam się')]"))
        )
        accept_button.click()
        time.sleep(1)
    except Exception:
        pass

def scroll_down(driver):
    """Scroll down to load all elements."""
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

def scrape_mediakrytyk_page(url):
    time.sleep(2)
    scroll_down(driver)  

    soup = BeautifulSoup(driver.page_source, "html.parser")
    movies = []

    for item in soup.select("ul > li > div > div:nth-of-type(3) > div > div:nth-of-type(1) > h3 > a"):
        title = item.text.strip()
        parent = item.find_parent("li")

        year = parent.select_one("a.label_small.link")
        year = year.text.strip() if year else "No year"

        critic_rating = parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.small.level_1.small_hide")
        critic_rating = critic_rating.text.strip() if critic_rating else "No critic rating"

        user_rating = parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.promoted.small_hide") \
                      or parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.small_hide")
        user_rating = user_rating.text.strip() if user_rating else "No user rating"

        genre = parent.select_one("ul:nth-of-type(2) > li > a.link_gray:not(.italic)")
        genre = genre.text.strip() if genre else "No genre"

        countries = [c.text.strip() for c in parent.select("ul:nth-of-type(2) > li:nth-of-type(2) > a.link_gray")]
        country = ", ".join(countries) if countries else "No production country"

        duration = parent.select_one("ul:nth-of-type(2) > li > a.link_gray[href*='dlugosc']")
        duration = duration.text.strip() if duration else "No duration"

        original_title = parent.select_one("a.link_gray.italic")
        original_title = original_title.text.strip() if original_title else "No original title"

        cover = parent.select_one("div > div:nth-of-type(2) > a > img.movie_full_image")
        cover_url = "https://mediakrytyk.pl" + cover["src"] if cover and "empty_dark/movie.png" not in cover["src"] else "No cover"

        movies.append({
            "Title": title,
            "Original Title": original_title,
            "Year": year,
            "Critic Rating": critic_rating,
            "User Rating": user_rating,
            "Genre": genre,
            "Production Country": country,
            "Duration": duration,
            "Cover URL": cover_url
        })

    return movies

def scrape_mediakrytyk_movies(max_pages=3):
    """Scrape movies from multiple pages, accepting cookies only on the first page."""
    base_url = "https://mediakrytyk.pl/filmy/ranking"
    all_movies = []
    
    for page in range(1, max_pages + 1):
        url = f"{base_url}?strona={page}"
        driver.get(url)
        
        if page == 1:
            handle_cookies(driver)
        
        movies = scrape_mediakrytyk_page(url)
        if not movies:
            print(f"No more movies on page {page}, stopping.")
            break

        all_movies.extend(movies)

    return all_movies

movies_data = scrape_mediakrytyk_movies()
df = pd.DataFrame(movies_data, index=range(1, len(movies_data) + 1))
df.to_csv("Best Movies.csv", index=False)

def render_images(df):
    html = df.to_html(escape=False, formatters={"Cover URL": lambda x: f'<img src="{x}" width="100">' if x != "No cover" else "No cover"})
    display(HTML(html))

render_images(df)
driver.quit()

Unnamed: 0,Title,Original Title,Year,Critic Rating,User Rating,Genre,Production Country,Duration,Cover URL
1,2001: Odyseja kosmiczna,No original title,1968,9.8,8.6,Sci-Fi,"USA, Wielka Brytania",141 min,
2,Władca Pierścieni: Powrót króla,No original title,2003,9.8,8.4,Fantasy,"Nowa Zelandia, USA",201 min,
3,Dzisiejsze czasy,Modern Times,1936,9.7,7.3,Komedia,"Komedia, Niemy",87 min,
4,Obywatel Kane,Citizen Kane,1941,9.7,8.2,Dramat,Dramat,119 min,
5,Bez przebaczenia,Unforgiven,1992,9.7,8.2,Western,Western,131 min,
6,Gwiezdne wojny: Część V - Imperium kontratakuje,Star Wars: Episode V - The Empire Strikes Back,1980,9.5,8.6,Przygodowy,"Przygodowy, Sci-Fi",124 min,
7,Nadzy,Naked,1993,9.4,6.4,Dramat,Dramat,131 min,
8,Aż poleje się krew,There Will Be Blood,2007,9.4,8.6,Dramat,"Dramat, Obyczajowy",158 min,
9,One More Time with Feeling,No original title,2016,9.3,7.3,Dokumentalny,"Francja, Wielka Brytania",112 min,
10,Chłopcy z ferajny,Goodfellas,1990,9.3,8.2,Dramat,"Dramat, Kryminał",146 min,


<p style="font-size: 20px; text-align: center;">3. Summary</p>

This project focused on web scraping data from Mediakrytyk, a platform aggregating user and critic reviews for movies and TV series. The main goal was to extract structured information about movies and TV series from rankings using two different scraping techniques: BeautifulSoup for static content extraction and Selenium for handling dynamic content.

The implementation involved selecting different ranking categories and retrieving relevant movie details such as title, release year, ratings, and cover images. The scraped data was then structured into a pandas DataFrame and exported to CSV for further analysis.

Throughout the project, ethical and legal considerations were carefully reviewed. Since Mediakrytyk does not explicitly prohibit web scraping in its terms of service, only publicly available data was collected without disrupting the platform’s functionality.

The results of this project demonstrate the effectiveness of both BeautifulSoup and Selenium for web scraping, highlighting their advantages and limitations. The dataset obtained through this process can be further used for analytical insights into movie trends, rating distributions, and content popularity.