<p style="color: white; font-size: 30px; text-align: center;"><b>Webscraping and Social Media Scraping</b></p>
<p style="color: white; font-size: 20px; text-align: center;">Project</p>
<p style="font-size: 20px; text-align: center;">Weronika Mądro, Wojciech Hrycenko</p>
<p style="font-size: 20px; text-align: center;">Spring 2025</p>

<p style="font-size: 20px; text-align: center;">1. Choice of the website and legal issues</p>

This project focuses on scraping data from Mediakrytyk (https://mediakrytyk.pl/), a platform aggregating moviesand series reviews from various sources. The platform assigns combined scores based on various sources, helping users make informed decisions about what to watch.

Web scraping should be conducted responsibly, adhering to legal and ethical guidelines. Before initiating the scraping process, we carefully reviewed Mediakrytyk’s terms of service to ensure compliance. This project respects the website's policies, making use of publicly available data while avoiding actions that could disrupt the platform's functionality.

Additionally, a standard practice when scraping is to check the website's robots.txt file, which provides guidelines on which parts of the site can be accessed by automated bots. Notably, Mediakrytyk does not have a robots.txt file, meaning there are no explicit crawling restrictions.

<p style="font-size: 20px; text-align: center;">2. Webscrapping process</p>

We scraped the TV series here using the BeautifulSoup library. 

1. In the first step we fetched the HTML from the ranking page while mimicking a real browser (User-Agent header). We pretended to be a real web browser by adding a special "User-Agent" header to our request. This trick makes the website think a human is visiting, so it sends us the full HTML content instead of blocking our scraper.

2. Later on, we parsed the page to extract details like title, original title, year, critic/user ratings, genre, country, duration and cover URL.

3. Finally, we stored the data in a pandas dataframe.


<p style="font-size: 20px; text-align: center;">3. Beautiful Soup Movies and Serial Scrapping</p>

<p style="font-size: 15px; text-align: center;">3.1. Serial Scrapping</p>

In [13]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape movie list from Mediakrytyk ranking
def scrape_mediakrytyk_movies():
    url = "https://mediakrytyk.pl/seriale/ranking"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
    
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        return []
    
    soup = BeautifulSoup(response.text, "html.parser")
    movies = []
    
    for index, item in enumerate(soup.select("ul > li > div > div:nth-of-type(3) > div > div:nth-of-type(1) > h3 > a"), start=1):
        title = item.text.strip()
        parent = item.find_parent("li")  # Get the main movie container
        
        # Extract year
        year_element = parent.select_one("a.label_small.link")
        year = year_element.text.strip() if year_element else "No year"
        
        # Extract critic rating
        critic_rating_element = parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.small.level_1.small_hide")
        critic_rating = critic_rating_element.text.strip() if critic_rating_element else "No critic rating"
        
        # Extract user rating
        user_rating_element = parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.promoted.small_hide")
        if not user_rating_element:
            user_rating_element = parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.small_hide")
        
        user_rating = user_rating_element.text.strip() if user_rating_element else "No user rating"
        
        # Extract genre (first link in the list)
        genre_element = parent.select_one("ul:nth-of-type(2) > li > a.link_gray:not(.italic)")
        genre = genre_element.text.strip() if genre_element else "No genre"
        
        # Extract production country (second element in the list)
        country_elements = parent.select("ul:nth-of-type(2) > li:nth-of-type(2) > a.link_gray")
        countries = [c.text.strip() for c in country_elements]
        country = ", ".join(countries) if countries else "No production country"
        
        # Extract movie duration (last link in the list)
        duration_element = parent.select_one("ul:nth-of-type(2) > li > a.link_gray[href*='dlugosc']")
        duration = duration_element.text.strip() if duration_element else "No duration"
        
        # Extract original title
        original_title_element = parent.select_one("a.link_gray.italic")
        original_title = original_title_element.text.strip() if original_title_element else "No original title"
        
        # Extract cover image URL
        cover_element = parent.select_one("div > div:nth-of-type(2) > a > img.movie_full_image")
        cover_url = "https://mediakrytyk.pl" + cover_element["src"] if cover_element else "No cover"
        
        # Append movie data to the list
        movies.append({
            "Rank": index,  
            "Title": title,
            "Original Title": original_title,
            "Year": year,
            "Critic Rating": critic_rating,
            "User Rating": user_rating,
            "Genre": genre,
            "Production Country": country,
            "Duration": duration,
            "Cover URL": cover_url
        })
    
    return movies

# Execute function and store results in a DataFrame
films = scrape_mediakrytyk_movies()
df = pd.DataFrame(films).set_index("Rank")   # Create DataFrame from the list of dictionaries
df.head(40)  # Display only the first 40 results


Unnamed: 0_level_0,Title,Original Title,Year,Critic Rating,User Rating,Genre,Production Country,Duration,Cover URL
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Fleabag,No original title,2016-2019,9.4,8.1,Komedia,Wielka Brytania,27 min,https://mediakrytyk.pl/media/images/empty_dark...
2,Legion,No original title,2017-2019,9.3,8.4,Dramat,USA,60 min,https://mediakrytyk.pl/media/images/empty_dark...
3,Sukcesja,Succession,2018-2023,9.2,7.9,Dramat,Dramat,56 min,https://mediakrytyk.pl/media/images/empty_dark...
4,Wielkie kłamstewka,Big Little Lies,2017-,9.2,8.1,Dramat,Dramat,60 min,https://mediakrytyk.pl/media/images/empty_dark...
5,Barry,No original title,2018-2023,9.2,7.3,Komedia,USA,30 min,https://mediakrytyk.pl/media/images/empty_dark...
6,Zadzwoń do Saula,Better Call Saul,2022-2022,9.1,8.3,Dramat,"Dramat, Komedia, Kryminał",46 min,https://mediakrytyk.pl/media/images/empty_dark...
7,Czarnobyl,Chernobyl,2019-,9.0,8.8,Dramat,Dramat,64 min,https://mediakrytyk.pl/media/images/empty_dark...
8,Arcane,No original title,2021-2024,9.0,8.6,Animacja,USA,41 min,https://mediakrytyk.pl/media/images/empty_dark...
9,Ostatni taniec,The Last Dance,2020-,9.0,8.1,Dokumentalny,"Dokumentalny, Sportowy",50 min,https://mediakrytyk.pl/media/images/empty_dark...
10,Dark,No original title,2017-2020,9.0,8.2,Sci-Fi,Niemcy,60 min,https://mediakrytyk.pl/media/images/empty_dark...


<p style="font-size: 15px; text-align: center;">3.2. Film Scrapping</p>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape movie list from Mediakrytyk ranking
def scrape_mediakrytyk_movies():
    url = "https://mediakrytyk.pl/filmy/ranking"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
    
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        return []
    
    soup = BeautifulSoup(response.text, "html.parser")
    movies = []
    
    for item in soup.select("ul > li > div > div:nth-of-type(3) > div > div:nth-of-type(1) > h3 > a"):  # XPath-based path
        title = item.text.strip()
        parent = item.find_parent("li")  # Get the main movie container
        
        # Extract year
        year_element = parent.select_one("a.label_small.link")
        year = year_element.text.strip() if year_element else "No year"
        
        # Extract critic rating
        critic_rating_element = parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.small.level_1.small_hide")
        critic_rating = critic_rating_element.text.strip() if critic_rating_element else "No critic rating"
        
        # Extract user rating
        user_rating_element = parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.promoted.small_hide")
        if not user_rating_element:
            user_rating_element = parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.small_hide")
        
        user_rating = user_rating_element.text.strip() if user_rating_element else "No user rating"
        
        # Extract genre (first link in the list)
        genre_element = parent.select_one("ul:nth-of-type(2) > li > a.link_gray:not(.italic)")
        genre = genre_element.text.strip() if genre_element else "No genre"
        
        # Extract production country (second element in the list)
        country_elements = parent.select("ul:nth-of-type(2) > li:nth-of-type(2) > a.link_gray")
        countries = [c.text.strip() for c in country_elements]
        country = ", ".join(countries) if countries else "No production country"
        
        # Extract movie duration (last link in the list)
        duration_element = parent.select_one("ul:nth-of-type(2) > li > a.link_gray[href*='dlugosc']")
        duration = duration_element.text.strip() if duration_element else "No duration"
        
        # Extract original title
        original_title_element = parent.select_one("a.link_gray.italic")
        original_title = original_title_element.text.strip() if original_title_element else "No original title"
        
        # Extract cover image URL
        cover_element = parent.select_one("div > div:nth-of-type(2) > a > img.movie_full_image")
        cover_url = "https://mediakrytyk.pl" + cover_element["src"] if cover_element else "No cover"
        
        # Append movie data to the list
        movies.append({
            "Title": title,
            "Original Title": original_title,
            "Year": year,
            "Critic Rating": critic_rating,
            "User Rating": user_rating,
            "Genre": genre,
            "Production Country": country,
            "Duration": duration,
            "Cover URL": cover_url
        })
    
    return movies

# Execute function and store results in a DataFrame
films = scrape_mediakrytyk_movies()
df = pd.DataFrame(films)  # Create DataFrame from the list of dictionaries
df.head(40)  # Display only the first 40 results


Unnamed: 0,Title,Original Title,Year,Critic Rating,User Rating,Genre,Production Country,Duration,Cover URL
0,2001: Odyseja kosmiczna,No original title,1968,9.8,8.6,Sci-Fi,"USA, Wielka Brytania",141 min,https://mediakrytyk.pl/media/images/empty_dark...
1,Władca Pierścieni: Powrót króla,No original title,2003,9.8,8.4,Fantasy,"Nowa Zelandia, USA",201 min,https://mediakrytyk.pl/media/images/empty_dark...
2,Obywatel Kane,Citizen Kane,1941,9.7,8.2,Dramat,Dramat,119 min,https://mediakrytyk.pl/media/images/empty_dark...
3,Dzisiejsze czasy,Modern Times,1936,9.7,7.3,Komedia,"Komedia, Niemy",87 min,https://mediakrytyk.pl/media/images/empty_dark...
4,Bez przebaczenia,Unforgiven,1992,9.7,8.2,Western,Western,131 min,https://mediakrytyk.pl/media/images/empty_dark...
5,Gwiezdne wojny: Część V - Imperium kontratakuje,Star Wars: Episode V - The Empire Strikes Back,1980,9.5,8.6,Przygodowy,"Przygodowy, Sci-Fi",124 min,https://mediakrytyk.pl/media/images/empty_dark...
6,Nadzy,Naked,1993,9.4,6.4,Dramat,Dramat,131 min,https://mediakrytyk.pl/media/images/empty_dark...
7,Aż poleje się krew,There Will Be Blood,2007,9.4,8.6,Dramat,"Dramat, Obyczajowy",158 min,https://mediakrytyk.pl/media/images/empty_dark...
8,One More Time with Feeling,No original title,2016,9.3,7.3,Dokumentalny,"Francja, Wielka Brytania",112 min,https://mediakrytyk.pl/media/images/empty_dark...
9,Chłopcy z ferajny,Goodfellas,1990,9.3,8.2,Dramat,"Dramat, Kryminał",146 min,https://mediakrytyk.pl/media/images/empty_dark...


<p style="font-size: 20px; text-align: center;">4. Selenium Scrapping</p>

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time

# Ustawienie sterownika Edge
driver = webdriver.Edge()

def handle_cookies(driver):
    try:
        # Poczekaj, aż strona się załaduje i przycisk będzie dostępny
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//button[contains(@class, 'fc-button') and contains(@aria-label, 'Zgadzam się')]"))
        )
        
        # Znajdź i kliknij przycisk "Zgadzam się"
        accept_button = driver.find_element(By.XPATH, "//button[contains(@class, 'fc-button') and contains(@aria-label, 'Zgadzam się')]")
        
        if accept_button.is_displayed():
            accept_button.click()
            print("Zaakceptowano cookies.")
            time.sleep(1)  # Czekaj chwilę po kliknięciu
        else:
            print("Przycisk cookies nie jest widoczny.")
    
    except Exception as e:
        print(f"Brak przycisku cookies lub nie udało się zaakceptować. Błąd: {e}")


# Funkcja do zbierania filmów z jednej strony
def scrape_mediakrytyk_page(url):
    driver.get(url)
    time.sleep(2)  # Krótka pauza na załadowanie strony

    soup = BeautifulSoup(driver.page_source, "html.parser")
    movies = []

    for item in soup.select("ul > li > div > div:nth-of-type(3) > div > div:nth-of-type(1) > h3 > a"):
        title = item.text.strip()
        parent = item.find_parent("li")

        year_element = parent.select_one("a.label_small.link")
        year = year_element.text.strip() if year_element else "No year"

        critic_rating_element = parent.select_one("a.movie_full_vscore_symbol.score_symbol.link.background.small.level_1.small_hide")
        critic_rating = critic_rating_element.text.strip() if critic_rating_element else "No critic rating"

        user_rating_element = parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.promoted.small_hide")
        if not user_rating_element:
            user_rating_element = parent.select_one("a.movie_full_vuscore_symbol.score_symbol.link.background.rounded.small.level_1.small_hide")

        user_rating = user_rating_element.text.strip() if user_rating_element else "No user rating"

        genre_element = parent.select_one("ul:nth-of-type(2) > li > a.link_gray:not(.italic)")
        genre = genre_element.text.strip() if genre_element else "No genre"

        country_elements = parent.select("ul:nth-of-type(2) > li:nth-of-type(2) > a.link_gray")
        countries = [c.text.strip() for c in country_elements]
        country = ", ".join(countries) if countries else "No production country"

        duration_element = parent.select_one("ul:nth-of-type(2) > li > a.link_gray[href*='dlugosc']")
        duration = duration_element.text.strip() if duration_element else "No duration"

        original_title_element = parent.select_one("a.link_gray.italic")
        original_title = original_title_element.text.strip() if original_title_element else "No original title"

        cover_element = parent.select_one("div > div:nth-of-type(2) > a > img.movie_full_image")
        cover_url = "https://mediakrytyk.pl" + cover_element["src"] if cover_element else "No cover"

        movies.append({
            "Title": title,
            "Original Title": original_title,
            "Year": year,
            "Critic Rating": critic_rating,
            "User Rating": user_rating,
            "Genre": genre,
            "Production Country": country,
            "Duration": duration,
            "Cover URL": cover_url
        })

    return movies

# Funkcja do iteracji przez strony rankingu
def scrape_mediakrytyk_movies():
    base_url = "https://mediakrytyk.pl/filmy/ranking"
    all_movies = []
    page = 1

    while True:
        print(f"Scraping page {page}...")
        url = f"{base_url}?strona={page}"

        # Akceptujemy warunki na stronie przed zbieraniem danych
        handle_cookies(driver)

        movies = scrape_mediakrytyk_page(url)
        
        if not movies:
            print("Brak nowych danych, zakończenie zbierania.")
            break

        all_movies.extend(movies)

        try:
            next_button = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.LINK_TEXT, "Następna"))
            )
            next_button.click()
            time.sleep(2)  # Czekamy na załadowanie kolejnej strony
            page += 1
        except:
            print("Brak przycisku 'Następna', koniec iteracji.")
            break

    return all_movies

# Wykonanie kodu
films = scrape_mediakrytyk_movies()
df = pd.DataFrame(films)

# Wyświetlenie pierwszych 40 wyników
print(df.head(40))

# Opcjonalnie: zapis do CSV
df.to_csv("mediakrytyk_movies.csv", index=False)

# Zamknięcie drivera Selenium
driver.quit()


Scraping page 1...
Brak przycisku cookies lub nie udało się zaakceptować. Błąd: Message: target frame detached
  (failed to check if window was closed: disconnected: Unable to receive message from renderer)
  (Session info: MicrosoftEdge=135.0.3179.33)
Stacktrace:
	GetHandleVerifier [0x00007FF7FDE8F345+16773]
	Microsoft::Applications::Events::ILogConfiguration::operator* [0x00007FF7FDDEF610+402688]
	(No symbol) [0x00007FF7FDBDBAB0]
	(No symbol) [0x00007FF7FDBCCABD]
	(No symbol) [0x00007FF7FDBCBA3C]
	(No symbol) [0x00007FF7FDBEA19A]
	(No symbol) [0x00007FF7FDC4EB96]
	(No symbol) [0x00007FF7FDC6605A]
	(No symbol) [0x00007FF7FDC48C33]
	(No symbol) [0x00007FF7FDC1D4D6]
	(No symbol) [0x00007FF7FDC1CA82]
	(No symbol) [0x00007FF7FDC1D303]
	(No symbol) [0x00007FF7FDD1059D]
	(No symbol) [0x00007FF7FDD1D5D2]
	(No symbol) [0x00007FF7FDD15603]
	Microsoft::Applications::Events::EventProperty::to_string [0x00007FF7FDF6A08A+265882]
	Microsoft::Applications::Events::ILogConfiguration::operator* [0x000

InvalidSessionIdException: Message: invalid session id: session deleted as the browser has closed the connection
from disconnected: not connected to DevTools
  (Session info: MicrosoftEdge=135.0.3179.33)
Stacktrace:
	GetHandleVerifier [0x00007FF7FDE8F345+16773]
	Microsoft::Applications::Events::ILogConfiguration::operator* [0x00007FF7FDDEF610+402688]
	Microsoft::Applications::Events::EventProperty::to_string [0x00007FF7FE0E50BA+1818314]
	(No symbol) [0x00007FF7FDBCB880]
	(No symbol) [0x00007FF7FDBEA19A]
	(No symbol) [0x00007FF7FDC4EB96]
	(No symbol) [0x00007FF7FDC6605A]
	(No symbol) [0x00007FF7FDC48C33]
	(No symbol) [0x00007FF7FDC1D4D6]
	(No symbol) [0x00007FF7FDC1CA82]
	(No symbol) [0x00007FF7FDC1D303]
	(No symbol) [0x00007FF7FDD1059D]
	(No symbol) [0x00007FF7FDD1D5D2]
	(No symbol) [0x00007FF7FDD15603]
	Microsoft::Applications::Events::EventProperty::to_string [0x00007FF7FDF6A08A+265882]
	Microsoft::Applications::Events::ILogConfiguration::operator* [0x00007FF7FDDFCA61+457041]
	Microsoft::Applications::Events::ILogConfiguration::operator* [0x00007FF7FDDF5D14+429060]
	Microsoft::Applications::Events::ILogConfiguration::operator* [0x00007FF7FDDF5E63+429395]
	Microsoft::Applications::Events::ILogConfiguration::operator* [0x00007FF7FDDE7966+370774]
	BaseThreadInitThunk [0x00007FFC145764A7+23]
	RtlUserThreadStart [0x00007FFC154B7BF0+32]
