# Web Scrapping Project - Movies screening

The goal of this project is to allow users to enter their address, to check movie theaters near them (restricted to UGC, CGR, MK2 and Pathé-Gaumont) and to see some of the movies that are currently being shown.

The movie information (title, director, genres, global ratings) are taken from Letterboxd.com, while the screenings date and hour are taken from the respective movie theater's website.

The user can enter a few genres of movies that he likes, and restrict / sort the movies by their preference. The genres on Letterboxd are shown in order of importance for said movie, making it easier to give a movie a score of potential from its info, although totally arbitrary.

Another goal of this project is to promote smaller movies, French movies, art house films etc... in order to make the user want to add to their cinematographic culture.

In [1]:
from geopy.geocoders import Nominatim
from geopy.distance import geodesic
import pandas as pd
import numpy as np

In [2]:
cinemasFrance = pd.read_excel("C:\\Users\\utilisateur\\Documents\\SEMESTRE 9\\Webscraping et Data Processing\\DonnéesCartographie2022Cinemas.xlsx")

In [3]:
cinemasFrance = cinemasFrance[(cinemasFrance["programmateur"] == "CGR") |
              (cinemasFrance["programmateur"] == "UGC") |
              (cinemasFrance["programmateur"] == "MK2") |
              (cinemasFrance["programmateur"] == "PATHE-GAUMONT")]

In [4]:
adress_user = "12 avenue Leonard de vinci, 92400, Courbevoie"


loc = Nominatim(user_agent="Geopy Library")
getLoc = loc.geocode(adress_user)
print(getLoc.address)

# printing latitude and longitude
print("Latitude = ", getLoc.latitude, "\n")
print("Longitude = ", getLoc.longitude)

12, Avenue Léonard de Vinci, Faubourg de l'Arche, Quartier du Faubourg de l'Arche, Courbevoie, Nanterre, Hauts-de-Seine, Île-de-France, France métropolitaine, 92400, France
Latitude =  48.8964618 

Longitude =  2.2363532


In [5]:
dist_max_km = 10


coord_user = (getLoc.latitude, getLoc.longitude)

#print(geodesic(coord_user, coords_2).km)

cinemasCloseBool = cinemasFrance.apply(
    lambda x: geodesic(
        coord_user,
        (x["latitude"], x["longitude"])
        ).km < dist_max_km,
    axis = 1
    )

In [6]:
cinemasProches = cinemasFrance[cinemasCloseBool]

In [7]:
cinemasProchesUGC = cinemasProches[cinemasProches["programmateur"] == "UGC"]
cinemasProchesCGR = cinemasProches[cinemasProches["programmateur"] == "CGR"]
cinemasProchesPathe = cinemasProches[cinemasProches["programmateur"] == "PATHE-GAUMONT"]
cinemasProchesMK2 = cinemasProches[cinemasProches["programmateur"] == "MK2"]

In [8]:
cinemasProchesCGR

Unnamed: 0,régionCNC,N° auto,nom,région administrative,adresse,code INSEE,commune,population de la commune,DEP,N°UU,...,nombre de films en semaine 1,PdM en entrées des films français,PdM en entrées des films américains,PdM en entrées des films européens,PdM en entrées des autres films,films Art et Essai,part des séances de films Art et Essai,PdM en entrées des films Art et Essai,latitude,longitude
151,2,148324,MEGA CGR,ILE-DE-FRANCE,5 AVENUE DU MARECHAL JOFFRE,93031,Épinay-sur-Seine,54569,93,851,...,187,18.727098,51.203756,10.479666,19.589481,34,5.933682,1.715426,48.957859,2.301996
215,2,284502,CAP CINEMA NANTERRE,ILE-DE-FRANCE,200 ALLEE DE CORSE,92050,Nanterre,96402,92,851,...,161,18.43528,59.333328,11.56418,10.667212,19,4.608783,1.537109,48.900167,2.21295


In [9]:
cinemasProchesPathe

Unnamed: 0,régionCNC,N° auto,nom,région administrative,adresse,code INSEE,commune,population de la commune,DEP,N°UU,...,nombre de films en semaine 1,PdM en entrées des films français,PdM en entrées des films américains,PdM en entrées des films européens,PdM en entrées des autres films,films Art et Essai,part des séances de films Art et Essai,PdM en entrées des films Art et Essai,latitude,longitude
2,1,54,GAUMONT CHAMPS ELYSEES MARIGNAN,ILE-DE-FRANCE,27/33 AVENUE DES CHAMPS ELYSEES,75108,Paris 8e Arrondissement,36218,75,851,...,86,20.450104,59.504905,14.713049,5.331942,17,3.891299,2.988388,48.869654,2.306873
7,1,321,PATHE,ILE-DE-FRANCE,32 RUE LOUIS LEGRAND,75102,Paris 2e Arrondissement,21277,75,851,...,89,42.563321,36.777314,14.432276,6.22709,54,38.452075,30.192691,48.870632,2.334338
9,1,481,GAUMONT PARNASSE,ILE-DE-FRANCE,3 RUE D'ODESSA,75114,Paris 14e Arrondissement,134926,75,851,...,189,45.459301,31.576091,18.559131,4.405477,122,40.954225,32.418248,48.843074,2.324424
20,1,803,BRETAGNE,ILE-DE-FRANCE,73 BD DU MONTPARNASSE,75106,Paris 6e Arrondissement,40452,75,851,...,26,9.852505,71.450129,10.699508,7.997858,7,6.198347,3.178698,48.843752,2.324925
34,1,3222,PATHE WEPLER,ILE-DE-FRANCE,132 -140 BOULEVARD DE CLICHY,75118,Paris 18e Arrondissement,191911,75,851,...,199,43.340268,38.516547,13.88362,4.259564,134,29.906018,23.147216,48.883938,2.328034
46,1,7394,GAUMONT CONVENTION,ILE-DE-FRANCE,27 RUE ALAIN CHARTIER,75115,Paris 15e Arrondissement,231186,75,851,...,175,58.817291,19.645669,13.76078,7.77626,149,64.899882,56.293856,48.837759,2.296259
47,1,7491,GAUMONT AQUABOULEVARD,ILE-DE-FRANCE,8/16 RUE DU COLONEL PIERRE AVIA,75115,Paris 15e Arrondissement,231186,75,851,...,207,24.870014,54.904694,14.95414,5.271152,103,10.979621,4.806176,48.830408,2.276322
48,1,7711,GAUMONT MONTPARNOS,ILE-DE-FRANCE,16 18 RUE D ODESSA,75114,Paris 14e Arrondissement,134926,75,851,...,72,59.939126,21.167681,10.540952,8.352241,99,51.828339,48.053403,48.841914,2.324596
70,1,9700,PATHE BEAUGRENELLE,ILE-DE-FRANCE,7 RUE LINOIS,75115,Paris 15e Arrondissement,231186,75,851,...,139,32.936367,46.631312,15.550574,4.881747,91,18.210276,12.87667,48.848895,2.282585
104,2,66923,PATHE,ILE-DE-FRANCE,26 RUE LE CORBUSIER,92012,Boulogne-Billancourt,122162,92,851,...,147,49.292058,34.316466,11.577674,4.813802,82,34.999563,27.731849,48.837649,2.239534


Now, we will scrap some data from the UGC site. We'll try to find the first cinema for now, just to test.

In [10]:
#!pip install selenium

In [11]:
import time
import selenium
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

In [104]:
driver = webdriver.Chrome()

In [28]:
def find_UGC_cinema(driver, theater_name):
    #We set the current URL of the driver to the search page for UCG movie theaters
    url_ugc = "https://www.ugc.fr/cinemas.html"
    driver.get(url_ugc)
    #We need to let the browser load everything, so we buffer the code for a second.
    #Somehow, the implicit_wait function does not work in this case...
    time.sleep(1)

    #We use the search bar to search for a specific theater, which name is contained in the previous dataframes
    search_bar = driver.find_element(By.ID, 'search-cinemas-field')
    search_bar.clear()
    driver.implicitly_wait(1)
    search_bar.send_keys(theater_name)
    driver.implicitly_wait(1)
    
    #We search for the list of theaters displayed
    cinema_list = driver.find_element(By.ID, "nav-cinemas")
    
    #We then find the list of all theaters. The UGC website is engineered in a way that all theaters are still loaded after the query, but
    #only those that match the query are displayed. The others are simply hidden by modifying the style attribute.
    cinema_list_items = cinema_list.find_elements(By.CLASS_NAME, 'component--cinema-list-item')
    visible_elements = [element for element in cinema_list_items if element.get_attribute('style') == ""]
    #Now that we have the list of movie theaters, there should only be one item in the list. Either way, we take the first one, and take its website link
    #There we will be able to find all the movies that have screenings.
    if visible_elements:
        first_cinema = visible_elements[0]
        first_link = first_cinema.find_element(By.TAG_NAME, 'a')
        href_value = first_link.get_attribute('href')
        print(href_value)
        return href_value
    else:
        print("Aucun cinéma")
        return("")

In [29]:
find_UGC_cinema(driver, cinemasProchesUGC.loc[0,"nom"])

https://www.ugc.fr/cinema.html?id=2


'https://www.ugc.fr/cinema.html?id=2'

In [110]:
def get_letterboxd_info(driver, movie):
    #We set the current URL of the driver to the Letterboxd main page
    url_letterboxd = "https://letterboxd.com/"
    driver.get(url_letterboxd)
    
    #We need to let the browser load everything, so we buffer the code for a second.
    #Somehow, the implicit_wait function does not work in this case...
    driver.implicitly_wait(1)
    time.sleep(1)
    
    #We need to discard the pop-ups of the cookies on the page
    pop_up = driver.find_elements(By.CLASS_NAME, 'fc-cta-do-not-consent')
    if pop_up:
        pop_up[0].click()

    #Now, we search for the movie in question
    search_bar = driver.find_element(By.ID, 'search-q')
    search_bar.clear()
    search_bar.send_keys(movie + " 2023")
    search_bar.send_keys(Keys.RETURN)
    link_element = driver.find_element(By.XPATH, '//a[text()="Films"]')
    link_element.click()

    #We click on the first movie in the list
    results = driver.find_elements(By.CLASS_NAME, "results")
    if not results:
        return {}
    movies_details = results[0].find_elements(By.CLASS_NAME, "film-detail-content")
    header = movies_details[0].find_element(By.CLASS_NAME, "headline-2")
    movie_link = header.find_element(By.TAG_NAME, "a")
    movie_link.click()

    #Now we need to get the info on the movie: the director's name, the ratings and the genre
    #We scroll down the page to avoid the ads.
    time.sleep(1)
    footer = driver.find_element(By.TAG_NAME, "footer")
    delta_y = footer.rect['y']
    ActionChains(driver)\
        .scroll_by_amount(0, 500)\
        .perform()

    driver.find_element(By.XPATH, '//*[@id="crew"]').click()
    directors = driver.find_elements(By.XPATH, '//*[@id="tab-crew"]/div[1]/p/a')
    directors = list(map(lambda x: x.text, directors))

    ratings = driver.find_elements(By.CLASS_NAME, "average-rating")
    if ratings:
        rating = ratings[0].find_element(By.TAG_NAME, "a").text
    else:
        rating = np.nan
    
    driver.find_element(By.XPATH, '//*[@id="tabbed-content"]/header/ul/li[4]/a').click()
    genres = driver.find_element(By.ID, "tab-genres").find_elements(By.TAG_NAME, "a")
    genres = list(map(lambda x: x.text, genres))[:-1]

    return {'ratings': rating, 'genres': genres, 'directors': directors}

In [111]:
get_letterboxd_info(driver, "Spartacus")

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="tab-genres"]"}
  (Session info: chrome=120.0.6099.71); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x00007FF6D3B04D02+56194]
	(No symbol) [0x00007FF6D3A704B2]
	(No symbol) [0x00007FF6D39176AA]
	(No symbol) [0x00007FF6D39616D0]
	(No symbol) [0x00007FF6D39617EC]
	(No symbol) [0x00007FF6D39A4D77]
	(No symbol) [0x00007FF6D3985EBF]
	(No symbol) [0x00007FF6D39A2786]
	(No symbol) [0x00007FF6D3985C23]
	(No symbol) [0x00007FF6D3954A45]
	(No symbol) [0x00007FF6D3955AD4]
	GetHandleVerifier [0x00007FF6D3E7D5BB+3695675]
	GetHandleVerifier [0x00007FF6D3ED6197+4059159]
	GetHandleVerifier [0x00007FF6D3ECDF63+4025827]
	GetHandleVerifier [0x00007FF6D3B9F029+687785]
	(No symbol) [0x00007FF6D3A7B508]
	(No symbol) [0x00007FF6D3A77564]
	(No symbol) [0x00007FF6D3A776E9]
	(No symbol) [0x00007FF6D3A68094]
	BaseThreadInitThunk [0x00007FFE68A9257D+29]
	RtlUserThreadStart [0x00007FFE69F4AA58+40]


In [103]:
def get_movies_UGC(driver, theaterpage):
    #We set the current URL of the driver to the wanted UGC theater
    driver.get(theaterpage)
    #We need to let the browser load everything, so we buffer the code for a second.
    #Somehow, the implicit_wait function does not work in this case...
    time.sleep(5)
    pop_ups = driver.find_elements(By.ID, 'didomi-notice-disagree-button')
    if pop_ups:
        pop_ups[0].click()
        
    #We search for the container for all movies in the page
    movie_container = driver.find_element(By.CLASS_NAME, "dates-content")
    
    #Next, we get the list of all the containers of movie info
    list_of_movies = movie_container.find_elements(By.CLASS_NAME, 'slider-item')
    #Now that we have the list of movies, we will go in each one of them, and access their title (only their title for now)
    if list_of_movies:
        list_of_movies_screenings = []
        for movie in list_of_movies:
            if (len(movie.find_elements(By.CLASS_NAME, 'component--screening-cards')) == 0):
                continue
            if (movie.find_elements(By.CLASS_NAME, 'film-tag')) and (movie.find_elements(By.CLASS_NAME, 'film-tag')[0].text in [" Opéra ", " Ballet "]):
                print("DAS NOT A FILM")
                continue
                
            movie_info = movie.find_element(By.CLASS_NAME, 'block--title')
            movie_title = movie_info.find_element(By.CSS_SELECTOR, "a[data-film-label]")
            title = movie_title.text
            print(title)
    
            screenings = []
            screenings_list = driver.find_element(By.CLASS_NAME, 'component--screening-cards')
            screenings_list = screenings_list.find_elements(By.TAG_NAME, 'button')
    
            for s in screenings_list:
                lang = s.find_element(By.TAG_NAME, 'span').text
                start = s.find_element(By.CLASS_NAME, 'screening-start').text
                end = s.find_element(By.CLASS_NAME, 'screening-end').text
                room = s.find_element(By.CLASS_NAME, 'screening-detail').text
                screenings.append({'lang': lang, 'start': start, 'end': end, 'room': room})
    
            list_of_movies_screenings.append({'title': title, 'screenings': screenings})
        return list_of_movies_screenings
    else:
        print("Pas de film pour aujourd'hui")
        return([])

In [72]:
get_movies_UGC(driver, "https://www.ugc.fr/cinema.html?id=2")

Pas de film pour aujourd'hui


[]

In [67]:
def update_list_of_screenings(driver, list):
    for i in range(len(list)):
        info = get_letterboxd_info(driver, list[i]['title'])
        list[i].update(info)
    return list

In [107]:
films = get_movies_UGC(driver, "https://www.ugc.fr/cinema.html?id=20")
films_updated = update_list_of_screenings(driver, films)

LES TROIS MOUSQUETAIRES : MILADY
WINTER BREAK
WONKA
FOLLOW DEAD
PAST LIVES - NOS VIES D'AVANT
MIGRATION
PERFECT DAYS
SOUDAIN SEULS
VOYAGE AU POLE SUD
WISH - ASHA ET LA BONNE ÉTOILE
HUNGER GAMES: LA BALLADE DU SERPENT ET DE L'OISEAU CHANTEUR
LE GARÇON ET LE HÉRON
BÂTIMENT 5
LA TRESSE
MARS EXPRESS
NAPOLÉON
NOËL JOYEUX
THANKSGIVING LA SEMAINE DE L'HORREUR


AttributeError: 'str' object has no attribute 'text'

In [79]:
pd.DataFrame(films_updated)

Unnamed: 0,title,screenings,ratings,genres,directors
0,WINTER BREAK,"[{'lang': 'VOSTF', 'start': '13:15', 'end': '(...",4.2,"[Comedy, Drama, Underdogs And Coming Of Age, R...",[Alexander Payne]
1,WONKA,"[{'lang': 'VOSTF', 'start': '13:15', 'end': '(...",3.5,"[Comedy, Fantasy]",[Paul King]
2,SOUDAIN SEULS,"[{'lang': 'VOSTF', 'start': '13:15', 'end': '(...",3.3,"[Drama, Adventure]",[Thomas Bidegain]
3,NAPOLÉON,"[{'lang': 'VOSTF', 'start': '13:15', 'end': '(...",3.1,"[Drama, History, War, Epic History And Literat...",[Ridley Scott]
4,LES TROIS MOUSQUETAIRES : MILADY,"[{'lang': 'VOSTF', 'start': '13:15', 'end': '(...",3.4,[Adventure],[Martin Bourboulon]
