# **Scraping AlloCine**

## **Intro**

In the current era, data collection and analysis are critical for extracting valuable insights from unstructured data such as web pages. Web scraping is a powerful technique that allows us to extract information from websites that may not be readily available in structured formats like APIs. In this project, we aim to scrape data from AlloCiné, a well-known French movie website, to gather information on 57 films. This project is important because it showcases how we can handle raw HTML content, clean it, and structure it into a dataframe for further analysis and machine learning applications.

In this notebook, we will:

- **Extract information** from a single HTML page to identify the patterns needed for scraping ([Step 1](#step-1-extract-information-from-a-single-page)).

- **Loop through all locally stored pages**, extract the same information for each, and store the results in a dataframe ([Step 2](#step-2-loop-through-all-pages))

⚠️ All HTML pages have been downloaded locally and can be found in the `Pages.rar` file.

## **Step-by-Step Breakdown**

### **Step 1: Extract Information from a Single Page**
First, we import the necessary libraries such as `re` (for regular expressions) and `pandas` (for managing our data). We then load a sample HTML page from our local directory. This sample page will help us define the patterns needed to extract the required information.

In [22]:
import re
import pandas as pd
import html
import os

with open('Pages/Serie_435.html', 'r', encoding ='utf8') as output:
    text=output.read()

text = html.unescape(text)

Here, the HTML page is read and cleaned using `html.unescape()` to remove any special characters specific to HTML formatting.

#### **Extracting Serie Informations**

We use regular expressions to identify specific patterns in the HTML text, which allows us to extract key information like the title, period, duration, genre, director, and more. For instance, we extract the title of the movie using the following pattern:

In [23]:
# Title
title_pattern = """(?s)<meta name="robots" content="index,follow,max-snippet:-1,max-image-preview:large" />\n        <title>(.*?) -"""

title = re.findall(title_pattern, text)
title

['El barco']

In [24]:
# Statut
status_pattern = """(?s)<span class="ico-play-inner"></span>\n      <i class="ico-play-arrow">\n          <i class="arrow"></i>\n      </i>\n  </span>\n\n\n        \n            \n            \n        \n                <div class="label label-text label-sm label-danger-full label-status">(.*?)</div>"""

status = re.findall(status_pattern, text)
if status:
    status
else:
    status = [None]

status

[None]

In [25]:
# Period
period_pattern = """(?s)<div class="meta  ">\n        \n        <div class="meta-body">\n                    <div class="meta-body-item meta-body-info">\n                                                                \n                                                    \n                                                    \n                                                    \n                                                \n                (.*?)\n                \n                                <span class="spacer">"""

period = re.findall(period_pattern, text)
period

['2011 - 2013']

In [26]:
# Duration
duration_pattern = """(?s)<span class="spacer">/</span>\n                \n                                \n                                    (.*?)min\n                                \n                                    <span class="spacer">/</span>"""
                                    
duration= re.findall(duration_pattern, text)
duration = [int(i) for i in duration]
duration

[75]

In [27]:
# The genre
type_pattern = """(?s)<span class="spacer">/</span>\n                \n                                                                                                                                            <span class="ACrL3NACrl(.*?)</span>"""


type = re.findall(type_pattern, text)

type_cleaned = [re.search('>(.*)', item).group(1) for item in type] # tout ce qui vient après '>'
type = type_cleaned
type

['Aventure']

In [28]:
# Director
director_pattern = """(?s)<span class="light">De</span>\n                                                                                                \n                                                                                    <a class="blue-link" href="/personne/fichepersonne_gen_cpersonne=(.*?)</a>"""
                                                                              
director = re.findall(director_pattern, text)
director_cleaned = [re.search('>(.*)', item).group(1) for item in director]
director = [', '.join(director_cleaned)]
director

['Iván Escobar']

In [29]:
# Main charater
main_character_pattern = """(?s)<span class="light">Avec</span>\n                \n                                                                                                                                    \n                                                                                    <span class="ACrL3BACrl(.*?)</span>"""
                                                                                    
main_character = re.findall(main_character_pattern, text)
main_character_cleaned = [re.search('>(.*)', item).group(1) for item in main_character]
main_character = [', '.join(main_character_cleaned)]
main_character

['Mario Casas']

In [30]:
# Nationality
nationality_pattern = """(?s)<span class="light">Nationalité</span>\n                                                <span class="ACrL3NA(.*?)</span>\n                            </div>"""
                                                                                    
nationality = re.findall(nationality_pattern, text)
nationality_cleaned = [re.search('> (.*)', item).group(1) for item in nationality]
nationality = [', '.join(nationality_cleaned)]
nationality

['Espagne']

In [31]:
# The channel
channel_pattern = """<span class="light">Chaîne d\'origine</span> <span class="ACrL3NACrvY2ll(.*?) </span>"""

channel = re.findall(channel_pattern, text)
channel_cleaned = [re.search('> (.*)', item).group(1) for item in channel]
channel = [', '.join(channel_cleaned)]
channel

['Antena 3']

In [32]:
# Ratings of press
p_ratings_pattern = """(?s)stareval-stars"><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div></div><span class="stareval-note">(.*?)</span><span class="stareval-review light"> (.*?)</span></div>\n        </div>\n    </div>\n    \n            <div class="rating-item">\n                                    <div class="rating-item-content">\n                        <span class="ACrL3NACrl"""

p_ratings = re.findall(p_ratings_pattern, text)
if p_ratings:
    note = [float(p_ratings[0][0].replace(',', '.'))]
    critics = [int(re.findall(r'[0-9]+', p_ratings[0][1])[0])]
else:
    note = [None]
    critics = [None]

print(note, critics)

[None] [None]


In [33]:
# Ratings of public
a_ratings_pattern = """(?s)stareval-stars"><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div></div><span class="stareval-note">(.*?)</span><span class="stareval-review light">"""

a_ratings = re.findall(a_ratings_pattern, text)
if a_ratings:
    a_ratings = [float(val.replace(',', '.')) for val in a_ratings]
    a_ratings = [a_ratings[0]]
else:
    a_ratings = [None]

a_ratings

[3.8]

In [34]:
# # of episodes & seasons
seasons_nb_pattern = """(?s)<div class="stats-numbers-row stats-numbers-seriespage">\n                <div class="stats-numbers-row-item">\n            <div class="stats-number">(.*?)</div>\n            <div class="stats-info">(.*?)</div>\n        </div>\n        \n                <div class="stats-numbers-row-item">\n            <div class="stats-number">(.*?)</div>\n            <div class="stats-info">Episode"""

seasons_nb = re.findall(seasons_nb_pattern, text)
if seasons_nb:
    seasons_nb = [int(seasons_nb[0][0]), int(seasons_nb[0][2])]
else:
    seasons_nb = [None, None]
seasons_nb

[None, None]

In [35]:
# Film description (synopsis)
desc_pattern = """(?s)\n    "description": "(.*?)"\n"""

desc = re.findall(desc_pattern, text)
if desc:
    desc
else:
    desc = [None]

desc

[None]

Similar patterns are used to extract other details such as the status of the show, period, duration, and director. After extracting all the information, we store it in a list `new_line`, which will be used to create a new row in a dataframe (Step 2).

In [36]:
all = [
        title, status, period, duration, type, director, main_character, nationality,
        channel, note, critics, a_ratings, seasons_nb, desc
]

new_line = []

for sous_list in all:
    for i in sous_list:
        new_line.append(i)

new_line

['El barco',
 None,
 '2011 - 2013',
 75,
 'Aventure',
 'Iván Escobar',
 'Mario Casas',
 'Espagne',
 'Antena 3',
 None,
 None,
 3.8,
 None,
 None,
 None]

### **Step 2: Loop Through All Pages**
Once we have successfully extracted data from a single page, we extend this process to all 57 pages stored locally in the 'Pages' directory. Using a `for` loop, we read each HTML file, extract the necessary information, and append it to our dataframe.

In [37]:
colonnes = [
    'Titre', 'Statut', 'Période', 'Durée', 'Genre', 'Réalisateur', 'Perso Principal', 
    'Nationalité', 'Chaîne', 'Note press', 'Critiques press', 'Note public', 
    'Saisons', 'Épisodes', 'Description'
]

film_data = pd.DataFrame(columns= colonnes)

for page in os.listdir('Pages'):
    file_path = 'Pages/{}'.format(page)
    
    with open(file_path,'r',encoding ='utf8') as output:
        text = output.read()
    
    text = html.unescape(text)
    
    title_pattern = """(?s)<meta name="robots" content="index,follow,max-snippet:-1,max-image-preview:large" />\n        <title>(.*?) -"""
    title = re.findall(title_pattern, text)


    status_pattern = """(?s)<span class="ico-play-inner"></span>\n      <i class="ico-play-arrow">\n          <i class="arrow"></i>\n      </i>\n  </span>\n\n\n        \n            \n            \n        \n                <div class="label label-text label-sm label-danger-full label-status">(.*?)</div>"""
    status = re.findall(status_pattern, text)
    if status:
        status
    else:
        status = [None]


    period_pattern = """(?s)<div class="meta  ">\n        \n        <div class="meta-body">\n                    <div class="meta-body-item meta-body-info">\n                                                                \n                                                    \n                                                    \n                                                    \n                                                \n                (.*?)\n                \n                                <span class="spacer">"""
    period = re.findall(period_pattern, text)


    duration_pattern = """(?s)<span class="spacer">/</span>\n                \n                                \n                                    (.*?)min\n                                \n                                    <span class="spacer">/</span>"""
    duration= re.findall(duration_pattern, text)
    duration = [int(i) for i in duration]


    type_pattern = """(?s)<span class="spacer">/</span>\n                \n                                                                                                                                            <span class="ACrL3NACrl(.*?)</span>"""
    type = re.findall(type_pattern, text)

    type_cleaned = [re.search('>(.*)', item).group(1) for item in type] # tout ce qui vient après '>'
    type = type_cleaned


    director_pattern = """(?s)<span class="light">De</span>\n                                                                                                \n                                                                                    <a class="blue-link" href="/personne/fichepersonne_gen_cpersonne=(.*?)</a>"""
    director = re.findall(director_pattern, text)
    director_cleaned = [re.search('>(.*)', item).group(1) for item in director]
    director = [', '.join(director_cleaned)]


    main_character_pattern = """(?s)<span class="light">Avec</span>\n                \n                                                                                                                                    \n                                                                                    <span class="ACrL3BACrl(.*?)</span>"""
    main_character = re.findall(main_character_pattern, text)
    main_character_cleaned = [re.search('>(.*)', item).group(1) for item in main_character]
    main_character = [', '.join(main_character_cleaned)]


    nationality_pattern = """(?s)<span class="light">Nationalité</span>\n                                                <span class="ACrL3NA(.*?)</span>\n                            </div>"""
    nationality = re.findall(nationality_pattern, text)
    nationality_cleaned = [re.search('> (.*)', item).group(1) for item in nationality]
    nationality = [', '.join(nationality_cleaned)]


    channel_pattern = """<span class="light">Chaîne d\'origine</span> <span class="ACrL3NACrvY2ll(.*?) </span>"""
    channel = re.findall(channel_pattern, text)
    channel_cleaned = [re.search('> (.*)', item).group(1) for item in channel]
    channel = [', '.join(channel_cleaned)]


    p_ratings_pattern = """(?s)stareval-stars"><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div></div><span class="stareval-note">(.*?)</span><span class="stareval-review light"> (.*?)</span></div>\n        </div>\n    </div>\n    \n            <div class="rating-item">\n                                    <div class="rating-item-content">\n                        <span class="ACrL3NACrl"""
    p_ratings = re.findall(p_ratings_pattern, text)

    if p_ratings:
        note = [float(p_ratings[0][0].replace(',', '.'))]
        critics = [int(re.findall(r'[0-9]+', p_ratings[0][1])[0])]
    else:
        note = [None]
        critics = [None]


    a_ratings_pattern = """(?s)stareval-stars"><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div><div class="star icon"></div></div><span class="stareval-note">(.*?)</span><span class="stareval-review light">"""
    a_ratings = re.findall(a_ratings_pattern, text)
    if a_ratings:
        a_ratings = [float(val.replace(',', '.')) for val in a_ratings]
        a_ratings = [a_ratings[0]]
    else:
        a_ratings = [None]


    seasons_nb_pattern = """(?s)<div class="stats-numbers-row stats-numbers-seriespage">\n                <div class="stats-numbers-row-item">\n            <div class="stats-number">(.*?)</div>\n            <div class="stats-info">(.*?)</div>\n        </div>\n        \n                <div class="stats-numbers-row-item">\n            <div class="stats-number">(.*?)</div>\n            <div class="stats-info">Episode"""
    seasons_nb = re.findall(seasons_nb_pattern, text)

    if seasons_nb:
        seasons_nb = [int(seasons_nb[0][0]), int(seasons_nb[0][2])]
    else:
        seasons_nb = [None, None]


    desc_pattern = """(?s)\n    "description": "(.*?)"\n"""
    desc = re.findall(desc_pattern, text)
    if desc:
        desc
    else:
        desc = [None]


    all = [title, status, period, duration, type, director, main_character, nationality,
        channel, note, critics, a_ratings, seasons_nb, desc
    ]

    new_line = []

    for sous_list in all:
        for i in sous_list:
            new_line.append(i)

    temp = pd.DataFrame([new_line], columns= colonnes) # temporary df to store informations

    film_data = pd.concat([film_data, temp], ignore_index=True)


film_data.head()

  film_data = pd.concat([film_data, temp], ignore_index=True)


Unnamed: 0,Titre,Statut,Période,Durée,Genre,Réalisateur,Perso Principal,Nationalité,Chaîne,Note press,Critiques press,Note public,Saisons,Épisodes,Description
0,Alphas,Annulée,2011 - 2012,42,Drame,Gail Berman,David Strathairn,U.S.A.,SyFy US,3.9,7.0,3.9,2,24,Des personnes dotées de capacités neurologique...
1,Wolfblood,Annulée,2012 - 2017,26,Drame,Debbie Moon,Bobby Lockwood,Grande-Bretagne,CBBC,,,4.0,5,62,"Nouvel élève au lycée de Stoneybridge, Rhydian..."
2,The Frankenstein Chronicles,Annulée,2015 - 2017,52,Drame,Benjamin Ross,Sean Bean,Grande-Bretagne,ITV,,,4.0,2,12,"Londres, 1827. Alors que la police fluviale de..."
3,Being Human (US),Annulée,2011 - 2014,42,Drame,Jeremy Carver,Sam Witwer,U.S.A.,SyFy US,,,3.9,4,52,Trois colocataires âgés d'une trentaine d'anné...
4,Les Revenants,Terminée,2012 - 2015,52,Drame,Fabrice Gobert,Anne Consigny,France,Canal +,4.2,12.0,4.2,2,16,Dans une ville de montagne dominée par un giga...


Each page is processed in a similar manner as in Step 1, and the extracted data is concatenated into the `film_data` dataframe.

#### **Handling Missing Values**
Some pages may not contain all the information, which could result in missing values (`NaN`). We handle these missing values appropriately, ensuring they don't affect our analysis.

In [38]:
film_data.isna().sum()

Titre               0
Statut             12
Période             0
Durée               0
Genre               0
Réalisateur         0
Perso Principal     0
Nationalité         0
Chaîne              0
Note press         32
Critiques press    32
Note public         0
Saisons             1
Épisodes            1
Description         1
dtype: int64

This line checks for any missing data in our dataframe, providing an overview of which columns contain `NaN` values.

#### **Saving the Data**
Finally, we save the resulting dataframe to a csv file for future use.

In [39]:
film_data.to_csv('film_data.csv', index=False, encoding='utf-8-sig')
print("Super c'est parfait !")

Super c'est parfait !
