# THEME :MISE EN PLACE D'UN SYSTEME POUR SUIVRE LES PRIX SUR UN SITE CONCURRENT


### OBJECTIF : Mettre en place un programme (un scraper) développé en Python, capable d'extraire les informations tarifaires de la libraire en ligne Book To Scrape, un revendeur de livres.

Pour réaliser ce travail, nous avons procedé comme suit :

1- visiter le site pour recencer les informations pertinentes pour notre travail
2- recherché un proce dé de scraping qui fonctionne avec le site web
3- scraper le site
4- Mettre les données dans un format exploitable par tableau
5- Exporter les données

### Importation des librairies

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
import requests
from bs4 import BeautifulSoup
import scrapy
from scrapy.crawler import CrawlerProcess
pd.set_option("display.max_columns", None)

### Visite du site 

In [2]:
"http://books.toscrape.com"

'http://books.toscrape.com'

Les informations jugées pertinentes sur le site sont : 
- le titre du livre
- sa catégorie
- son prix
- les avis
- la disponibilité ( In stock)

Ces informations nous serons utiles pour la construction des tableau de bords

### Recherche ou construction du scaper

In [3]:
url_bookts ="http://books.toscrape.com/catalogue/category/books/travel_2/index.html"

In [4]:
page = requests.get(url_bookts)
html_content = page.text
soup = BeautifulSoup(html_content, 'html.parser')

In [5]:
page.status_code

200

In [6]:
produits  = soup.find_all('div', {'class':"product_price"})  
produits

[<div class="product_price">
 <p class="price_color">Â£45.17</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>,
 <div class="product_price">
 <p class="price_color">Â£49.43</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>,
 <div class="product_price">
 <p class="price_color">Â£48.87</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>,
 <div class="product_price">
 <p class="price_color">Â£36.94</p>
 <p class="instock availability">
 <i class="ic

Nous pouvons constater que le résultats ne se présente pas sous la forme habituelle. Ainsi, il convient de trouver des moyens d'adaptations.

In [7]:
url_bookts = "http://books.toscrape.com/catalogue/category/books_1/page-2.html"

response = requests.get(url_bookts)

soup = BeautifulSoup(response.text, "html.parser")


In [12]:

urls = [
    'http://books.toscrape.com/catalogue/category/books_1/page-1.html',
    'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
    'http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
]


all_titles = []
all_prices = []
all_ratings = []
all_availabilities = []


for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    product = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

    titles = []
    prices = []
    ratings = []
    availabilities = []

    for product in product:
        title = product.find('h3').find('a')['title']
        titles.append(title)

        price = product.find('p', class_='price_color').text.strip()
        prices.append(price)

        rating = product.find('p', class_='star-rating')['class'][1]
        ratings.append(rating)

        availability = product.find('p', class_='instock availability').text.strip()
        availabilities.append(availability)

    all_titles.extend(titles)
    all_prices.extend(prices)
    all_ratings.extend(ratings)
    all_availabilities.extend(availabilities)

data = pd.DataFrame.from_dict({
    'Title': all_titles,
    'price': all_prices,
    'Rating': all_ratings,
     'Availability': all_availabilities
}, orient="index").T

df = pd.DataFrame(data)


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Title         51 non-null     object
 1   price         51 non-null     object
 2   Rating        51 non-null     object
 3   Availability  51 non-null     object
dtypes: object(4)
memory usage: 1.7+ KB


#### Automatisation de la tâche

Pour automatiser les tâches réaliser précédemment, nous avons en premier lieu recenser les catégories afin de construire les URL qui seront utiles pour scraper les informations. En second lieu, nous avons utiliser ces url pour acceder aux pages et les scraper

Books n'est pas une catégorie, elle regroupe toutes les autres catégories. c'est ce qui justifie son retrait de la base

In [39]:
base_url = 'http://books.toscrape.com/'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, 'html.parser')

categories_container = soup.find('div', class_='side_categories')
categories = [category.text.strip() for category in categories_container.find_all('a')[1:]]
catalogue = pd.DataFrame({'Category': categories})
catalogue.index = catalogue.index + 1
catalogue.head(20)

Unnamed: 0,Category
1,Travel
2,Mystery
3,Historical Fiction
4,Sequential Art
5,Classics
6,Philosophy
7,Romance
8,Womens Fiction
9,Fiction
10,Childrens


In [40]:
all_titles = []
all_prices = []
all_ratings = []
all_availabilities = []
all_categories = []

for index, row in catalogue.iterrows():
    for i in range(1,9):
        if i==1: 
            url = base_url + 'catalogue/category/books/' + row['Category'].replace(' ', '-').lower() + '_' + str(index+1) + '/index.html'
        else :
        
            url = base_url + 'catalogue/category/books/' + row['Category'].replace(' ', '-').lower() + '_' + str(index+1) + '/page-' + str(i) + '.html'
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        product = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

        for p in product: 
            title = p.find('h3').find('a')['title']
            all_titles.append(title)

            price = p.find('p', class_='price_color').text.strip()
            all_prices.append(price)

            rating = p.find('p', class_='star-rating')['class'][1]
            all_ratings.append(rating)

            availability = p.find('p', class_='instock availability').text.strip()
            all_availabilities.append(availability)

            all_categories.append(row['Category'])

data = {
    'Title': all_titles,
    'Price': all_prices,
    'Rating': all_ratings,
    'Availability': all_availabilities,
    'Category': all_categories  
}

df = pd.DataFrame(data)


In [41]:
df.head(35)

Unnamed: 0,Title,Price,Rating,Availability,Category
0,It's Only the Himalayas,Â£45.17,Two,In stock,Travel
1,Full Moon over Noahâs Ark: An Odyssey to Mou...,Â£49.43,Four,In stock,Travel
2,See America: A Celebration of Our National Par...,Â£48.87,Three,In stock,Travel
3,Vagabonding: An Uncommon Guide to the Art of L...,Â£36.94,Two,In stock,Travel
4,Under the Tuscan Sun,Â£37.33,Three,In stock,Travel
5,A Summer In Europe,Â£44.34,Two,In stock,Travel
6,The Great Railway Bazaar,Â£30.54,One,In stock,Travel
7,A Year in Provence (Provence #1),Â£56.88,Four,In stock,Travel
8,The Road to Little Dribbling: Adventures of an...,Â£23.21,One,In stock,Travel
9,Neither Here nor There: Travels in Europe,Â£38.95,Three,In stock,Travel


Nous pouvons constater au vu des résultats que les valeurs de la colonne price et ratings ne sont pas dans un format adéquat. Il convient de le traiter. c'est l'objectif de la prochaine section

### Traitement des données recueillies

In [42]:
data = (df
        .assign(prix=df.Price.replace(regex=True, to_replace=r'[^0-9]', value=''),
                unité=df.Price.str.extract(r"(\W+)")
               )
        .astype({"prix": float}, errors="ignore")
       )
data.head(20)


Unnamed: 0,Title,Price,Rating,Availability,Category,prix,unité
0,It's Only the Himalayas,Â£45.17,Two,In stock,Travel,4517.0,£
1,Full Moon over Noahâs Ark: An Odyssey to Mou...,Â£49.43,Four,In stock,Travel,4943.0,£
2,See America: A Celebration of Our National Par...,Â£48.87,Three,In stock,Travel,4887.0,£
3,Vagabonding: An Uncommon Guide to the Art of L...,Â£36.94,Two,In stock,Travel,3694.0,£
4,Under the Tuscan Sun,Â£37.33,Three,In stock,Travel,3733.0,£
5,A Summer In Europe,Â£44.34,Two,In stock,Travel,4434.0,£
6,The Great Railway Bazaar,Â£30.54,One,In stock,Travel,3054.0,£
7,A Year in Provence (Provence #1),Â£56.88,Four,In stock,Travel,5688.0,£
8,The Road to Little Dribbling: Adventures of an...,Â£23.21,One,In stock,Travel,2321.0,£
9,Neither Here nor There: Travels in Europe,Â£38.95,Three,In stock,Travel,3895.0,£


In [43]:
word_to_num = {
    'One': 1,
    'Two': 2,
    'Three': 3,
    'Four': 4,
    'Five': 5
}

def convert_word_to_num(word):
    return word_to_num.get(word.strip(), word)

data['Avis'] = data['Rating'].apply(lambda x: convert_word_to_num(x))

In [44]:
data.head()

Unnamed: 0,Title,Price,Rating,Availability,Category,prix,unité,Avis
0,It's Only the Himalayas,Â£45.17,Two,In stock,Travel,4517.0,£,2
1,Full Moon over Noahâs Ark: An Odyssey to Mou...,Â£49.43,Four,In stock,Travel,4943.0,£,4
2,See America: A Celebration of Our National Par...,Â£48.87,Three,In stock,Travel,4887.0,£,3
3,Vagabonding: An Uncommon Guide to the Art of L...,Â£36.94,Two,In stock,Travel,3694.0,£,2
4,Under the Tuscan Sun,Â£37.33,Three,In stock,Travel,3733.0,£,3


### Contruction de la base finale et Exportation des données

In [46]:
Base= data[['Title','prix', 'unité', 'Category', 'Avis', 'Availability']]

In [47]:
Base.to_csv("Base.csv", index = False)

In [48]:
Base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Title         1000 non-null   object 
 1   prix          1000 non-null   float64
 2   unité         1000 non-null   object 
 3   Category      1000 non-null   object 
 4   Avis          1000 non-null   int64  
 5   Availability  1000 non-null   object 
dtypes: float64(1), int64(1), object(4)
memory usage: 47.0+ KB
