<a href="https://colab.research.google.com/github/YassLahb/Projet-PI2/blob/main/TrustPilot_scraper_ProjetPI2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [32]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import json
import math
import requests 
from time import sleep 
from bs4 import BeautifulSoup
from os import path 
from IPython.display import clear_output, display

**Needed configurations to run the scraper.**

In [33]:
# Trustpilot review page
url = 'https://fr.trustpilot.com/review/direct-assurance.fr'

# Data file to save reviews to
save_datafile = 'trustpilot_reviews.csv'

# Final list to be the dataframe
final_list = [] 

# Handling for Pagination
results_per_page = 20 
run_pagination_finder = True
total_pages = 1

# Throttling to avoid spamming page with requests
# With sleepTime seconds between every page request
throttle = False
sleep_time =1 

print(f'Scraper set for {url} \nSaving results to {save_datafile}'
      f'\nRun Pagination Finder: {run_pagination_finder} \nThrottling On: {throttle}')

Scraper set for https://fr.trustpilot.com/review/direct-assurance.fr 
Saving results to trustpilot_reviews.csv
Run Pagination Finder: True 
Throttling On: False


**Pagination Finder : Total number of pages to scrap**

In [34]:
## Count amount of pages to scrape
if run_pagination_finder:
    # Get page
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')

    # Get total number of reviews
    rating_count = soup.find('span', class_='headline__review-count')
    rating_count = int(rating_count.text.replace(',', ''))

    # Total pages to scrape
    total_pages = math.ceil(rating_count / results_per_page)

    print(f'Found total of {total_pages} pages to scrape')

Found total of 34 pages to scrape


**Running the scraper**

In [35]:
for page_num in range(1, total_pages + 1):    
    page = url + '?page=' + str(page_num)
    r = requests.get(page)
    soup = BeautifulSoup(r.text, 'lxml')
        
    for paragraph in soup.find_all('section', class_='review__content'):
        # get review title
        title_section = paragraph.find('h2', class_='review-content__title')   
        
        if title_section:
            title = title_section.find('a').text.strip()
        else:
            title = ''
        
        # get review text
        content = paragraph.find('p', class_='review-content__text')
        
        if content:
            content = content.text.strip()
            
            # get review posted date
            datedata = json.loads(paragraph.find('div', class_='review-content-header__dates').text)
            date = datedata['publishedDate'].split('T')[0]
            
            # get review rating
            rating_class = paragraph.find('div', class_='star-rating')
            rating = rating_class.find('img')['alt'][0]   

            final_list.append([title, content, date, rating])
    
    # print progress
    clear_output(wait=True)
    print(f'scraped page {page_num} of {total_pages}')
    
    if(throttle): 
        time.sleep(sleep_time)

# Save to pandas dataframe
df = pd.DataFrame(final_list, columns=['Title', 'Content', 'Date', 'Rating'])

scraped page 34 of 34


**Printing results**

In [36]:
df

Unnamed: 0,Title,Content,Date,Rating
0,Direct assurance top,Direct assurance topDirect assurance topDirect...,2020-11-20,5
1,@jerome - Pratique mafieuse,@jerome : Exactement la même histoire avec aug...,2020-11-19,1
2,Très bon rapport Qualité/Prix!,Très bon rapport Qualité/Prix!Voici le code pr...,2020-11-14,5
3,A fuir,Très mauvaise expérience avec cette assurance ...,2020-11-14,1
4,Mauvaise foi,Mauvaise foi. Cette compagnie a réévalué mon d...,2020-11-13,1
...,...,...,...,...
524,Désinvolte envers ses assurés,Le 26/08/2020 j'ai reçu l'avis d'échéance de m...,2020-10-27,1
525,J'ai attendu plus de deux heures une…,J'ai attendu plus de deux heures une dépanneus...,2020-10-24,1
526,Assurance à fuir absolument.,Assurance à fuir absolument.Ils me reclament 1...,2020-10-23,1
527,Assurance nul escro pour des fautes…,Assurance nul des escro incompetent pour des ...,2020-10-20,1


**Saving the dataframe to a csv**

In [36]:
df.to_csv(save_datafile, encoding='utf-8', index=False)