After we've done the scraping phase, we will dive to the cleaning phase

# Data Preprocessing

In [1]:
import pandas as pd
import numpy as np


In [4]:
df_1 = pd.read_csv("/Users/ahmed/Documents/ESILV/s9/web scraping/ecostay/ecostay/sustainable_hotels_paris3.csv")

In [5]:
df_1.head()

Unnamed: 0,Name,Address,Description,Rating,RatingText,NumReviews,HotelLink
0,NH Paris Gare de l'Est,"10e arr., Paris","10e arr., ParisIndiquer sur la carte2,2 km du ...","Avec une note de 8,1",Très bien,1 398 expériences vécues,https://www.booking.com/hotel/fr/mercure-termi...
1,Citadines Austerlitz Paris,"13e arr., Paris","13e arr., ParisIndiquer sur la carte2,5 km du ...","Avec une note de 8,2",Très bien,1 971 expériences vécues,https://www.booking.com/hotel/fr/citadines-apa...
2,B&B HOTEL Paris Porte des Lilas,"19e arr., Paris","19e arr., ParisIndiquer sur la carte4,9 km du ...","Avec une note de 7,8",Bien,14 516 expériences vécues,https://www.booking.com/hotel/fr/b-amp-b-porte...
3,Best Western Hotel Opéra Drouot,"9e arr., Paris","9e arr., ParisIndiquer sur la carte1,9 km du c...","Avec une note de 8,0",Très bien,1 677 expériences vécues,https://www.booking.com/hotel/fr/comfort-opera...
4,Hotel de la Tour,"14e arr., Paris","14e arr., ParisIndiquer sur la carte2,7 km du ...","Avec une note de 8,2",Très bien,736 expériences vécues,https://www.booking.com/hotel/fr/de-la-tour-pa...


for the address and description we have collected them in another way (better way since here are so general or not full) in the other csvs, so we need to drop those

In [6]:
df_1.drop(columns=["Address","Description","NumReviews"],inplace=True)

In [7]:
df_1.head()

Unnamed: 0,Name,Rating,RatingText,HotelLink
0,NH Paris Gare de l'Est,"Avec une note de 8,1",Très bien,https://www.booking.com/hotel/fr/mercure-termi...
1,Citadines Austerlitz Paris,"Avec une note de 8,2",Très bien,https://www.booking.com/hotel/fr/citadines-apa...
2,B&B HOTEL Paris Porte des Lilas,"Avec une note de 7,8",Bien,https://www.booking.com/hotel/fr/b-amp-b-porte...
3,Best Western Hotel Opéra Drouot,"Avec une note de 8,0",Très bien,https://www.booking.com/hotel/fr/comfort-opera...
4,Hotel de la Tour,"Avec une note de 8,2",Très bien,https://www.booking.com/hotel/fr/de-la-tour-pa...


As we can see the rating is not float and there is some text before values, let's fix this :

In [8]:
df_1['Rating'] = df_1['Rating'].str.extract(r'(\d+,\d+)')

In [9]:
df_1.head()

Unnamed: 0,Name,Rating,RatingText,HotelLink
0,NH Paris Gare de l'Est,81,Très bien,https://www.booking.com/hotel/fr/mercure-termi...
1,Citadines Austerlitz Paris,82,Très bien,https://www.booking.com/hotel/fr/citadines-apa...
2,B&B HOTEL Paris Porte des Lilas,78,Bien,https://www.booking.com/hotel/fr/b-amp-b-porte...
3,Best Western Hotel Opéra Drouot,80,Très bien,https://www.booking.com/hotel/fr/comfort-opera...
4,Hotel de la Tour,82,Très bien,https://www.booking.com/hotel/fr/de-la-tour-pa...


In [10]:
df_1['Rating'] = df_1['Rating'].str.replace(',', '.').astype(float)

In [11]:
df_1.isnull().sum()

Name          0
Rating        0
RatingText    0
HotelLink     0
dtype: int64

In [12]:
df_1.head()

Unnamed: 0,Name,Rating,RatingText,HotelLink
0,NH Paris Gare de l'Est,8.1,Très bien,https://www.booking.com/hotel/fr/mercure-termi...
1,Citadines Austerlitz Paris,8.2,Très bien,https://www.booking.com/hotel/fr/citadines-apa...
2,B&B HOTEL Paris Porte des Lilas,7.8,Bien,https://www.booking.com/hotel/fr/b-amp-b-porte...
3,Best Western Hotel Opéra Drouot,8.0,Très bien,https://www.booking.com/hotel/fr/comfort-opera...
4,Hotel de la Tour,8.2,Très bien,https://www.booking.com/hotel/fr/de-la-tour-pa...


In [13]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        208 non-null    object 
 1   Rating      208 non-null    float64
 2   RatingText  208 non-null    object 
 3   HotelLink   208 non-null    object 
dtypes: float64(1), object(3)
memory usage: 6.6+ KB


In [14]:
df_2 = pd.read_csv("hotels_with_address.csv")

In [15]:
df_2.head()

Unnamed: 0,url,address,lat,lng
0,https://www.booking.com/hotel/fr/mercure-termi...,"5 rue du 8 Mai 1945, 10e arr., 75010 Paris, Fr...",48.87595,2.358766
1,https://www.booking.com/hotel/fr/citadines-apa...,"27 Rue Esquirol, 13e arr., 75013 Paris, France",48.834906,2.360376
2,https://www.booking.com/hotel/fr/b-amp-b-porte...,"23 Avenue René Fonck, 19e arr., 75019 Paris, F...",48.880018,2.408066
3,https://www.booking.com/hotel/fr/comfort-opera...,"4 Rue De La Grange Bateliere, 9e arr., 75009 P...",48.873089,2.342492
4,https://www.booking.com/hotel/fr/de-la-tour-pa...,"19 boulevard Edgar Quinet, 14e arr., 75014 Par...",48.841197,2.323891


From this dataset we can get the exact address and also its coordinates (lat and lng), let s merge them to the other dataset using the url

In [16]:
merged_df = pd.merge(df_1, df_2, left_on='HotelLink', right_on='url', how='inner')


In [17]:
merged_df.drop(columns=['url'], inplace=True)

In [18]:
merged_df.head()

Unnamed: 0,Name,Rating,RatingText,HotelLink,address,lat,lng
0,NH Paris Gare de l'Est,8.1,Très bien,https://www.booking.com/hotel/fr/mercure-termi...,"5 rue du 8 Mai 1945, 10e arr., 75010 Paris, Fr...",48.87595,2.358766
1,Citadines Austerlitz Paris,8.2,Très bien,https://www.booking.com/hotel/fr/citadines-apa...,"27 Rue Esquirol, 13e arr., 75013 Paris, France",48.834906,2.360376
2,B&B HOTEL Paris Porte des Lilas,7.8,Bien,https://www.booking.com/hotel/fr/b-amp-b-porte...,"23 Avenue René Fonck, 19e arr., 75019 Paris, F...",48.880018,2.408066
3,Best Western Hotel Opéra Drouot,8.0,Très bien,https://www.booking.com/hotel/fr/comfort-opera...,"4 Rue De La Grange Bateliere, 9e arr., 75009 P...",48.873089,2.342492
4,Hotel de la Tour,8.2,Très bien,https://www.booking.com/hotel/fr/de-la-tour-pa...,"19 boulevard Edgar Quinet, 14e arr., 75014 Par...",48.841197,2.323891


In [20]:
df3 = pd.read_csv("hotel_details.csv")

In [21]:
df3.head()

Unnamed: 0,url,full_description,all_reviews_text,rating_subscores
0,https://www.booking.com/hotel/fr/mercure-termi...,Le NH Paris Gare de l'Est est situé en face de...,"(+) L amabilité du personnel, propreté de la c...","{'Personnel': 8.7, 'Équipements': 8.1, 'Propre..."
1,https://www.booking.com/hotel/fr/citadines-apa...,Situé à mi-chemin entre le Quartier latin et l...,(+) Le personnel de soir à l'accueille était t...,"{'Personnel': 9.2, 'Équipements': 8.1, 'Propre..."
2,https://www.booking.com/hotel/fr/b-amp-b-porte...,"Situé dans le 19ème arrondissement de Paris, l...",(+) Le design le petit déjeuner le confort de ...,"{'Personnel': 8.5, 'Équipements': 7.7, 'Propre..."
3,https://www.booking.com/hotel/fr/comfort-opera...,Situé dans le quartier chic et central du 9ème...,(+) Hôtel chaleureux et agréable dans le 9ème ...,"{'Personnel': 9.0, 'Équipements': 7.9, 'Propre..."
4,https://www.booking.com/hotel/fr/de-la-tour-pa...,"Situé dans le 14ème arrondissement de Paris, l...",(+) Petit déjeuner très copieux et très bon. \...,"{'Personnel': 9.3, 'Équipements': 7.9, 'Propre..."


While scraping, we have seen some reviews that are like this :

![Ce client n'a pas laissé de commentaire](assets/problems/no_review.png)

Let's handle this

In [25]:
df3['all_reviews_text'] = df3['all_reviews_text'].str.replace(
    r"Ce client n'a pas laissé de commentaire", "", regex=True
)

In [39]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   url               208 non-null    object
 1   full_description  207 non-null    object
 2   all_reviews_text  156 non-null    object
 3   rating_subscores  208 non-null    object
dtypes: object(4)
memory usage: 6.6+ KB


Another problem that our reviews are in different languages, so to handle this we will translate them all in english

In [43]:
#pip install translatepy


Collecting translatepy
  Downloading translatepy-2.3-py3-none-any.whl.metadata (16 kB)
Collecting safeIO>=1.2 (from translatepy)
  Downloading safeIO-1.2.tar.gz (8.0 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting pyuseragents (from translatepy)
  Downloading pyuseragents-1.0.5-py3-none-any.whl.metadata (4.3 kB)
Collecting inquirer>=2.8.0 (from translatepy)
  Downloading inquirer-3.4.0-py3-none-any.whl.metadata (6.8 kB)
Collecting blessed>=1.19.0 (from inquirer>=2.8.0->translatepy)
  Downloading blessed-1.20.0-py2.py3-none-any.whl.metadata (13 kB)
Collecting editor>=1.6.0 (from inquirer>=2.8.0->translatepy)
  Downloading editor-1.6.6-py3-none-any.whl.metadata (2.3 kB)
Collecting readchar>=4.2.0 (from inquirer>=2.8.0->translatepy)
  Downloading readchar-4.2.1-py3-none-any.whl.metadata (7.5 kB)
Collecting runs (from editor>=1.6.0->inquirer>=2.8.0->translatepy)
  Downloading runs-1.2.2-py3-none-any.whl.metadata (10 kB)
Collecting xmod (from editor>=1.6.0->inquirer>=2.8.0->tr

In [49]:
import pandas as pd
from translatepy import Translator
import time
# Initialize Translator
translator = Translator()
# Function to translate text to English
def translate_to_english(text):
    try:
        if pd.isnull(text) or not isinstance(text, str) or text.strip() == "":
            return text

        translated = translator.translate(text, "en")
        print(translated.result)
        
        time.sleep(1)
        return translated.result
    except Exception as e:
        print(f"Unexpected error: {e}")
        return "Not Translated"  # Return original text on error

df3['all_reviews_text_en'] = df3['all_reviews_text'].apply(translate_to_english)



(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) The friendliness of the staff, cleanliness of the room
(-) Nothing to complain about

(+) Comfortable and well located
(-) Room a little small

(+) Comfortable and well located
(-) Room 



Unexpected error: No service has returned a valid result
(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of space between the bed and the desk.

(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of space between the bed and the desk.

(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of space between the bed and the desk.

(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of space between the bed and the desk.

(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of space between the bed and the desk.

(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of space between the bed and the desk.

(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of space between the bed and the desk.

(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of space between the bed and the desk.

(+) Suite tastefully decorated and very spacious for Paris!
(-) Lack of



Unexpected error: No service has returned a valid result
(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Elegance, great breakfast and dinner.
(-) Nothing

(+) Its location and luxury.
At the front desk, staff do everything they can to help.
(-) We liked everything.

(+) Its location and luxury.
At the front desk, staff do everything they can to help.
(-) We liked everything.

(+) Its location and luxury.
At the front desk, staff do everything they can to help.
(-) We liked everything.

(+) Its location and luxury.
At the front desk, staff do 



Unexpected error: No service has returned a valid result
(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) The kindness of the staff 
The location
(-) RAS

(+) Magic location
(-) The noise of air conditioning

(+) Magic location
(-) The noise of air conditioning

(+) Magic location
(-) The noise of air conditioning

(+) Magic location
(-) The noise of air conditioning

(+) Magic location
(-) The noise of air conditioning

(+) Magic location
(-) The noise of air conditioning

(+) Magic location
(-) The noise of air conditioning

(+) Magic location
(-) The noise of



Unexpected error: No service has returned a valid result
(+) All the location, staff, room gorgeous. We were able to take our room right after our arrival
(-) Nothing

(+) All the location, staff, room gorgeous. We were able to take our room right after our arrival
(-) Nothing

(+) All the location, staff, room gorgeous. We were able to take our room right after our arrival
(-) Nothing

(+) All the location, staff, room gorgeous. We were able to take our room right after our arrival
(-) Nothing

(+) All the location, staff, room gorgeous. We were able to take our room right after our arrival
(-) Nothing

(+) All the location, staff, room gorgeous. We were able to take our room right after our arrival
(-) Nothing

(+) All the location, staff, room gorgeous. We were able to take our room right after our arrival
(-) Nothing

(+) All the location, staff, room gorgeous. We were able to take our room right after our arrival
(-) Nothing

(+) All the location, staff, room gorgeous. We were abl



Unexpected error: No service has returned a valid result
(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) the welcome and aesthetics
(-) no breakfast possible before 7am

(+) Nice decoration, beautiful finishes. 
Nice view on the roofs. 
The room was large.
(-) Bed uncomfortable: pillows too hard and mattress not straight that was lowered at the head. 

The extra bed in the room for 3 was a comfortable clack. 

The showe



(+) Beautiful design 
Beautiful facilities (bar terrace, courtyard for lunch)
We were upgraded to suite 
Very comfortable bed
The staff was really nice
(-) Everything was great. We will return without hesitation.

(+) Beautiful design 
Beautiful facilities (bar terrace, courtyard for lunch)
We were upgraded to suite 
Very comfortable bed
The staff was really nice
(-) Everything was great. We will return without hesitation.

(+) Beautiful design 
Beautiful facilities (bar terrace, courtyard for lunch)
We were upgraded to suite 
Very comfortable bed
The staff was really nice
(-) Everything was great. We will return without hesitation.

(+) Beautiful design 
Beautiful facilities (bar terrace, courtyard for lunch)
We were upgraded to suite 
Very comfortable bed
The staff was really nice
(-) Everything was great. We will return without hesitation.

(+) Beautiful design 
Beautiful facilities (bar terrace, courtyard for lunch)
We were upgraded to suite 
Very comfortable bed
The staff was real



Unexpected error: No service has returned a valid result




Unexpected error: No service has returned a valid result




Unexpected error: No service has returned a valid result




(+) The welcome and sympathy of the staff 
The water fountain and coffee at the reception
The proximity of the east station
(-) Breakfast. It is quite expensive, in my opinion 18€. For this price I would have liked more choice in salty and sweet and products of better quality, such as fresh pastries and non-industrial fruit juices.

The extra bed installed for my son was not comfortable we felt the metal bars

(+) The welcome and sympathy of the staff 
The water fountain and coffee at the reception
The proximity of the east station
(-) Breakfast. It is quite expensive, in my opinion 18€. For this price I would have liked more choice in salty and sweet and products of better quality, such as fresh pastries and non-industrial fruit juices.

The extra bed installed for my son was not comfortable we felt the metal bars

(+) The welcome and sympathy of the staff 
The water fountain and coffee at the reception
The proximity of the east station
(-) Breakfast. It is quite expensive, in my opin



(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) The service is very nice
(-) Everything was fine

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) Service/ reception
(-) Nothing

(+) the personnel is irreproachable
(-) breakfast has nothing of a 5*

(+) the personnel is irreproa



Unexpected error: No service has returned a valid result




(+) What we enjoyed the most was the large bed the bath robes the size of the bathroom and the large counter in the room The air conditioning that finally the housekeeper on our floor communicated with the technician the day after our arrival for the start  (after a first night in this room to die of heat) The kindness of a person who was very competent (different from the big incompetent who talked to us so loud on our arrival and who could not find after 2 hours to find THE room we had booked) this one took care of booking the baggage handler and the taxi on the day of our departure and made sure of their accuracy Our stay started badly at the hotel but it is not the hotel that is in default and we will gladly return
(-) The lack of knowledge and arrogance of the person at the reception who said to see that we had notified the hotel of our arrival at 3pm the request for a large bed and a bath and that everything had been confirmed but that there was nothing of all this possible ... I



(+) exceptional location, 2 steps from everything, breakfast without fail, extremely comfortable and very nice room
Top concierge and friendly staff. A perfect stay
(-) nothing

(+) exceptional location, 2 steps from everything, breakfast without fail, extremely comfortable and very nice room
Top concierge and friendly staff. A perfect stay
(-) nothing

(+) exceptional location, 2 steps from everything, breakfast without fail, extremely comfortable and very nice room
Top concierge and friendly staff. A perfect stay
(-) nothing

(+) exceptional location, 2 steps from everything, breakfast without fail, extremely comfortable and very nice room
Top concierge and friendly staff. A perfect stay
(-) nothing

(+) exceptional location, 2 steps from everything, breakfast without fail, extremely comfortable and very nice room
Top concierge and friendly staff. A perfect stay
(-) nothing

(+) exceptional location, 2 steps from everything, breakfast without fail, extremely comfortable and very nice



Unexpected error: No service has returned a valid result




(+) - The location
- The decoration of the room
- The bedding
- The reception
(-) The size of bed 140 a little just for 2 people not in couple

(+) - The location
- The decoration of the room
- The bedding
- The reception
(-) The size of bed 140 a little just for 2 people not in couple

(+) - The location
- The decoration of the room
- The bedding
- The reception
(-) The size of bed 140 a little just for 2 people not in couple

(+) - The location
- The decoration of the room
- The bedding
- The reception
(-) The size of bed 140 a little just for 2 people not in couple

(+) - The location
- The decoration of the room
- The bedding
- The reception
(-) The size of bed 140 a little just for 2 people not in couple

(+) - The location
- The decoration of the room
- The bedding
- The reception
(-) The size of bed 140 a little just for 2 people not in couple

(+) - The location
- The decoration of the room
- The bedding
- The reception
(-) The size of bed 140 a little just for 2 people not in 



Unexpected error: No service has returned a valid result




Unexpected error: No service has returned a valid result




(+) The apartment was superb
The Japanese toilet 💜
The tea time, dinner at LOiseau Blanc, breakfast
You were all lovely. (Kiss to Ash and Tien)
The location between Trocadero and Arc de Triompe
The pool
The SPA
(-) No towel heater in the bathroom
Touchpad next to the bathtub that didn’t work
The hair dryer not powerful enough
But frankly these are not minor details

(+) The apartment was superb
The Japanese toilet 💜
The tea time, dinner at LOiseau Blanc, breakfast
You were all lovely. (Kiss to Ash and Tien)
The location between Trocadero and Arc de Triompe
The pool
The SPA
(-) No towel heater in the bathroom
Touchpad next to the bathtub that didn’t work
The hair dryer not powerful enough
But frankly these are not minor details

(+) The apartment was superb
The Japanese toilet 💜
The tea time, dinner at LOiseau Blanc, breakfast
You were all lovely. (Kiss to Ash and Tien)
The location between Trocadero and Arc de Triompe
The pool
The SPA
(-) No towel heater in the bathroom
Touchpad next t



(+) The location, the home is really top. we feel good, safe as at home.
(-) We would have appreciated even more that the maid goes during the week for a quick cleaning and think of leaving a small brush and her kneading would be good in case of need. With children, we often have small crumbs here and there and it is not at all evident to remove them by hand; moreover, I left you a little in the small detergent cabinet, I hope that next time, I will find it in the same place 😁.

Can you also ask one of your attendants to be on site to welcome late guests. There are those who get the wrong apartment simply because there is no one to guide them.

(+) The location, the home is really top. we feel good, safe as at home.
(-) We would have appreciated even more that the maid goes during the week for a quick cleaning and think of leaving a small brush and her kneading would be good in case of need. With children, we often have small crumbs here and there and it is not at all evident to remove



Unexpected error: No service has returned a valid result
(+) Breakfast was good.

The location is close to the Porte de Versailles lounge but a little far from tourist attractions
(-) The entrance to the underground car park is very narrow

(+) Breakfast was good.

The location is close to the Porte de Versailles lounge but a little far from tourist attractions
(-) The entrance to the underground car park is very narrow

(+) Breakfast was good.

The location is close to the Porte de Versailles lounge but a little far from tourist attractions
(-) The entrance to the underground car park is very narrow

(+) Breakfast was good.

The location is close to the Porte de Versailles lounge but a little far from tourist attractions
(-) The entrance to the underground car park is very narrow

(+) Breakfast was good.

The location is close to the Porte de Versailles lounge but a little far from tourist attractions
(-) The entrance to the underground car park is very narrow

(+) Breakfast was good.



(+) The friendliness of the staff. 
The excellent breakfast. 
The gourmet dinner. 
The cleanliness of the room and bathroom.
The ideal location to discover Paris.
(-) Too bad there were only two elevators because one of the two was broken down most of the time.

(+) The friendliness of the staff. 
The excellent breakfast. 
The gourmet dinner. 
The cleanliness of the room and bathroom.
The ideal location to discover Paris.
(-) Too bad there were only two elevators because one of the two was broken down most of the time.

(+) The friendliness of the staff. 
The excellent breakfast. 
The gourmet dinner. 
The cleanliness of the room and bathroom.
The ideal location to discover Paris.
(-) Too bad there were only two elevators because one of the two was broken down most of the time.

(+) The friendliness of the staff. 
The excellent breakfast. 
The gourmet dinner. 
The cleanliness of the room and bathroom.
The ideal location to discover Paris.
(-) Too bad there were only two elevators becaus

In [61]:
df3[
    (df3['all_reviews_text_en'] == "Not Translated")
]

Unnamed: 0,url,full_description,all_reviews_text,rating_subscores,all_reviews_text_en
113,https://www.booking.com/hotel/fr/villa-marquis...,L'Hotel Villa Marquis Member of Meliá Collecti...,(+) .Le personnel au petit soin vraiment formi...,"{'Personnel': 9.3, 'Équipements': 8.6, 'Propre...",Not Translated
118,https://www.booking.com/hotel/fr/artus.fr.html...,L’Hôtel Artus vous accueille sur la chic Rive ...,"(+) Emplacement incroyable, hôtel top modernis...","{'Personnel': 9.4, 'Équipements': 8.9, 'Propre...",Not Translated
146,https://www.booking.com/hotel/fr/9hotel-confid...,"Idéalement situé dans le centre de Paris, le 9...",(+) Notre hotel favori a paris. starck. l'inté...,"{'Personnel': 9.2, 'Équipements': 8.6, 'Propre...",Not Translated
155,https://www.booking.com/hotel/fr/astor-saint-h...,"L’hôtel Maison Astor Paris, Curio Collection b...",(+) Accueil très bien le personnel très gentil...,"{'Personnel': 9.0, 'Équipements': 8.3, 'Propre...",Not Translated
160,https://www.booking.com/hotel/fr/niepce-paris-...,"Situé à Paris, à moins de 3,1 km du parc des e...",(+) Le design et la decoration interieure des ...,"{'Personnel': 9.3, 'Équipements': 8.7, 'Propre...",Not Translated
169,https://www.booking.com/hotel/fr/sofitel-le-fa...,Le Sofitel Paris Le Faubourg est situé au cœur...,(+) Personnel très aimable. Chambre spacieuse ...,"{'Personnel': 8.9, 'Équipements': 8.4, 'Propre...",Not Translated
170,https://www.booking.com/hotel/fr/de-buci.fr.ht...,Cet hôtel 4 étoiles de style classique se situ...,(+) Le personnel est extrêmement serviable et ...,"{'Personnel': 9.3, 'Équipements': 8.5, 'Propre...",Not Translated
171,https://www.booking.com/hotel/fr/renaissance-p...,Le Renaissance Paris Nobel Tour Eiffel Hotel e...,(+) Très gentil! L’équipe de l’hôtel est chale...,"{'Personnel': 8.4, 'Équipements': 7.5, 'Propre...",Not Translated
174,https://www.booking.com/hotel/fr/melia-vendome...,Situé à Paris dans le prestigieux 1er arrondis...,(+) les chambres sont propres et spacieuses \n...,"{'Personnel': 9.0, 'Équipements': 7.8, 'Propre...",Not Translated
177,https://www.booking.com/hotel/fr/hyatt-paris-m...,L’Hyatt Paris Madeleine vous accueille dans le...,(+) la chambre et l’emplacement\n(-) salon pet...,"{'Personnel': 9.2, 'Équipements': 8.5, 'Propre...",Not Translated


In [62]:
# Filter out the rows with "Not Translated"
not_translated_reviews = df3[df3['all_reviews_text_en'] == "Not Translated"]

# Save to CSV for manual processing
not_translated_reviews.to_csv('not_translated_reviews.csv', index=False)

print("Exported rows for manual translation.")

Exported rows for manual translation.


In [66]:
!pip install langdetect retrying

Collecting retrying
  Downloading retrying-1.3.4-py3-none-any.whl.metadata (6.9 kB)
Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Installing collected packages: retrying
Successfully installed retrying-1.3.4


In [67]:
import pandas as pd
from translatepy import Translator
from langdetect import detect
from bs4 import BeautifulSoup
import re
import time
from retrying import retry


In [68]:
def preprocess_text(text):
    try:
        if pd.isnull(text) or not isinstance(text, str) or text.strip() == "":
            return ""
        # Remove HTML tags
        text = BeautifulSoup(text, "html.parser").get_text()
        # Replace multiple spaces with one
        text = re.sub(r'\s+', ' ', text)
        # Remove special characters (preserve punctuation)
        text = re.sub(r'[^\w\s,.!?-]', '', text)
        return text.strip()
    except Exception as e:
        print(f"Preprocessing Error: {e}")
        return "Preprocessing Error"


In [69]:
# Remove duplicate lines in each review
def remove_duplicates(text):
    try:
        if pd.isnull(text) or not isinstance(text, str) or text.strip() == "":
            return text
        
        # Split the text into lines and remove duplicates while preserving order
        lines = text.split('\n')
        seen = set()
        filtered_lines = []
        for line in lines:
            if line.strip() not in seen:  # Avoid duplicates
                filtered_lines.append(line.strip())
                seen.add(line.strip())
        
        # Rejoin filtered lines
        return "\n".join(filtered_lines).strip()
    except Exception as e:
        print(f"Duplicate Removal Error: {e}")
        return "Error Removing Duplicates"


In [70]:
translator = Translator()

# Retry decorator for handling temporary translation errors
@retry(stop_max_attempt_number=3, wait_fixed=2000)  # Retry 3 times, wait 2 seconds
def translate_to_english(text):
    try:
        if pd.isnull(text) or not isinstance(text, str) or text.strip() == "":
            return ""

        # Detect language
        lang = detect(text)
        if lang == 'en':  # Skip translation if already in English
            return text
        
        # Translate text
        translated = translator.translate(text, "en")
        time.sleep(1)  # Delay to avoid rate-limiting
        return translated.result
    except Exception as e:
        print(f"Translation Error: {e}")
        return "Not Translated"

In [71]:
# Step 1: Preprocess reviews
df3['clean_reviews'] = df3['all_reviews_text'].apply(preprocess_text)

# Step 2: Remove duplicate lines
df3['filtered_reviews'] = df3['clean_reviews'].apply(remove_duplicates)

# Step 3: Translate reviews to English
df3['all_reviews_text_en'] = df3['filtered_reviews'].apply(translate_to_english)

# Step 4: Identify rows that failed translation
not_translated = df3[df3['all_reviews_text_en'] == "Not Translated"]

# Save failed translations for manual review
not_translated.to_csv('failed_translations.csv', index=False)

# Step 5: Save the final processed dataset
df3.to_csv('translated_reviews.csv', index=False)

# Print results
print("Reviews successfully translated!")
print(f"Total rows: {len(df3)}, Failed translations: {len(not_translated)}")



Reviews successfully translated!
Total rows: 208, Failed translations: 0


In [74]:
df3[
    (df3['all_reviews_text_en'].str.strip() == "") |    # Empty strings
    (df3['all_reviews_text_en'].isnull())               # NaN values
]

Unnamed: 0,url,full_description,all_reviews_text,rating_subscores,all_reviews_text_en,clean_reviews,filtered_reviews
42,https://www.booking.com/hotel/fr/paris-orleans...,Le Mercure Paris Alesia est un hôtel 4 étoiles...,,{},,,
43,https://www.booking.com/hotel/fr/home-business...,L'Appart'City Collection Paris Grande Biblioth...,,{},,,
44,https://www.booking.com/hotel/fr/bob-by-elegan...,"Le Bob Hotel est situé à Paris, à 950 mètres d...",,{},,,
45,https://www.booking.com/hotel/fr/jules.fr.html...,Le NH Paris Opéra Faubourg vous accueille rue ...,,{},,,
46,https://www.booking.com/hotel/fr/citadines-apa...,Le Citadines Trocadéro Paris se trouve à 500 m...,,{},,,
47,https://www.booking.com/hotel/fr/sthonore.fr.h...,"Situé au cœur de Paris, dans la célèbre rue du...",,{},,,
48,https://www.booking.com/hotel/fr/citotel-sport...,L'hôtel L'Interlude (ex Citotel Sport hotel) p...,,{},,,
49,https://www.booking.com/hotel/fr/hotel-rosalie...,"Situé dans le quartier des Gobelins, récemment...",,{},,,
50,https://www.booking.com/hotel/fr/plazatoureiff...,"Situé dans le 16e arrondissement de Paris, à p...",,{},,,
51,https://www.booking.com/hotel/fr/royalsaintger...,Le Royal Saint Germain est un hôtel de charme ...,,{},,,


In [73]:
df3[
    (df3['all_reviews_text_en'].str.strip() == "") |    # Empty strings
    (df3['all_reviews_text_en'].isnull())               # NaN values
].to_csv("hotel_no_reviews.csv")

In [75]:
df4 = pd.read_csv("hotel_details2.csv")

In [77]:
df4

Unnamed: 0,url,full_description,all_reviews_text,rating_subscores
0,https://www.booking.com/hotel/fr/paris-orleans...,Le Mercure Paris Alesia est un hôtel 4 étoiles...,,"{'Personnel': 9.2, 'Équipements': 8.4, 'Propre..."
1,https://www.booking.com/hotel/fr/home-business...,L'Appart'City Collection Paris Grande Biblioth...,,"{'Personnel': 8.3, 'Équipements': 7.3, 'Propre..."
2,https://www.booking.com/hotel/fr/bob-by-elegan...,"Le Bob Hotel est situé à Paris, à 950 mètres d...",,"{'Personnel': 9.0, 'Équipements': 8.4, 'Propre..."
3,https://www.booking.com/hotel/fr/jules.fr.html...,Le NH Paris Opéra Faubourg vous accueille rue ...,,"{'Personnel': 8.7, 'Équipements': 7.8, 'Propre..."
4,https://www.booking.com/hotel/fr/citadines-apa...,Le Citadines Trocadéro Paris se trouve à 500 m...,,"{'Personnel': 8.9, 'Équipements': 8.4, 'Propre..."
5,https://www.booking.com/hotel/fr/sthonore.fr.h...,"Situé au cœur de Paris, dans la célèbre rue du...",,"{'Personnel': 8.8, 'Équipements': 8.4, 'Propre..."
6,https://www.booking.com/hotel/fr/citotel-sport...,L'hôtel L'Interlude (ex Citotel Sport hotel) p...,,"{'Personnel': 9.0, 'Équipements': 8.1, 'Propre..."
7,https://www.booking.com/hotel/fr/hotel-rosalie...,"Situé dans le quartier des Gobelins, récemment...",,"{'Personnel': 9.5, 'Équipements': 8.7, 'Propre..."
8,https://www.booking.com/hotel/fr/plazatoureiff...,"Situé dans le 16e arrondissement de Paris, à p...",,"{'Personnel': 9.4, 'Équipements': 8.6, 'Propre..."
9,https://www.booking.com/hotel/fr/royalsaintger...,,,{}


In [79]:
df3[
    ~(
        (df3['all_reviews_text_en'].str.strip() == "") | 
        (df3['all_reviews_text_en'].isnull())
    )
].to_csv("hotel_details_tr_cleaned.csv")