<a href="https://colab.research.google.com/github/brusangues/maps-reviews-api-scraper/blob/reorganization/1_maps_reviews_api_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Dados Textuais

Utilizaremos uma ferramenta scraper para obter dados de texto de avaliações de hotéis no Google Maps.
Depois, faremos um tratamento básico desses dados.

## 1.1. Utilizando o Scraper para extração de Dados

In [1]:
!git clone --branch reorganization https://github.com/brusangues/maps-reviews-api-scraper

fatal: destination path 'maps-reviews-api-scraper' already exists and is not an empty directory.


In [2]:
%cd /content/maps-reviews-api-scraper

/content/maps-reviews-api-scraper


In [3]:
!pip install -r requirements.txt



In [4]:
%cd /content/maps-reviews-api-scraper/scraper

/content/maps-reviews-api-scraper/scraper


In [5]:
!python -m app run-async --path ./input/test.csv

[38;20m2023-10-17 14:35:27,324 INFO   1179 [    call_pools] (app.py: 46): - Running async[0m
[38;20m2023-10-17 14:35:27,383 INFO   1201 [  call_scraper] (app.py: 84): - folder created[0m
[38;20m2023-10-17 14:35:27,387 INFO   1202 [  call_scraper] (app.py: 84): - folder created[0m
[38;20m2023-10-17 14:35:27,391 INFO   1202 [  scrape_place] (scraper.py:476): Wish Foz do Igua - Scraping metadata for url: Wish Foz do Iguaçu[0m
[38;20m2023-10-17 14:35:27,392 INFO   1202 [  scrape_place] (scraper.py:480): Wish Foz do Igua - Parsing metadata...[0m
[38;20m2023-10-17 14:35:27,395 INFO   1201 [  scrape_place] (scraper.py:476): Pousada Itararé  - Scraping metadata for url: Pousada Itararé[0m
[38;20m2023-10-17 14:35:27,396 INFO   1201 [  scrape_place] (scraper.py:480): Pousada Itararé  - Parsing metadata...[0m
[31;20m2023-10-17 14:35:28,132 ERROR   1201 [  _parse_place] (scraper.py:183): Pousada Itararé  - error parsing place: place_name[0m
[31;20m2023-10-17 14:35:28,133 ERROR   1

## 1.2. Reunindo os Resultados

In [2]:
%load_ext autoreload

In [4]:
%autoreload 2
import json
import logging
import os
from datetime import datetime
from pathlib import Path

import pandas as pd
import regex as re
import typer
from dateutils import relativedelta
from unidecode import unidecode

from analysis.src.config import *
from analysis.src.preprocessing import map_progress, tokenizer_lemma
from analysis.src.utils import *
from scraper.src.custom_logger import get_logger

# Removendo aviso de debug
os.environ["PYDEVD_WARN_SLOW_RESOLVE_TIMEOUT"] = "3000"
# Removendo limitação de print de dfs
# pd.set_option("display.max_rows", 500)
# pd.set_option("display.max_columns", 500)
# pd.set_option("display.width", 1000)

# Caminho da pasta contendo os csvs
data_path = "scraper/data/2023/01/19/"
places_file = "scraper/data/places.csv"
input_file = "scraper/input/hotels_23_01_19.csv"
reports_folder = Path("./reports")
Path(reports_folder).mkdir(exist_ok=True)
data_folder = Path("./data")
Path(data_folder).mkdir(exist_ok=True)


In [5]:
df = read_data(input_file, data_path)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   name                     50 non-null     object 
 1   feature_id               50 non-null     object 
 2   retrieval_date_metadata  50 non-null     object 
 3   place_name               50 non-null     object 
 4   address                  50 non-null     object 
 5   overall_rating           50 non-null     float64
 6   n_reviews                50 non-null     int64  
 7   topics                   50 non-null     object 
 8   url                      50 non-null     object 
 9   file_name                50 non-null     object 
dtypes: float64(1), int64(1), object(8)
memory usage: 4.0+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 270100 entries, 0 to 3416
Data columns (total 19 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------             

In [20]:
# Saving
save_df(df, "data", "df_raw")

data/df_raw_2023-10-17_11-42-29_936902.pq


In [6]:
# Checking for errors
df.errors.value_counts()

[]    270100
Name: errors, dtype: int64

In [7]:
# Checking for duplicate reviews
duplicate_ids = df[df.review_id.duplicated()].review_id
df_duplicates = df[df.review_id.isin(duplicate_ids)]
print("df_duplicates.shape", df_duplicates.shape)
# Dropping duplicate
df = df.drop_duplicates(subset="review_id").reset_index(drop=True)
print("df.shape", df.shape)

df_duplicates.shape (6, 35)
df.shape (270097, 35)


In [8]:
# Counting actual number of reviews per hotel
df_count = df.groupby(["name"]).agg(agg_dict).reset_index()
df_count["n_reviews_diff"] = df_count.n_reviews - df_count.review_id
df_count["n_reviews"] = df_count["n_reviews"].astype(int)

In [9]:
df_count[["name","n_reviews_max","n_reviews","review_id","n_reviews_diff"]]

Unnamed: 0,name,n_reviews_max,n_reviews,review_id,n_reviews_diff
0,Acqua Lokos,8180,8542,8542,0
1,Aroso Paço Hotel,1105,1118,1118,0
2,Atlantic Hotel Copacabana,4860,4983,4983,0
3,Atlântico Center,1382,1397,1397,0
4,Atlântico Inn Apart Hotel,1429,1471,1471,0
5,Atrium Quinta de Pedras,1375,1392,1392,0
6,Blue Tree Thermas de Lins,4840,5032,5032,0
7,Boa Vista Eco Hotel,1132,1139,1139,0
8,Bourbon Cataratas do Iguaçu Thermas Eco Resort,4785,4868,4868,0
9,Caravelle Palace Hotel,3180,3208,3208,0


In [12]:
# Counting how many reviews have text
df_count["text_percentage"] = df_count["text"] / df_count["review_id"]
df_count[["text_percentage", "text", "review_id"]]
df_count2 = df_count.agg(
    {"text_percentage": "mean", "text": "sum", "review_id": "sum"}
)
df_count2

text_percentage         0.54868
text               150774.00000
review_id          270097.00000
dtype: float64

## 1.3. Tratamento Básico

In [19]:
%timeit
%autoreload 2
from analysis.src.prep import prep_complete
df_prep = prep_complete(df)

drop duplicates
likes
trip_type_travel_group
lattitude and longitude
relative dates


100%|██████████| 270097/270097 [01:14<00:00, 3616.39it/s]
100%|██████████| 270097/270097 [00:40<00:00, 6594.14it/s] 


Other ratings


100%|██████████| 270097/270097 [00:15<00:00, 17947.38it/s]


Topics


100%|██████████| 270097/270097 [01:01<00:00, 4366.67it/s]


User


100%|██████████| 270097/270097 [00:02<00:00, 97922.05it/s] 


Text


100%|██████████| 270097/270097 [00:25<00:00, 10686.24it/s]
100%|██████████| 270097/270097 [00:21<00:00, 12311.68it/s]


tokens


100%|██████████| 270097/270097 [00:24<00:00, 11252.20it/s]
100%|██████████| 270097/270097 [00:30<00:00, 8816.38it/s] 


In [21]:
# Saving
save_df(df_prep, "data", "df_prep")

data/df_prep_2023-10-17_11-42-29_936902.pq
