**Tripadvisor 국내 투어 프로그램 리뷰 데이터 분석을 통한 랜선투어 상품 개발**
- 목적 : Tours & Tickets 메뉴에 등록된 국내 투어 프로그램 리뷰 데이터를 분석하여 오프라인 경험을 최대한 살릴 수 있는 랜선투어 상품 개발
- 단계 :<br>
1) 리뷰수, 평균평점을 기준으로 Top N 인기 프로그램 선정<br>
2) 선정 프로그램의 리뷰 데이터 크롤링<br>
3) 리뷰 내용을 평점별로 구분하여 분석 (높은 평점 vs 낮은 평점)<br>
4) 분석 내용을 기반으로 국내 랜선투어 프로그램 개발

In [1]:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from tqdm import tqdm

pd.set_option('display.max_columns', None)
pd.set_option("display.max_colwidth", -1)

  pd.set_option("display.max_colwidth", -1)


In [2]:
# Importing tour title & url file (separately scraped)
df_title = pd.read_excel('tour_titles_url_korea.xlsx').drop(['Unnamed: 0', 'tour_location'], axis = 1)
df_title['tour_title'] = df_title['tour_title'].str.replace(' ', '_').str.replace(':','').str.replace('&','_').str.replace('-','_')\
.str.replace('(','').str.replace(')','').str.replace('___','_').str.replace('__','_')
df_title

Unnamed: 0,tour_title,tour_url
0,Seoul_Private_Flexible_Adventure_Tour,AttractionProductReview-g294197-d12521956-Seoul_Private_Flexible_Adventure_Tour-Seoul
1,Korean_History_Heritage_Tour,AttractionProductReview-g294197-d15883739-Korean_History_Heritage_Tour-Seoul
2,Fully_Customizable_Private_Tour_From_Seoul_to_Gyeonggi,AttractionProductReview-g294197-d12649287-Fully_Customizable_Private_Tour_From_Seoul_to_Gyeonggi-Seoul
3,Half_Day_Korean_DMZ_Tour_from_Seoul,AttractionProductReview-g294197-d11989360-Half_Day_Korean_DMZ_Tour_from_Seoul-Seoul
4,Korean_Demilitarized_Zone_DMZ_Half_Day_Tour_from_Seoul,AttractionProductReview-g294197-d11454650-Korean_Demilitarized_Zone_DMZ_Half_Day_Tour_from_Seoul-Seoul
...,...,...
140,Suwon_Hwaseong_Fortress_and_Korean_Folk_Village_Day_Tour_from_Seoul,AttractionProductReview-g294197-d11461630-Suwon_Hwaseong_Fortress_and_Korean_Folk_Village_Day_Tour_from_Seoul-Seoul
141,Korean_Palace_and_Temple_Tour_in_Seoul_Gyeongbokgung_Palace_and_Jogyesa_Temple,AttractionProductReview-g294197-d12463701-Korean_Palace_and_Temple_Tour_in_Seoul_Gyeongbokgung_Palace_and_Jogyesa_Temple-Seo
142,Korean_Folk_Village,AttractionProductReview-g294197-d13159855-Korean_Folk_Village-Seoul
143,History_and_Culture_Tour,AttractionProductReview-g294197-d20232226-History_and_Culture_Tour-Seoul


In [3]:
df_title = df_title[107:]
df_title

Unnamed: 0,tour_title,tour_url
107,Secret_Food_Tours_Seoul_w_Private_Tour_Option,AttractionProductReview-g294197-d17597945-Secret_Food_Tours_Seoul_w_Private_Tour_Option-Seoul
108,Private_Day_Trip_to_Seoraksan_National_Park,AttractionProductReview-g294197-d12968517-Private_Day_Trip_to_Seoraksan_National_Park-Seoul
109,Ganghwa_Island_Full_day_private_tour,AttractionProductReview-g297889-d16823242-Ganghwa_Island_Full_day_private_tour-Incheon
110,Full_Day_Essential_Jeju_Island_Private_tour_for_West_Course,AttractionProductReview-g297885-d16654477-Full_Day_Essential_Jeju_Island_Private_tour_for_West_Course-Jeju_Jeju_Island
111,Hanbok_Photoshoot_in_Seoul,AttractionProductReview-g294197-d11484280-Hanbok_Photoshoot_in_Seoul-Seoul
112,Suwon_Hwaseong_Fortress_Small_Group_Morning_Tour_from_Seoul,AttractionProductReview-g294197-d11464111-Suwon_Hwaseong_Fortress_Small_Group_Morning_Tour_from_Seoul-Seoul
113,Private_Day_Trip_to_Korean_Folk_Village_and_Hwaseong_Fortress,AttractionProductReview-g294197-d12968516-Private_Day_Trip_to_Korean_Folk_Village_and_Hwaseong_Fortress-Seoul
114,"Hanbok_Photo_Shooting_SnapStudio_Gyeongbokgung_Palace_Hanbok_Rental,_Make_up",AttractionProductReview-g294197-d17773425-Hanbok_Photo_Shooting_Snap_Studio_Gyeongbokgung_Palace_Hanbok_Rental_Make_up-Seoul
115,VIP_Brompton_Bike_Food_Tour_with_Car_Pick_Up_Service,AttractionProductReview-g294197-d16795119-VIP_Brompton_Bike_Food_Tour_with_Car_Pick_Up_Service-Seoul
116,Private_Group_Day_Trip_to_Seongmodo_Island_and_Ganghwa_Island,AttractionProductReview-g294197-d12471528-Private_Group_Day_Trip_to_Seongmodo_Island_and_Ganghwa_Island-Seoul


In [4]:
# Opening Chrome web driver
driver = webdriver.Chrome('./chromedriver.exe')

In [5]:
tour_url = df_title['tour_url'].tolist()
tour_title = df_title['tour_title'].tolist()

In [6]:
def tour_search(program):
    '''Getting tour page url for tour programs'''
    url = f'https://www.tripadvisor.com/{program}.html'
    return url

def get_reviews(driver):
    '''Scraping review contents (title, comment, score)'''
    
    # Clicking 'read more' for review comments (5 items per page)
    try:
        driver.find_element_by_xpath("//div[@data-automation='reviewReadMoreCTA_0']").click()
        time.sleep(0.1)
        driver.find_element_by_xpath("//div[@data-automation='reviewReadMoreCTA_1']").click()
        time.sleep(0.1)
        driver.find_element_by_xpath("//div[@data-automation='reviewReadMoreCTA_2']").click()
        time.sleep(0.1)
        driver.find_element_by_xpath("//div[@data-automation='reviewReadMoreCTA_3']").click()
        time.sleep(0.1)
        driver.find_element_by_xpath("//div[@data-automation='reviewReadMoreCTA_4']").click()
        time.sleep(0.1)
        
    except:
        pass
    
    # Getting page source and parsing with BeautifulSoup
    results = driver.page_source
    soup = BeautifulSoup(results, 'lxml')
    review_pg_open = soup.select_one('[data-automation = "reviewList"]')

    result_pg = []
    
    for review in review_pg_open:
        # Tour title, review title, review comment, review date, trip type
        tour_title = soup.select('span.IKwHbf8J')[0].text
        review_title = review.select('div._2cigFICy')[0].text
        review_comment = review.select('q._2vmgOjMl > span')[0].text
        
        temp_date = review.select('div._30IBqsJg > span')
        try:
            if len(temp_date[0].text) > 0:
                review_date = temp_date[0].text
        except:
            review_date = ''

        # Review score
        path = review.find_all('path')
        score = 0
        for s in range(0,5):
            onclick = path[s].attrs['d']
            if onclick == 'M 12 0C5.388 0 0 5.388 0 12s5.388 12 12 12 12-5.38 12-12c0-6.612-5.38-12-12-12zm0 2a9.983 9.983 0 019.995 10 10 10 0 01-10 10A10 10 0 012 12 10 10 0 0112 2z':
                score += 0
            else :
                score += 1
            review_score = score
        
        # Trip type
        temp_type = review.findAll('div', {'class': '_20I-kAyv'})
        try:
            if len(temp_type[0].text) > 0:
                trip_type = temp_type[0].text
        except:
            trip_type = ''
        
        result = [tour_title, review_title, review_comment, review_date, review_score, trip_type]
        result_pg.append(result)
    
    df = pd.DataFrame(result_pg)
    df.columns = [['tour_title', 'review_title', 'review_comment', 'review_date', 'review_score', 'trip_type']]
    
    return df

def next_reviews(driver):
    '''Scraping information through get_reviews function and moving to next page'''
    result_all_pg = pd.DataFrame()
    
    for page in range(0,70):
        pg_review = get_reviews(driver)
        result_all_pg = pd.concat([result_all_pg, pg_review], axis = 0).reset_index(drop=True)
        print(f'{page} page done')
        
        try:
            driver.find_element_by_xpath("//a[@data-automation='pageLink_next']").click()
            time.sleep(3)
            page += 1
            
        except:
            break
            
    return result_all_pg

In [7]:
review_all_tours = pd.DataFrame()

for i in zip(tour_url, tour_title):
    url = tour_search(i[0])
    driver.get(url)
    driver.implicitly_wait(10)
    time.sleep(3)
    
    review = next_reviews(driver)
    
    review_all_tours = pd.concat([review_all_tours, review], axis = 0).reset_index(drop=True)
    review.to_excel(f'./t_scraped/{i[1]}_scored.xlsx')
    print(f'"{i[1]}" done')

0 page done
1 page done
2 page done
3 page done
4 page done
5 page done
"Secret_Food_Tours_Seoul_w_Private_Tour_Option" done
0 page done
1 page done
"Private_Day_Trip_to_Seoraksan_National_Park" done
0 page done
"Ganghwa_Island_Full_day_private_tour" done
0 page done
"Full_Day_Essential_Jeju_Island_Private_tour_for_West_Course" done
0 page done
1 page done
"Hanbok_Photoshoot_in_Seoul" done
0 page done
"Suwon_Hwaseong_Fortress_Small_Group_Morning_Tour_from_Seoul" done
0 page done
"Private_Day_Trip_to_Korean_Folk_Village_and_Hwaseong_Fortress" done
0 page done
1 page done
2 page done
3 page done
"Hanbok_Photo_Shooting_SnapStudio_Gyeongbokgung_Palace_Hanbok_Rental,_Make_up" done
0 page done
"VIP_Brompton_Bike_Food_Tour_with_Car_Pick_Up_Service" done
0 page done
"Private_Group_Day_Trip_to_Seongmodo_Island_and_Ganghwa_Island" done
0 page done
"Pyeongchang_Private_Day_Trip_from_Seoul" done
0 page done
1 page done
2 page done
3 page done
"Full_day_Customizable_Private_Seoul_Highlight_Tour_wit

In [8]:
review_all_tours.to_excel(f'./t_scraped/review_all_tours.xlsx')

추가작업
- 가격 등 기타 정보 수집
- 엑셀 파일 통합 저장