✅ [0] 라이브러리 설치 및 spaCy 모델 다운로드

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm
!pip install contractions

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------- -------------------------------- 2.4/12.8 MB 11.2 MB/s eta 0:00:01
     -------------- ------------------------- 4.7/12.8 MB 11.9 MB/s eta 0:00:01
     ---------------------- ----------------- 7.3/12.8 MB 11.9 MB/s eta 0:00:01
     ------------------------------ -------- 10.0/12.8 MB 11.9 MB/s eta 0:00:01
     ------------------------------------- - 12.3/12.8 MB 11.9 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 11.5 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


✅ [1] 라이브러리 불러오기 및 spaCy 초기화

In [7]:
import pandas as pd
import re
import spacy

# spaCy 영어 모델 로딩
nlp = spacy.load("en_core_web_sm")

✅ [2] 데이터 불러오기

In [8]:
df = pd.read_csv("dataset/Hotel_Reviews.csv")
print("데이터 개수:", len(df))
df.head()

데이터 개수: 515738


Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,3.8,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968
4,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/24/2017,7.7,Hotel Arena,New Zealand,You When I booked with your company on line y...,140,1403,Amazing location and building Romantic setting,8,3,6.7,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",10 days,52.360576,4.915968


✅ [3] 감정별 리뷰 분리 및 통합

In [10]:
# 긍정 리뷰
pos_df = df[df['Positive_Review'].str.strip().str.lower() != 'no positive'][['Positive_Review']]
pos_df = pos_df.rename(columns={'Positive_Review': 'Review'})
pos_df['Sentiment'] = 1

# 부정 리뷰
neg_df = df[df['Negative_Review'].str.strip().str.lower() != 'no negative'][['Negative_Review']]
neg_df = neg_df.rename(columns={'Negative_Review': 'Review'})
neg_df['Sentiment'] = 0

# 병합 및 정리
df_clean = pd.concat([pos_df, neg_df], ignore_index=True)
df_clean.dropna(subset=['Review'], inplace=True)
df_clean = df_clean.reset_index(drop=True)

print("최종 리뷰 개수:", len(df_clean))
df_clean.sample(5)

최종 리뷰 개수: 867640


Unnamed: 0,Review,Sentiment
393607,Comfortable Room Good parking,1
510853,decor a bit tired room small and bed pillows ...,0
184945,The room was comfortable nice to have a bath ...,1
16569,the little extra touches like complimentary c...,1
198382,I liked that I was greeted by the concierge u...,1


✅ [4] 리뷰 텍스트 전처리 (spaCy 기반)

In [14]:
from tqdm import tqdm

# 전처리 대상 리스트
texts = df_clean['Review'].fillna("").tolist()

# 부정어는 불용어에서 제외하고 보존
NEGATION_WORDS = {"not", "no", "never", "nor"}

# 전처리 결과 저장용 리스트
processed_texts = []

print("spaCy 전처리 시작 (부정어 보존 적용)...")

for doc in tqdm(nlp.pipe(texts, batch_size=2000, n_process=2), total=len(texts)):
    try:
        tokens = [
            token.lemma_.lower()
            for token in doc
            if (
                not token.is_stop or token.text.lower() in NEGATION_WORDS
            )
            and not token.is_punct
            and not token.is_space
        ]
        processed_texts.append(' '.join(tokens))
    except Exception as e:
        processed_texts.append("")
        print("전처리 오류:", e)

# 결과 저장
df_clean['Processed'] = processed_texts

print("전처리 완료! 예시:")
print(df_clean[['Review', 'Processed']].sample(3))

spaCy 전처리 시작 (부정어 보존 적용)...


100%|██████████| 867640/867640 [38:07<00:00, 379.33it/s]  


전처리 완료! 예시:
                                                   Review  \
695688   Elevator takes long long wifi is not working ...   
410811   Rooms were functioned well with a safe fridge...   
714481   I didn t like that our room phone didn t work...   

                                                Processed  
695688  elevator take long long wifi not work good cou...  
410811  room function safe fridge air conditioning sup...  
714481  didn t like room phone didn t work breakfast e...  


✅ 전처리 품질 점검용 코드

In [15]:
# 1. 빈 문자열이 있는지 확인
num_empty = (df_clean['Processed'] == "").sum()
print(f"⚠️ 빈 Processed 리뷰 수: {num_empty}")

# 2. 부정어 포함 비율
negation_words = {"not", "no", "never", "nor", "n't", "didn", "don", "wasn", "weren", "couldn", "wouldn", "shouldn"}
df_clean['HasNegation'] = df_clean['Processed'].apply(
    lambda x: any(neg in x.split() for neg in negation_words)
)
negation_ratio = df_clean['HasNegation'].mean()
print(f"✅ 부정어 포함 비율: {negation_ratio:.2%}")

# 3. 부정어 예시 출력
print("\n📌 부정어 포함 예시:")
print(df_clean[df_clean['HasNegation']].sample(5)[['Review', 'Processed']])


⚠️ 빈 Processed 리뷰 수: 31995
✅ 부정어 포함 비율: 23.86%

📌 부정어 포함 예시:
                                                   Review  \
625522   It is a 5 star hotel but I can t suggest anyo...   
519217   No heating in the room and window broken afte...   
756343   I didn t have any WiFi the room was not very ...   
686161   Little out of the main tourist area which is ...   
525833                                       Don t change   

                                                Processed  
625522  5 star hotel t suggest know feeling experience...  
519217  no heating room window break report reception ...  
756343  didn t wifi room not clean no riding lamp not ...  
686161    little main tourist area not ideal limited time  
525833                                       don t change  
