<a href="https://colab.research.google.com/github/angel870326/NTU_Social_Media_Analytics/blob/main/HW3/code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> 2023.05.20 Ssu-Yun Wang<br/>
[Github @angel870326](https://github.com/angel870326)

# **Programming Assignment 3: Sentiment Analysis**

### Contents

1. Read Data
2. Preprocessing
3. Feature Extraction for Training and Validation Sets
4. Classification
    * 4.0 Fit & Evaluate
    * 4.1 Naive Bayes
    * 4.2 SVM
    * 4.3 Logistic Regression
    * 4.4 Random Forest
    * 4.5 XGB
5. Train & Predict with Logistic Regression
    * 5.1 Feature Extraction
    * 5.2 Train
    * 5.3 Predict




In [1]:
# sConnect to the Google Drive
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [2]:
import os
import pandas as pd
import numpy as np

## **1. Read Data**

**data.csv：**作為模型訓練使⽤，共有 31860 筆，包含以下兩個欄位。<br>
*   content：針對航空公司提供的服務做出的⽂字評論。
*   category：標籤評論是屬於負⾯ (-1)、中立 (0) 或正⾯ (1) 態度。

<br>

**predict.csv：**預測結果使⽤，共 5000 筆（無標記），包含以下兩個欄位。<br>
*   num：每⾏評論的編號。
*   content：針對航空公司提供的服務做出的⽂字評論。


In [3]:
# Path
path = '/content/gdrive/MyDrive/碩二下/社群媒體分析/HW/HW3'

In [4]:
data = pd.read_csv(os.path.join(path, 'data.csv'), encoding='latin-1')
data

Unnamed: 0,content,category
0,Two short hops ZRH-LJU and LJU-VIE. Very fast ...,1
1,Flew Zurich-Ljubljana on JP365 newish CRJ900. ...,1
2,Adria serves this 100 min flight from Ljubljan...,1
3,WAW-SKJ Economy. No free snacks or drinks on t...,-1
4,Sarajevo-Frankfurt via Ljubljana. I loved flyi...,1
...,...,...
31855,Treviso to Lviv. Seemed like a new plane. Very...,1
31856,Rome to Prague. Was very happy with the flight...,1
31857,We often fly with Wizzair to/from Charleroi/Bu...,1
31858,PRG-LTN and LTN-PRG were rather good flights. ...,0


In [5]:
# Check if Y is imbalanced
data['category'].value_counts()

 1    14847
-1    10784
 0     6229
Name: category, dtype: int64

In [6]:
predict = pd.read_csv(os.path.join(path, 'predict.csv'), encoding='latin-1')
predict

Unnamed: 0,num,content
0,1,Outbound flight FRA/PRN A319. 2 hours 10 min f...
1,2,Our flight from Rhodes to Athens on route to H...
2,3,Athens to Larnaca economy. Early morning fligh...
3,4,MUC-SKG on 17th of Dec. One-way. New A320 with...
4,5,Flew London to Athens and then on to Ioannina ...
...,...,...
4995,4996,28 Oct FLT OZ 574. Tired old plane. 90 minutes...
4996,4997,SYD-ICN: old plane old reclining seat poor ent...
4997,4998,Flew from Recife Brazil to Miami. This airline...
4998,4999,I travel internationally very frequently on bu...


## **2. Preprocessing**

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [8]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [9]:
def preprocess_text(text):
    tokens = word_tokenize(text.lower())    # lowercase
    # filtered_tokens = [stemmer.stem(token) for token in tokens if token.isalpha() and token not in stop_words]
    filtered_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return ' '.join(filtered_tokens)

In [10]:
train = data.copy()
train['content'] = train['content'].apply(preprocess_text)
train

Unnamed: 0,content,category
0,two short hop fast crj seat comfortable crew f...,1
1,flew newish flight almost full departure time ...,1
2,adria serf min flight ljubljana amsterdam bran...,1
3,economy free snack drink star alliance partner...,-1
4,via ljubljana loved flying small airline funct...,1
...,...,...
31855,treviso lviv seemed like new plane comfortable...,1
31856,rome prague happy flight airplane clean leathe...,1
31857,often fly wizzair find quite reliable least de...,1
31858,rather good flight minimum legroom friendly st...,0


In [11]:
test = predict.copy()
test['content'] = test['content'].apply(preprocess_text)
test

Unnamed: 0,num,content
0,1,outbound flight hour min flight thought sale s...
1,2,flight rhodes athens route heathrow cancelled ...
2,3,athens larnaca economy early morning flight de...
3,4,new new style seat quite uncomfortable though ...
4,5,flew london athens ioannina ioannina athens ba...
...,...,...
4995,4996,oct flt oz tired old plane minute late taking ...
4996,4997,old plane old reclining seat poor entertainmen...
4997,4998,flew recife brazil miami airline continues liv...
4998,4999,travel internationally frequently business hol...


## **3. Feature Extraction for Training and Validation Sets**


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(train['content'], train['category'], test_size=0.2, random_state=42)
print("X_train", X_train.shape)
print("y_train", y_train.shape)
print("X_valid", X_valid.shape)
print("X_valid", y_valid.shape)

X_train (25488,)
y_train (25488,)
X_valid (6372,)
X_valid (6372,)


In [14]:
# Vectorize the text data
vectorizer = TfidfVectorizer()
X_train_v = vectorizer.fit_transform(X_train)
X_valid_v = vectorizer.transform(X_valid)

## **4. Classification**



### **4.0 Fit & Evaluate**


In [15]:
from sklearn.metrics import classification_report

In [16]:
def fit_evaluate(clf, X_train, y_train, X_valid, y_valid, y_converted: bool = False):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_valid)
    if y_converted == True:   # y_train was converted from (-1, 0, 1) to (0, 1, 2)
        y_pred = [y-1 for y in y_pred]    #  Convert y_pred back to (-1, 0, 1)
    print(classification_report(y_valid, y_pred, labels=np.unique(y_pred)))

### **4.1 Naive Bayes**


In [17]:
from sklearn.naive_bayes import MultinomialNB

In [18]:
clf = MultinomialNB()
fit_evaluate(clf, X_train_v, y_train, X_valid_v, y_valid)

              precision    recall  f1-score   support

          -1       0.80      0.79      0.79      2106
           0       1.00      0.00      0.00      1280
           1       0.67      0.95      0.78      2986

    accuracy                           0.71      6372
   macro avg       0.82      0.58      0.53      6372
weighted avg       0.78      0.71      0.63      6372



### **4.2 SVM**


In [19]:
from sklearn.svm import LinearSVC

In [20]:
clf = LinearSVC()
fit_evaluate(clf, X_train_v, y_train, X_valid_v, y_valid)

              precision    recall  f1-score   support

          -1       0.76      0.85      0.80      2106
           0       0.50      0.28      0.36      1280
           1       0.79      0.88      0.84      2986

    accuracy                           0.75      6372
   macro avg       0.68      0.67      0.66      6372
weighted avg       0.72      0.75      0.73      6372



### **4.3 Logistic Regression**



In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
clf = LogisticRegression(max_iter = 1000, solver = 'saga')
fit_evaluate(clf, X_train_v, y_train, X_valid_v, y_valid)

              precision    recall  f1-score   support

          -1       0.78      0.86      0.82      2106
           0       0.54      0.30      0.38      1280
           1       0.80      0.89      0.84      2986

    accuracy                           0.76      6372
   macro avg       0.70      0.68      0.68      6372
weighted avg       0.74      0.76      0.74      6372



### **4.4 Random Forest**



In [23]:
from sklearn.ensemble import RandomForestClassifier

In [24]:
clf = RandomForestClassifier()
fit_evaluate(clf, X_train_v, y_train, X_valid_v, y_valid)

              precision    recall  f1-score   support

          -1       0.74      0.83      0.78      2106
           0       0.73      0.01      0.01      1280
           1       0.70      0.93      0.80      2986

    accuracy                           0.71      6372
   macro avg       0.72      0.59      0.53      6372
weighted avg       0.72      0.71      0.64      6372



### **4.5 XGB**



In [25]:
from xgboost import XGBClassifier

In [26]:
clf = XGBClassifier()
converted_y_train = [y+1 for y in y_train]
fit_evaluate(clf, X_train_v, converted_y_train, X_valid_v, y_valid, True)

              precision    recall  f1-score   support

          -1       0.77      0.85      0.80      2106
           0       0.51      0.22      0.31      1280
           1       0.77      0.90      0.83      2986

    accuracy                           0.75      6372
   macro avg       0.68      0.66      0.65      6372
weighted avg       0.72      0.75      0.72      6372



## **5. Train & Predict with Logistic Regression**


### **5.1 Feature Extraction**

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
vectorizer = TfidfVectorizer()
train_X = vectorizer.fit_transform(train['content'])
test_X = vectorizer.transform(test['content'])

### **5.2 Train**


In [29]:
from sklearn.linear_model import LogisticRegression

In [30]:
clf = LogisticRegression(max_iter = 1000, solver = 'saga')
clf.fit(train_X, train['category'])

### **5.3 Predict**


In [31]:
test['pred'] = clf.predict(test_X)
test

Unnamed: 0,num,content,pred
0,1,outbound flight hour min flight thought sale s...,1
1,2,flight rhodes athens route heathrow cancelled ...,1
2,3,athens larnaca economy early morning flight de...,1
3,4,new new style seat quite uncomfortable though ...,1
4,5,flew london athens ioannina ioannina athens ba...,1
...,...,...,...
4995,4996,oct flt oz tired old plane minute late taking ...,-1
4996,4997,old plane old reclining seat poor entertainmen...,-1
4997,4998,flew recife brazil miami airline continues liv...,-1
4998,4999,travel internationally frequently business hol...,-1


In [32]:
test['pred'].value_counts()

 1    2314
-1    2170
 0     516
Name: pred, dtype: int64

In [33]:
test[['pred']].to_csv(os.path.join(path, 'result.csv'), index = False, header = False)