# **Lesson02: Probabilistic-Based Methods**

In [1]:
!gdown --id 1p4moPeR2QoRmoTPWTx0rbtGdGocZPhWA

Downloading...
From: https://drive.google.com/uc?id=1p4moPeR2QoRmoTPWTx0rbtGdGocZPhWA
To: /content/ChnSentiCorp_htl_ba_6000_cutted.csv
  0% 0.00/1.74M [00:00<?, ?B/s]100% 1.74M/1.74M [00:00<00:00, 114MB/s]


## **Load Dataset and do some process**

In [2]:
import pandas as pd


seed = 42

data = pd.read_csv('ChnSentiCorp_htl_ba_6000_cutted.csv', compression='gzip')

if isinstance(data.cut[0], str):
    data.cut = data.cut.apply(lambda x: eval(x))
data.sample(20, random_state=seed)

Unnamed: 0,label,review,cut
1782,1,没有想象的离西湖近 年轻人走走也得10 15分钟 因为窗不是面对西湖的 所以景基本看不到西湖...,"[没有, 想象, 西湖, 近, 年轻人, 走走, 10, 15, 分钟, 窗, 面对, 西湖..."
3917,0,住的标准单人间实在是小 房间里一张大床已经塞得满满的了 设施极其简陋除了电视机和烧水壶其他什...,"[住, 标准, 单人间, 实在, 小, 房间, 里, 一张, 大床, 塞得, 满满的, 设施..."
221,1,初到当地 在携程上订了一晚这家酒店 印象中原来认为新疆很偏僻 没想到酒店的环境及服务与内地的...,"[初到, 携程, 上订, 一晚, 这家, 酒店, 印象, 中, 新疆, 偏僻, 没想到, 酒..."
2135,1,酒店的床比一般的香港酒店大 双人房的床都很大 枕头超多超舒服 酒店正好在金钟的正上方 逛街很...,"[酒店, 床, 比, 一般, 香港, 酒店, 大, 双人房, 床, 很大, 枕头, 超多超,..."
5224,0,在网上综合考虑后 并打电话给汉庭 汉庭说靠街边的窗户都换成双层窗户 隔音很好 也有停车场 汉...,"[网上, 综合, 后, 打电话, 给汉庭, 汉庭, 说, 靠, 街边, 窗户, 换成, 双层..."
1168,1,总体感觉很不错 酒店硬件和软件都还满意 前台服务生服务热情到位 房间安静 不嘈杂 总体房间还...,"[总体, 感觉, 不错, 酒店, 硬件, 和, 软件, 满意, 前台, 服务生, 服务, 热..."
879,1,感觉还算可以的 不过没有香格里拉那么好 价格和协议的也没什么区别 所以还是去香格里拉较划算点,"[感觉, 算, 没有, 香格里拉, 好, 价格, 和, 协议, 没什么, 区别, 香格里拉,..."
156,1,各方面都一般 只要期望值不太高就还可以,"[方面, 一般, 期望值, 不太高]"
1657,1,这次来的比较晚 出去吃饭不太方便 就在酒店的餐厅吃的海鲜自助 还可以 不过感觉价格稍微偏高了...,"[来, 比较, 晚, 吃饭, 不太, 方便, 酒店, 餐厅, 吃, 海鲜, 自助, 感觉, ..."
323,1,我订的是行政房 但酒店给我免费升级到公寓房 是他们新的房子 房间设施 装修都非常不错 而且可...,"[我订, 行政, 房, 酒店, 给, 免费, 升级, 公寓, 房, 新, 房子, 房间, 设..."


### Label transform from string to one-hot vector

In [3]:
from sklearn import preprocessing


le = preprocessing.LabelEncoder()

y = le.fit_transform(data.label)

### Split Train and Test Dataset

In [4]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
  data.cut.str.join(sep=' ').values, y, 
  test_size=0.2, random_state=seed, shuffle=True)

## **Define Metric object**

In [5]:
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support


class Metric(object):

  def __init__(self, y_true):
    self.y_true = y_true
      
  def get_metric(self, y_pred, y_true=None):
    if y_true is None:
      y_true = self.y_true
    
    loss = np.nan
    if y_pred.ndim > 1:
      loss = log_loss(y_true=y_true, y_pred=y_pred)
      y_pred = y_pred.argmax(axis=1)
    
    acuuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    precision, recall, f_score, _ = precision_recall_fscore_support(
      y_true=y_true, y_pred=y_pred, average='macro', zero_division='warn')
    
    print('loss:', loss)
    print('acuuracy:', acuuracy)
    print('precision:', precision)
    print('recall:', recall)
    print('f_score:', f_score)


metric_fn = Metric(y_true=y_test)

## **Create feature Vecotr**

### Create Bag-of-Words feature vector

In [6]:
from sklearn.feature_extraction.text import CountVectorizer


count_vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=5, max_df=0.5)

count_vectorizer.fit(data.cut.str.join(sep=' ').values)

X_train_count = count_vectorizer.transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

### Create TF-IDF feature vector

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer


tfidf_transformer = TfidfTransformer(smooth_idf=True, sublinear_tf=True)

tfidf_transformer.fit(count_vectorizer.transform(data.cut.str.join(sep=' ').values))

X_train_tfidf = tfidf_transformer.transform(X_train_count)
X_test_tfidf = tfidf_transformer.transform(X_test_count)

## Train a Logistic Regression Model

In [8]:
import numpy as np
from sklearn.linear_model import LogisticRegression

### Using Bag-of-Words feature to train a logistic regression model

In [9]:
clf_count = LogisticRegression(random_state=seed, n_jobs=-1)

clf_count.fit(X_train_count, y_train)
        
metric_fn.get_metric(clf_count.predict_proba(X_test_count))

loss: 0.2629031947573188
acuuracy: 0.9058333333333334
precision: 0.9069177350427351
recall: 0.906457475870617
f_score: 0.9058254200526572


Using TF-IDF feature to train a logistic regression model

In [10]:
clf_tfidf = LogisticRegression(random_state=42, n_jobs=-1)

clf_tfidf.fit(X_train_tfidf, y_train)

metric_fn.get_metric(clf_tfidf.predict_proba(X_test_tfidf))

loss: 0.3491885562397428
acuuracy: 0.8875
precision: 0.8906265717383286
recall: 0.8885129407972076
f_score: 0.8874148574859738


## Train a Naive Bayes Model

In [11]:
from sklearn.naive_bayes import MultinomialNB

### Using Bag-of-Words feature to train a Multinomial Naive Bayes model

In [12]:
clf_mnb_count = MultinomialNB()

clf_mnb_count.fit(X_train_count, y_train)

metric_fn.get_metric(clf_mnb_count.predict_proba(X_test_count))

loss: 0.6085395752478586
acuuracy: 0.8866666666666667
precision: 0.8876001104667219
recall: 0.8861437730490147
f_score: 0.8864534514811144


### Using TF-IDF feature to train a Multinomial Naive Bayes model

In [13]:
clf_mnb_tfidf = MultinomialNB()

clf_mnb_tfidf.fit(X_train_tfidf, y_train)

metric_fn.get_metric(clf_mnb_tfidf.predict_proba(X_test_tfidf))

loss: 0.31485330766091335
acuuracy: 0.89
precision: 0.8900987042460348
recall: 0.8898038245732025
f_score: 0.8899116234977524


## **Grid Search**

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.utils.fixes import loguniform


param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [1, 10, 100],
    'max_iter': [100, 200, 300]
}
grid = GridSearchCV(estimator=clf_tfidf, param_grid=param_grid, n_jobs=-1, cv=5)

In [15]:
grid.fit(X_train_tfidf, y_train)

metric_fn.get_metric(grid.predict_proba(X_test_tfidf))

loss: 0.24736313915278496
acuuracy: 0.8966666666666666
precision: 0.8990568485183232
recall: 0.8975574644763792
f_score: 0.8966181346243097
