# 情感分析第一件事

首先，我們必須將分類數據轉換為數字形式，然後才能將其傳遞給機器學習算法。 我們使用“詞袋”模型進行此操作。 其背後的想法很簡單，可以總結如下：
1.我們從整個文檔集中創建具有唯一**符記（例如單詞）的**詞彙**。
2.我們從每個文檔構造一個特徵向量，其中包含每個單詞在特定文檔中出現的頻率計數。

### 將單詞轉換為特徵向量

要建立詞袋模型，我們可以使用在scikit-learn中實現的CountVectorizer，如下所示。

In [1]:
#導入需要套件
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


In [None]:
# 建立詞袋模型
count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'
])

bag = count.fit_transform(docs)

### 通過tf-idf評估單詞相關性

此技術可用於減輕特徵向量中不包含有用或歧視性信息的頻繁出現的單詞的量。 scikit-learn通過TfidfTransformer實現。

In [2]:
# 計算TF/IDF
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
np.set_printoptions(precision=2)

print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.56 0.56 0.   0.43 0.  ]
 [0.   0.43 0.   0.   0.56 0.43 0.56]
 [0.4  0.48 0.31 0.31 0.31 0.48 0.31]]


# 清理文字資料

在建立詞袋模型前，以刪除所有不需要的字元來清理文字資料。 以後隨機文檔中的最後60個字元來說明：

In [3]:
# 導入電影評論檔案
df = pd.read_csv('movie_data.csv')

df.loc[49941, 'review'][-60:]

'If only I could understand the right meaning of the lyrics:('

此處將刪除所有HTML標記以及標點符號和其他非字母字元，僅保留在情感分析肯定有用的表情符號字元，可以使用Python的正則表示式庫`re`完成。

In [4]:
# 刪除所有HTML標記以及標點符號和其他非字母字元，並僅保留表情符號字元
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower())
    text = text + " ".join(emoticons).replace('-', '')
    return text

preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [5]:
# 實際進行刪除
df['review'] = df['review'].apply(preprocessor)

print(df.tail(10))

                                                  review  sentiment
49959  my first attempt at watching this ended in 8 m...          0
49960  pepe le moko played by charles boyer is some s...          0
49961  awful simply awful it proves my theory about s...          0
49962  i attended camp chesapeake it was located at t...          1
49963  i was impressed that i could take my 5 year ol...          1
49964  this movie is terrible it s about some no brai...          0
49965  well what was fun except for the fun part it s...          0
49966  by the time this film was released i had seen ...          0
49967  well if you like pop punk punk ska and a tad b...          0
49968  where this movie is faithful to burroughs visi...          1


# 將文檔處理為符記

現在我們已經成功地準備了資料集，我們需要考慮如何將文本拆分為單個元素。 我們可以這樣做：

In [6]:
# 將文字檔分解為單詞
def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

另一種策略是“單詞詞幹” **，這是將單詞轉換為詞根形式的過程，使我們可以將相關單詞映射到同一詞幹。 我們將使用由`nltk`軟件包實現的** Porter阻止算法**。

In [7]:
porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

在我們開始使用詞袋對機器學習模型進行訓練之前，讓我們刪除那些不會在文本中添加有用信息的極其常見的詞（稱為“停用詞”）。

In [8]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
stop = stopwords.words('english')

[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

In [10]:
X = df.review
y = df.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                            test_size=0.5, random_state=0)

接下來，我們將使用“ GridSearchCV”通過3折交叉驗證對我們的羅吉斯迴歸模型進行超參數調整：

In [11]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, 
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1,1)],
              'vect__stop_words': [stop, None],
              'vect__tokenizer': [tokenizer, tokenizer_porter],
              'clf__penalty': ['l1', 'l2'],
              'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range': [(1,1)],
              'vect__stop_words': [stop, None],
              'vect__tokenizer': [tokenizer, tokenizer_porter],
              'vect__use_idf': [False],
              'vect__norm': [None],
              'clf__penalty': ['l1', 'l2'],
              'clf__C': [1.0, 10.0, 100.0]}]

lr_tfidf = Pipeline([('vect', tfidf), 
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, 
                           scoring='accuracy', cv=3, verbose=1, n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)

print('Best parameter set: %s' % gs_lr_tfidf.best_params_)

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed: 28.4min finished


Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x00000277C3CDB0D8>}


In [13]:
print("Best score: %0.3f" % gs_lr_tfidf.best_score_)

Best score: 0.891


In [22]:
# 求混淆矩陣(Confusion Matrix)，計算準確度
print('Confusion Matrix')
preds = gs_lr_tfidf.predict(X_test)
print(pd.crosstab(preds, y_test))
print(gs_lr_tfidf.score(X_test, y_test))

Confusion Matrix
sentiment      0      1
row_0                  
0          11057   1175
1           1348  11405
0.8990194116469882


In [23]:
print(gs_lr_tfidf.predict(X_test[0:10]))
print(y_test[0:10])

[0 0 1 1 1 0 1 1 1 0]
38954    0
30836    0
43372    1
39814    1
49826    1
16166    0
20583    1
10278    1
22531    1
29521    0
Name: sentiment, dtype: int64


In [25]:
print(X_test)

38954    this movie is such a total waste of time i can...
30836    this was without a doubt the worst movie i hav...
43372    when dirty dancing was on tv in the middle of ...
39814    i don t give a movie or a show ten very often ...
49826    though derivative labyrinth still stands as th...
                               ...                        
3269     this is one of the dumbest ideas for a movie r...
20040    first of all the big named actors must need th...
42       end of days is one of the worst big budget act...
45973    i got a free pass to a preview of this movie l...
32784    i don t know what the last reviewer is talking...
Name: review, Length: 24985, dtype: object


In [27]:
print(X[38954])

this movie is such a total waste of time i can t understand anyone sitting through this piece of trash oh i would have loved it when i was seven years old i think a seven year old child may have written and directed it there s no script no acting just rubbish the best acting is that by the fighting roosters i think i could whip these ninjas and i am not someone you d consider tough totally unconvincing and did not spark the least bit of interest i was yawning and laughing by the end of the first ten minutes of the film this is one that would turn people away from martial art movies great comedy bad action flick 
