## 一、实验背景

情感分析是自然语言处理中的一个重要任务,它通过分析文本中的情感词汇和语气,判断文本表达的情感态度,通常分为正向、负向和中性。准确的情感分析可以用于社交媒体评论分析、客户反馈分析等场景。本次实验是软件开发领域中的评论文本进行情感分析。



## 二、实验目标
利用机器学习方法建立文本情感分析模型,对文本进行正面、负面和中性情感判断。

## 三、实验数据

利用Senti4SD数据集,该数据集包含4423条Stack Overflow网站中的评论文本,每条文本都有经过人工标注后的正面、负面和中性的情感标注。

## 四、实验步骤


1. 导入所需库

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

2. 加载数据,预处理

In [2]:
import pandas as pd
data = pd.read_excel('data/Senti4SD_GoldStandard_EmotionPolarity.xlsx')
data

Unnamed: 0,Study,Label,Stack Overflow ID,Post Type,Text,Final Label (majority voting),r1,r2,r3
0,Pilot,pA1,,,"Vineet, what you are trying to do is a terribl...",negative,Negative,Negative,Negative
1,Pilot,pA2,,,"'Course I do, corrected.",positive,Positive,Positive,Positive
2,Pilot,pA3,,,"Excellent, happy to help! If you don't mind, c...",positive,Neutral,Positive,Positive
3,Pilot,pA4,,,@DrabJay: excellent suggestion! Code changed. :-),positive,Positive,Positive,Positive
4,Pilot,pA5,,,Any decent browser should protect against mali...,neutral,Neutral,Neutral,Neutral
...,...,...,...,...,...,...,...,...,...
4418,Extended Study,tD496,923922.0,QuestionComment,Yes - that feature is extremely useful for wri...,positive,Positive,Positive,Positive
4419,Extended Study,tD497,38466121.0,AnswerComment,"Works great! And you can add ""desc"" after the ...",positive,Positive,Positive,Positive
4420,Extended Study,tD498,288982.0,Answer,"Yeah, I didn't know about the non-greedy thing...",positive,Positive,Neutral,Positive
4421,Extended Study,tD499,863762.0,QuestionComment,Fortunately I'm doing *very* little with Offic...,positive,Positive,Positive,Positive


In [3]:
data['Final Label (majority voting)'] = data['Final Label (majority voting)'].replace({'positive': 1, 'neutral': 0, 'negative': -1})

3. 划分训练集和测试集

In [4]:
X = data['Text']
y = data['Final Label (majority voting)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. 特征工程:文本向量化

使用CountVectorizer将文本转换为词频向量:

In [5]:
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

5. 建立朴素贝叶斯分类器

In [6]:
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_vectorized, y_train)

6. 模型评估

In [7]:
y_pred = naive_bayes.predict(X_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.7197740112994351
Classification Report:
              precision    recall  f1-score   support

          -1       0.65      0.71      0.68       233
           0       0.72      0.60      0.66       356
           1       0.77      0.86      0.82       296

    accuracy                           0.72       885
   macro avg       0.71      0.73      0.72       885
weighted avg       0.72      0.72      0.72       885



7. 预测未知文本

In [8]:
new_text = "You do a good job!"
new_text_vectorized = vectorizer.transform([new_text])
predicted_sentiment = naive_bayes.predict(new_text_vectorized)[0]
print("Predicted Sentiment for new text:", predicted_sentiment)

Predicted Sentiment for new text: 1


## 六、实验结果
- 模型准确率为89.83%
- 正样本召回率为82%,负样本召回率为94%
- 对未知文本进行了情感分析，结果准确