# Рубежный контроль №2
Тема: Методы обработки текстов. Решение задачи классификации.

Выполнил:
Чжан Мяо

Группа:
ИУ5И-23М

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Загрузим датасет
data = pd.read_csv('BilibiliCommentsDataSet.csv')
# Удалим пропущенные значения
data_clean = data.dropna(subset=['Comment', 'Sentiment'])

# Разделим на train/test
X = data_clean['Comment']
y = data_clean['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [3]:
# Векторизация
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [4]:
# Функция для обучения и оценки
def train_and_evaluate(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return accuracy, report

In [5]:
# Обучим и оценим
results = {}

# CountVectorizer + LogisticRegression
model_lr = LogisticRegression(max_iter=1000, random_state=42)
acc, report = train_and_evaluate(model_lr, X_train_count, y_train, X_test_count, y_test)
results['CountVectorizer + LogisticRegression'] = (acc, report)

# CountVectorizer + LinearSVC
model_svc = LinearSVC(random_state=42, max_iter=10000)
acc, report = train_and_evaluate(model_svc, X_train_count, y_train, X_test_count, y_test)
results['CountVectorizer + LinearSVC'] = (acc, report)

# TfidfVectorizer + LogisticRegression
model_lr_tfidf = LogisticRegression(max_iter=1000, random_state=42)
acc, report = train_and_evaluate(model_lr_tfidf, X_train_tfidf, y_train, X_test_tfidf, y_test)
results['TfidfVectorizer + LogisticRegression'] = (acc, report)

# TfidfVectorizer + LinearSVC
model_svc_tfidf = LinearSVC(random_state=42, max_iter=10000)
acc, report = train_and_evaluate(model_svc_tfidf, X_train_tfidf, y_train, X_test_tfidf, y_test)
results['TfidfVectorizer + LinearSVC'] = (acc, report)

In [6]:
# Выведем результаты
for k, v in results.items():
    print(f"{k}: Accuracy = {v[0]:.4f}")
    print(v[1])
    print("-"*80)

CountVectorizer + LogisticRegression: Accuracy = 0.7610
              precision    recall  f1-score   support

    negative       0.62      0.42      0.50       467
     neutral       0.61      0.63      0.62       925
    positive       0.84      0.88      0.86      2281

    accuracy                           0.76      3673
   macro avg       0.69      0.64      0.66      3673
weighted avg       0.75      0.76      0.75      3673

--------------------------------------------------------------------------------
CountVectorizer + LinearSVC: Accuracy = 0.7441
              precision    recall  f1-score   support

    negative       0.54      0.45      0.49       467
     neutral       0.59      0.59      0.59       925
    positive       0.84      0.86      0.85      2281

    accuracy                           0.74      3673
   macro avg       0.66      0.64      0.65      3673
weighted avg       0.74      0.74      0.74      3673

------------------------------------------------------

**Вывод:**
- Все комбинации с TfidfVectorizer или LogisticRegression показывают схожее качество (accuracy ≈ 0.76).
- Чуть хуже результат у CountVectorizer + LinearSVC (accuracy ≈ 0.74).
- Модель TfidfVectorizer + LinearSVC показала немного более сбалансированные результаты по классам.