# Часть 1 (классические модели машинного обучения)
**Название:** Сравнение классических и современных подходов к детекции фейковых новостей

**Описание**: Проект направлен на анализ эффективности различных методов машинного обучения для задачи классификации фейковых новостей. В работе сравниваются:
- Классические модели (Logistic Regression, Random Forest)
- Современная NLP-модель (LSTM)

**Цель** — выявить оптимальный подход по точности, скорости работы и интерпретируемости результатов.

# **1. Загрузка данных и составление датасета**

In [55]:
import pandas as pd
data = pd.read_csv('FakeReal.csv')

In [56]:
# нам понадобятся только столбцы title и real
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [57]:
data_title_real = data[['text', 'label']]

In [58]:
data_title_real.sample(5)

Unnamed: 0,text,label
61459,Friday at a Boeing manufacturing facility in N...,0
49580,(Reuters) - It could take years to learn how l...,0
49317,Read more: Daily Mail,1
14676,EU NATO Secretary General Jens Stoltenberg spe...,1
70845,WASHINGTON (Reuters) - U.S. House Speaker Paul...,0


Посмотрим, являются ли классы сбалансированными

In [59]:
data_title_real['label'].value_counts()

label
1    37106
0    35028
Name: count, dtype: int64

Классы почти одинакового размера -> возьмём по 15000 примеров каждого

In [60]:
count_real, count_fake = 0, 0
temp = []

for i in range(len(data_title_real)):
    row = data_title_real.iloc[i]
    if row['label'] == 1 and count_real < 15000:
        temp.append({'news': row['text'], 'real': 1})
        count_real += 1
    elif row['label'] == 0 and count_fake < 15000:
        temp.append({'news':  row['text'], 'real': 0})
        count_fake += 1

In [61]:
import random
random.shuffle(temp)
dataset = pd.DataFrame(temp)
dataset.head(10)

Unnamed: 0,news,real
0,It s certainly no secret that Jefferson Beaure...,1
1,"\nNow that the election is over, Donald Trum...",1
2,WASHINGTON — President Obama had “an intens...,0
3,When will government officials who allowed thi...,1
4,You may have read or heard about the study deb...,0
5,ELECTION EVE BOMBSHELL : Wikileaks Reveals Ana...,1
6,America s village idiot got her ass handed to ...,1
7,"For the second time in the last ten days, slim...",1
8,WASHINGTON (Reuters) - U.S. Rep. Kevin Brady o...,0
9,American pride and American exceptionalism is ...,1


Проверим, чтобы в итоговом датасете не было пропусков

In [62]:
dataset.isnull().sum()

news    14
real     0
dtype: int64

In [63]:
len(dataset['news'])

30000

In [64]:
dataset = dataset[dataset['news'].str.len() > 0]

In [65]:
# датасет немного уменьшился
len(dataset['news'])

29986

**Очищаем текст**

In [66]:
# от Reuters и городов
dataset.head()

Unnamed: 0,news,real
0,It s certainly no secret that Jefferson Beaure...,1
1,"\nNow that the election is over, Donald Trum...",1
2,WASHINGTON — President Obama had “an intens...,0
3,When will government officials who allowed thi...,1
4,You may have read or heard about the study deb...,0


In [67]:
def extract_after_dash(text):
    parts = str(text).split('-', 1)  # Разделяем по первому вхождению '-'
    return parts[1].strip() if len(parts) > 1 else text  # Возвращаем вторую часть или исходный текст, если '-' нет

In [68]:
dataset['cleaned_text'] = dataset['news'].apply(extract_after_dash)

In [69]:
dataset.head()

Unnamed: 0,news,real,cleaned_text
0,It s certainly no secret that Jefferson Beaure...,1,wing voters. A BBC report from 2012 showed tha...
1,"\nNow that the election is over, Donald Trum...",1,"right, as Bannon has bragged. In other words, ..."
2,WASHINGTON — President Obama had “an intens...,0,WASHINGTON — President Obama had “an intens...
3,When will government officials who allowed thi...,1,year old man is on suspicion of failure to dis...
4,You may have read or heard about the study deb...,0,You may have read or heard about the study deb...


In [70]:
dataset = dataset[['cleaned_text', 'real']]
dataset.head()

Unnamed: 0,cleaned_text,real
0,wing voters. A BBC report from 2012 showed tha...,1
1,"right, as Bannon has bragged. In other words, ...",1
2,WASHINGTON — President Obama had “an intens...,0
3,year old man is on suspicion of failure to dis...,1
4,You may have read or heard about the study deb...,0


In [71]:
dataset.to_csv('full_dataset.csv', index=False)

# 2. Функции предобработки текста

Будем экспериментировать с **4мя способами предобработки**: *никакой предобработки, даление стопслов и спец.символов, стемминг и лемматизация*

**Способ векторизации**: *TfIdfVectorizer*

In [72]:
import nltk
from nltk.tokenize import word_tokenize
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt_tab')
stop_words = list(set(stopwords.words("english")))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arina\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\arina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [73]:
def stopwords_special(text):
  text = text.lower()
  tokens = word_tokenize(text)
  tokens = [word for word in tokens if word not in stop_words]  # убираем стоп-слова
  tokens = word_tokenize(re.sub(r'[^a-zA-Zа-яА-Я ]', '', ' '.join(tokens)))  # убираем спец символы, числа и знаки препинания
  return tokens

In [74]:
nltk.download('wordnet')
nltk.download('omw-1.4')

# стемминг
def stemming(text):
  tokens = stopwords_special(text)
  stemmer = nltk.PorterStemmer()  # инициализируем стеммер
  stemmed_tokens = [stemmer.stem(token) for token in tokens]  # перебираем токены и применяем алгоритм стемминга

  return stemmed_tokens

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\arina\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\arina\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [75]:
def lemma(text):
  tokens = stopwords_special(text)
  lemmatizer = nltk.WordNetLemmatizer()
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

  return lemmatized_tokens

# 3. Деление датасета на train/test

In [76]:
from sklearn.model_selection import train_test_split

In [77]:
text = dataset['cleaned_text'].tolist()
real = dataset['real'].tolist()

In [78]:
train_texts, test_texts, train_labels, test_labels = train_test_split(text, real, test_size=0.2, random_state=42)

# 4. Обучение Logistic Regression и подбор гиперпараметров

In [79]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [80]:
train_texts_stop = [' '.join(stopwords_special(i)) for i in train_texts]
test_texts_stop = [' '.join(stopwords_special(i)) for i in test_texts]

In [81]:
train_texts_stem = [' '.join(stemming(i)) for i in train_texts]
test_texts_stem = [' '.join(stemming(i)) for i in test_texts]

In [82]:
train_texts_lemma = [' '.join(lemma(i)) for i in train_texts]
test_texts_lemma = [' '.join(lemma(i)) for i in test_texts]

In [111]:
res = []

for vect in ['tfidfVectorizer', 'countVectorizer']:
    for preprocess in ['nothing', 'stopwords', 'stem', 'lemma']:
        # for p in [None, 'l1', 'l2']:
            for C in [0.01, 0.001, 0.1, 1]:
                    if vect == 'tfidfVectorizer':
                        vectorizer = TfidfVectorizer()
                    else:
                        vectorizer = CountVectorizer()
                    
                    if preprocess == 'nothing':
                        X_train = train_texts
                        X_test = test_texts
                    elif preprocess == 'stopwords':
                        X_train = train_texts_stop
                        X_test = test_texts_stop
                    elif preprocess == 'stem':
                        X_train = train_texts_stem
                        X_test = test_texts_stem
                    elif preprocess == 'lemma':
                        X_train = train_texts_lemma
                        X_test = test_texts_lemma

                    X_train_vect = vectorizer.fit_transform(X_train)
                    X_test_vect = vectorizer.transform(X_test)

                    lr = LogisticRegression(C=C)
                    lr.fit(X_train_vect, train_labels)

                    y_pred = lr.predict(X_test_vect)
                    acc = accuracy_score(test_labels, y_pred)
                    f1 = f1_score(test_labels, y_pred)
                    res.append([vect, preprocess, C, round(acc, 3), round(f1, 3)])
                    if len(res) % 5 == 0:
                        print(len(res))

5
10
15


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

20


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


25


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


30


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [112]:
res[:5]

[['tfidfVectorizer', 'nothing', 0.01, 0.832, 0.829],
 ['tfidfVectorizer', 'nothing', 0.001, 0.799, 0.794],
 ['tfidfVectorizer', 'nothing', 0.1, 0.88, 0.88],
 ['tfidfVectorizer', 'nothing', 1, 0.925, 0.925],
 ['tfidfVectorizer', 'stopwords', 0.01, 0.855, 0.859]]

In [113]:
res_lr = pd.DataFrame(res, columns=["Vectorizer", "Preprocess", "C", "accuracy", "f1 score"])

In [114]:
res_lr_sort_f1 = res_lr.sort_values("f1 score", ascending=False)
res_lr_sort_f1.head()

Unnamed: 0,Vectorizer,Preprocess,C,accuracy,f1 score
18,countVectorizer,nothing,0.1,0.934,0.934
19,countVectorizer,nothing,1.0,0.934,0.933
16,countVectorizer,nothing,0.01,0.93,0.93
30,countVectorizer,lemma,0.1,0.929,0.929
22,countVectorizer,stopwords,0.1,0.929,0.929


**Лучшие гиперпараметры для обучения LogisticRegression:**
- Vectorizer: countVectorizer
- Preprocess: nothing
- C = 0.1
- accuracy = 0.934
- **f1_score = 0.934**

# 5. Обучение Random Forest (параметры - дефолтные, кроме n_estimatoros)

In [105]:
from sklearn.ensemble import RandomForestClassifier

In [115]:
res = []

for vect in ['tfidfVectorizer', 'countVectorizer']:
    for preprocess in ['nothing', 'stopwords', 'stem', 'lemma']:
        for n in [100, 500]:
            for m_d in [None, 5, 15, 25]:
                        # for c in ['gini', 'entropy']:
                            # for c_weight in [None, 'balanced']:
                                if vect == 'tfidfVectorizer':
                                    vectorizer = TfidfVectorizer()
                                else:
                                    vectorizer = CountVectorizer()

                                if preprocess == 'nothing':
                                    X_train = train_texts
                                    X_test = test_texts
                                elif preprocess == 'stopwords':
                                    X_train = train_texts_stop
                                    X_test = test_texts_stop
                                elif preprocess == 'stem':
                                    X_train = train_texts_stem
                                    X_test = test_texts_stem
                                elif preprocess == 'lemma':
                                    X_train = train_texts_lemma
                                    X_test = test_texts_lemma

                                X_train_vect = vectorizer.fit_transform(X_train)
                                X_test_vect = vectorizer.transform(X_test)

                                rf = RandomForestClassifier(n_estimators=n, max_depth=m_d)
                                rf.fit(X_train_vect, train_labels)

                                y_pred = rf.predict(X_test_vect)
                                acc = accuracy_score(test_labels, y_pred)
                                f1 = f1_score(test_labels, y_pred)
                                res.append([vect, preprocess, n, m_d, round(acc, 3), round(f1, 3)])
                                if len(res) % 5 == 0:
                                    print(len(res))

5
10
15
20
25
30
35
40
45
50
55
60


In [116]:
res[:5]

[['tfidfVectorizer', 'nothing', 100, None, 0.901, 0.899],
 ['tfidfVectorizer', 'nothing', 100, 5, 0.794, 0.799],
 ['tfidfVectorizer', 'nothing', 100, 15, 0.859, 0.859],
 ['tfidfVectorizer', 'nothing', 100, 25, 0.886, 0.885],
 ['tfidfVectorizer', 'nothing', 500, None, 0.903, 0.902]]

In [117]:
res_rf = pd.DataFrame(res, columns=["Vectorizer", "Preprocess", "n_estimators", "max_depth", "accuracy", "f1 score"])

In [118]:
res_rf_sort_f1 = res_rf.sort_values("f1 score", ascending=False)
res_rf_sort_f1.head()

Unnamed: 0,Vectorizer,Preprocess,n_estimators,max_depth,accuracy,f1 score
12,tfidfVectorizer,stopwords,500,,0.909,0.906
36,countVectorizer,nothing,500,,0.909,0.906
28,tfidfVectorizer,lemma,500,,0.907,0.904
4,tfidfVectorizer,nothing,500,,0.903,0.902
44,countVectorizer,stopwords,500,,0.906,0.901


**Лучшие гиперпараметры для обучения RandomForest:**
- Vectorizer: countVectorizer
- Preprocess: nothing
- n_estimators = 500
- max_depth = None
- accuracy = 0.91
- **f1_score = 0.906**

**Лучший результат из классических моделей ML - Logistic Regression (f1-score = 0.934)**