<a href="https://colab.research.google.com/github/daradanci/MMO_2025/blob/main/notes/LR3_MMO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Лабораторная работа №3  
## Предмет: Машинное обучение  
### Тема: Обработка признаков (часть 2)

**Цель работы:**  
Изучить продвинутые методы предварительной обработки данных для последующего формирования моделей машинного обучения.  

**Используемый датасет:**  
Набор данных по новостным статьям, содержащий как числовые, так и категориальные признаки, текстовые поля и метку `label`, обозначающую достоверность новости (`Fake` / `Real`).  

**Основные задачи лабораторной работы:**
- Масштабирование признаков (не менее 3 способов);
- Обработка выбросов: удаление и замена;
- Обработка нестандартного признака (текста);
- Отбор признаков тремя методами:
  - Filter (фильтрация),
  - Wrapper (обёртка),
  - Embedded (встроенные методы).


In [1]:
import pandas as pd
import numpy as np

# Загрузка датасета
df = pd.read_csv('fake_news_dataset.csv')  # замените на свой путь

# Отображение структуры
print("Размер датасета:", df.shape)
display(df.head())

# Проверка типов данных и пропусков
print("\nТипы признаков:")
print(df.dtypes)

print("\nКоличество пропусков:")
print(df.isnull().sum())

# Распределение целевой переменной
print("\nРаспределение меток:")
print(df['label'].value_counts())

# Преобразуем метку в числовую (если ещё не)
df['label'] = df['label'].map({'Fake': 1, 'Real': 0})


Размер датасета: (4000, 24)


Unnamed: 0,id,title,author,text,state,date_published,source,category,sentiment_score,word_count,...,num_shares,num_comments,political_bias,fact_check_rating,is_satirical,trust_score,source_reputation,clickbait_score,plagiarism_score,label
0,1,Breaking News 1,Jane Smith,This is the content of article 1. It contains ...,Tennessee,30-11-2021,The Onion,Entertainment,-0.22,1302,...,47305,450,Center,FALSE,1,76,6,0.84,53.35,Fake
1,2,Breaking News 2,Emily Davis,This is the content of article 2. It contains ...,Wisconsin,02-09-2021,The Guardian,Technology,0.92,322,...,39804,530,Left,Mixed,1,1,5,0.85,28.28,Fake
2,3,Breaking News 3,John Doe,This is the content of article 3. It contains ...,Missouri,13-04-2021,New York Times,Sports,0.25,228,...,45860,763,Center,Mixed,0,57,1,0.72,0.38,Fake
3,4,Breaking News 4,Alex Johnson,This is the content of article 4. It contains ...,North Carolina,08-03-2020,CNN,Sports,0.94,155,...,34222,945,Center,TRUE,1,18,10,0.92,32.2,Fake
4,5,Breaking News 5,Emily Davis,This is the content of article 5. It contains ...,California,23-03-2022,Daily Mail,Technology,-0.01,962,...,35934,433,Right,Mixed,0,95,6,0.66,77.7,Real



Типы признаков:
id                     int64
title                 object
author                object
text                  object
state                 object
date_published        object
source                object
category              object
sentiment_score      float64
word_count             int64
char_count             int64
has_images             int64
has_videos             int64
readability_score    float64
num_shares             int64
num_comments           int64
political_bias        object
fact_check_rating     object
is_satirical           int64
trust_score            int64
source_reputation      int64
clickbait_score      float64
plagiarism_score     float64
label                 object
dtype: object

Количество пропусков:
id                   0
title                0
author               0
text                 0
state                0
date_published       0
source               0
category             0
sentiment_score      0
word_count           0
char_count          

## Масштабирование признаков

Масштабирование — важный этап подготовки данных, особенно при использовании моделей, чувствительных к масштабу признаков (например, логистическая регрессия или kNN). В этой части мы применим:

- **StandardScaler** — стандартизация до нормального распределения (среднее = 0, стандартное отклонение = 1);
- **MinMaxScaler** — нормализация в диапазон [0, 1];
- **RobustScaler** — масштабирование на основе медианы и межквартильного размаха, устойчивое к выбросам.

Будем масштабировать только числовые признаки (исключая целевую переменную `label`).


In [2]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Выделим числовые признаки (исключим текст и категориальные)
numeric_features = df.select_dtypes(include=[np.number]).drop(columns=['label'])

# Применим масштабирование
scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

scaled_dfs = {}

for name, scaler in scalers.items():
    scaled_data = scaler.fit_transform(numeric_features)
    scaled_df = pd.DataFrame(scaled_data, columns=numeric_features.columns)
    scaled_dfs[name] = scaled_df
    print(f"\n{name} (первые 5 строк):")
    display(scaled_df.head())



StandardScaler (первые 5 строк):


Unnamed: 0,id,sentiment_score,word_count,char_count,has_images,has_videos,readability_score,num_shares,num_comments,is_satirical,trust_score,source_reputation,clickbait_score,plagiarism_score
0,-1.731618,-0.381689,1.246162,0.362743,-0.993024,-0.969466,0.792614,1.540442,-0.138727,1.006018,0.883758,0.156845,1.195264,0.095127
1,-1.730752,1.601968,-1.165712,-0.711398,1.007025,-0.969466,-0.948784,1.019023,0.13963,1.006018,-1.661701,-0.191119,1.229854,-0.771487
2,-1.729886,0.436135,-1.397055,0.744273,-0.993024,1.031496,-1.71672,1.439996,0.950348,-0.994018,0.238908,-1.582975,0.780185,-1.735928
3,-1.72902,1.636769,-1.576715,-1.579219,1.007025,-0.969466,1.416129,0.631,1.583612,1.006018,-1.084731,1.548701,1.471984,-0.635982
4,-1.728154,-0.016278,0.409389,-1.459362,1.007025,-0.969466,-0.754369,0.750007,-0.197879,-0.994018,1.528608,0.156845,0.572645,0.936852



MinMaxScaler (первые 5 строк):


Unnamed: 0,id,sentiment_score,word_count,char_count,has_images,has_videos,readability_score,num_shares,num_comments,is_satirical,trust_score,source_reputation,clickbait_score,plagiarism_score
0,0.0,0.39,0.858571,0.609658,0.0,0.0,0.723779,0.946058,0.45,1.0,0.76,0.555556,0.84,0.53358
1,0.00025,0.96,0.158571,0.296425,1.0,0.0,0.221777,0.795921,0.53,1.0,0.01,0.444444,0.85,0.282654
2,0.0005,0.625,0.091429,0.720918,0.0,1.0,0.0004,0.917135,0.763,0.0,0.57,0.0,0.72,0.003403
3,0.00075,0.97,0.039286,0.043356,1.0,0.0,0.903523,0.684194,0.945,1.0,0.18,1.0,0.92,0.32189
4,0.001,0.495,0.615714,0.078308,1.0,0.0,0.277822,0.71846,0.433,0.0,0.95,0.555556,0.66,0.7773



RobustScaler (первые 5 строк):


Unnamed: 0,id,sentiment_score,word_count,char_count,has_images,has_videos,readability_score,num_shares,num_comments,is_satirical,trust_score,source_reputation,clickbait_score,plagiarism_score
0,-1.0,-0.21,0.722755,0.203496,0.0,0.0,0.482919,0.891566,-0.065606,1.0,0.5,0.0,0.7,0.037652
1,-0.9995,0.93,-0.668797,-0.406731,1.0,0.0,-0.531029,0.587534,0.093439,1.0,-0.942308,-0.2,0.72,-0.46713
2,-0.999,0.26,-0.802272,0.420246,0.0,1.0,-0.978169,0.832997,0.55666,0.0,0.134615,-1.0,0.46,-1.028894
3,-0.9985,0.95,-0.905928,-0.899747,1.0,0.0,0.845967,0.361284,0.918489,1.0,-0.615385,0.8,0.86,-0.388201
4,-0.997999,0.0,0.239972,-0.831655,1.0,0.0,-0.417829,0.430675,-0.099404,0.0,0.865385,0.0,0.34,0.527937
