Лабораторная работа №1 по дисциплине "Основы машинного обучения"

Выполнили cтуденты группы 3311
- Аршин Александр
- Баймухамедов Рафаэль
- Пасечный Леонид

Требования:
- Число столбцов признаков – не менее 10;
- Число записей – не менее 10000;
- Набор данных имеет пропуски.

Задачи:
- Очистка данных (удаление пропусков, нормализация, удаление дубликатов)
- Визуализация значимых признаков
- Диаграммы рассеяния
- Ящики с усами
- Гистограммы
- Корреляция данных (матрица корреляций)

Скачаем датасет

In [45]:
from pathlib import Path
import requests
import pandas as pd

RAW_URL = "https://raw.githubusercontent.com/brick1ng5654/course-3/RafaelB/boml/lab_01/health_lifestyle_classification.csv"
DEST = Path("/content/data/health_lifestyle_classification.csv")
DEST.parent.mkdir(parents=True, exist_ok=True)

r = requests.get(RAW_URL, timeout=60)
r.raise_for_status()

DEST.write_bytes(r.content)
print(f"Saved to {DEST} ({DEST.stat().st_size/1_000_000:.2f} MB)")

Saved to /content/data/health_lifestyle_classification.csv (52.98 MB)


В датасете более 100+ тысяч строк и 41 признак. Выберем необходимое и отсеем лишнее.

Целевым признаком для предсказания будет target (Бинарная классификация): healthy (здоров) / diseased (болен). Важное уточнение, что данное понятие достаточно расплывчатое.

Из числовых признаков возьмём:
- age (Возраст)
- bmi (Индекс массы тела)
- blood pressure (Артериальное давление)
- calorie intake (Потребление калорий)
- cholesterol (Холестерин)
- daily steps (Ежедневное количество шагов)
- glucose (Глюкоза)
- heart rate (Пульс)
- insulin (Инсулин)
- sleep hours (Количество часов сна)
- stress level (Заявленный уровень стресса)
- sugar intake (Потреблнение сахара)
- water intake (Потребление воды)
- work hours (Количество часов работы)
- meals per day (Количество приёмов пищи в день)

Из категориальных признаков возьмём:
- alchohol consumption (Потребление алкоголя, [Occasionally, Regularly, None])
- caffeine intake (Потребление кофеина, [Moderate, High, None])
- diet type (Тип диенты, [Vegan, Omnivore, Vegetarian, Keto])
- exercise type (Тип тренировок, [Strength, Cardio, None, Mixed])
- gender (Пол, [Male, Female])
- sleep quality (Качество сна, [Good, Excellent, Fair, Poor])
- smoking level (Потребление сигарет, [Light, Non-smoker, Heavy])
- sunlight exposure (Ежедневное воздействие солнечного света, [Low, Moderate, High])

Таким образом были убраны следующие признаки:
- bmi ?* (В виду взятия признака bmi, вслесдствие чего остальные показатели bmi не нужны)
- daily supplement dosage (В виду не до конца понятной классификации)
- device usage (В виду наименьшего влияния на результат заболевания)
- education level (В виду наименьшего влияния на результат заболевания)
- electrolyte level (В виду бесполезности признака, т.к. все строки одинаковые)
- enviromnent risk score (В виду бесполезности признака, т.к. все строки одинаковые)
- family history (В виду не до конца понятной бинарной классификации)
- gene marker flag (В виду бесполнезности признака, т.к. все строки одинаковые)
- healthcare access (В виду наименьшего влияния на результат заболевания)
- height (В виду ненужности признака из-за наличия признака bmi)
- income ((В виду наименьшего влияния на результат заболевания)
- insurance (В виду наименьшего влияния на результат заболевания)
- job type ((В виду наименьшего влияния на результат заболевания)
- mental health score (В виду схожести с заявленным уровнем стресса)
- mental health support (В виду наименьшего влияния на результат заболевания)
- occupation (В виду наименьшего влияния на результат заболевания)
- pet owner (В виду наименьшего влияния на результат заболевания)
- physical activity (В виду не до конца понятного получения значения признака)
- screen time (В виду наименьшего влияния на результат заболевания)
- waist size (В виду наименьшего влияния на результат заболевания)
- weight (В виду ненужности признака из-за наличия признака bmi)

Загрузим библиотеки

In [46]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

Setup Complete


Загрузка датасета

In [47]:
filepath = "/content/data/health_lifestyle_classification.csv"

df = pd.read_csv(filepath)

В переменной filepath хранится путь до датасета. В переменной df хранится dataframe по нашему датасету

Удалим лишние признаки

In [48]:
delete_columns = [
    'screen_time',
    'family_history',
    'mental_health_score',
    'occupation',
    'mental_health_support',
    'device_usage',
    'healthcare_access',
    'insurance',
    'pet_owner',
    'height',
    'weight',
    'waist_size',
    'bmi_estimated',
    'bmi_scaled',
    'bmi_corrected',
    'physical_activity',
    'education_level',
    'job_type',
    'income',
    'electrolyte_level',
    'gene_marker_flag',
    'environmental_risk_score',
    'daily_supplement_dosage'
]

df = df.drop(columns=delete_columns)

Посмотрим типы столбцов

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 25 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   survey_code          100000 non-null  int64  
 1   age                  100000 non-null  int64  
 2   gender               100000 non-null  object 
 3   bmi                  100000 non-null  float64
 4   blood_pressure       92331 non-null   float64
 5   heart_rate           85997 non-null   float64
 6   cholesterol          100000 non-null  float64
 7   glucose              100000 non-null  float64
 8   insulin              84164 non-null   float64
 9   sleep_hours          100000 non-null  float64
 10  sleep_quality        100000 non-null  object 
 11  work_hours           100000 non-null  float64
 12  daily_steps          91671 non-null   float64
 13  calorie_intake       100000 non-null  float64
 14  sugar_intake         100000 non-null  float64
 15  alcohol_consumptio

Посмотрим пустые значения столбцов

In [50]:
df.isna().sum()

Unnamed: 0,0
survey_code,0
age,0
gender,0
bmi,0
blood_pressure,7669
heart_rate,14003
cholesterol,0
glucose,0
insulin,15836
sleep_hours,0


все требования к лабораторной работе выполнены.
- [x] Число столбцов признаков - не менее 10
- [x] Число записей - не менее 10000
- [x] Набор данных имеет пропуски

Очистим столбцы с пустыми категориальными признаками

In [51]:
for column in df.columns:
    if df[column].dtype not in ('int64', 'float64'):
        df = df.dropna(subset=[column])

Обозначм столбцы с ID и целевым признаком

In [52]:
id_column = "survey_code"
target_column = "target"

Удалим дубликаты

In [53]:
print("Точных дублей:", df.duplicated().sum())
df = df.drop_duplicates()

print("дубликатов ID:", df.duplicated(subset=[id_column]).sum())
df = df.drop_duplicates(subset=[id_column], keep='first')

X_cols = df.columns.difference([id_column, target_column])
print("дублей по X:", df.duplicated(subset=X_cols).sum())

Точных дублей: 0
дубликатов ID: 0
дублей по X: 0


Выделим категориальные и числовые признаки

In [54]:
numerical_columns = [
    'age','bmi','blood_pressure','heart_rate','cholesterol','glucose','insulin','sleep_hours','work_hours','daily_steps','calorie_intake','sugar_intake','water_intake','stress_level','meals_per_day'
]

categorical_columns = [
    'gender','sleep_quality','alcohol_consumption','smoking_level','diet_type','exercise_type','sunlight_exposure','caffeine_intake'
]

Теперь сделаем диаграмму рассеяности, но для начала посмотрим матрицу корреляций для выявления признаков с сильной взаимосвзяью

Для начала посмотрим корреляцию между числовыми признаками

In [55]:
corr = df[numerical_columns].corr(method='pearson')
corr

Unnamed: 0,age,bmi,blood_pressure,heart_rate,cholesterol,glucose,insulin,sleep_hours,work_hours,daily_steps,calorie_intake,sugar_intake,water_intake,stress_level,meals_per_day
age,1.0,0.005813,0.004073,0.003724,0.003975,0.00336,0.007096,-2.8e-05,0.008027,0.006774,0.005464,0.000516,0.00625,0.001445,-0.002469
bmi,0.005813,1.0,0.001597,0.002725,-0.002661,-0.006623,0.001626,0.005992,0.000517,-0.001132,0.007845,0.006033,-0.00103,0.004516,-0.00416
blood_pressure,0.004073,0.001597,1.0,-0.004422,0.003761,0.000324,0.008213,0.011707,0.000379,-0.005427,0.003217,-0.01457,-0.004899,0.005546,0.004311
heart_rate,0.003724,0.002725,-0.004422,1.0,-0.004179,-0.004668,0.005576,-0.003813,-0.005878,-0.005139,0.012429,0.00453,-0.002552,0.004429,0.005565
cholesterol,0.003975,-0.002661,0.003761,-0.004179,1.0,-0.007027,0.020514,-0.000885,-0.000121,-0.005513,-0.0007,-0.000732,-0.003526,0.001546,-0.002427
glucose,0.00336,-0.006623,0.000324,-0.004668,-0.007027,1.0,-0.009471,-0.012819,-0.009,-0.00614,-0.006927,0.008066,0.000842,-0.001294,-0.00052
insulin,0.007096,0.001626,0.008213,0.005576,0.020514,-0.009471,1.0,-0.002588,-0.000668,-0.011813,0.003446,-0.002923,0.011518,-0.005206,-0.003807
sleep_hours,-2.8e-05,0.005992,0.011707,-0.003813,-0.000885,-0.012819,-0.002588,1.0,0.000975,-0.004664,0.004178,0.000853,0.014802,-0.011555,0.005172
work_hours,0.008027,0.000517,0.000379,-0.005878,-0.000121,-0.009,-0.000668,0.000975,1.0,-0.00203,-0.005692,0.005755,-0.001761,-0.006277,-0.0073
daily_steps,0.006774,-0.001132,-0.005427,-0.005139,-0.005513,-0.00614,-0.011813,-0.004664,-0.00203,1.0,-0.000359,0.00375,0.007756,0.004492,-0.002344


Теперь посмотрим корреляцию между целевым признаком и численными признаками

In [56]:
corr_with_y = df[numerical_columns].corrwith(y)
corr_with_y.sort_values(ascending=False)

Unnamed: 0,0
work_hours,0.014562
meals_per_day,0.009275
daily_steps,0.007677
blood_pressure,0.003943
calorie_intake,0.003517
sugar_intake,0.003103
insulin,0.002462
heart_rate,0.000755
water_intake,0.000559
stress_level,-0.002207


Разделим данные на обучающие и тренировочные

In [57]:
X = df.drop(columns=[id_column, target_column])
y = df[target_column]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=26, test_size=0.2
)

Заполним медианой строки с пустыми численными признаками

In [58]:
for column in X_train.columns:
    if X_train[column].dtype in ('int64', 'float64'):
        X_train[column] = X_train[column].fillna(X_train[column].median())

Проверим результат

In [59]:
X_train.isna().sum()

Unnamed: 0,0
age,0
gender,0
bmi,0
blood_pressure,0
heart_rate,0
cholesterol,0
glucose,0
insulin,0
sleep_hours,0
sleep_quality,0


Всё верно сделано

Создадим копию датафрейма для безопасной работы с ним

In [60]:
df_copy = df.copy()

Нормализуем численные признаки

In [61]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])

Проверим это

In [62]:
X_train[numerical_columns].head()

Unnamed: 0,age,bmi,blood_pressure,heart_rate,cholesterol,glucose,insulin,sleep_hours,work_hours,daily_steps,calorie_intake,sugar_intake,water_intake,stress_level,meals_per_day
86953,0.180328,0.246551,0.512171,0.447303,0.594185,0.669569,0.493713,0.54771,0.281402,0.352715,0.601689,0.477911,0.546113,0.9,0.0
8978,0.032787,0.407378,0.624534,0.693213,0.753914,0.499134,0.423054,0.409013,0.689828,0.409174,0.392391,0.419906,0.359304,1.0,0.75
46239,0.786885,0.290406,0.658088,0.693042,0.692744,0.567587,0.363575,0.227916,0.723833,0.460811,0.333298,0.367148,0.318064,0.5,0.5
49158,0.967213,0.222712,0.510556,0.335969,0.415707,0.414671,0.493713,0.574274,0.384541,0.352715,0.461585,0.385289,0.422708,0.9,1.0
28799,0.442623,0.422961,0.660673,0.671196,0.65725,0.673384,0.499005,0.14264,0.473008,0.206725,0.677191,0.332247,0.272437,0.4,0.25


Перейдем к преобразованию категориальных признаков в численные при помощи OneHot энкодера

In [63]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
ohe.fit(X_train[categorical_columns])
ohe_columns = ohe.get_feature_names_out(categorical_columns)

X_train_categorical = pd.DataFrame(
    ohe.transform(X_train[categorical_columns]),
    columns=ohe.get_feature_names_out(categorical_columns),
    index=X_train.index
)
X_test_categorical = pd.DataFrame(
    ohe.transform(X_test[categorical_columns]),
    columns=ohe.get_feature_names_out(categorical_columns),
    index=X_test.index
)

X_train = pd.concat([X_train[numerical_columns], X_train_categorical], axis=1)
X_test = pd.concat([X_test[numerical_columns], X_test_categorical], axis=1)

Проверим это

In [64]:
X_train[ohe_columns].head()

Unnamed: 0,gender_Female,gender_Male,sleep_quality_Excellent,sleep_quality_Fair,sleep_quality_Good,sleep_quality_Poor,alcohol_consumption_Occasionally,alcohol_consumption_Regularly,smoking_level_Heavy,smoking_level_Light,smoking_level_Non-smoker,diet_type_Keto,diet_type_Omnivore,diet_type_Vegan,diet_type_Vegetarian,exercise_type_Cardio,exercise_type_Mixed,exercise_type_Strength,sunlight_exposure_High,sunlight_exposure_Low,sunlight_exposure_Moderate,caffeine_intake_High,caffeine_intake_Moderate
86953,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
8978,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
46239,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
49158,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
28799,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [65]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23076 entries, 86953 to 31700
Data columns (total 38 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   age                               23076 non-null  float64
 1   bmi                               23076 non-null  float64
 2   blood_pressure                    23076 non-null  float64
 3   heart_rate                        23076 non-null  float64
 4   cholesterol                       23076 non-null  float64
 5   glucose                           23076 non-null  float64
 6   insulin                           23076 non-null  float64
 7   sleep_hours                       23076 non-null  float64
 8   work_hours                        23076 non-null  float64
 9   daily_steps                       23076 non-null  float64
 10  calorie_intake                    23076 non-null  float64
 11  sugar_intake                      23076 non-null  float64
 12  water

Всё верно сделано