<a href="https://colab.research.google.com/github/andrea-acampora/Data_Intensive_Project/blob/main/Progetto_Data_Intensive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Predizione della salute del feto con dati forniti da cardiotocografie**

***Acampora Andrea***

Ingegneria e Scienze Informatiche \\
Università di Bologna, sede di Cesena \\
Corso di Programmazione di Applicazioni Data Intensive

## Descrizione del problema e analisi esplorativa

Il dataset scelto contiene i risultati di cardiotocografie eseguite sui feti durante la gravidanza.
La cardiotocografia è un esame molto diffuso per la valutazione del benessere del feto in ambito prenatale.
L'obbiettivo del progetto è quello di riuscire a prevedere una variabile discreta che rappresenta la salute del feto in base ai dati forniti dalle cardiotocografie.

### Caricamento dei dati e preprocessing

Vengono importate tutte le librerie neccesarie per il progetto

In [9]:
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import os.path

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import learning_curve

In [17]:
import os.path
if not os.path.exists("fetal_health.csv"):
    from urllib.request import urlretrieve
    urlretrieve("https://raw.githubusercontent.com/andrea-acampora/Data_Intensive_Project/main/fetal_health.csv", "fetal_health.csv")

dataset = pd.read_csv('fetal_health.csv')
dataset.head()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,mean_value_of_long_term_variability,histogram_width,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,2.4,64.0,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,10.4,130.0,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,13.4,130.0,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,23.0,117.0,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,19.9,117.0,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


### Significato delle features


Le **features** del dataset sono le seguenti:


*   **baseline_value**: frequenza cardiaca del feto 
*   **accelerations**: numero di accelerazioni del battito cardiaco per secondo
*   **fetal_movement**: numero di movimenti del feto per secondo
*   **uterine_contractions**: numero di contrazioni dell'utero per secondo
*   **light_decelerations**: numero di  decelerazioni brevi del battito cardiaco
*   **severe_decelerations**: numero di decelerazioni consistenti del battito cardiaco
*   **prolongued_decelerations**: numero di decelerazioni prolungate del battito cardiaco
*   **abnormal_short_term_variability**: indica la variabilità del battito cardiaco
*   **mean_value_of_short_term_variability** valore medio di oscilazioni brevi
*   **percentage_of_time_with_abnormal_long_term_variability**: percentuale di tempo con oscillazioni anormali
*   **mean_value_of_long_term_variability**: valore medio di oscillazioni prolungate. \\
Le seguenti variabili rappresentano invece i dati dell'istogramma ovvero l'output della cardiotocografia
*   **histogram_width**
*   **histogram_min**
*   **histogram_max**
*   **histogram_number_of_peaks**
*   **histogram_number_of_zeroes**
*   **histogram_mode**
*   **histogram_variance**
*   **histogram_tendency**
*   **fetal_health**

### Analisi delle features

In [29]:
dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
baseline value,2126.0,133.303857,9.840844,106.0,126.0,133.0,140.0,160.0
accelerations,2126.0,0.003178,0.003866,0.0,0.0,0.002,0.006,0.019
fetal_movement,2126.0,0.009481,0.046666,0.0,0.0,0.0,0.003,0.481
uterine_contractions,2126.0,0.004366,0.002946,0.0,0.002,0.004,0.007,0.015
light_decelerations,2126.0,0.001889,0.00296,0.0,0.0,0.0,0.003,0.015
severe_decelerations,2126.0,3e-06,5.7e-05,0.0,0.0,0.0,0.0,0.001
prolongued_decelerations,2126.0,0.000159,0.00059,0.0,0.0,0.0,0.0,0.005
abnormal_short_term_variability,2126.0,46.990122,17.192814,12.0,32.0,49.0,61.0,87.0
mean_value_of_short_term_variability,2126.0,1.332785,0.883241,0.2,0.7,1.2,1.7,7.0
percentage_of_time_with_abnormal_long_term_variability,2126.0,9.84666,18.39688,0.0,0.0,0.0,11.0,91.0


### Analisi della variabile da predire

La variabile da predire è ***fetal_health*** che rappresenta la salute del feto e contiene valori:
* 1 - Normale
* 2 - Sospetto 
* 3 - Patologico

0       False
1        True
2        True
3        True
4        True
        ...  
2121    False
2122    False
2123    False
2124    False
2125     True
Name: fetal_health, Length: 2126, dtype: bool