# **COVID 19 - DATA SCIENCE PROJECT**

<u style="font-weight:600; font-size: 18px;">Objectif du projet :</u> À partir de l'ensemble de données à notre disposition, nous souhaitons mettre en place un modèle de Machine Learning capable de prédire si une personne est infectée ou pas.

<u style="font-weight:600; font-size: 18px;">Métriques :</u> Accuracy → 90%

In [106]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Importation des données

In [107]:
df_original = pd.read_excel('./datasets/dataset.xlsx')
df_original.head()

Unnamed: 0,Patient ID,Patient age quantile,SARS-Cov-2 exam result,"Patient addmited to regular ward (1=yes, 0=no)","Patient addmited to semi-intensive unit (1=yes, 0=no)","Patient addmited to intensive care unit (1=yes, 0=no)",Hematocrit,Hemoglobin,Platelets,Mean platelet volume,...,Hb saturation (arterial blood gases),pCO2 (arterial blood gas analysis),Base excess (arterial blood gas analysis),pH (arterial blood gas analysis),Total CO2 (arterial blood gas analysis),HCO3 (arterial blood gas analysis),pO2 (arterial blood gas analysis),Arteiral Fio2,Phosphor,ctO2 (arterial blood gas analysis)
0,44477f75e8169d2,13,negative,0,0,0,,,,,...,,,,,,,,,,
1,126e9dd13932f68,17,negative,0,0,0,0.236515,-0.02234,-0.517413,0.010677,...,,,,,,,,,,
2,a46b4402a0e5696,8,negative,0,0,0,,,,,...,,,,,,,,,,
3,f7d619a94f97c45,5,negative,0,0,0,,,,,...,,,,,,,,,,
4,d9e41465789c2b5,15,negative,0,0,0,,,,,...,,,,,,,,,,


Au premier coup d'œil du dataset, on remarque que :
- Le nom de chaque colonne est sous un mauvais format
- Certaines colonnes semblent ne contenir que des valeurs manquantes NaN
- La colonne target est 'SARS-Cov-2 exam result'

## Exploratory Data Analysis (EDA)

In [136]:
df = df_original.copy()
df.head()

Unnamed: 0,Patient ID,Patient age quantile,SARS-Cov-2 exam result,"Patient addmited to regular ward (1=yes, 0=no)","Patient addmited to semi-intensive unit (1=yes, 0=no)","Patient addmited to intensive care unit (1=yes, 0=no)",Hematocrit,Hemoglobin,Platelets,Mean platelet volume,...,Hb saturation (arterial blood gases),pCO2 (arterial blood gas analysis),Base excess (arterial blood gas analysis),pH (arterial blood gas analysis),Total CO2 (arterial blood gas analysis),HCO3 (arterial blood gas analysis),pO2 (arterial blood gas analysis),Arteiral Fio2,Phosphor,ctO2 (arterial blood gas analysis)
0,44477f75e8169d2,13,negative,0,0,0,,,,,...,,,,,,,,,,
1,126e9dd13932f68,17,negative,0,0,0,0.236515,-0.02234,-0.517413,0.010677,...,,,,,,,,,,
2,a46b4402a0e5696,8,negative,0,0,0,,,,,...,,,,,,,,,,
3,f7d619a94f97c45,5,negative,0,0,0,,,,,...,,,,,,,,,,
4,d9e41465789c2b5,15,negative,0,0,0,,,,,...,,,,,,,,,,


- Dimensions du dataset

In [137]:
print(f"Nombre de colonnes : {df.shape[1]}\nNombre de lignes : {df.shape[0]}")

Nombre de colonnes : 111
Nombre de lignes : 5644


**Remarque :** Nous avons un grand nombre de variables, il faudra donc réduire la dimensionnalité

- Formatage du nom des colonnes

In [138]:
def format_col_name(col_name_ : str) -> str:
    """
    Formats a column name by converting it to lowercase, replacing spaces with underscores,
    and trimming unnecessary whitespace. If the input contains a parenthesis, the function
    only processes the substring before the parenthesis.

    :param col_name_: The name of the column to be formatted.
    :return: The formatted column name with spaces replaced by underscores and converted
        to lowercase.
    """
    if col_name_.find('(') == -1:
        return col_name_.strip(' ').replace(' ', '_').lower()
    return col_name_[ :col_name_.find('(')].strip(' ').replace(' ', '_').lower()

In [139]:
df.rename(columns={col : format_col_name(col) for col in df.columns}, inplace=True)

In [140]:
df.columns

Index(['patient_id', 'patient_age_quantile', 'sars-cov-2_exam_result',
       'patient_addmited_to_regular_ward',
       'patient_addmited_to_semi-intensive_unit',
       'patient_addmited_to_intensive_care_unit', 'hematocrit', 'hemoglobin',
       'platelets', 'mean_platelet_volume',
       ...
       'hb_saturation', 'pco2', 'base_excess', 'ph', 'total_co2', 'hco3',
       'po2', 'arteiral_fio2', 'phosphor', 'cto2'],
      dtype='object', length=111)

- Valeurs manquantes NaN

In [141]:
df.isnull().sum().sort_values()

patient_id                                    0
patient_age_quantile                          0
sars-cov-2_exam_result                        0
patient_addmited_to_regular_ward              0
patient_addmited_to_semi-intensive_unit       0
                                           ... 
mycoplasma_pneumoniae                      5644
urine_-_sugar                              5644
prothrombin_time                           5644
partial_thromboplastin_time                5644
d-dimer                                    5644
Length: 111, dtype: int64

**Remarque :** Une bonne partie des colonnes/variables ont plus de 50% de valeurs manquantes ; Du coup, on pourrait les supprimer dans un premier temps, parce qu'à ce stade, on ne peut rien de plus pour récupérer ce trop-plein de valeurs manquantes, mais on va les garder pour le moment.

- Types de variables

In [143]:
df.dtypes.sort_values()

patient_age_quantile                         int64
patient_addmited_to_regular_ward             int64
patient_addmited_to_semi-intensive_unit      int64
patient_addmited_to_intensive_care_unit      int64
cto2                                       float64
                                            ...   
urine_-_bile_pigments                       object
urine_-_ketone_bodies                       object
urine_-_nitrite                             object
urine_-_protein                             object
strepto_a                                   object
Length: 111, dtype: object