# Data Wrangling

In this project, I will be looking at COVID information from Mexico to predict high risk patients and then find where the best locations to send supplies. 

This data was found on Kaggle and originates from the Mexican Government General Directorate of Epidemiology

(https://www.kaggle.com/datasets/marianarfranklin/mexico-covid19-clinical-data)

(https://www.gob.mx/salud/documentos/datos-abiertos-152127)

In [1]:
#Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
df=pd.read_csv('221227COVID19MEXICO.csv')

  df=pd.read_csv('221227COVID19MEXICO.csv')


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6330966 entries, 0 to 6330965
Data columns (total 40 columns):
 #   Column                 Dtype 
---  ------                 ----- 
 0   FECHA_ACTUALIZACION    object
 1   ID_REGISTRO            object
 2   ORIGEN                 int64 
 3   SECTOR                 int64 
 4   ENTIDAD_UM             int64 
 5   SEXO                   int64 
 6   ENTIDAD_NAC            int64 
 7   ENTIDAD_RES            int64 
 8   MUNICIPIO_RES          int64 
 9   TIPO_PACIENTE          int64 
 10  FECHA_INGRESO          object
 11  FECHA_SINTOMAS         object
 12  FECHA_DEF              object
 13  INTUBADO               int64 
 14  NEUMONIA               int64 
 15  EDAD                   int64 
 16  NACIONALIDAD           int64 
 17  EMBARAZO               int64 
 18  HABLA_LENGUA_INDIG     int64 
 19  INDIGENA               int64 
 20  DIABETES               int64 
 21  EPOC                   int64 
 22  ASMA                   int64 
 23  INMUSUP

Starting out with 6330966 rows and 40 columns

In [None]:
df.head(5)

In [None]:
cols = df.columns.values
cols

I will rename the data to make it easier to refer back to each column and know what they are for. 

In [4]:
df2={}
df2 = df.rename(columns={
    'FECHA_ACTUALIZACION' : 'Updated Date', 
    'ID_REGISTRO' : 'Registration ID', 
    'ORIGEN':'Origin', 
    'SECTOR':'Sector',     
    'ENTIDAD_UM': 'Entity', 
    'SEXO':'Sex', 
    'ENTIDAD_NAC':'Birth Entity', 
    'ENTIDAD_RES':'Residence Entity', 
    'MUNICIPIO_RES':'Residence Municipality', 
    'TIPO_PACIENTE':'Patient Type', 
    'FECHA_INGRESO':'Date Entry',    
    'FECHA_SINTOMAS':'Date Symptoms', 
    'FECHA_DEF':'Date Died', 
    'INTUBADO':'Intubated', 
    'NEUMONIA':'Pneumonia', 
    'EDAD':'Age',
    'NACIONALIDAD':'Nationality', 
    'EMBARAZO':'Pregnant', 
    'HABLA_LENGUA_INDIG':'Language', 
    'INDIGENA':'Indigenous',   
    'DIABETES':'Diabetes', 
    'EPOC':'COPD', 
    'ASMA':'Asthma', 
    'INMUSUPR':'Immunosuppresed', 
    'HIPERTENSION':'Hypertension', 
    'OTRA_COM':'Other Disease',
    'CARDIOVASCULAR':'Cardiovascular', 
    'OBESIDAD':'Obesity', 
    'RENAL_CRONICA':'Chronic Kidney', 
    'TABAQUISMO':'Tobacco',
    'OTRO_CASO':'COVID Contact', 
    'TOMA_MUESTRA_LAB':'Lab Sample', 
    'RESULTADO_LAB':'Sample Result',   
    'TOMA_MUESTRA_ANTIGENO':'Antigen Sample', 
    'RESULTADO_ANTIGENO':'Antigen Result', 
    'CLASIFICACION_FINAL':'COVID', 
    'MIGRANTE':'Migrant', 
    'PAIS_NACIONALIDAD':'Previous Nationality',
    'PAIS_ORIGEN':'Previous Origin', 
    'UCI':'ICU'
    })

df2.head(5)

Unnamed: 0,Updated Date,Registration ID,Origin,Sector,Entity,Sex,Birth Entity,Residence Entity,Residence Municipality,Patient Type,...,COVID Contact,Lab Sample,Sample Result,Antigen Sample,Antigen Result,COVID,Migrant,Previous Nationality,Previous Origin,ICU
0,2022-12-27,10e0db,1,12,20,2,20,20,67,1,...,1,2,97,1,2,7,99,México,97,97
1,2022-12-27,0989f5,2,12,14,1,32,14,71,1,...,2,2,97,1,1,3,99,México,97,97
2,2022-12-27,01e27d,2,9,25,2,25,25,1,1,...,2,2,97,1,2,7,99,México,97,97
3,2022-12-27,180725,2,9,9,2,9,9,12,2,...,2,2,97,1,2,7,99,México,97,2
4,2022-12-27,0793b8,2,12,9,2,9,9,10,1,...,2,2,97,1,2,7,99,México,97,97


In [None]:
df2.nunique()

I can see here a few things we will have to adjust
1. the Updated Date only has one value, so we can drop that
2. The registration ID is diffferent for every row, so we might not be needing this either
3. Age has 128 different values, that doesnt seem right, we may need to look into this more 
4. We don't really need the columns stating where the patient came from, as we are wanting to find out where to send supplies would be to where the patient currently is

In [5]:
df2=df2.drop(['Updated Date'], axis=1)

In [None]:
df2[df2.duplicated()==True]

In [None]:
missing = pd.concat([df2.isnull().sum(), 100 * df2.isnull().mean()], axis=1)
missing.columns=['count','%']
missing.sort_values(by='count')

In [6]:
df2=df2.drop(['Migrant', 'Previous Origin','Birth Entity','Previous Nationality',], axis=1)

In [7]:
import numpy as np
from scipy import stats 

To perform Chi-Square testing to find if there any relationships between our categorical variables 
Null Hypothesis = No difference or no relationship between "Classification Final" ( Whether the patient has COVID) and other features
Significnace level = 0.05
if P-Value is less than significance level, the hypothesis is rejected, meaning that there could be a relationship between the two features

In [8]:
compare = pd.crosstab(df['CLASIFICACION_FINAL'],df['OBESIDAD'])
print(compare)

OBESIDAD                 1        2      98
CLASIFICACION_FINAL                        
1                      4200    68968    337
2                        39      374      1
3                    201261  2852207  11328
4                        12      332      0
5                       546     8566     27
6                     10254   188798    677
7                    153059  2818490  11490


In [9]:
chi2, p, dof, ex=stats.chi2_contingency(compare)
print(f'Chi_square value: {chi2} \nP-Value: {p}\nDegrees of Freedom: {dof}\nExpected: {ex}')

Chi_square value: 5902.031143831834 
P-Value: 0.0
Degrees of Freedom: 12
Expected: [[4.28854228e+03 6.89394338e+04 2.77023964e+02]
 [2.41542276e+01 3.88285499e+02 1.56027374e+00]
 [1.78811063e+05 2.87443440e+06 1.15505331e+04]
 [2.00701795e+01 3.22633361e+02 1.29645934e+00]
 [5.33201658e+02 8.57135549e+03 3.44428544e+01]
 [1.16528979e+04 1.87323368e+05 7.52734091e+02]
 [1.74041070e+05 2.79775552e+06 1.12424092e+04]]


In [10]:
feature=[]
pvalue=[]

for c in df2.columns:
    if(df2[c].dtype == np.int64):
        compare = pd.crosstab(df2['COVID'],df2[c])
        chi2, p, dof, ex=stats.chi2_contingency(compare)
        pvalue.append(p)
        feature.append(c)

        df_pvalue = pd.DataFrame(list(zip(feature, pvalue)),columns=['feature','pvalue'])

In [11]:
df_pvalue

Unnamed: 0,feature,pvalue
0,Origin,0.0
1,Sector,0.0
2,Entity,0.0
3,Sex,5.231144e-113
4,Residence Entity,0.0
5,Residence Municipality,0.0
6,Patient Type,0.0
7,Intubated,0.0
8,Pneumonia,0.0
9,Age,0.0


In [12]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import StandardScaler
from numpy import array

In [13]:
df3=df2.drop(['Registration ID','Date Entry','Date Symptoms','Date Died'],axis=1)

In [14]:
y=df3['COVID']
X=df3.drop(columns=['COVID'], axis=1)

select=SelectKBest(chi2,k=30)

In [15]:
new=select.fit_transform(X,y)

In [16]:
select.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [17]:
select.get_feature_names_out(input_features=None)

array(['Origin', 'Sector', 'Entity', 'Sex', 'Residence Entity',
       'Residence Municipality', 'Patient Type', 'Intubated', 'Pneumonia',
       'Age', 'Nationality', 'Pregnant', 'Language', 'Indigenous',
       'Diabetes', 'COPD', 'Asthma', 'Immunosuppresed', 'Hypertension',
       'Other Disease', 'Cardiovascular', 'Obesity', 'Chronic Kidney',
       'Tobacco', 'COVID Contact', 'Lab Sample', 'Sample Result',
       'Antigen Sample', 'Antigen Result', 'ICU'], dtype=object)

In [None]:
df2['Date Died'].value_counts()

In [None]:
df2['Death'] = [0 if each=="9999-99-99" else 1 for each in df2['Date Died']]

0 - person is still alive
1 - person has a death date, has died

In [None]:
ax = sns.countplot(df2['Death']);
abs_values = df2['Death'].value_counts(ascending=False).values
ax.bar_label(container=ax.containers[0], labels=abs_values)
plt.show()

In [None]:
df2['Date Symptoms'].value_counts()

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(x=df2['Date Symptoms'])
plt.title("Number of Cases each Date symptoms");

bar chart of date symptoms
then stacking chart on positive vs negative 

In [None]:
df2['Age'].value_counts()

In [None]:
df2['Age'].value_counts(ascending=True).head(20)

looking at various news articles, the oldest men and women across the world were recorded to be about 110s-120s

Mexico Life expectancy is about 77-78 years 

In [None]:
df2=df2[df2['Age']<130]

In [None]:
df2['Age'].value_counts(ascending=True)

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(x=df2['Age'])
plt.axvline(df2['Age'].median(), color='red')
plt.title("Age Distribution");

In [None]:
Age and dead

In [None]:
corr=df.corr()
corr.style.backgound_gradient(cmap='coolwarm')

In [None]:
#Find out if a patient is high risk or not
hr=df[

In [None]:
hr.head()

In [None]:
for col in df2:
    print(df2[col].unique())

In [None]:
for c in df2.columns:
   if len(df2[c].unique()) >= 3 and len(df2[c].unique()) <=10:
        print(df2[c].value_counts(normalize=True))