# Data Wrangling

In this project, I will be looking at COVID information from Mexico to predict high risk patients and then find where the best locations to send supplies. 

This data was found on Kaggle and originates from the Mexican Government General Directorate of Epidemiology

(https://www.kaggle.com/datasets/marianarfranklin/mexico-covid19-clinical-data)

(https://www.gob.mx/salud/documentos/datos-abiertos-152127)

The data is mostly made up of categorical data, but also includes date-time data and location data.


Below is an explanation of each feature along with how these will be preprocessed to allow for better analysis and model building.


'FECHA_ACTUALIZACION' : 'Updated Date'


- Database is updated daily, so this shows the most date of the most recent update

'ID_REGISTRO' : 'Registration ID'

- Case Identifier Number


'ORIGEN' : 'Origin'

'SECTOR' : 'Sector'     

- Type of institution that is providing care

'ENTIDAD_UM' : 'Entity'

- Entity or State 

'SEXO' : 'Sex'

- Sex of the Patient 
- 1 is Female : 2 is Male

'ENTIDAD_NAC' : 'Birth Entity'

'ENTIDAD_RES' : 'Residence Entity' 

'MUNICIPIO_RES' : 'Residence Municipality'

'TIPO_PACIENTE' : 'Patient Type'
- 1 if the patient was sent home (outpatient)
- 2 if patient was admitted to the hospital (hospitalized)
- 99 is unspecified 

'FECHA_INGRESO' : 'Date Entry'    
- Date of admission

'FECHA_SINTOMAS' : 'Date Symptoms'
- Date which patients symptoms began

'FECHA_DEF' : 'Date Died'
- Date the patient died
- Actual date if the patient died
- 9999-99-99 if they didn't die, but how long patient was in the hospital is unknown 

'INTUBADO' : 'Intubated'
- If patient required intubation
- 1 if the patient was intubated : 2 if not
- 97 and 99 are considered missing values. 

'NEUMONIA' : 'Pneumonia'
- Diagnosed with pneumonia
- 1 if the patient was found to have pneumonia:  2 if not. 
- 99 is missing data.

'EDAD' : 'Age'
- Age of patient

'NACIONALIDAD' : 'Nationality'

'EMBARAZO' : 'Pregnant'
- 1 if the patient was pregnant : 2 if not
- 97 and 98 are missing values.

'HABLA_LENGUA_INDIG' : 'Language'
- If patient speaks indigenous language

'INDIGENA' : 'Indigenous'
- If patient self-identifies as an indigenous person

'DIABETES' : 'Diabetes'
- 1 if the patient has diabetes : 2 if not
- 98 is missing data. 
- There does not appear to be a distinction between Type 1 and Type 2 diabetes in this data

'EPOC' : 'COPD' 
- 1 if the patient has Chronic Obstructive Pulmonary Disease : 2 if not
- 98 is missing data.

'ASMA' : 'Asthma' 
- 1 if the patient has asthma : 2 if not
- 98 is missing data

'INMUSUPR' : 'Immunosuppresed' 
- 1 if the patient is immunosuppressed : 2 if not
- 98 is missing data.

'HIPERTENSION' : 'Hypertension' 
- 1 if the patient has hypertension : 2 if not. 
- 98 is missing data.

'OTRA_COM' : 'Other Disease’
- 1 if the patient has some other unspecified disease : 2 if not. 
- 98 is missing data.

'CARDIOVASCULAR' : 'Cardiovascular'
- 1 if the patient has a cardiovascular disease : 2 if not. 
- 98 is missing data.

'OBESIDAD' : 'Obesity'
- 1 if the patient is classified as obese : 2 if not. 
- 98 is missing data.

'RENAL_CRONICA' : 'Chronic Kidney' 
- 1 if the patient has chronic renal disease : 2 if not. 
- 98 is missing data

'TABAQUISMO' : 'Tobacco'
- 1 if the patient uses tobacco products : 2 if not. 
- 98 is missing data.

'OTRO_CASO' : 'COVID Contact'
- If patient had contact with any cases diagnosed with CoV-2, but doesn’t specify timeframe

'TOMA_MUESTRA_LAB' : 'Lab Sample'
- If there was a lab sample taken

'RESULTADO_LAB' : 'Sample Result'
- Lab sample result

'TOMA_MUESTRA_ANTIGENO' : 'Antigen Sample'
- If there was an antigen sample taken

'RESULTADO_ANTIGENO' : 'Antigen Result'
- Antigen Sample Result

'CLASIFICACION_FINAL' : 'COVID' 
- Values of 1 - 3 signify the patient tested positive for COVID-19 in some degree
- 4 and above means the test was either inconclusive or negative. 

'MIGRANTE' : 'Migrant'
- If the patient is a migrant

'PAIS_NACIONALIDAD' : 'Nationality'
- Patient Nationality

'PAIS_ORIGEN' : 'Country Origin'
- The Country the patient left for Mexico

'UCI' : 'ICU'
- 1 if the patient was admitted to the icu : 2 if not. 
- 97 and 99 represent missing values. 


In [1]:
#Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
df=pd.read_csv('221227COVID19MEXICO.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
df.info()

Starting out with 6330966 rows and 40 columns

In [None]:
df.head(5)

In [None]:
cols = df.columns.values
cols

I will rename the data to make it easier to refer back to each column and know what they are for. 

In [3]:
df2={}
df2 = df.rename(columns={
    'FECHA_ACTUALIZACION' : 'Updated Date', 
    'ID_REGISTRO' : 'Registration ID', 
    'ORIGEN':'Origin', 
    'SECTOR':'Sector',     
    'ENTIDAD_UM': 'Entity', 
    'SEXO':'Sex', 
    'ENTIDAD_NAC':'Birth Entity', 
    'ENTIDAD_RES':'Residence Entity', 
    'MUNICIPIO_RES':'Residence Municipality', 
    'TIPO_PACIENTE':'Patient Type', 
    'FECHA_INGRESO':'Date Entry',    
    'FECHA_SINTOMAS':'Date Symptoms', 
    'FECHA_DEF':'Date Died', 
    'INTUBADO':'Intubated', 
    'NEUMONIA':'Pneumonia', 
    'EDAD':'Age',
    'NACIONALIDAD':'Nationality', 
    'EMBARAZO':'Pregnant', 
    'HABLA_LENGUA_INDIG':'Language', 
    'INDIGENA':'Indigenous',   
    'DIABETES':'Diabetes', 
    'EPOC':'COPD', 
    'ASMA':'Asthma', 
    'INMUSUPR':'Immunosuppresed', 
    'HIPERTENSION':'Hypertension', 
    'OTRA_COM':'Other Disease',
    'CARDIOVASCULAR':'Cardiovascular', 
    'OBESIDAD':'Obesity', 
    'RENAL_CRONICA':'Chronic Kidney', 
    'TABAQUISMO':'Tobacco',
    'OTRO_CASO':'COVID Contact', 
    'TOMA_MUESTRA_LAB':'Lab Sample', 
    'RESULTADO_LAB':'Sample Result',   
    'TOMA_MUESTRA_ANTIGENO':'Antigen Sample', 
    'RESULTADO_ANTIGENO':'Antigen Result', 
    'CLASIFICACION_FINAL':'COVID', 
    'MIGRANTE':'Migrant', 
    'PAIS_NACIONALIDAD':'Previous Nationality',
    'PAIS_ORIGEN':'Previous Origin', 
    'UCI':'ICU'
    })

df2.head(5)

Unnamed: 0,Updated Date,Registration ID,Origin,Sector,Entity,Sex,Birth Entity,Residence Entity,Residence Municipality,Patient Type,...,COVID Contact,Lab Sample,Sample Result,Antigen Sample,Antigen Result,COVID,Migrant,Previous Nationality,Previous Origin,ICU
0,2022-12-27,10e0db,1,12,20,2,20,20,67,1,...,1,2,97,1,2,7,99,México,97,97
1,2022-12-27,0989f5,2,12,14,1,32,14,71,1,...,2,2,97,1,1,3,99,México,97,97
2,2022-12-27,01e27d,2,9,25,2,25,25,1,1,...,2,2,97,1,2,7,99,México,97,97
3,2022-12-27,180725,2,9,9,2,9,9,12,2,...,2,2,97,1,2,7,99,México,97,2
4,2022-12-27,0793b8,2,12,9,2,9,9,10,1,...,2,2,97,1,2,7,99,México,97,97


In [None]:
df2.nunique()

I can see here a few things we will have to adjust
1. the Updated Date only has one value, so we can drop that
2. The registration ID is diffferent for every row, so we might not be needing this either
3. Age has 128 different values, that doesnt seem right, we may need to look into this more 
4. We don't really need the columns stating where the patient came from, as we are wanting to find out where to send supplies would be to where the patient currently is

In [4]:
df2=df2.drop(['Updated Date'], axis=1)

In [None]:
df2[df2.duplicated()==True]

In [None]:
missing = pd.concat([df2.isnull().sum(), 100 * df2.isnull().mean()], axis=1)
missing.columns=['count','%']
missing.sort_values(by='count')

In [5]:
df2=df2.drop(['Migrant', 'Previous Origin','Birth Entity','Previous Nationality',], axis=1)

In [6]:
import numpy as np
from scipy import stats 

To perform Chi-Square testing to find if there any relationships between our categorical variables 
Null Hypothesis = No difference or no relationship between "Classification Final" ( Whether the patient has COVID) and other features
Significnace level = 0.05
if P-Value is less than significance level, the hypothesis is rejected, meaning that there could be a relationship between the two features

In [7]:
compare = pd.crosstab(df['CLASIFICACION_FINAL'],df['OBESIDAD'])
print(compare)

OBESIDAD                 1        2      98
CLASIFICACION_FINAL                        
1                      4200    68968    337
2                        39      374      1
3                    201261  2852207  11328
4                        12      332      0
5                       546     8566     27
6                     10254   188798    677
7                    153059  2818490  11490


In [8]:
chi2, p, dof, ex=stats.chi2_contingency(compare)
print(f'Chi_square value: {chi2} \nP-Value: {p}\nDegrees of Freedom: {dof}\nExpected: {ex}')

Chi_square value: 5902.031143831834 
P-Value: 0.0
Degrees of Freedom: 12
Expected: [[4.28854228e+03 6.89394338e+04 2.77023964e+02]
 [2.41542276e+01 3.88285499e+02 1.56027374e+00]
 [1.78811063e+05 2.87443440e+06 1.15505331e+04]
 [2.00701795e+01 3.22633361e+02 1.29645934e+00]
 [5.33201658e+02 8.57135549e+03 3.44428544e+01]
 [1.16528979e+04 1.87323368e+05 7.52734091e+02]
 [1.74041070e+05 2.79775552e+06 1.12424092e+04]]


In [9]:
feature=[]
pvalue=[]

for c in df2.columns:
    if(df2[c].dtype == np.int64):
        compare = pd.crosstab(df2['COVID'],df2[c])
        chi2, p, dof, ex=stats.chi2_contingency(compare)
        pvalue.append(p)
        feature.append(c)

        df_pvalue = pd.DataFrame(list(zip(feature, pvalue)),columns=['feature','pvalue'])

In [10]:
df_pvalue

Unnamed: 0,feature,pvalue
0,Origin,0.0
1,Sector,0.0
2,Entity,0.0
3,Sex,5.231144e-113
4,Residence Entity,0.0
5,Residence Municipality,0.0
6,Patient Type,0.0
7,Intubated,0.0
8,Pneumonia,0.0
9,Age,0.0


In [11]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import StandardScaler
from numpy import array

In [12]:
df3=df2.drop(['Registration ID','Date Entry','Date Symptoms','Date Died'],axis=1)

In [13]:
y=df3['COVID']
X=df3.drop(columns=['COVID'], axis=1)

select=SelectKBest(chi2,k=20)

In [14]:
print(select)

SelectKBest(k=20, score_func=<function chi2 at 0x000002A6FC1DD4C0>)


In [16]:
select=select.fit_transform(X,y)

In [18]:
print(select)

[[ 1 12 20 ...  1  2 97]
 [ 2 12 14 ...  1  1 97]
 [ 2  9 25 ...  1  2 97]
 ...
 [ 2 12 15 ...  2 97 97]
 [ 2 12 15 ...  2 97 97]
 [ 2 12 15 ...  2 97 97]]


In [21]:
select

array([[ 1, 12, 20, ...,  1,  2, 97],
       [ 2, 12, 14, ...,  1,  1, 97],
       [ 2,  9, 25, ...,  1,  2, 97],
       ...,
       [ 2, 12, 15, ...,  2, 97, 97],
       [ 2, 12, 15, ...,  2, 97, 97],
       [ 2, 12, 15, ...,  2, 97, 97]], dtype=int64)

In [None]:
df2['Date Died'].value_counts()

In [None]:
df2['Death'] = [0 if each=="9999-99-99" else 1 for each in df2['Date Died']]

0 - person is still alive
1 - person has a death date, has died

In [None]:
ax = sns.countplot(df2['Death']);
abs_values = df2['Death'].value_counts(ascending=False).values
ax.bar_label(container=ax.containers[0], labels=abs_values)
plt.show()

In [None]:
df2['Date Symptoms'].value_counts()

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(x=df2['Date Symptoms'])
plt.title("Number of Cases each Date symptoms");

bar chart of date symptoms
then stacking chart on positive vs negative 

In [None]:
df2['Age'].value_counts()

In [None]:
df2['Age'].value_counts(ascending=True).head(20)

looking at various news articles, the oldest men and women across the world were recorded to be about 110s-120s

Mexico Life expectancy is about 77-78 years 

In [None]:
df2=df2[df2['Age']<130]

In [None]:
df2['Age'].value_counts(ascending=True)

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(x=df2['Age'])
plt.axvline(df2['Age'].median(), color='red')
plt.title("Age Distribution");

In [None]:
# Boxplots of AGE by SURVIVED status
AGE_SURVIVED = df2[df2.Death == 0].Age
AGE_DIED = df2[df2.Death == 1].Age
plt.boxplot(x=[AGE_SURVIVED, AGE_DIED], labels=['SURVIVED (0)', 'DIED (1)'])
plt.ylabel("AGE")
plt.title("Boxplots of AGE by SURVIVED status")
plt.show()

# Mean age for each group
print("\nMean age of all patients: {} years".format(round(sum(df2.Age) / 
                                                    len(df2.Age), 1)))
print("\nMean age of those who survived: {} years".format(round(sum(df2[df2.Death == 0].Age) /   
                                                                len(df2[df2.Death == 0].Age), 1)))
print("\nMean age of those who died: {} years".format(round(sum(df2[df2.Death == 1].Age) / 
                                                    len(df2[df2.Death == 1].Age), 1)))

In [None]:
# Check how many people actually tested positive
percent_positive = round(sum(df2['COVID'] < 4) / len(df2['COVID']), 3) * 100
print("{}% of patients tested positive for COVID in the original dataset.".format(percent_positive))

# Only 37% of patients tested positive. Only keep these patients in the data
positive = df2[df2['COVID'] < 4]
print(positive.shape)


In [None]:
# Rename DATE_DIED as SURVIVED
#train_COVID.rename(columns={'DATE_DIED': 'SURVIVED'}, inplace=True)

# Number of people who survived / died and their percentages
survived = sum(df2['Death'] == 0)
survived_percent = round(survived / len(df2['Death']), 3) * 100
died = sum(df2['Death'] == 1)
died_percent = round(died /len(df2['Death']), 3) * 100

print("{} people ({}%) survived COVID-19 infection.".format(survived, survived_percent))
print("{} people ({}%) died of COVID-19 infection.".format(died, died_percent))

In [None]:
Age and dead

In [None]:
x = 
y1 =
y2 = 

plt.bar(x,y1, color='r')
plt.bar(x,y2,bottom=y1,color='b')
plt.show()

In [None]:
corr=df.corr()
corr.style.backgound_gradient(cmap='coolwarm')

In [None]:
#Find out if a patient is high risk or not
hr=df[

In [None]:
hr.head()

In [None]:
for col in df2:
    print(df2[col].unique())

In [None]:
for c in df2.columns:
   if len(df2[c].unique()) >= 3 and len(df2[c].unique()) <=10:
        print(df2[c].value_counts(normalize=True))