# Data cleaning

#### This is the second database I'm going to work with. The objective of working with this is to know the beneficiary population assigned from 0 to 4 years by UMF and answer the question if this data has changed along the last 10 years.
In this file I explore and clean two files: pda_2013_01_31 and pda_2023_01_31. Both databases were downloaded from webpage http://datos.imss.gob.mx/dataset/pda-2013 and http://datos.imss.gob.mx/dataset/pda-2023

1. First, I imported the database of 2013 and after the database of 2023. I explore the type of variables, the NaN values and the 0 values of each.

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
beneficiarios2013 = pd.read_csv('https://drive.google.com/file/d/1t5GxpSE-80OI_kYxcsPAfOVUxxDk4Qha/view?usp=share_link', encoding="ISO-8859-1", sep="|")
display(beneficiarios2013.head())
beneficiarios2013.shape

Unnamed: 0,ID_DELEG_RP,ID_SUBDEL_RP,ID_UMF_RP,NOMBRE_UMF_RP,ST_TIT_FAM,ID_CALIDAD,CVE_GENERO,CVE_RANGO_EDAD,ST_CONSULTORIO,ID_TURNO,ID_CONSULTORIO,TOT_CASOS
0,27,70,31,UMF 031,2,13,1,E10,1,M,1,28
1,27,70,31,UMF 031,2,2,1,E25,1,M,1,1
2,27,70,31,UMF 031,2,14,2,E4,1,M,1,6
3,27,70,31,UMF 031,2,12,2,E23,1,M,1,8
4,27,70,31,UMF 031,2,16,2,E10,1,M,1,2


(3070864, 12)

In [3]:
beneficiarios2013.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3070864 entries, 0 to 3070863
Data columns (total 12 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   ID_DELEG_RP     int64 
 1   ID_SUBDEL_RP    int64 
 2   ID_UMF_RP       int64 
 3   NOMBRE_UMF_RP   object
 4   ST_TIT_FAM      int64 
 5   ID_CALIDAD      int64 
 6   CVE_GENERO      int64 
 7   CVE_RANGO_EDAD  object
 8   ST_CONSULTORIO  int64 
 9   ID_TURNO        object
 10  ID_CONSULTORIO  int64 
 11  TOT_CASOS       int64 
dtypes: int64(9), object(3)
memory usage: 281.1+ MB


In [4]:
beneficiarios2013.isna().sum()

ID_DELEG_RP       0
ID_SUBDEL_RP      0
ID_UMF_RP         0
NOMBRE_UMF_RP     0
ST_TIT_FAM        0
ID_CALIDAD        0
CVE_GENERO        0
CVE_RANGO_EDAD    0
ST_CONSULTORIO    0
ID_TURNO          0
ID_CONSULTORIO    0
TOT_CASOS         0
dtype: int64

In [5]:
nulls_df = pd.DataFrame(round(beneficiarios2013.isna().sum()/len(beneficiarios2013),4)*100)
nulls_df = nulls_df.reset_index()
nulls_df.columns = ['header_name', 'percent_nulls']
display(nulls_df)

Unnamed: 0,header_name,percent_nulls
0,ID_DELEG_RP,0.0
1,ID_SUBDEL_RP,0.0
2,ID_UMF_RP,0.0
3,NOMBRE_UMF_RP,0.0
4,ST_TIT_FAM,0.0
5,ID_CALIDAD,0.0
6,CVE_GENERO,0.0
7,CVE_RANGO_EDAD,0.0
8,ST_CONSULTORIO,0.0
9,ID_TURNO,0.0


In [6]:
for col in beneficiarios2013.columns:
    count_zeros = beneficiarios2013[col].value_counts().get(0, 0)
    print(f'Column {col} has {count_zeros} zero values')

Column ID_DELEG_RP has 0 zero values
Column ID_SUBDEL_RP has 0 zero values
Column ID_UMF_RP has 0 zero values
Column NOMBRE_UMF_RP has 64797 zero values
Column ST_TIT_FAM has 0 zero values
Column ID_CALIDAD has 0 zero values
Column CVE_GENERO has 0 zero values
Column CVE_RANGO_EDAD has 172222 zero values
Column ST_CONSULTORIO has 134470 zero values
Column ID_TURNO has 1513551 zero values
Column ID_CONSULTORIO has 97970 zero values
Column TOT_CASOS has 0 zero values


In [7]:
beneficiarios2013.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID_DELEG_RP,3070864.0,19.620509,10.8163,1.0,12.0,18.0,27.0,40.0
ID_SUBDEL_RP,3070864.0,18.60922,21.415858,1.0,2.0,9.0,33.0,80.0
ID_UMF_RP,3070864.0,44.059968,46.870837,1.0,11.0,32.0,60.0,250.0
ST_TIT_FAM,3070864.0,1.811528,0.391089,1.0,2.0,2.0,2.0,2.0
ID_CALIDAD,3070864.0,9.570846,6.269501,1.0,2.0,13.0,14.0,40.0
CVE_GENERO,3070864.0,1.530917,0.499043,1.0,1.0,2.0,2.0,2.0
ST_CONSULTORIO,3070864.0,0.956211,0.204625,0.0,1.0,1.0,1.0,1.0
ID_CONSULTORIO,3070864.0,127.387324,1082.604295,0.0,2.0,6.0,13.0,9998.0
TOT_CASOS,3070864.0,16.221569,56.234139,1.0,2.0,6.0,15.0,12262.0


In [8]:
beneficiarios2013['CVE_GENERO'].unique()

array([1, 2])

In [9]:
beneficiarios2013['CVE_GENERO'].value_counts()

CVE_GENERO
2    1630375
1    1440489
Name: count, dtype: int64

In [10]:
beneficiarios2013['ID_UMF_RP'].unique()

array([ 31,  14,  69,  22,  13,  41,   6,  26,   9,   2, 128,  56,  20,
        50,  51,  47,  16, 245,  53,  12,  49,  11,   1,  17,  25,  42,
        30,   3,  18,  10,  38,  43,  32,  55,  24,  73,   5,  48, 137,
        15,  23,  54,  19,  94,  29,  44,   4,  34,  37, 187, 192, 235,
        39,   7,  36, 161,  70,  66,  71,  40,  28,  45,  21,  68, 230,
        46,  52, 120,   8, 160,  67,  93,  79, 195, 193, 181, 180,  33,
        58, 189,  84,  83,  81,  92,  57,  64,  62,  65,  61,  63,  77,
        91, 198,  60,  74,  72, 234,  35, 184, 165, 236, 243, 240, 244,
        85,  27, 114, 167,  88, 171,  78,  59, 190,  87, 188,  86, 227,
       226, 186, 100, 178, 246, 250, 191, 140, 177, 223,  97, 237,  96,
       162, 231,  80,  75, 182, 183,  76, 169, 130,  89,  95, 170, 179,
        82, 132, 106, 248, 247, 228, 249, 238,  98, 239, 241, 229, 242,
       159, 168, 155, 222, 220, 225, 185, 133, 233, 232, 224])

In [11]:
beneficiarios2013['ID_UMF_RP'].value_counts()

ID_UMF_RP
2      113225
1      111472
7       80299
3       70655
9       69257
        ...  
137       281
238       270
237       267
98        264
106       264
Name: count, Length: 167, dtype: int64

In [12]:
beneficiarios2013['ST_TIT_FAM'].unique()

array([2, 1])

In [13]:
beneficiarios2013['ST_TIT_FAM'].value_counts()

ST_TIT_FAM
2    2492091
1     578773
Name: count, dtype: int64

In [14]:
beneficiarios2013['ID_CALIDAD'].unique()

array([13,  2, 14, 12, 16, 15,  1, 21, 11, 40, 17,  6, 19, 20,  3, 18,  7,
        4, 25,  5, 10, 30, 27,  9, 24, 28,  8, 38, 33, 34, 32, 26, 29, 36,
       31, 35, 37, 39])

In [15]:
beneficiarios2013['ID_CALIDAD'].value_counts()

ID_CALIDAD
1     578773
13    455713
14    401725
2     385447
15    348402
16    214914
6     166028
12    163970
11    140517
17     91577
18     34894
3      31492
19     13421
7       7772
40      7307
5       6954
20      5826
10      3341
4       2744
21      2629
24      1552
30       858
25       801
27       774
8        691
28       524
26       454
29       364
9        311
31       259
33       167
39       166
32       159
38        90
34        72
37        65
36        56
35        55
Name: count, dtype: int64

In [16]:
beneficiarios2013['TOT_CASOS'].unique()

array([  28,    1,    6, ...,  905, 4257, 1627])

In [17]:
beneficiarios2013['TOT_CASOS'].value_counts()

TOT_CASOS
1       699180
2       299720
3       198432
4       158593
5       138226
         ...  
2368         1
1308         1
2323         1
1300         1
1627         1
Name: count, Length: 1939, dtype: int64

In [18]:
beneficiarios2013['ST_CONSULTORIO'].unique()

array([1, 0])

In [19]:
beneficiarios2013['ST_CONSULTORIO'].value_counts()

ST_CONSULTORIO
1    2936394
0     134470
Name: count, dtype: int64

In [20]:
beneficiarios2013['CVE_RANGO_EDAD'].unique()

array(['E10', 'E25', 'E4', 'E23', 'E21', 'E5', 'E24', 'E3', 'E2', 'E14',
       'E0', 'E17', 'E22', 'E8', 'E11', 'E19', 'E13', 'E6', 'E9', 'E28',
       'E7', 'E1', 'E16', 'E12', 'E26', 'E27', 'E20', 'E18', 'E15', 'ND'],
      dtype=object)

In [21]:
beneficiarios2013['CVE_RANGO_EDAD'].value_counts()

CVE_RANGO_EDAD
E10    172222
E11    120730
E20    118419
E9     118083
E8     117583
E7     117198
E6     116257
E15    116237
E19    116048
E21    115616
E5     115447
E4     114036
E3     112398
E2     111100
E22    110534
E1     109072
E18    107035
E23    105259
E0     101541
E24     99129
E17     98066
E16     96041
E14     95809
E25     92655
E26     85803
E27     77906
E28     77100
E13     73827
E12     56413
ND       3300
Name: count, dtype: int64

2. Once I reviewed the data, I cleaned the database, doing the next steps:
- Eliminate NaN
- The 0 values are part of codification of database, so I will work with all 0 values.
- Eliminate the columns that I don't need to the analysis. I will work just with 'ID_DELEG_RP', 'ID_SUBDEL_RP', 'ID_UMF_RP', 'NOMBRE_UMF_RP', 'ST_TIT_FAM', 'ID_CALIDAD', 'CVE_SEXO', 'CVE_RANGO_EDAD', 'ST_CONSULTORIO', 'TOT_CASOS'
- Make column names lowercase.
- Rename columns ID_DELEG_RP and ID_SUBDEL_RP with the same of database asegurados_clean. And CVE_GENERO with CVE_SEXO to could concatenate with the year 2023.
- Add the date column, to identify the year of every data.

In [22]:
def clean_beneficiarios2013(df):
    df=df.drop(columns=[ 'ID_TURNO','ID_CONSULTORIO'])
    df=df.rename(columns={'ID_DELEG_RP':'cve_delegacion','ID_SUBDEL_RP':'cve_subdelegacion', 'CVE_GENERO':'CVE_SEXO'})
    df.columns=[e.lower().replace(' ', '_') for e in df.columns]
    df=df.dropna()
    df['period']=pd.to_datetime('2013-01-31')
    return df

cleaned_beneficiarios2013 = clean_beneficiarios2013(beneficiarios2013) 

In [23]:
cleaned_beneficiarios2013

Unnamed: 0,cve_delegacion,cve_subdelegacion,id_umf_rp,nombre_umf_rp,st_tit_fam,id_calidad,cve_sexo,cve_rango_edad,st_consultorio,tot_casos,period
0,27,70,31,UMF 031,2,13,1,E10,1,28,2013-01-31
1,27,70,31,UMF 031,2,2,1,E25,1,1,2013-01-31
2,27,70,31,UMF 031,2,14,2,E4,1,6,2013-01-31
3,27,70,31,UMF 031,2,12,2,E23,1,8,2013-01-31
4,27,70,31,UMF 031,2,16,2,E10,1,2,2013-01-31
...,...,...,...,...,...,...,...,...,...,...,...
3070859,39,57,35,UMF 035,1,1,2,E23,1,11,2013-01-31
3070860,12,2,26,UMF 026,2,2,2,E27,1,2,2013-01-31
3070861,14,38,2,UMF 002,1,1,1,E22,1,33,2013-01-31
3070862,20,32,28,UMF 028,2,17,2,E6,1,2,2013-01-31


3. I analize the data for 2023, I repeated the same clean steps from database of 2013.

In [24]:
beneficiarios2023 = pd.read_csv('https://drive.google.com/file/d/1RLUAWDaxYnEikgLLqkJspvVqoVD3GS06/view?usp=share_link', encoding="ISO-8859-1", sep="|")
display(beneficiarios2023.head())
beneficiarios2023.shape

Unnamed: 0,ID_DELEG_RP,ID_SUBDEL_RP,ID_UMF_RP,NOMBRE_UMF_RP,ST_TIT_FAM,ID_CALIDAD,CVE_SEXO,CVE_RANGO_EDAD,ST_CONSULTORIO,ID_TURNO,ID_CONSULTORIO,TOT_CASOS
0,1,1,2,UMF 002,1,1,1,E10,0,0,9998,2
1,1,1,2,UMF 002,1,1,1,E10,1,M,1,1
2,1,1,2,UMF 002,1,1,1,E10,1,M,2,2
3,1,1,2,UMF 002,1,1,1,E10,1,M,3,3
4,1,1,2,UMF 002,1,1,1,E10,1,M,4,1


(3255918, 12)

In [25]:
beneficiarios2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3255918 entries, 0 to 3255917
Data columns (total 12 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   ID_DELEG_RP     int64 
 1   ID_SUBDEL_RP    int64 
 2   ID_UMF_RP       int64 
 3   NOMBRE_UMF_RP   object
 4   ST_TIT_FAM      int64 
 5   ID_CALIDAD      int64 
 6   CVE_SEXO        int64 
 7   CVE_RANGO_EDAD  object
 8   ST_CONSULTORIO  int64 
 9   ID_TURNO        object
 10  ID_CONSULTORIO  int64 
 11  TOT_CASOS       int64 
dtypes: int64(9), object(3)
memory usage: 298.1+ MB


In [26]:
beneficiarios2023.isna().sum()

ID_DELEG_RP       0
ID_SUBDEL_RP      0
ID_UMF_RP         0
NOMBRE_UMF_RP     0
ST_TIT_FAM        0
ID_CALIDAD        0
CVE_SEXO          0
CVE_RANGO_EDAD    0
ST_CONSULTORIO    0
ID_TURNO          0
ID_CONSULTORIO    0
TOT_CASOS         0
dtype: int64

In [27]:
nulls_df = pd.DataFrame(round(beneficiarios2023.isna().sum()/len(beneficiarios2023),4)*100)
nulls_df = nulls_df.reset_index()
nulls_df.columns = ['header_name', 'percent_nulls']
display(nulls_df)

Unnamed: 0,header_name,percent_nulls
0,ID_DELEG_RP,0.0
1,ID_SUBDEL_RP,0.0
2,ID_UMF_RP,0.0
3,NOMBRE_UMF_RP,0.0
4,ST_TIT_FAM,0.0
5,ID_CALIDAD,0.0
6,CVE_SEXO,0.0
7,CVE_RANGO_EDAD,0.0
8,ST_CONSULTORIO,0.0
9,ID_TURNO,0.0


In [28]:
for col in beneficiarios2023.columns:
    count_zeros = beneficiarios2023[col].value_counts().get(0, 0)
    print(f'Column {col} has {count_zeros} zero values')

Column ID_DELEG_RP has 0 zero values
Column ID_SUBDEL_RP has 0 zero values
Column ID_UMF_RP has 0 zero values
Column NOMBRE_UMF_RP has 65072 zero values
Column ST_TIT_FAM has 0 zero values
Column ID_CALIDAD has 0 zero values
Column CVE_SEXO has 0 zero values
Column CVE_RANGO_EDAD has 177393 zero values
Column ST_CONSULTORIO has 39212 zero values
Column ID_TURNO has 1666971 zero values
Column ID_CONSULTORIO has 17 zero values
Column TOT_CASOS has 0 zero values


In [29]:
beneficiarios2023.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID_DELEG_RP,3255918.0,19.540911,10.789197,1.0,12.0,18.0,27.0,40.0
ID_SUBDEL_RP,3255918.0,18.415117,21.231943,1.0,2.0,8.0,33.0,80.0
ID_UMF_RP,3255918.0,45.139341,47.043765,1.0,12.0,33.0,61.0,250.0
ST_TIT_FAM,3255918.0,1.813003,0.389909,1.0,2.0,2.0,2.0,2.0
ID_CALIDAD,3255918.0,9.264421,6.130044,1.0,2.0,12.0,14.0,40.0
CVE_SEXO,3255918.0,1.528318,0.499198,1.0,1.0,2.0,2.0,2.0
ST_CONSULTORIO,3255918.0,0.987957,0.109079,0.0,1.0,1.0,1.0,1.0
ID_CONSULTORIO,3255918.0,129.302108,1089.415681,0.0,3.0,6.0,13.0,9998.0
TOT_CASOS,3255918.0,18.849417,89.836263,1.0,2.0,6.0,17.0,34472.0


In [30]:
beneficiarios2023['CVE_SEXO'].unique()

array([1, 2])

In [31]:
beneficiarios2023['CVE_SEXO'].value_counts()

CVE_SEXO
2    1720161
1    1535757
Name: count, dtype: int64

In [32]:
beneficiarios2023['ID_UMF_RP'].unique()

array([  2,   3,   5,   6,   7,   8,   9,  10,  12,   1,   4,  11,  15,
        16,  24,  26,  28,  31,  37,  40,  39,  13,  14,  22,  25,  29,
        32,  38,  17,  18,  19,  21,  27,  33,  34,  35,  36,  62,  67,
        70,  73,  82,  88,  89,  91,  20,  61,  66,  80,  83,  81,  87,
        60,  64,  79,  50,  52,  74,  84,  85,  86,  23,  41,  42,  43,
        44,  45,  30,  58,  69,  57,  54,  55,  47,  48,  56,  63,  65,
        46,  49,  53,  59,  51,  94, 128, 130, 132, 133, 159, 160, 168,
       177, 183,  95, 100, 165, 169,  71,  76, 114,  68, 106, 171, 178,
       182,  78,  93, 167, 181, 184,  72,  75,  77,  92,  96,  97,  98,
       137, 155, 162, 170, 179, 185, 186, 188, 191, 198, 180, 189, 193,
       195, 220, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232,
       233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245,
       246, 247, 248, 249, 250, 187, 190, 192, 120, 140, 161, 163, 164])

In [33]:
beneficiarios2023['ID_UMF_RP'].value_counts()

ID_UMF_RP
1      111173
2      109411
7       79475
3       72263
9       70046
        ...  
98        247
240       240
237       210
225       186
106       182
Name: count, Length: 169, dtype: int64

In [34]:
beneficiarios2023['ST_TIT_FAM'].unique()

array([1, 2])

In [35]:
beneficiarios2023['ST_TIT_FAM'].value_counts()

ST_TIT_FAM
2    2647071
1     608847
Name: count, dtype: int64

In [36]:
beneficiarios2023['ID_CALIDAD'].unique()

array([ 1,  2,  3,  4,  5,  6,  7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 24, 26, 28, 29, 40, 25, 39,  8, 30, 32, 34,  9, 27,
       31, 37, 38, 33, 35, 36])

In [37]:
beneficiarios2023['ID_CALIDAD'].value_counts()

ID_CALIDAD
1     608847
13    463519
14    419726
2     386137
15    355440
6     257286
16    213188
12    158793
11    144218
17     87867
3      74376
18     29975
7      20255
19     10129
40      5127
20      3852
5       3851
4       3035
10      2862
21      1699
8        921
24       659
25       655
30       484
22       435
26       386
31       334
27       334
28       328
9        267
29       245
32       140
23       112
39       109
33       107
34        65
37        40
38        39
35        38
36        38
Name: count, dtype: int64

In [38]:
beneficiarios2023['TOT_CASOS'].unique()

array([   2,    1,    3, ..., 2303, 1295,  925])

In [39]:
beneficiarios2023['TOT_CASOS'].value_counts()

TOT_CASOS
1        709196
2        327113
3        220459
4        173944
5        145979
          ...  
8604          1
9798          1
10155         1
1837          1
925           1
Name: count, Length: 2188, dtype: int64

In [40]:
beneficiarios2023['ST_CONSULTORIO'].unique()

array([0, 1])

In [41]:
beneficiarios2023['ST_CONSULTORIO'].value_counts()

ST_CONSULTORIO
1    3216706
0      39212
Name: count, dtype: int64

In [42]:
beneficiarios2023['CVE_RANGO_EDAD'].unique()

array(['E10', 'E11', 'E12', 'E13', 'E14', 'E15', 'E16', 'E17', 'E18',
       'E19', 'E20', 'E21', 'E22', 'E23', 'E24', 'E25', 'E26', 'E27',
       'E28', 'E5', 'E6', 'E7', 'E8', 'E9', 'E3', 'E0', 'E1', 'E2', 'E4',
       'ND'], dtype=object)

In [43]:
beneficiarios2023['CVE_RANGO_EDAD'].value_counts()

CVE_RANGO_EDAD
E10    177393
E11    144398
E20    138214
E19    134435
E21    134150
E22    127309
E18    123854
E23    120260
E17    116595
E7     116562
E9     115770
E8     115053
E6     114770
E5     113755
E4     112165
E24    112054
E3     110207
E16    106582
E2     106443
E15    103616
E25    103285
E1     103024
E0     100530
E26     95965
E12     91623
E27     88300
E28     84816
E14     71317
E13     69530
ND       3943
Name: count, dtype: int64

In [44]:
display(beneficiarios2013.columns)
display(beneficiarios2023.columns)

Index(['ID_DELEG_RP', 'ID_SUBDEL_RP', 'ID_UMF_RP', 'NOMBRE_UMF_RP',
       'ST_TIT_FAM', 'ID_CALIDAD', 'CVE_GENERO', 'CVE_RANGO_EDAD',
       'ST_CONSULTORIO', 'ID_TURNO', 'ID_CONSULTORIO', 'TOT_CASOS'],
      dtype='object')

Index(['ID_DELEG_RP', 'ID_SUBDEL_RP', 'ID_UMF_RP', 'NOMBRE_UMF_RP',
       'ST_TIT_FAM', 'ID_CALIDAD', 'CVE_SEXO', 'CVE_RANGO_EDAD',
       'ST_CONSULTORIO', 'ID_TURNO', 'ID_CONSULTORIO', 'TOT_CASOS'],
      dtype='object')

4. Once I reviewed the data, I cleaned the database, doing the next steps:
- Eliminate NaN
- The 0 values are part of codification of database, so I will work with all 0 values.
- Eliminate the columns that I don't need to the analysis. I will work just with 'ID_DELEG_RP', 'ID_SUBDEL_RP', 'ID_UMF_RP', 'NOMBRE_UMF_RP','ST_TIT_FAM', 'ID_CALIDAD', 'CVE_SEXO', 'CVE_RANGO_EDAD', 'ST_CONSULTORIO', 'TOT_CASOS'
- Make column names lowercase.
- Rename columns ID_DELEG_RP and ID_SUBDEL_RP with the same of database asegurados_clean.
- Add the date column, to identify the year of every data.

In [45]:
def clean_beneficiarios2023(df):
    df=df.drop(columns=['ID_TURNO', 'ID_CONSULTORIO'])
    df=df.rename(columns={'ID_DELEG_RP':'cve_delegacion','ID_SUBDEL_RP':'cve_subdelegacion'})
    df.columns=[e.lower().replace(' ', '_') for e in df.columns]
    df=df.dropna()
    df['period']=pd.to_datetime('2023-01-31')
    return df

cleaned_beneficiarios2023 = clean_beneficiarios2023(beneficiarios2023) 

In [46]:
cleaned_beneficiarios2023

Unnamed: 0,cve_delegacion,cve_subdelegacion,id_umf_rp,nombre_umf_rp,st_tit_fam,id_calidad,cve_sexo,cve_rango_edad,st_consultorio,tot_casos,period
0,1,1,2,UMF 002,1,1,1,E10,0,2,2023-01-31
1,1,1,2,UMF 002,1,1,1,E10,1,1,2023-01-31
2,1,1,2,UMF 002,1,1,1,E10,1,2,2023-01-31
3,1,1,2,UMF 002,1,1,1,E10,1,3,2023-01-31
4,1,1,2,UMF 002,1,1,1,E10,1,1,2023-01-31
...,...,...,...,...,...,...,...,...,...,...,...
3255913,40,58,26,HGZMF 026,2,20,2,E10,1,1,2023-01-31
3255914,40,58,26,HGZMF 026,2,40,1,ND,1,1,2023-01-31
3255915,40,58,26,HGZMF 026,2,40,2,ND,1,1,2023-01-31
3255916,40,58,26,HGZMF 026,2,40,2,ND,1,1,2023-01-31


Once both database are cleaned, I concatenated them and save in a new file.

In [47]:
beneficiarios_clean = pd.concat([cleaned_beneficiarios2013, cleaned_beneficiarios2023], axis=0)

In [48]:
beneficiarios_clean = beneficiarios_clean.reset_index()

In [49]:
# beneficiarios_clean.to_csv('../Data/cleaned/beneficiarios_clean1.csv', index=False)