Now, let's finally look at that divorced dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('../data/divorces_2000-2015_translated.csv')
print(df.shape)
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace('.', '')
df.head()

(4923, 41)


Unnamed: 0,divorce_date,type_of_divorce,nationality_partner_man,dob_partner_man,place_of_birth_partner_man,birth_municipality_of_partner_man,birth_federal_partner_man,birth_country_partner_man,age_partner_man,residence_municipality_partner_man,...,marriage_certificate_municipality,marriage_certificate_federal,level_of_education_partner_man,employment_status_partner_man,level_of_education_partner_woman,employment_status_partner_woman,marriage_duration,marriage_duration_months,num_children,custody
0,9/6/06,Necesario,MEXICANA,18/12/75,XALAPA - ENRIQUEZ,XALAPA,VERACRUZ,MEXICO,30.0,XALAPA,...,XALAPA,VERACRUZ,SECUNDARIA,OBRERO,SECUNDARIA,EMPLEADO,5.0,,1.0,
1,1/2/00,Voluntario,MEXICANA,,,,,,47.0,,...,XALAPA,VERACRUZ,PREPARATORIA,ESTABLECIMIENTO,PREPARATORIA,EMPLEADO,,,,
2,1/2/05,Necesario,MEXICANA,22/2/55,XALAPA - ENRIQUEZ,XALAPA,VERACRUZ,MEXICO,49.0,,...,XALAPA,VERACRUZ,PREPARATORIA,OBRERO,,TRABAJADOR POR CUENTA PROPIA EN VIA PUBLICA,,,,
3,1/2/06,Necesario,MEXICANA,20/1/64,XALAPA - ENRIQUEZ,XALAPA,VERACRUZ,MEXICO,42.0,XALAPA,...,XALAPA,VERACRUZ,PROFESIONAL,EMPLEADO,PREPARATORIA,EMPLEADO,18.0,,2.0,MADRE
4,1/2/06,Necesario,MEXICANA,30/10/75,XALAPA - ENRIQUEZ,XALAPA,VERACRUZ,MEXICO,30.0,COATEPEC,...,XALAPA,VERACRUZ,PROFESIONAL,EMPLEADO,PREPARATORIA,NO TRABAJA,7.0,,2.0,MADRE


Yeah, once again we need to do a lot of sprucing. We'll definitely err on the side of making things as simple as possible.

In [45]:
df.isna().sum().sort_values(ascending=False)[:15]

marriage_duration_months                3368
custody                                 2851
dob_registration_date_partner_woman     2679
monthly_income_partner_woman_peso       2119
num_children                            1912
monthly_income_partner_man_peso         1419
occupation_partner_woman                 578
dob_partner_woman                        452
employment_status_partner_woman          417
dob_partner_man                          381
level_of_education_partner_woman         380
employment_status_partner_man            356
residence_country_partner_man            324
residence_municipality_partner_woman     307
place_of_residence_partner_woman         307
dtype: int64

Oh my, that's quite a lot of nulls. But, let us deal with each feature as we get around to it. Especially when our target is either the duration or the type...

In [3]:
 i = 0

In [4]:
df[df.columns[i]].value_counts() #Yeah, we don't care about this... way too detailed for our purposes. Do note though
#the range of them

divorce_date
9/10/09     9
5/9/05      8
25/2/09     8
21/11/03    8
3/11/04     8
           ..
8/10/13     1
8/10/14     1
8/11/01     1
8/11/06     1
31/12/13    1
Name: count, Length: 2596, dtype: int64

In [5]:
min(df['divorce_date']), max(df['divorce_date'])

('1/10/01', '9/9/14')

In [6]:
to_be_deleted = []

to_be_deleted.append(df.columns[i])
print(f"Next up to be deleted is {df.columns[i]} and now we're poised to delete a total of {len(to_be_deleted)} from our dataframe.")

Next up to be deleted is divorce_date and now we're poised to delete a total of 1 from our dataframe.


In [7]:
i += 1
df[df.columns[i]].value_counts()

type_of_divorce
Necesario     2528
Voluntario    2395
Name: count, dtype: int64

Guess we'll worry about one-hot-encodng later.

Pragmatically, I'd assume Necesario should be 0. Ie the type of scenario we'd ideally like to avoid, but relatively 'better' than the alternative (in short...).

In [8]:
i += 1
df[df.columns[i]].value_counts()

nationality_partner_man
MEXICANA          4879
CUBANA               8
ESTADOUNIDENSE       7
ESPA√ëOLA            6
ARGENTINA            4
FRANCESA             2
CANADIENSE           2
ALEMANA              2
VENEZOLANA           2
JAPONESA             1
AUSTRALIANA          1
POLACA               1
CHINA                1
ITALIANA             1
CHILENA              1
COSTARRICENSE        1
AUSTRIACA            1
COLOMBIANA           1
NICARAGUENSE         1
Name: count, dtype: int64

Yeah, we'll simplify this to mexican/not. Although we'd be interested to see a mexican of any type with a non, we're also interested in a case where neither are mexican yet are still living there (if it occurs in this dataset).

In [9]:
df[df.columns[i]] = df[df.columns[i]].apply(lambda x: 0 if x == 'MEXICANA' else 1)

In [10]:
df.head()

Unnamed: 0,divorce_date,type_of_divorce,nationality_partner_man,dob_partner_man,place_of_birth_partner_man,birth_municipality_of_partner_man,birth_federal_partner_man,birth_country_partner_man,age_partner_man,residence_municipality_partner_man,...,marriage_certificate_municipality,marriage_certificate_federal,level_of_education_partner_man,employment_status_partner_man,level_of_education_partner_woman,employment_status_partner_woman,marriage_duration,marriage_duration_months,num_children,custody
0,9/6/06,Necesario,0,18/12/75,XALAPA - ENRIQUEZ,XALAPA,VERACRUZ,MEXICO,30.0,XALAPA,...,XALAPA,VERACRUZ,SECUNDARIA,OBRERO,SECUNDARIA,EMPLEADO,5.0,,1.0,
1,1/2/00,Voluntario,0,,,,,,47.0,,...,XALAPA,VERACRUZ,PREPARATORIA,ESTABLECIMIENTO,PREPARATORIA,EMPLEADO,,,,
2,1/2/05,Necesario,0,22/2/55,XALAPA - ENRIQUEZ,XALAPA,VERACRUZ,MEXICO,49.0,,...,XALAPA,VERACRUZ,PREPARATORIA,OBRERO,,TRABAJADOR POR CUENTA PROPIA EN VIA PUBLICA,,,,
3,1/2/06,Necesario,0,20/1/64,XALAPA - ENRIQUEZ,XALAPA,VERACRUZ,MEXICO,42.0,XALAPA,...,XALAPA,VERACRUZ,PROFESIONAL,EMPLEADO,PREPARATORIA,EMPLEADO,18.0,,2.0,MADRE
4,1/2/06,Necesario,0,30/10/75,XALAPA - ENRIQUEZ,XALAPA,VERACRUZ,MEXICO,30.0,COATEPEC,...,XALAPA,VERACRUZ,PROFESIONAL,EMPLEADO,PREPARATORIA,NO TRABAJA,7.0,,2.0,MADRE


In [11]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 3.


dob_partner_man
25/5/73     5
5/4/74      5
17/3/75     4
23/4/74     4
2/8/61      4
           ..
1/10/72     1
19/6/68     1
5/3/70      1
20/12/77    1
7/2/58      1
Name: count, Length: 3848, dtype: int64

Hmm, potentially let us simplify.... Oh wait, I just ealized - I want to example their relative ages.Hence, I'd need tha divocre data ultimately.... we'll keep on going for now though.

In [12]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 4.


place_of_birth_partner_man
XALAPA - ENRIQUEZ    2472
DISTRITO FEDERAL      347
VERACRUZ              129
COATEPEC               68
POZA RICA              49
                     ... 
ANTONIO HIDALGO         1
ESPARTACO               1
AHUACATAN               1
SAN PABLO COAPAN        1
EL LIMON                1
Name: count, Length: 669, dtype: int64

Yeah, we're simplifyng this. Let the cut off be 10 (beyond the scope to comment here why I'm chosing that number).

In [13]:
to_be_deleted.append(df.columns[i])
print(f"Next up to be deleted is {df.columns[i]} and now we're poised to delete a total of {len(to_be_deleted)} from our dataframe.")

Next up to be deleted is place_of_birth_partner_man and now we're poised to delete a total of 2 from our dataframe.


In [14]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 5.


birth_municipality_of_partner_man
XALAPA                   2479
DISTRITO FEDERAL          348
VERACRUZ                  131
COATEPEC                   78
MARTINEZ DE LA TORRE       60
                         ... 
PATZCUARO                   1
TOPILITOS DE ZARAGOZA       1
TULA DE ALLENDE             1
PARRAL                      1
ZAUTLA                      1
Name: count, Length: 424, dtype: int64

Somehow different??? A bit tempted to delete as a quick scan shows barely any differencs. However, this seems more inclusive so I'll go with this (424 vs. 669).

In [15]:
list(df[df.columns[i]].value_counts(normalize=True)[:9].index)

['XALAPA',
 'DISTRITO FEDERAL',
 'VERACRUZ',
 'COATEPEC',
 'MARTINEZ DE LA TORRE',
 'MISANTLA',
 'POZA RICA',
 'ORIZABA',
 'CORDOBA']

In [16]:
print(sum(df[df.columns[i]].value_counts(normalize=True)[:9]))

0.6846057571964955


In [17]:
#I feel like I had a neat code to do this. Oh Well.

feature_keepers = list(df[df.columns[i]].value_counts(normalize=True)[:9].index)

#Somewhat arbitrarily chosen, but eh. After like 7 they have such a small amount... Will simplify.

df[df.columns[i]] = ["OTHER" if i not in feature_keepers  else i for i in df[df.columns[i]]]

In [18]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 6.


birth_federal_partner_man
VERACRUZ              3920
DISTRITO FEDERAL       403
PUEBLA                 111
TAMAULIPAS              33
OAXACA                  30
                      ... 
EL SALVADOR              1
GALICIA                  1
MISSOURI                 1
CAMAG√úEY                1
NORTE DE SANTANDER       1
Name: count, Length: 76, dtype: int64

Similar to the other.... We might as well keep it Especially since we're condensing. This time we'll simplify it to Puebla than 'Other' as they don' even have 100...

In [19]:
feature_keepers = list(df[df.columns[i]].value_counts(normalize=True)[:3].index)
df[df.columns[i]] = ["OTHER" if i not in feature_keepers  else i for i in df[df.columns[i]]]

In [20]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 7.


birth_country_partner_man
MEXICO                       4748
CUBA                            8
ESPA√ëA                         8
ESTADOS UNIDOS DE AMERICA       6
ARGENTINA                       3
COLOMBIA                        3
FRANCIA                         3
VENEZUELA                       2
POLONIA                         2
ALEMANIA                        2
COSTA RICA                      1
CANADA                          1
CENTRO AMERICA                  1
CHILE                           1
IRAN                            1
ITALIA                          1
JAPON                           1
MARYLAND                        1
INGLATERRA                      1
AUSTRALIA                       1
NICARAGUA                       1
Name: count, dtype: int64

Unsurprising... we'll keep both for now. Especially when this will get simplified to binary Mexico or not.

In [21]:
feature_keepers = list(df[df.columns[i]].value_counts(normalize=True)[:1].index)
df[df.columns[i]] = ["OTHER" if i not in feature_keepers  else i for i in df[df.columns[i]]]

In [22]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 8.


age_partner_man
34.0    235
31.0    216
35.0    212
32.0    203
36.0    203
       ... 
78.0      2
76.0      2
75.0      2
91.0      1
83.0      1
Name: count, Length: 62, dtype: int64

I just reconfirmed and yes this is referring to dod. Hence, we're good to go to simplify by removing (as we originally assumed) the dob. However, I'd still be interested in seeing about how old they were when they got married. So, we'll see.

In [23]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 9.


residence_municipality_partner_man
XALAPA                 3927
VERACRUZ                 67
COATEPEC                 63
EMILIANO ZAPATA          62
DISTRITO FEDERAL         40
                       ... 
SEBASTIAN TUXTLA          1
GUSTAVO A. MADERO         1
PETLALCINGO               1
TECOLUTILLA               1
SANTA MARIA EL TULE       1
Name: count, Length: 175, dtype: int64

In [24]:
df[df.columns[i]].value_counts(normalize=True)[:4]

residence_municipality_partner_man
XALAPA             0.853881
VERACRUZ           0.014568
COATEPEC           0.013699
EMILIANO ZAPATA    0.013481
Name: proportion, dtype: float64

Eh... who cares. We'll keep it. Simplify it to anything past Emiliana Zapata ak 1 percent is other.

In [25]:
feature_keepers = list(df[df.columns[i]].value_counts(normalize=True)[:4].index)
df[df.columns[i]] = ["OTHER" if i not in feature_keepers  else i for i in df[df.columns[i]]]

In [26]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()[:5]

Now looking at feature 10.


residence_federal_partner_man
VERACRUZ            4402
DISTRITO FEDERAL      53
PUEBLA                27
TABASCO               11
CHIAPAS                8
Name: count, dtype: int64

Yeah, this is becoming binary. More spread out than the 'birth_country_partner_man' so worth having both.

In [27]:
feature_keepers = list(df[df.columns[i]].value_counts(normalize=True)[:1].index)
df[df.columns[i]] = ["OTHER" if i not in feature_keepers  else i for i in df[df.columns[i]]]

In [28]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 11.


residence_country_partner_man
MEXICO                       4589
ESTADOS UNIDOS DE AMERICA       7
ESPA√ëA                         1
AUSTRALIA                       1
PUERTO RICO                     1
Name: count, dtype: int64

Heads up already re. potential inflation issues. However, the data isn't too far apart so we'll likely be lax and not bother for the sake of this study.

Perhaps if it turns out that this feature is super significant we'll reconsider.

In [29]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 12.


monthly_income_partner_man_peso
5000.0     326
3000.0     301
4000.0     279
10000.0    276
6000.0     275
          ... 
95000.0      1
6300.0       1
4100.0       1
550.0        1
12500.0      1
Name: count, Length: 214, dtype: int64


Hmm,pretty small after the first one. Potentially a move to just make this binary. At least for now we'll conntinue.

In [30]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts(normalize=True)

Now looking at feature 13.


occupation_partner_man
EMPLEADO                0.535275
COMERCIANTE             0.068047
PROFESOR                0.030041
MAESTRO                 0.020255
MEDICO                  0.019800
                          ...   
AGENTE DE SEGUROS       0.000228
PRODUCTOR               0.000228
ADMINISTRATIVO          0.000228
MEDICO ANESTECIOLOGO    0.000228
CAPITAN DE CORBETA      0.000228
Name: proportion, Length: 221, dtype: float64

First 4 are more than 2%; 13 are 1% so at most do 13 than an other category. They sum up to 65 and 80%.

Similar to things we've seen before, but eh...

In [95]:
sum(df[df.columns[13]].value_counts(normalize=True)[:3]) #I think I'm doing something wrong here when I wnet back to check this out
#Will worry about it lter.

0.9731870810481414

In [31]:
feature_keepers = list(df[df.columns[i]].value_counts(normalize=True)[:3].index)
df[df.columns[i]] = ["OTHER" if i not in feature_keepers  else i for i in df[df.columns[i]]]

## Re. the women:

By default won't bother to comment unless I notice trends that I find significant between the two. Ie even if the nationality is much more likely to be Mexican if a woman I won't bother commenting given that they're similar.

In [32]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts(normalize=True)

Now looking at feature 14.


place_of_residence_partner_man
XALAPA-ENRIQUEZ     0.851369
VERACRUZ            0.014342
COATEPEC            0.012386
EMILIANO ZAPATA     0.008040
DISTRITO FEDERAL    0.007388
                      ...   
ZONGOLICA           0.000217
CERRO COLORADO      0.000217
MIGUEL HIDALGO      0.000217
ALVARO OBREGON      0.000217
LAS PUENTES         0.000217
Name: proportion, Length: 229, dtype: float64

In [33]:
feature_keepers = list(df[df.columns[i]].value_counts(normalize=True)[:1].index)
df[df.columns[i]] = ["OTHER" if i not in feature_keepers  else i for i in df[df.columns[i]]]

In [34]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 15.


nationality_partner_woman
MEXICANA          4890
CUBANA               6
ESTADOUNIDENSE       5
FRANCESA             3
ESPA√ëOLA            3
ARGENTINA            2
POLACA               2
BRASILE√ëA           1
PERUANA              1
DOMINICANA           1
VENEZOLANA           1
COLOMBIANA           1
BOLIVIANA            1
JAPONESA             1
BRITANICA            1
SUIZA                1
Name: count, dtype: int64

In [36]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 16.


dob_partner_woman
11/12/72    5
19/2/71     5
6/6/68      4
6/7/77      4
3/6/72      4
           ..
28/2/65     1
11/12/71    1
1/10/52     1
25/3/76     1
22/1/72     1
Name: count, Length: 3765, dtype: int64

Like earlier with the man, it's want to be deleted. But, let's continue for now.

In [37]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 17.


dob_registration_date_partner_woman
21/2/73     4
1/8/77      3
7/7/69      3
3/5/73      3
8/6/70      3
           ..
16/2/52     1
31/8/74     1
29/5/87     1
27/12/76    1
7/9/72      1
Name: count, Length: 2003, dtype: int64

Woah, we did not have this earlier. More than half nulls and not even sure what it is. Perhaps a proxy of some kind for growing up in a messed up home if the dob somehow wasn't legally/formally recognized in a hospital. Really do not know. But yeah, we're deleting this one.

In [68]:
j = -1

In [77]:
j += 1
print(j)
df.iloc[j]

8


divorce_date                                     1/2/08
type_of_divorce                              Voluntario
nationality_partner_man                               0
dob_partner_man                                 2/12/76
place_of_birth_partner_man                 CIUDAD MANTE
birth_municipality_of_partner_man                 OTHER
birth_federal_partner_man                         OTHER
birth_country_partner_man                        MEXICO
age_partner_man                                    31.0
residence_municipality_partner_man                OTHER
residence_federal_partner_man                     OTHER
residence_country_partner_man                    MEXICO
monthly_income_partner_man_peso                 15000.0
occupation_partner_man                      COMERCIANTE
place_of_residence_partner_man                    OTHER
nationality_partner_woman                      MEXICANA
dob_partner_woman                                3/1/80
dob_registration_date_partner_woman             

In [78]:
to_be_deleted.append(df.columns[i])
print(f"Next up to be deleted is {df.columns[i]} and now we're poised to delete a total of {len(to_be_deleted)} from our dataframe.")

Next up to be deleted is dob_registration_date_partner_woman and now we're poised to delete a total of 3 from our dataframe.


In [79]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 18.


place_of_birth_partner_woman
XALAPA-ENRIQUEZ         2595
DISTRITO FEDERAL         306
VERACRUZ                 107
POZA RICA                 54
MARTINEZ DE LA TORRE      51
                        ... 
EL TIBOR                   1
TLATLAUQUITEPEC            1
HUAYACOCOTLA               1
MIRADORES DEL MAR          1
LA LIMA                    1
Name: count, Length: 653, dtype: int64

Like above... Besides we're deletign something I'll actually comment though.

In [80]:
to_be_deleted.append(df.columns[i])
print(f"Next up to be deleted is {df.columns[i]} and now we're poised to delete a total of {len(to_be_deleted)} from our dataframe.")

Next up to be deleted is place_of_birth_partner_woman and now we're poised to delete a total of 4 from our dataframe.


In [81]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 19.


birth_municipality_of_partner_woman
XALAPA                  2612
DISTRITO FEDERAL         308
VERACRUZ                 110
MARTINEZ DE LA TORRE      66
POZA RICA                 55
                        ... 
IRAPUATO                   1
OTATITLAN                  1
IXTACZOQUITLAN             1
JESUS CARRANZA             1
PONTEVEDRA                 1
Name: count, Length: 404, dtype: int64

In [82]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 20.


birth_federal_partner_woman
VERACRUZ            4011
DISTRITO FEDERAL     362
PUEBLA                93
OAXACA                38
TAMAULIPAS            30
                    ... 
SANTO DOMINGO          1
NUEVA ESPARTA          1
SAN PAULO              1
ESPA√ëA                1
ZURICH                 1
Name: count, Length: 72, dtype: int64

In [83]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 21.


birth_country_partner_woman
MEXICO                       4742
ESTADOS UNIDOS DE AMERICA       7
CUBA                            7
ESPA√ëA                         4
FRANCIA                         3
REPUBLICA DOMINICANA            3
BRASIL                          2
COLOMBIA                        2
SUIZA                           2
POLONIA                         2
URUGUAY                         1
ARGELIA                         1
JAPON                           1
ARGENTINA                       1
VENEZUELA                       1
GUATEMALA                       1
BOLIVIA                         1
CALIFORNIA                      1
PERU                            1
INGLATERRA                      1
Name: count, dtype: int64

Interesting that women seems to be more diverse, country wise, than men. Might be significant. Ie would show that there's an inclination for the man to be more 'at ease' in this dataset (via the fact that he's more likely to be 'at home', 'in his element').

In [84]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 22.


age_partner_woman
33.0    220
32.0    213
35.0    208
29.0    205
31.0    194
       ... 
84.0      1
76.0      1
73.0      1
68.0      1
75.0      1
Name: count, Length: 63, dtype: int64

In [85]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 23.


place_of_residence_partner_woman
XALAPA-ENRIQUEZ    4162
EMILIANO ZAPATA      43
COATEPEC             40
VERACRUZ             35
BANDERILLA           26
                   ... 
CHILTOYAC             1
EL ESPINAL            1
VILLA RICA            1
INDIANA               1
SANTIAGO TUXTLA       1
Name: count, Length: 174, dtype: int64

In [86]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 24.


residence_municipality_partner_woman
XALAPA                4179
EMILIANO ZAPATA         61
COATEPEC                41
VERACRUZ                37
BANDERILLA              26
                      ... 
PACHUCA                  1
TRES VALLES              1
VALENCIA                 1
CUAUTITLAN IZCALLI       1
SANTIAGO TUXTLA          1
Name: count, Length: 131, dtype: int64

In [87]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 25.


residence_federal_partner_woman
VERACRUZ                4489
DISTRITO FEDERAL          25
PUEBLA                    13
QUINTANA ROO               9
QUERETARO                  7
TAMAULIPAS                 6
MEXICO                     6
ESTADO DE MEXICO           6
MORELOS                    6
CHIAPAS                    5
CAMPECHE                   5
BAJA CALIFORNIA            5
OAXACA                     4
TABASCO                    4
HIDALGO                    4
AGUASCALIENTES             3
GUANAJUATO                 2
SONORA                     2
GUERRERO                   2
YUCATAN                    2
NAYARIT                    1
CHAUTEMPAN                 1
PARIS                      1
TEXAS                      1
FLORIDA                    1
USA                        1
NEW YORK                   1
COMUNIDAD VALENCIANA       1
BAJA CALIFORNIA SUR        1
MICHOACAN                  1
COLIMA                     1
TEXCOCO                    1
COAHUILA                   1
Name: count

Perhaps due earlier and already tacitly assumed, but whatever proxy for these I'll use for both the men and women.

In [88]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 26.


residence_country_partner_woman
MEXICO            4612
ESTADOS UNIDOS       4
ESPA√ëA              1
FRANCIA              1
Name: count, dtype: int64

In [90]:
# i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts(normalize=True)

Now looking at feature 27.


occupation_partner_woman
EMPLEADA              0.434062
AMA DE CASA           0.126582
LABORES DOMESTICAS    0.074799
MAESTRA               0.049482
COMERCIANTE           0.040046
                        ...   
SENADORA              0.000230
BANQUERO              0.000230
SOCIOLOGA             0.000230
MODESTA               0.000230
PREPARATORIA          0.000230
Name: proportion, Length: 156, dtype: float64

Much more of a spread - unsurprisingly per 'domain knowledge.' Match ups wil be interesting.

In [96]:
sum(df[df.columns[i]].value_counts(normalize=True)[:3]), sum(df[df.columns[i]].value_counts(normalize=True)[:10])

(0.6354430379746836, 0.8324510932105867)

In [97]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 28.


monthly_income_partner_woman_peso
5000.0     277
6000.0     245
4000.0     238
3000.0     226
10000.0    192
          ... 
75000.0      1
5100.0       1
1550.0       1
1930.0       1
29000.0      1
Name: count, Length: 188, dtype: int64

In [98]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Now looking at feature 29.


date_of_marriage
10/2/00     8
11/3/97     7
14/12/90    7
13/2/01     7
8/2/02      7
           ..
6/1/75      1
26/2/82     1
27/1/98     1
12/12/74    1
22/8/09     1
Name: count, Length: 3651, dtype: int64

Oh! What we've been waiting for. In of itself we don't care, but releative to: Man's age, woman's, d

In [None]:
i += 1
print(f"Now looking at feature {i}.")
df[df.columns[i]].value_counts()

Re. check with above whatever feeatures we condensed with men to do the same with women (per 9/3/24 just doing eda essentially).

OHE:

type_of_divorce

?bucket monthly incomes?

occupation


Fix: Men's employment...