# Assignment 12: Project Part III
Predicting Divorce\
Ismail Abdo Elmaliki\
CS 502 - Predictive Analytics\
Capitol Technology University\
Professor Frank Neugebauer\
March 22, 2022

# Table of Contents

**Data Understanding**
- Info
- Further Investigate Missing Data

**Feature Engineering**
- Rename Columns
- Address Missing Data

**Data Understanding (Post Addressing Missing Values)**
- Profile Report
- Understanding divorced data
- Understanding non-divorced data
- Skew

**Predictive Model**
- Setup Reusable Functions
- Setup Predictive Model & Evaluate
- Introduce Synthetic Data
- Reanalyze Skew
- Setup Predictive Model & Evaluate (Part 2 - Post Synthetic Data Addition)
- Hyperparameter Tuning

**Conclusion**

**References**


## Data Understanding
We'll take a look at the divorce data to better understand things like correlations, its description, and other essentials parts.

### Info
There are no missing values to start off with, with all values having a type of int64

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import joblib
from matplotlib import pyplot

pd.set_option('display.max_columns', None)

df = pd.read_csv('divorce.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 55 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   Sorry_end                      170 non-null    int64
 1   Ignore_diff                    170 non-null    int64
 2   begin_correct                  170 non-null    int64
 3   Contact                        170 non-null    int64
 4   Special_time                   170 non-null    int64
 5   No_home_time                   170 non-null    int64
 6   2_strangers                    170 non-null    int64
 7   enjoy_holiday                  170 non-null    int64
 8   enjoy_travel                   170 non-null    int64
 9   common_goals                   170 non-null    int64
 10  harmony                        170 non-null    int64
 11  freeom_value                   170 non-null    int64
 12  entertain                      170 non-null    int64
 13  people_goals        

### Further Investigate Missing Data

But let's further verify there's no missing data. According to the `divorce_README` file, questions are ranked on a scale from 1 to 5. Hence any column that has the value 0 (except the column `Divorce_Y_N`) means there's a missing value.

Hence it looks like below out of 177 entries rows, some columns have less than or more than half of values missing. Hence, the columns below with the value 0 will need to be addressed.

In [2]:
for column_name in df.columns:
    if column_name == 'Divorce_Y_N':
        break
    column = df[column_name]
    # Get the count of Zeros in column 
    count = (column == 0).sum()
    print('Count of zeros in column ', column_name, ' is : ', count)

Count of zeros in column  Sorry_end  is :  69
Count of zeros in column  Ignore_diff  is :  59
Count of zeros in column  begin_correct  is :  51
Count of zeros in column  Contact  is :  75
Count of zeros in column  Special_time  is :  82
Count of zeros in column  No_home_time  is :  86
Count of zeros in column  2_strangers  is :  114
Count of zeros in column  enjoy_holiday  is :  81
Count of zeros in column  enjoy_travel  is :  84
Count of zeros in column  common_goals  is :  62
Count of zeros in column  harmony  is :  71
Count of zeros in column  freeom_value  is :  58
Count of zeros in column  entertain  is :  47
Count of zeros in column  people_goals  is :  66
Count of zeros in column  dreams  is :  69
Count of zeros in column  love  is :  75
Count of zeros in column  happy  is :  73
Count of zeros in column  marriage  is :  79
Count of zeros in column  roles  is :  77
Count of zeros in column  trust  is :  81
Count of zeros in column  likes  is :  78
Count of zeros in column  care_s

Before analyzing data any further, let's make sure to replace those values of 0 with Nan. That way if we need to calculate the mean, it won't take the value 0 into consideration. This will exclude the column `Divorce_Y_N` which either has a valid value of 0 or 1.

In [3]:
for c in df.columns:
    if c == 'Divorce_Y_N':
        break
    df[c] = df[c].replace(0, np.NaN)

## Feature Engineering
We'll be doing the following in this section:
- rename columns to include question number and lowercase all letters
- fill missing data using interpolation - which will change all 0s to one of the following numbers: [1, 2, 3, 4, 5]

### Rename Columns

In [4]:
# Rename columns by appending question number
new_columns = {}
number = 1
for c in df.columns:
    new_columns.update({c: str(number) + '_' + c.lower()})
    number += 1

df = df.rename(columns=new_columns)
df.head()

Unnamed: 0,1_sorry_end,2_ignore_diff,3_begin_correct,4_contact,5_special_time,6_no_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,31_aggro_argue,32_always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_i'm_right,49_accusations,50_i'm_not_guilty,51_i'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_divorce_y_n
0,2.0,2.0,4.0,1.0,,,,,,,1.0,,1.0,1.0,,1.0,,,,1.0,,,,,,,,,,1.0,1.0,2.0,1.0,2.0,,1.0,2.0,1.0,3.0,3.0,2.0,1.0,1.0,2.0,3.0,2.0,1.0,3.0,3.0,3.0,2.0,3.0,2.0,1.0,1
1,4.0,4.0,4.0,4.0,4.0,,,4.0,4.0,4.0,4.0,3.0,4.0,,4.0,4.0,4.0,4.0,3.0,2.0,1.0,1.0,,2.0,2.0,1.0,2.0,,1.0,1.0,,4.0,2.0,3.0,,2.0,3.0,4.0,2.0,4.0,2.0,2.0,3.0,4.0,2.0,2.0,2.0,3.0,4.0,4.0,4.0,4.0,2.0,2.0,1
2,2.0,2.0,2.0,2.0,1.0,3.0,2.0,1.0,1.0,2.0,3.0,4.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,1.0,,1.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,3.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,3.0,3.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,1
3,3.0,2.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,4.0,3.0,3.0,3.0,3.0,3.0,4.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,3.0,2.0,3.0,2.0,2.0,1.0,1.0,3.0,3.0,4.0,4.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,3.0,3.0,3.0,3.0,2.0,2.0,2.0,1
4,2.0,2.0,1.0,1.0,1.0,1.0,,,,,,1.0,,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,,,,,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,2.0,1.0,,2.0,3.0,,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,1.0,,1


### Address Missing Data
To fill in missing data, we'll utilize interpolation which will mathematically estimate what the missing values should be based on the existing data (Analytics Vidhya, 2021). 

In [5]:
columns = df.columns

for c in columns:
    if c == '55_divorce_y_n':
        continue
    else:
        df[c] = df[c].interpolate(method='linear')
        df[c].fillna(df[c].median(), inplace=True)
        df[c] = np.round(df[c]) # Make sure values are rounded
        df[c] = df[c].astype(int)
df.head()

Unnamed: 0,1_sorry_end,2_ignore_diff,3_begin_correct,4_contact,5_special_time,6_no_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,31_aggro_argue,32_always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_i'm_right,49_accusations,50_i'm_not_guilty,51_i'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_divorce_y_n
0,2,2,4,1,2,1,1,2,2,2,1,2,1,1,2,1,1,2,2,1,1,1,2,2,1,1,1,2,1,1,1,2,1,2,2,1,2,1,3,3,2,1,1,2,3,2,1,3,3,3,2,3,2,1,1
1,4,4,4,4,4,1,1,4,4,4,4,3,4,2,4,4,4,4,3,2,1,1,2,2,2,1,2,2,1,1,2,4,2,3,2,2,3,4,2,4,2,2,3,4,2,2,2,3,4,4,4,4,2,2,1
2,2,2,2,2,1,3,2,1,1,2,3,4,2,3,3,3,3,3,3,2,1,1,1,2,2,2,2,2,3,2,3,3,1,1,1,1,2,1,3,3,3,3,2,3,2,3,2,3,1,1,1,2,2,2,1
3,3,2,3,2,3,3,3,3,3,3,4,3,3,4,3,3,3,3,3,4,1,1,1,1,2,1,1,1,1,3,2,3,2,2,1,1,3,3,4,4,2,2,3,2,3,2,2,3,3,3,3,2,2,2,1
4,2,2,1,1,1,1,3,3,3,2,3,1,2,1,1,1,1,1,2,1,1,2,2,2,2,2,1,2,1,1,1,1,1,1,1,1,2,2,2,1,2,2,3,2,2,2,1,2,3,2,2,2,1,2,1


## Data Understanding (Post Addressing Missing Values)

### Profile Report
From the profile report analysis, we notice a couple of things:
- Features are correlated with another
- There are duplicate rows

Because the divorce dataset has many features contributing to whether or not the respondent is divorced, we'll avoid deleting any features since there are a large number of features with high correlation.

Regarding duplicate rows, we'll move forward with addressing that!

In [6]:
from pandas_profiling import ProfileReport

report = ProfileReport(df)
report

Summarize dataset: 100%|██████████| 69/69 [00:23<00:00,  2.96it/s, Completed]                                         
Generate report structure: 100%|██████████| 1/1 [00:16<00:00, 16.60s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.56s/it]




In [7]:
df.drop_duplicates(inplace=True)

### Understanding divorced data
By calling the describe function, we notice right away the mean of question responses (columns 1 through 54). Specifically values that have a mean of 3.0 or higher.

For example, columns 31 through 54 all have a mean of 3.2 or higher. And based on the questions from the `divorce_README` these are signs of a marriage going downhill that may lead to divorce.

In [8]:
divorced = df.copy()
divorced = divorced[divorced['55_divorce_y_n'] == 1]
divorced.describe()

Unnamed: 0,1_sorry_end,2_ignore_diff,3_begin_correct,4_contact,5_special_time,6_no_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,31_aggro_argue,32_always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_i'm_right,49_accusations,50_i'm_not_guilty,51_i'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_divorce_y_n
count,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0
mean,3.166667,2.909091,2.893939,2.818182,2.984848,1.5,1.530303,2.909091,2.954545,2.878788,3.242424,3.015152,3.106061,2.939394,2.939394,2.954545,3.121212,3.0,3.166667,2.909091,2.69697,2.590909,2.924242,2.848485,3.015152,2.863636,2.681818,2.742424,2.878788,2.818182,3.393939,3.287879,3.30303,3.227273,3.287879,3.242424,3.515152,3.424242,3.590909,3.545455,3.469697,3.227273,3.439394,3.348485,3.257576,3.136364,3.272727,3.333333,3.393939,3.378788,3.227273,3.378788,3.166667,3.348485,1.0
std,0.714322,0.7986,0.806298,0.83958,0.885654,0.685004,0.980105,0.7986,0.753082,0.774898,0.702974,0.712362,0.767188,0.782084,0.801514,0.792888,0.734117,0.723241,0.64649,0.854441,0.803257,0.94425,0.80976,0.808463,0.774446,0.839164,0.862181,0.828541,0.886048,0.83958,0.942644,0.872929,0.976173,0.941283,1.034262,0.961739,0.808463,0.804996,0.722757,0.660578,0.826851,0.873463,0.825158,0.774446,0.828541,0.909545,1.015957,0.751068,0.839025,0.872929,0.890902,0.940788,0.970065,0.902857,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,3.0,2.0,2.25,2.0,3.0,1.0,1.0,2.0,3.0,2.0,3.0,3.0,3.0,2.0,3.0,2.0,3.0,3.0,3.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.25,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0
50%,3.0,3.0,3.0,3.0,3.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,4.0,3.0,4.0,3.5,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.5,3.0,3.0,4.0,3.0,4.0,4.0,3.0,4.0,3.0,4.0,1.0
75%,4.0,3.0,3.0,3.0,4.0,2.0,2.0,3.0,3.0,3.0,4.0,3.0,4.0,3.0,3.0,3.75,4.0,3.0,4.0,3.75,3.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0
max,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0


Let's filter this data even more by only presenting columns that have a mean of 3.0 or higher and 1.5 or less. These numbers should tell us a bigger story on what leads to divorce. 

So far we went from 55 columns to 31 columns! 

And as expected, all columns from 31 through 54 appear to be high indicators of divorce. 

In [9]:
divorced_mean = divorced.copy()
columns = divorced_mean.columns 
for c in columns:
    if divorced_mean[c].mean() < 3.0 and divorced_mean[c].mean() >= 2.0:
        divorced_mean.drop(c, axis=1, inplace=True)

divorced_mean.describe()

Unnamed: 0,1_sorry_end,6_no_home_time,7_2_strangers,11_harmony,12_freeom_value,13_entertain,17_happy,18_marriage,19_roles,25_inner_world,31_aggro_argue,32_always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_i'm_right,49_accusations,50_i'm_not_guilty,51_i'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_divorce_y_n
count,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0
mean,3.166667,1.5,1.530303,3.242424,3.015152,3.106061,3.121212,3.0,3.166667,3.015152,3.393939,3.287879,3.30303,3.227273,3.287879,3.242424,3.515152,3.424242,3.590909,3.545455,3.469697,3.227273,3.439394,3.348485,3.257576,3.136364,3.272727,3.333333,3.393939,3.378788,3.227273,3.378788,3.166667,3.348485,1.0
std,0.714322,0.685004,0.980105,0.702974,0.712362,0.767188,0.734117,0.723241,0.64649,0.774446,0.942644,0.872929,0.976173,0.941283,1.034262,0.961739,0.808463,0.804996,0.722757,0.660578,0.826851,0.873463,0.825158,0.774446,0.828541,0.909545,1.015957,0.751068,0.839025,0.872929,0.890902,0.940788,0.970065,0.902857,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,3.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.25,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0
50%,3.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,4.0,3.0,4.0,3.5,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.5,3.0,3.0,4.0,3.0,4.0,4.0,3.0,4.0,3.0,4.0,1.0
75%,4.0,2.0,2.0,4.0,3.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0
max,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0


### Understanding non divorced data

Analyzing the non divorced data set, not a single question has a mean of 3.0 or higher. The most it went up was 2.5 especially when analyzing columns 31 through 54.

In [10]:
not_divorced = df.copy()
not_divorced = not_divorced[not_divorced['55_divorce_y_n'] == 0]
not_divorced.describe()

Unnamed: 0,1_sorry_end,2_ignore_diff,3_begin_correct,4_contact,5_special_time,6_no_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,31_aggro_argue,32_always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_i'm_right,49_accusations,50_i'm_not_guilty,51_i'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_divorce_y_n
count,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0
mean,1.593023,1.383721,1.639535,1.616279,1.476744,1.686047,1.081395,1.244186,1.162791,1.313953,1.116279,1.104651,1.325581,1.139535,1.174419,1.232558,1.023256,1.348837,1.116279,1.139535,1.0,1.069767,1.651163,1.360465,1.023256,1.011628,1.0,1.081395,1.011628,1.186047,1.604651,1.523256,1.232558,1.27907,1.430233,1.093023,1.593023,1.372093,1.418605,1.860465,1.290698,1.976744,2.337209,1.616279,2.232558,2.5,1.895349,2.290698,1.872093,1.790698,1.848837,2.127907,1.848837,1.488372,0.0
std,0.886207,0.57739,0.701474,0.984212,0.502388,0.800906,0.275045,0.432123,0.37134,0.46682,0.322439,0.307899,0.518864,0.348536,0.381695,0.424941,0.151599,0.479398,0.322439,0.348536,0.0,0.256249,0.628115,0.482951,0.151599,0.107833,0.0,0.314927,0.107833,0.391427,0.70759,0.731264,0.567231,0.566748,0.711927,0.423651,0.872831,0.595009,0.727043,0.984142,0.481817,0.853736,0.953143,0.68888,1.091943,0.979195,1.040691,0.733505,0.891747,0.783901,0.743876,0.955437,0.874709,0.589001,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
50%,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.5,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,0.0
75%,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.75,2.0,2.0,3.0,2.0,3.0,3.0,2.75,3.0,2.0,2.0,2.0,3.0,2.0,2.0,0.0
max,4.0,3.0,4.0,4.0,2.0,4.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,1.0,3.0,2.0,2.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,0.0


Taking a similar approach as we did with the divorced data set, it's important to better understand non_divorced data and exclude means that are less than 3 and greater than 1.5.

After making that change, we can see below we go from 55 columns to 38 columns. In addition we can clearly see based on questions from the `divorce_README.pdf` file that non divorced spouses don't necessarily have common values as divorced couples. Yet despite not sharing exactly common values the heated discussions as shown from columns 31 to 54 are far more tame than those couples who know each other well then ended up being divorced.

However before further analyzing another thing to notice out of the 86 entries is that many of the column values (except for the last column) have a lot of missing values! Hence before moving forward with an analysis, it's important that we address the missing data then re-analyze both the divorced and non-divorced data set.

In [11]:
non_divorced_mean = not_divorced.copy()
columns = non_divorced_mean.columns 
for c in columns:
    if non_divorced_mean[c].mean() < 3 and non_divorced_mean[c].mean() > 1.5:
        non_divorced_mean.drop(c, axis=1, inplace=True)

non_divorced_mean.describe()

Unnamed: 0,2_ignore_diff,5_special_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,38_hate_subjects,39_sudden_discussion,41_calm_breaks,54_incompetence,55_divorce_y_n
count,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0
mean,1.383721,1.476744,1.081395,1.244186,1.162791,1.313953,1.116279,1.104651,1.325581,1.139535,1.174419,1.232558,1.023256,1.348837,1.116279,1.139535,1.0,1.069767,1.360465,1.023256,1.011628,1.0,1.081395,1.011628,1.186047,1.232558,1.27907,1.430233,1.093023,1.372093,1.418605,1.290698,1.488372,0.0
std,0.57739,0.502388,0.275045,0.432123,0.37134,0.46682,0.322439,0.307899,0.518864,0.348536,0.381695,0.424941,0.151599,0.479398,0.322439,0.348536,0.0,0.256249,0.482951,0.151599,0.107833,0.0,0.314927,0.107833,0.391427,0.567231,0.566748,0.711927,0.423651,0.595009,0.727043,0.481817,0.589001,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
50%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
75%,2.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,0.0
max,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,3.0,2.0,2.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,0.0


### Skew
Based on the skew results, all features are fairly symmetrical or moderately skewed with the exception of `7_2_strangers` which is highly right skewed. For now, we'll omit applying feature engineering until we get further down the line.

In [12]:
def displaySkew(df):
    for c in df.columns:
        if df[c].skew() >= 1.0 or df[c].skew() <= -1.0:
            print(c, df[c].skew())

displaySkew(df)

6_no_home_time 1.26651339871123
7_2_strangers 2.787329100651511
22_care_sick 1.0533662124200573


## Predictive Model

### Setup Re-Usable Functions
We'll be setting up re-usable functions here to display classification metrics and a confusion matrix after using our model to make a prediction.

In [13]:
import sklearn.metrics as metric
from sklearn.metrics import confusion_matrix

def classificationMetrics(y_test, y_pred):
    accuracy = np.round(metric.accuracy_score(y_true=y_test, y_pred=y_pred), decimals=3)
    precision = np.round(metric.precision_score(y_true=y_test, y_pred=y_pred), decimals=3)
    recall = np.round(metric.recall_score(y_true=y_test, y_pred=y_pred), decimals=3)

    return { 'accuracy': accuracy, 'precision': precision, 'recall': recall }

def displayConfusionMatrix(x_test, y_test, model):
    threshold = 0.5
    y_pred_prob = model.predict_proba(x_test)[:, 1]
    y_pred = (y_pred_prob > threshold).astype(int)
    matrix = confusion_matrix(y_test, y_pred)
    matrix_df = pd.DataFrame(matrix, index=["Obs Not Divorced", "Obs Divorced"], columns=["Pred Not Divorced", "Pred Divorced"])
    print(matrix_df)

### Setup Predictive Model & Evaluate
Using default hyperparameters for our RandomForestClassifier and a default test_size of 20%, our classification metrics are perfect! 100% accuracy, precision and recall!

Chances are if it's too good to be true, *it is too good to be true*. The reason our prediction model was able to achieve such astounding metrics is because the divorce dataset is a small size. So how do we address it? We'll utilize a handy python library that'll analyze our existing dataset then introduce synthetic data.

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X = df.loc[:, df.columns != '55_divorce_y_n']
Y = df['55_divorce_y_n']

x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=100) # Using default test_size of 0.20 (20%)
model = RandomForestClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

print('Random Forest Evalulation Metrics:', classificationMetrics(y_test, y_pred))
displayConfusionMatrix(x_test, y_test, model)

Random Forest Evalulation Metrics: {'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0}
                  Pred Not Divorced  Pred Divorced
Obs Not Divorced                 20              0
Obs Divorced                      0             18


### Introduce Synthetic Data
SDV (Synthetic Data Vault) is a python package that generates synthetic data based on our existing dataset (Wijaya, 2022). What's really cool is that after generating synthetic data, SDV has an evalution tool to score the synthetic data - giving us assurance we're using high quality assurance data.

Being the evaluation scored over 80%, we can confidently move forward with adding this synthetic data to our existing dataset.

In [15]:
# Installation instructions found here - https://github.com/sdv-dev/SDV
from sdv.tabular import GaussianCopula
from sdv.evaluation import evaluate

synthetic_model = GaussianCopula()
synthetic_model.fit(df)

sample = synthetic_model.sample(5000)
evaluate(sample, df, metrics=['KSTest'], aggregate=False)

  return c**2 / (c**2 - n**2)
  Lhat = muhat - Shat*mu
  return cd2*x**(c-1)
  a = (self.min - loc) / scale
  b = (self.max - loc) / scale
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
  improvement from the last ten iterations.


Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal,error
0,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.881197,0.881197,0.0,1.0,MAXIMIZE,


In [16]:
df = pd.concat([df, sample])        # Add synthetic data to our existing data set
df.drop_duplicates(inplace=True)    # Drop any potential duplicate introduced by synthetic dataset
df

Unnamed: 0,1_sorry_end,2_ignore_diff,3_begin_correct,4_contact,5_special_time,6_no_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,31_aggro_argue,32_always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_i'm_right,49_accusations,50_i'm_not_guilty,51_i'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_divorce_y_n
0,2,2,4,1,2,1,1,2,2,2,1,2,1,1,2,1,1,2,2,1,1,1,2,2,1,1,1,2,1,1,1,2,1,2,2,1,2,1,3,3,2,1,1,2,3,2,1,3,3,3,2,3,2,1,1
1,4,4,4,4,4,1,1,4,4,4,4,3,4,2,4,4,4,4,3,2,1,1,2,2,2,1,2,2,1,1,2,4,2,3,2,2,3,4,2,4,2,2,3,4,2,2,2,3,4,4,4,4,2,2,1
2,2,2,2,2,1,3,2,1,1,2,3,4,2,3,3,3,3,3,3,2,1,1,1,2,2,2,2,2,3,2,3,3,1,1,1,1,2,1,3,3,3,3,2,3,2,3,2,3,1,1,1,2,2,2,1
3,3,2,3,2,3,3,3,3,3,3,4,3,3,4,3,3,3,3,3,4,1,1,1,1,2,1,1,1,1,3,2,3,2,2,1,1,3,3,4,4,2,2,3,2,3,2,2,3,3,3,3,2,2,2,1
4,2,2,1,1,1,1,3,3,3,2,3,1,2,1,1,1,1,1,2,1,1,2,2,2,2,2,1,2,1,1,1,1,1,1,1,1,2,2,2,1,2,2,3,2,2,2,1,2,3,2,2,2,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,2,2,2,3,1,2,2,2,2,1,2,2,1,2,2,2,1,2,2,1,1,1,1,1,1,1,1,1,1,2,1,2,1,1,1,1,3,1,1,2,1,2,1,1,1,1,1,2,2,2,2,1,1,1,0
4996,3,3,2,3,3,2,2,4,3,3,4,3,3,3,3,4,3,3,3,3,2,3,3,3,3,3,2,3,3,3,3,2,3,3,2,3,4,3,4,4,3,2,3,1,4,2,2,3,2,3,4,3,2,3,1
4997,3,3,4,2,3,1,1,4,4,2,4,4,4,4,3,2,4,4,4,4,3,2,4,3,4,3,3,2,3,3,2,4,4,4,4,4,4,4,3,4,4,4,4,4,4,4,4,3,4,4,4,4,3,4,1
4998,4,4,4,4,4,2,2,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,4,4,4,4,4,4,4,4,4,4,4,3,4,4,4,4,4,4,4,4,4,4,4,4,1


### Re-Analyze Skew
As expected, all features are fairly symmetrical or moderately skewed with the exception of `7_2_strangers` which is still highly rightly skewed.

In [17]:
from scipy.stats.mstats import winsorize
displaySkew(df)

temp_df = df.copy()
temp_df['7_2_strangers'] = winsorize(temp_df['7_2_strangers'], limits=(0.00, 0.04))
temp_df['7_2_strangers'].skew() # skew value of 0.5

df['7_2_strangers'] = winsorize(df['7_2_strangers'], limits=(0.00, 0.04)) # make it permanent!

7_2_strangers 1.1098403827164918


### Setup Predictive Model & Evaluate (Part 2 - Post Synthetic Data Addition)
After adding 5000 (slighly less after removing duplicate rows) additional rows of synthetic data and using our same predictive model setup, our classification metrics now seem more realistic and intune with reality. 

With accuracy, precision and recall scores of 90% more or less, we're looking good so far!

But as always, we can do better! Especially looking at our confusion matrix since there are:
- 72 actual not divorced, but predicted as divorced (false positive)
- 56 actual divorced, but predicted as not divorced (false negative)

Although our model is solid, we want to avoid false positives especially! And that's because for married couples coming in for therapy and filling out the survery, it would be better to deal with false negatives (wrongly predicting couples won't get divorced) versus false positive (wrongly predicting couples will get divorced). The reason we'd like to avoid false positives is because we may have couples pay for more marriage conseling than they need - the last thing we'd want to do is place more financial stress on married couples who are seeking help to save their marriage.

With that said, we'll want to aim for a higher precision which we'll cover next.

In [18]:
X = df.loc[:, df.columns != '55_divorce_y_n']
Y = df['55_divorce_y_n']

x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=100) # Using default test_size of 0.20 (20%)
model = RandomForestClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

print('Random Forest Evalulation Metrics:', classificationMetrics(y_test, y_pred))
displayConfusionMatrix(x_test, y_test, model)

Random Forest Evalulation Metrics: {'accuracy': 0.901, 'precision': 0.907, 'recall': 0.874}
                  Pred Not Divorced  Pred Divorced
Obs Not Divorced                644             53
Obs Divorced                     74            515


### Hyperparameter Tuning
Using the RandomForestClassifier model, we'll utilize grid search to apply hyperparameter tuning in order to get the best `precision` score.
After 4-5 minutes of running, turns out the best precision score based on the variety of hyperparamters is `0.89` - not much of a difference given the the default hyperparameters for RandomForestClassifier is `0.88`.

Hence, it's safe to say our best hyperparameters is the default ones included with our predictive model.

In [19]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier()
n_estimators = [100, 500, 1000]
class_weight = ['balanced', None]
max_features = ['sqrt', 'log2', 'auto']

grid = {
    'n_estimators': n_estimators,
    'class_weight': class_weight,
    'max_features': max_features
}

cv = RepeatedStratifiedKFold(n_splits=15, n_repeats=3, random_state=1)

grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='precision',error_score=0)
grid_result = grid_search.fit(x_train, y_train)
print("Best precision: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best precision: 0.898741 using {'class_weight': 'balanced', 'max_features': 'log2', 'n_estimators': 100}


### Export Table & Model
This file will be used for assignment 16.

In [22]:
model = RandomForestClassifier()
model.fit(x_train, y_train)

filename = 'divorce_model.sav'
joblib.dump(model, filename)

df.to_csv('divorce_new_table.csv')

# Conclusion
Bringing it altogether, we've done the following:
- Understood the data
- Apply feature engineering - including renaming columns and address missing data
- Setup an initial predictive model
- After evaluating our initial predictive model results, we added synthetic data to address our predictive model's shortcoming of overfitting because of our dataset's small size
- Re-run our predictive model and apply hyperparameter tuning to ensure we're indeed getting the best precision for our predictive model

# References
*Dealing With Missing Values in Python – A Complete Guide*. (2021, May 19). Analytics Vidhya.\
&emsp; Retrieved March 24, 2022, from https://www.analyticsvidhya.com/blog/2021/05/\
&emsp; dealing-with-missing-values-in-python-a-complete-guide/ 

Wijaya, C. Y. (2022, January 31). *Top 3 Python Packages to Generate Synthetic Data*. Towards Data\
&emsp; Science. Retrieved March 24, 2022, from https://towardsdatascience.com/ \
&emsp; top-3-python-packages-to-generate-synthetic-data-33a351a5de0c