# Assignment 6: Project Part II
Predicting Divorce\
Ismail Abdo Elmaliki\
CS 502 - Predictive Analytics\
Capitol Technology University\
Professor Frank Neugebauer\
February 10, 2022

## Data Understanding
We'll take a look at the divorce data to better understand things like correlations, its description, and other essentials parts.

### Info
There are no missing values to start off with, with all values having a type of int64

In [123]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot

df = pd.read_csv('divorce.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 55 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   Sorry_end                      170 non-null    int64
 1   Ignore_diff                    170 non-null    int64
 2   begin_correct                  170 non-null    int64
 3   Contact                        170 non-null    int64
 4   Special_time                   170 non-null    int64
 5   No_home_time                   170 non-null    int64
 6   2_strangers                    170 non-null    int64
 7   enjoy_holiday                  170 non-null    int64
 8   enjoy_travel                   170 non-null    int64
 9   common_goals                   170 non-null    int64
 10  harmony                        170 non-null    int64
 11  freeom_value                   170 non-null    int64
 12  entertain                      170 non-null    int64
 13  people_goals        

But let's further verify there's no missing data. According to the `divorce_README` file, questions are ranked on a scale from 1 to 5. Hence any column that has the value 0 (except the column `Divorce_Y_N`) means there's a missing value.

Hence it looks like below out of 177 entries rows, some columns have less than or more than half of values missing. Hence, the columns below with the value 0 will need to be addressed.

In [124]:
for column_name in df.columns:
    if column_name == 'Divorce_Y_N':
        break
    column = df[column_name]
    # Get the count of Zeros in column 
    count = (column == 0).sum()
    print('Count of zeros in column ', column_name, ' is : ', count)

Count of zeros in column  Sorry_end  is :  69
Count of zeros in column  Ignore_diff  is :  59
Count of zeros in column  begin_correct  is :  51
Count of zeros in column  Contact  is :  75
Count of zeros in column  Special_time  is :  82
Count of zeros in column  No_home_time  is :  86
Count of zeros in column  2_strangers  is :  114
Count of zeros in column  enjoy_holiday  is :  81
Count of zeros in column  enjoy_travel  is :  84
Count of zeros in column  common_goals  is :  62
Count of zeros in column  harmony  is :  71
Count of zeros in column  freeom_value  is :  58
Count of zeros in column  entertain  is :  47
Count of zeros in column  people_goals  is :  66
Count of zeros in column  dreams  is :  69
Count of zeros in column  love  is :  75
Count of zeros in column  happy  is :  73
Count of zeros in column  marriage  is :  79
Count of zeros in column  roles  is :  77
Count of zeros in column  trust  is :  81
Count of zeros in column  likes  is :  78
Count of zeros in column  care_s

Before analyzing data any further, let's make sure to replace those values of 0 with Nan. That way if we need to calculate the mean, it won't take the value 0 into consideration. This will exclude the column `Divorce_Y_N` which either has a valid value of 0 or 1.

In [125]:
for c in df.columns:
    if c == 'Divorce_Y_N':
        break
    df[c] = df[c].replace(0, np.NaN)
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 55 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Sorry_end                      101 non-null    float64
 1   Ignore_diff                    111 non-null    float64
 2   begin_correct                  119 non-null    float64
 3   Contact                        95 non-null     float64
 4   Special_time                   88 non-null     float64
 5   No_home_time                   84 non-null     float64
 6   2_strangers                    56 non-null     float64
 7   enjoy_holiday                  89 non-null     float64
 8   enjoy_travel                   86 non-null     float64
 9   common_goals                   108 non-null    float64
 10  harmony                        99 non-null     float64
 11  freeom_value                   112 non-null    float64
 12  entertain                      123 non-null    flo

### Describe
To better understand this data, it's best to separate them into two different data frames - divorced vs non-divorced. We also want to rename columns so we know which question they're associated to, which will help us easily refer to the `divorce_README.pdf` file for more information.

In [126]:
# Rename columns by appending question number
new_columns = {}
number = 1
for c in df.columns:
    new_columns.update({c: str(number) + '_' + c})
    number += 1

df = df.rename(columns=new_columns)
df.head()

Unnamed: 0,1_Sorry_end,2_Ignore_diff,3_begin_correct,4_Contact,5_Special_time,6_No_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,31_Aggro_argue,32_Always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_I'm_right,49_accusations,50_I'm_not_guilty,51_I'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_Divorce_Y_N
0,2.0,2.0,4.0,1.0,,,,,,,1.0,,1.0,1.0,,1.0,,,,1.0,,,,,,,,,,1.0,1.0,2.0,1.0,2.0,,1.0,2.0,1.0,3.0,3.0,2.0,1.0,1.0,2.0,3.0,2.0,1.0,3.0,3.0,3.0,2.0,3.0,2.0,1.0,1
1,4.0,4.0,4.0,4.0,4.0,,,4.0,4.0,4.0,4.0,3.0,4.0,,4.0,4.0,4.0,4.0,3.0,2.0,1.0,1.0,,2.0,2.0,1.0,2.0,,1.0,1.0,,4.0,2.0,3.0,,2.0,3.0,4.0,2.0,4.0,2.0,2.0,3.0,4.0,2.0,2.0,2.0,3.0,4.0,4.0,4.0,4.0,2.0,2.0,1
2,2.0,2.0,2.0,2.0,1.0,3.0,2.0,1.0,1.0,2.0,3.0,4.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,1.0,,1.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,3.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,3.0,3.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,1
3,3.0,2.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,4.0,3.0,3.0,3.0,3.0,3.0,4.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,3.0,2.0,3.0,2.0,2.0,1.0,1.0,3.0,3.0,4.0,4.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,3.0,3.0,3.0,3.0,2.0,2.0,2.0,1
4,2.0,2.0,1.0,1.0,1.0,1.0,,,,,,1.0,,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,,,,,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,2.0,1.0,,2.0,3.0,,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,1.0,,1


#### Understanding divorced data
By calling the describe function, we notice right away the mean of question responses (columns 1 through 54). Specifically values that have a mean of 3.0 or higher.

For example, columns 31 through 54 all have a mean of 3.2 or higher. And based on the questions from the `divorce_README` these are signs of a marriage going downhill that may lead to divorce.

In [127]:
divorced = df.copy()
divorced = divorced[divorced['55_Divorce_Y_N'] == 1]
divorced.describe()

Unnamed: 0,1_Sorry_end,2_Ignore_diff,3_begin_correct,4_Contact,5_Special_time,6_No_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,31_Aggro_argue,32_Always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_I'm_right,49_accusations,50_I'm_not_guilty,51_I'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_Divorce_Y_N
count,83.0,82.0,84.0,82.0,81.0,62.0,55.0,80.0,81.0,82.0,82.0,82.0,82.0,81.0,83.0,82.0,83.0,83.0,82.0,83.0,79.0,77.0,75.0,80.0,79.0,81.0,80.0,78.0,81.0,82.0,82.0,84.0,81.0,82.0,78.0,79.0,83.0,81.0,83.0,83.0,83.0,83.0,82.0,82.0,80.0,80.0,81.0,84.0,84.0,84.0,84.0,84.0,84.0,80.0,84.0
mean,3.228916,2.939024,2.916667,2.792683,3.123457,1.532258,1.509091,2.95,3.0,2.841463,3.292683,3.012195,3.170732,2.987654,2.975904,2.890244,3.204819,3.012048,3.256098,2.915663,2.822785,2.675325,3.106667,2.925,3.139241,2.91358,2.8125,2.75641,3.037037,2.890244,3.52439,3.416667,3.481481,3.353659,3.525641,3.417722,3.626506,3.530864,3.686747,3.614458,3.590361,3.373494,3.560976,3.463415,3.45,3.325,3.444444,3.452381,3.511905,3.5,3.357143,3.488095,3.321429,3.5375,1.0
std,0.66855,0.806571,0.747821,0.827577,0.796598,0.694657,0.978902,0.809751,0.689202,0.777294,0.67564,0.728501,0.716729,0.782525,0.732119,0.801328,0.658139,0.740698,0.604734,0.843986,0.729585,0.895038,0.763586,0.823315,0.693083,0.809283,0.764625,0.840314,0.813087,0.846286,0.863893,0.824451,0.895979,0.89404,0.90775,0.871303,0.72769,0.743075,0.64255,0.621394,0.749571,0.822116,0.755189,0.72342,0.761411,0.853511,0.935414,0.718176,0.783782,0.814034,0.845154,0.871144,0.92046,0.810434,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,3.0,2.0,3.0,2.0,3.0,1.0,1.0,2.0,3.0,2.0,3.0,3.0,3.0,2.0,3.0,2.0,3.0,3.0,3.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,2.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0
50%,3.0,3.0,3.0,3.0,3.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0
75%,4.0,4.0,3.0,3.0,4.0,2.0,1.5,4.0,3.0,3.0,4.0,3.75,4.0,4.0,3.0,3.0,4.0,3.5,4.0,4.0,3.0,3.0,4.0,4.0,4.0,3.0,3.0,3.0,4.0,3.75,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0
max,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0


Let's filter this data even more by only presenting columns that have a mean of 3.0 or higher and 1.5 or less. These numbers should tell us a bigger story on what leads to divorce. 

So far we went from 55 columns to 31 columns! 

And as expected, all columns from 31 through 54 appear to be high indicators of divorce. 

In [128]:
divorced_mean = divorced.copy()
columns = divorced_mean.columns 
for c in columns:
    if divorced_mean[c].mean() < 3.2 and divorced_mean[c].mean() > 1.6:
        divorced_mean.drop(c, axis=1, inplace=True)

divorced_mean.info()
divorced_mean.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84 entries, 0 to 83
Data columns (total 31 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   1_Sorry_end                       83 non-null     float64
 1   6_No_home_time                    62 non-null     float64
 2   7_2_strangers                     55 non-null     float64
 3   11_harmony                        82 non-null     float64
 4   17_happy                          83 non-null     float64
 5   19_roles                          82 non-null     float64
 6   31_Aggro_argue                    82 non-null     float64
 7   32_Always_never                   84 non-null     float64
 8   33_negative_personality           81 non-null     float64
 9   34_offensive_expressions          82 non-null     float64
 10  35_insult                         78 non-null     float64
 11  36_humiliate                      79 non-null     float64
 12  37_not_cal

Unnamed: 0,1_Sorry_end,6_No_home_time,7_2_strangers,11_harmony,17_happy,19_roles,31_Aggro_argue,32_Always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_I'm_right,49_accusations,50_I'm_not_guilty,51_I'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_Divorce_Y_N
count,83.0,62.0,55.0,82.0,83.0,82.0,82.0,84.0,81.0,82.0,78.0,79.0,83.0,81.0,83.0,83.0,83.0,83.0,82.0,82.0,80.0,80.0,81.0,84.0,84.0,84.0,84.0,84.0,84.0,80.0,84.0
mean,3.228916,1.532258,1.509091,3.292683,3.204819,3.256098,3.52439,3.416667,3.481481,3.353659,3.525641,3.417722,3.626506,3.530864,3.686747,3.614458,3.590361,3.373494,3.560976,3.463415,3.45,3.325,3.444444,3.452381,3.511905,3.5,3.357143,3.488095,3.321429,3.5375,1.0
std,0.66855,0.694657,0.978902,0.67564,0.658139,0.604734,0.863893,0.824451,0.895979,0.89404,0.90775,0.871303,0.72769,0.743075,0.64255,0.621394,0.749571,0.822116,0.755189,0.72342,0.761411,0.853511,0.935414,0.718176,0.783782,0.814034,0.845154,0.871144,0.92046,0.810434,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,3.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0
50%,3.0,1.0,1.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0
75%,4.0,2.0,1.5,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0
max,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0


#### Understanding non divorced data

Analyzing the non divorced data set, not a single question has a mean of 3.0 or higher. The most it went up was 2.5 especially when analyzing columns 31 through 54.

In [129]:
not_divorced = df.copy()
not_divorced = not_divorced[not_divorced['55_Divorce_Y_N'] == 0]
not_divorced.describe()

Unnamed: 0,1_Sorry_end,2_Ignore_diff,3_begin_correct,4_Contact,5_Special_time,6_No_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,31_Aggro_argue,32_Always_never,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,42_argue_then_leave,43_silent_for_calm,44_good_to_leave_home,45_silence_instead_of_discussion,46_silence_for_harm,47_silence_fear_anger,48_I'm_right,49_accusations,50_I'm_not_guilty,51_I'm_not_wrong,52_no_hesitancy_inadequate,53_you're_inadequate,54_incompetence,55_Divorce_Y_N
count,18.0,29.0,35.0,13.0,7.0,22.0,1.0,9.0,5.0,26.0,17.0,30.0,41.0,23.0,18.0,13.0,14.0,8.0,11.0,6.0,13.0,6.0,5.0,18.0,28.0,17.0,13.0,7.0,8.0,16.0,44.0,40.0,18.0,38.0,7.0,3.0,38.0,25.0,37.0,15.0,32.0,43.0,72.0,28.0,62.0,68.0,56.0,76.0,58.0,67.0,74.0,63.0,55.0,40.0,86.0
mean,1.888889,1.37931,1.571429,1.769231,1.285714,1.454545,1.0,1.222222,1.0,1.346154,1.0,1.133333,1.268293,1.086957,1.111111,1.076923,1.071429,1.0,1.090909,1.0,1.0,1.0,1.4,1.277778,1.035714,1.0,1.0,1.0,1.0,1.0625,1.636364,1.575,1.388889,1.263158,1.285714,1.0,1.421053,1.2,1.324324,1.2,1.28125,2.023256,2.333333,1.642857,2.290323,2.470588,1.910714,2.315789,1.896552,1.776119,1.878378,2.142857,1.854545,1.475,0.0
std,1.02262,0.676852,0.777844,1.300887,0.48795,0.738549,,0.440959,0.0,0.485165,0.0,0.345746,0.501218,0.288104,0.323381,0.27735,0.267261,0.0,0.301511,0.0,0.0,0.0,0.547723,0.460889,0.188982,0.0,0.0,0.0,0.0,0.25,0.809562,0.873763,0.849837,0.644486,0.755929,0.0,0.858395,0.5,0.626013,0.414039,0.522671,0.912568,1.020908,0.82616,1.178873,1.014383,1.10003,0.769598,0.985681,0.794315,0.775528,1.075492,0.970265,0.678894,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
50%,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.5,2.0,3.0,1.5,2.0,2.0,2.0,2.0,2.0,2.0,1.0,0.0
75%,2.0,2.0,2.0,2.0,1.5,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.75,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.75,1.0,2.0,1.0,1.25,2.0,3.0,2.0,3.0,3.0,3.0,3.0,2.75,2.0,2.0,3.0,2.0,2.0,0.0
max,4.0,3.0,4.0,4.0,2.0,4.0,1.0,2.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,4.0,4.0,4.0,4.0,3.0,1.0,4.0,3.0,4.0,2.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,0.0


Taking a similar approach as we did with the divorced data set, it's important to better understand non_divorced data and exclude means that are less than 3 and greater than 1.5.

After making that change, we can see below we go from 55 columns to 38 columns. In addition we can clearly see based on questions from the `divorce_README.pdf` file that non divorced spouses don't necessarily have common values as divorced couples. Yet despite not sharing exactly common values the heated discussions as shown from columns 31 to 54 are far more tame than those couples who know each other well then ended up being divorced.

However before further analyzing another thing to notice out of the 86 entries is that many of the column values (except for the last column) have a lot of missing values! Hence before moving forward with an analysis, it's important that we address the missing data then re-analyze both the divorced and non-divorced data set.

In [130]:
non_divorced_mean = not_divorced.copy()
columns = non_divorced_mean.columns 
for c in columns:
    if non_divorced_mean[c].mean() < 3 and non_divorced_mean[c].mean() > 1.5:
        non_divorced_mean.drop(c, axis=1, inplace=True)

non_divorced_mean.info()
non_divorced_mean.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86 entries, 84 to 169
Data columns (total 38 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   2_Ignore_diff             29 non-null     float64
 1   5_Special_time            7 non-null      float64
 2   6_No_home_time            22 non-null     float64
 3   7_2_strangers             1 non-null      float64
 4   8_enjoy_holiday           9 non-null      float64
 5   9_enjoy_travel            5 non-null      float64
 6   10_common_goals           26 non-null     float64
 7   11_harmony                17 non-null     float64
 8   12_freeom_value           30 non-null     float64
 9   13_entertain              41 non-null     float64
 10  14_people_goals           23 non-null     float64
 11  15_dreams                 18 non-null     float64
 12  16_love                   13 non-null     float64
 13  17_happy                  14 non-null     float64
 14  18_marriag

Unnamed: 0,2_Ignore_diff,5_Special_time,6_No_home_time,7_2_strangers,8_enjoy_holiday,9_enjoy_travel,10_common_goals,11_harmony,12_freeom_value,13_entertain,14_people_goals,15_dreams,16_love,17_happy,18_marriage,19_roles,20_trust,21_likes,22_care_sick,23_fav_food,24_stresses,25_inner_world,26_anxieties,27_current_stress,28_hopes_wishes,29_know_well,30_friends_social,33_negative_personality,34_offensive_expressions,35_insult,36_humiliate,37_not_calm,38_hate_subjects,39_sudden_discussion,40_idk_what's_going_on,41_calm_breaks,54_incompetence,55_Divorce_Y_N
count,29.0,7.0,22.0,1.0,9.0,5.0,26.0,17.0,30.0,41.0,23.0,18.0,13.0,14.0,8.0,11.0,6.0,13.0,6.0,5.0,18.0,28.0,17.0,13.0,7.0,8.0,16.0,18.0,38.0,7.0,3.0,38.0,25.0,37.0,15.0,32.0,40.0,86.0
mean,1.37931,1.285714,1.454545,1.0,1.222222,1.0,1.346154,1.0,1.133333,1.268293,1.086957,1.111111,1.076923,1.071429,1.0,1.090909,1.0,1.0,1.0,1.4,1.277778,1.035714,1.0,1.0,1.0,1.0,1.0625,1.388889,1.263158,1.285714,1.0,1.421053,1.2,1.324324,1.2,1.28125,1.475,0.0
std,0.676852,0.48795,0.738549,,0.440959,0.0,0.485165,0.0,0.345746,0.501218,0.288104,0.323381,0.27735,0.267261,0.0,0.301511,0.0,0.0,0.0,0.547723,0.460889,0.188982,0.0,0.0,0.0,0.0,0.25,0.849837,0.644486,0.755929,0.0,0.858395,0.5,0.626013,0.414039,0.522671,0.678894,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
50%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
75%,2.0,1.5,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.75,1.0,2.0,1.0,1.25,2.0,0.0
max,3.0,2.0,4.0,1.0,2.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,4.0,4.0,3.0,1.0,4.0,3.0,4.0,2.0,3.0,4.0,0.0


# Conclusion
After looking at the data, we're able to narrow down the number of columns to better understand patterns when it comes to the divorced versus non divorced data set. However the non-divorced data set has many missing values. So the values would need to be addressed before further analyzing the data and would require Feature Engineering.