<p style="coloe:darkred; font-family:Futura; font-size:50px; text-align:center">WaiDatathon 2021: "Combat domestic violence with data and AI"</p>   

# Content

1. #### [Problem description](#prob)
1. #### [Setup](#setup)
1. #### [Acceptability vs experience of domestic violence](#acc_vs_exp)
1. #### [Analysis of the DHS women questionnaireAnalysis of the DHS women questionnaire](#DHS)
    1. #### [Feature selection](#feat)
    1. #### [Data cleaning](#clean)
    1. #### [Exploratory data analysis](#EDA)
1. #### [Modeling](#model)
    1. #### [Data cleaning](#clean)
    1. #### [Feature predictivity](#feature)
    1. #### [Model training](#train)
    1. #### [Final result: personalized questionnaire](#result)
    1. #### [Discussion](#discuss)
1. #### [Conclusion](#concl)    

<a id="prob"></a>
<p style="color:darkred; font-family:Futura; font-size:40px">1. Problem description</p>

The WaiDatathon 2021 was organized by Women in AI on the theme ["Combat Domestic Violence with Data & AI"](https://www.womeninai.co/waidatathon-details). This notebook reproduces our main analysis for which we earned the first place at the competition.

Many professionals including social workers, medical professionals, academics, politicians or police agents are working daily to help women in abusive relationships, raise awareness in the general public and overall fight this phenomenon. Despite their efforts domestic violence is still a critical but hidden problem all around the world. When approaching this challenge it was clear for us that no amount of machine learning would solve the problem. Domestic violence is a sensitive and emotional topic for the victims. One of machine learning greatest strength is its ability to automate complex tasks, but domestic violence is a deeply human problem, and we believe that humans are the most important part of the support mechanism. 

Three human parties are involved in domestic violence: the victim, the abuser and the supporters. We approached the challenge in two parts. Firstly, we addressed the victim's point of view. In particular, we tried to understand the relationships between the attitudes toward domestic and its occurence, and how these are affected by demographic characteristics. Secondly, we tried to provide an additional tools to supporters in order to detect women in danger. For this purpose we proposed a machine learning model that creates a questionnaire to detect early signs of domestic violence. Because the model can be deployed on paper or by dialogue, it is especially suited for developing countries. 

You can watch our 5-minute competition talk [here](https://lnkd.in/gBupkte).

<a id="prob"></a>
<p style="color:darkred; font-family:Futura; font-size:40px">1. Dataset description</p>

Our analysis is based on the amazing data from [The Demographic and Health Surveys (DHS) Program](https://www.google.com). This DHS program's goal is to collect, analyze and disseminate accurate and representative data on population, health, HIV and nutrition. It has run more than 400 surveys in 90 countries.

The [dataset provided competition](https://data.world/makeovermonday/2020w10/workspace/file?filename=20200306+Data+International+Women%27s+Day+Viz5+Launch.csv) contains the results to the **acceptability question** (see below) aggregated per demographics (e.g. age, education, location). In other terms, this dataset contains information such as "70% of respondents with a secondary level of education answered 'yes' to a given question". However, since it does not contain the demographic information and answers of individual respondents it was not possible to cross mutliple information (e.g. age and education) or to build a predictive model. 

To overcome this difficulty we obtained the complete DHS survey's [women's questionnaire](https://dhsprogram.com/Methodology/Survey-Types/DHS-Questionnaires.cfm#CP_JUMP_16179) for Sub-Saharan countries. This dataset contains the response of individual respondents (samples) to thousands of questions (features). For a given country and a given survey phase the dataset would contain in the order of 10,000 samples and 4200 features. However, due to time constraints and the necessity to clean the data we did not use all the features but selected a subset of 38 questions.

For the competition we used data of the DHS survey for Benin and Congo, but to comply with copyright requirements of the DHS program only the example data of the DHS survey or generated by [STATcompiler](https://www.statcompiler.com/) are contained in this github repository. Access to more data is free upon request on the [DHS website](https://dhsprogram.com/data/Using-Datasets-for-Analysis.cfm). This notebook can be used to treat any survey data from the DHS women's questionnaire. However, since the survey questions are revised from one phase to another, some adjustements might be required in the data cleaning process when using earlier phases.

Some of the most important features for our analysis were in the form of multiple short questions on a given theme. We treated these questions either as: 
- individual features
- a sum of positive answer (for a given theme)
- a boolean value (True if any question on a theme was answered true, False otherwise)

Below, we list these questions with their possible answers.

# Important questions

### Acceptability of violence:

Feature names: `v744a - v744f` <br>
Possible answers: 
- 0  No
- 1  Yes
- 8  Don't know
- (m) 9  Missing
- (na)    Not applicable

In your opinion, is a husband justified in hitting or beating his wife when:
- she goes out without telling him?
- she neglects the children?
- she argues with him?
- she refuses to have sex with him?

### marital control behaviors:
Feature name: `d101a - d101e` <br>
Possible answers: see previous  <br>

Does your husband: 
- become jealous or angry if you talks to other men?
- frequently accuses you of being unfaithful?
- not permit you to meet your female friends?
- tries to limit your contact with your family?
- insists on knowing where you are at all times?
- not trust you with money?





### Physical violence:

Feature names: `d105a - d105k` <br>
Possible answers: <br>
- 0  Never
- 1  Often
- 2  Sometimes
- 3  Yes, but not in the last 12 months
- 4  Yes, but frequency in last 12 months missing
- (m) 9  Missing
- (na)    Not applicable

Have you even been:
- pushed, shook or had something thrown by husband/partner?
- slapped by husband/partner?
- punched with fist or hit by something harmful by husband/partner?
- kicked or dragged by husband/partner?
- strangled or burnt by husband/partner?
- threatened with knife/gun or other weapon by husband/partner?
- physically forced into unwanted sex by husband/partner?
- forced into other unwanted sexual acts by husband/partner?
- had arm twisted or hair pulled by husband/partner?
- physically forced to perform sexual acts respondent didn't want to?

### Emotional violence:

Feature names: `d103a - d103c` <br>
Possible answers: see previous <br>

Have you ever benn:
- humiliated by husband/partner?
- threatened with harm by husband/partner?
- insulted or made to feel bad by husband/partner?

<a id="setup"></a>
<p style="color:darkred; font-family:Futura; font-size:40px">2. Setup</p>
First we import libraries, customize figure output and load the data.

In [2]:
#collapse 
# Import libraries
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Customize figure output
plt.style.use('seaborn')
mpl.rc('font', size=18)
mpl.rc('axes', labelsize='large')
mpl.rc('xtick', labelsize='large')
mpl.rc('ytick', labelsize='large')

plt.rcParams['figure.figsize'] = [20, 10] # For larger plots



<a id="acc_vs_exp"></a>
<p style="color:darkred; font-family:Futura; font-size:40px">3. Acceptability vs experience of domestic violence</p>

In [3]:
#collapse 
df = pd.read_csv('./Data/DHS_summary_world.csv', skiprows=1, skipfooter=11,engine='python')
df = df.merge(pd.read_csv('./Data/iso_alpha_list.csv'),
              left_on='Country', right_on='country',how='left')
# Sample only the most recent survey for each country
country_list = df['Country'].unique().tolist()
temp = [df[df['Country']==country].reset_index().iloc[0] for country in country_list]
df = pd.DataFrame(temp).reset_index().drop(columns=['level_0','index'])

# Distributions
# x = "Wife beating justified for at least one specific reason"
# y = "Physical violence committed by husband/partner in last 12 months"

# y_list = [x + " [Men]", x + " [Women]"]

# temp = df[y_list].copy().rename(dict(zip(y_list,["Men", "Women"])),axis=1)




# fig = px.violin(temp,box=True,title=x)
# fig.update_layout(title=dict(x=0.5,font=dict(family='Futura')))
# fig.show()