# [Global 88] Gender Violence in Colombia (cont.)
### Professor: Karenjot Bhangoo Randhawa  
**Estimated Time:** 50 minutes  
**Notebook Created By:** Vaidehi Bulusu, Emily Guo, Bella Chang, Carlos Calderon  
**Code Maintenance:** Carlos Calderon 

Welcome! Last week we got an introduction to bar plots and got more experience interpreting line plots. We then generated the same visualizations, partitioned by sex. This week, we will be looking at a non sex-disaggregated dataset. We will be delving deeper into interpreting bar plots and generating a deeper understanding of our data from them. We will look at change throughout time again, and will introduce the word cloud visualization. 

**Learning Outcomes:**  
By the end of this notebook, students will be able to:  
1. Have a deeper understanding of bar plots.   
2. Understand the differences in what analysis we could produce with non sex-disaggregated data.
3. Develop a deeper intuition on when and how to use line plots.  

# Table of Contents  
1. Understanding the Data  
2. Visualizing the Data 
3. Non Sex-disaggregated Data

---
---

# Importing Packages   

<div class="alert alert-block alert-warning">
<b>Make sure to run these cells FIRST! Not doing so may result in pesky errors in the code.</b>
</div>

In [None]:
pip install wordcloud -q

In [None]:
from datascience import *
import numpy as np
import pandas as pd

import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual, Output

from wordcloud import WordCloud as wc, STOPWORDS
from collections import Counter

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from plotting import barplot, lineplot, comparable_lineplot, wordcloud

---
---

# 1. Understanding the Data

The main dataset for this notebook was obtained from the [Secretariat of Women and Gender Equality of the Atlantic](https://www.atlantico.gov.co/index.php/gobernacion/secretarias/mujeres-y-equidad), a government institution whose mission statement is to "better the quality of life for all women in the Atlantic Colombian region."  

One of the Secretariat's main functions is to point women to the right resources. As such, the Secretariat receives several cases of violence against women, with many of the cases going beyond domestic violence. Load the cell below to see how these data look. 

In [None]:
cases = Table.read_table("../data/nb4/secretariat-cases.csv")
cases = cases.with_column("Municipality", [i.title() for i in cases["Municipality"]])
cases_df = cases.to_df()
cases.show(3)

Throughout the previous notebooks, we have asked you to find the size of some given dataset. That is, we've asked you to find how many rows and columns there are. We've also provided explanations on what each column, and asked you to interpret what each row represents for a given dataset. 

The questions for this section will be largely similar to those presented last week. This is so that you get into the habit of understanding what your dataset represents prior to any analysis.  

<div class="alert alert-info">
<b> Question 1.1: </b> Based on the data they contain, what do you think each column represents?
</div> 

*Hint:* Read [the mission statement](https://www.atlantico.gov.co/index.php/secretaria) of the entity that collected these data. 

In [None]:
# Printing out the dataset to help you answer the next couple of questions
cases.show(2)

- Month/Year: ...  
- Victim of the conflict?: ...
- Municipality: ...  
- Reason for Consultation: ... 
- Previous complaints?: ... 
- Violence setting: ...  
- Type of violence: ... 
- Referral: ...  
- Entity referred to: ...

<div class="alert alert-info">
<b> Question 1.2: </b> With these interpretations of our dataset's columns in mind, what do you think each row in our dataset represents?
</div>

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 1.3: </b> Using the table properties <code>num_columns</code> and <code>num_rows</code> that we learned in Notebook 1, fill in the code below to print out the size of our dataset. 
</div>

*Hint:* The format of your code should be of the form `dataset.property`

In [None]:
# Hint: What is the name of our dataset?
cases_num_rows = ...
cases_num_columns = ...
print(f"Our dataset has {cases_num_rows} cases and {cases_num_columns} properties to describe each case.")

<div class="alert alert-info">
    <b> Question 1.4: </b> Fill in the blanks below with the names of the columns that contain categorical data. If you need more or less space, feel free to add or delete a bullet point. 
</div>

- ...
- ...
- ...

<div class="alert alert-info">
    <b> Question 1.5: </b> Fill in the blanks below with the names of the columns that contain numerical data. 
</div>

- ...

<div class="alert alert-info">
    <b> Question 1.6: </b> What special case of numerical data does this column represent?
</div>

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 1.7: </b> Think back to the Colombian National Police family intraviolence dataset we dealt with in Notebook 3. How are that dataset and our current data, <code>cases</code> similar? How are they different? 
</div>

- They are similar in: ... 
- They differ in: ...

---
---

# 2. Visualizing the Data  

<div class="alert alert-info">
    <b> Question 2.1: </b> Think about the types of data described by our dataset. What visualizations do you think we can generate from these data? Below, delete yes or no depending on whether we can generate the given plot type. 
</div>  

- Histogram? Yes/No  
- Scatter plots? Yes/No  
- Line plots? Yes/No  
- Bar charts? Yes/No

---

## 2.1 Visualizing Categorical Data 
You may have found that a lot of your answers above were no. This is true, since our data is largely categorical, we are restricted in the analysis and visualizations we can generate. Run the cell below, which will allow you to select a categorical column and allow you to visualize its distribution through a bar plot

In [None]:
barplot()

<div class="alert alert-info">
    <b> Question 2.2: </b> What does the bar plot for <code>Month/Year</code> tell you? Why were we able to generate a bar plot from this column if it is a numerical variable?
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 2.3: </b> Given that each each row in our dataset represents an individual case brought to the Secretariat, what does the bar plot for <code>Victim of the conflict?</code> tell you? Are most of these cases brought forth by victims or non-victims? 
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 2.4: </b> Do you think the bar plot for <code>Municipality</code> is informative? Why or why not? How would you improve this plot? 
</div>  

- The bar plot (is/is not) informative. 
- Why/why not?
- I would make the following improvements: ...

<div class="alert alert-info">
    <b> Question 2.5: </b> According to the plot for <code>Violence setting</code>, in what setting did most of the assaults occur? What biases in the data are exposed by this plot? 
</div>  

**Note: You may have noticed that we have a category here called `nan`. This is a commonly seen value in a lot of datasets that means "Not a number". However, a value of `nan` usually tells us that we failed to collect or find any data.**

*Replace this text with your answer*

In [None]:
# Run the cell again. Generates same output as above. 
# This is so that you dont have to scroll back and forth. 
barplot()

<div class="alert alert-info">
    <b> Question 2.6: </b> What types of violence are most common in our dataset according to the bar plot for <code>Type of violence</code>?
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 2.7: </b> What were some of the most common reasons that cases were brought to the Secretariat? What bar plot helped you derive this information?
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 2.8: </b> Where were most of the cases referred to, if at all, after the Secretariat? What bar plot helped you derive this information?
</div>  

*Replace this text with your answer*

---

## 2.2 Visualizing Unstructured Data 

You may have noticed that we have an additional column, `Reason for Consultation`, in our dataset that you may have said contains categorical data. Run the cell below to print out some of the values contained in this column.

In [None]:
num_examples = 10
print("Some of the reasons for consultation: ")
print(cases.column("Reason for Consultation")[:num_examples])

<div class="alert alert-info">
    <b> Question 2.9: </b> Looking at the first set of values contained in this column, why do you think we did not generate a bar plot for these data? Feel free to modify the <code>num_examples</code> variable to see more examples.
</div>  

*Replace this text with your answer*

Indeed, these data are neither categorical nor numerical. Instead, the variable `Reason for Consultation` represents text that resembles human language. These type of data are called [unstructured textual data](https://en.wikipedia.org/wiki/Unstructured_data). This is different from categorical or numerical data we have seen before, and as such, it cannot be processed or analyzed through conventional methods. Often times, it is difficult to derive insights from these data as it requires intensive **data cleaning and wrangling**.  

In our case, we are dealing with textual data that, as previously stated, closely resembles human language. Computers have a difficult time understanding and parsing the meaning of words, so much so that there is an entire field called [Natural Language Processing](https://machinelearningmastery.com/natural-language-processing/). that focuses on exactly this.  

Run the cell below, which will generate a Word Cloud -- a special visualization that can be generated from textual data. You may have seen one of these before. 

In [None]:
wordcloud()

<div class="alert alert-info">
    <b> Question 2.10: </b> What are some of the most common words that are seen in the <code>Reason for Consultation</code> variable? Least common?
</div>  

- Most common: ...  
- Least common: ...

<div class="alert alert-info">
    <b> Question 2.11: </b> Look back at the bar plot for <code>Referral</code>. Does the word cloud above back up the data shown in the bar plot? Taking these two visualizations together, what can we infer about our data? Why are most people coming to the Secretariat for?
</div>  

*Replace this text with your answer*

## 2.3 Visualizing Categorical Data Troughout Time 

As you previously stated, our `cases` dataset possesses only one numerical variable in `Month/Year`. You also probably deduced that this is a sequential numerical variable. Because of this, we are limited to bar plots and line plots. Now that we have generated bar plots for most of the categorical variables, we can start looking at how these variables changed throughout time. 

<div class="alert alert-info">
    <b> Question 2.12: </b> One particular variable we might be interested in is <code>Type of violence</code>. What type of insights could you derive from generating a line plot on <code>Type of violence</code>?
</div>  

*Replace this text with your answer*

Run the cell below to generate this line plot. 

In [None]:
# Run this cell to generate a line plot 
lineplot()

<div class="alert alert-info">
    <b> Question 2.13: </b> Based on this graph, what was the type of violence that was reported the most? The least?
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 2.14: </b> Why might be some reasons that sexual violence appears to be reported the least? What does this say about the bias of our dataset?
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 2.15: </b> What does the line plot tell you about the trend for each type of violence? 
</div>  

- Psychological: ... 
- Physical: ... 
- Sexual: ... 
- Financial: ...

<div class="alert alert-info">
    <b> Question 2.16: </b> Use the line plot to generate a narrative on crimes against women in the Atlantic regions of Colombia. This should be roughly a paragraph that is written in storytelling manner. 
</div> 

*Replace this text with your answer*

----
----

# 3. Non Sex-disaggregated Data

On Notebook 3, we dealt with data collected from the Colombia National Police database. Recall that this was a sex-disaggregated dataset. The `cases` dataset, however, is a non sex-disaggregated dataset. This means that the data cannot be partitioned by sex. 

<div class="alert alert-info">
    <b> Question 3.1: </b> What implications does having a non sex-disaggregated data have on our analysis compared to a sex-disaggregated dataset? That is, are we more limited or do we have more freedom in our potential for analysis?
</div>  

*Replace this text with your answer*

You may have identified that we have lost the ability to run a comparative analysis on our data relative to gender. That is, our `cases` dataset contains information on assaults towards women, but not towards men. As a consequence, we cannot generate visualizations we made on Notebook 3. However, do not fall into the fallacy that these datasets are not as valuable. The following set of questions will guide you to think deeper on the differences, similarities, and potential benefits between the two analyses (analysis on National Police dataset vs. Secretariat of Women dataset).   

<div class="alert alert-info">
    <b> Question 3.2: </b> What information do we have in the <code>cases</code> dataset that we did not have in the <code>family_violence</code> dataset we saw in Notebook 3?
</div>  

*Replace this text with your answer*

You may have noticed by know that the `cases` dataset contains information on the **Atlantic** regions of Colombia, whilst the `family_violence` dataset contained information for *all* of Colombia. In addition, the `cases` dataset contains information for only June 2017-June 2019 whilst `family_violence` covers 2015-2021. Thus, we can say that the **scope** of the `cases` dataset is smaller than that of `family_violence`. 

While this comparison looks at difference in *scope* of information between the datasets, we can generate visualizations that will show us the difference in the **actual data** described by each one. When you do this, you generally want to compare visualizations that plot the same data. That is, if we want to compare categorical information contained in Table 1 vs Table 2, then it would be ideal to compare bar plot from Table 1 vs bar plot from Table 2. It would be much more difficult, but not impossible, to compare a bar plot from Table 1 with a line plot from Table 2.  

In our case, we are in luck. We were able to generate sequential line plots for both the `cases` and `family_violence` datasets. Let's compare these. Run the cell below, which will allow you to select a city and visualize its line plots side by side as described by each dataset.   

In [None]:
# Run this cell to generate a plot with dropdown menu. 
comparable_lineplot()

<div class="alert alert-info">
    <b> Question 3.3: </b> Select the city of Barranquilla (CT). Does the police dataset tell us that crime is increasing or decreasing? What does the secretariat dataset tell us? 
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 3.4: </b> Select the city of Suan. What information do we have about this city? 
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 3.5: </b> Select the city of Tubara. Does the secretariat plot really tell us anything? What about the police plot? What does the city of Tubara and Suan have in common that might explain why we see similar behavior?
</div>  

*Replace this text with your answer*

In [None]:
# Run this cell to generate a plot again so that you don't have to scroll up and down a lot. 
comparable_lineplot()

<div class="alert alert-info">
    <b> Question 3.6: </b> Select the city of Ponedera. Does the police dataset plot differ from the secretariat plot? What common violence type do we see missing here? 
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 3.7: </b> Select the city of Santa Lucia. What type of violence is increasing in the secretariat plot? Are both plots telling the same general story?
</div>  

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 3.8: </b> How do the plots for cities like Soledad and Barranquilla differ from those from cities like Suan or Santa Lucia? Does this mean there is less crimes in the latter set of cities?
</div>  

*Replace this text with your answer*

Now, take some time to think about what you saw for each city. You probably found mixed results. For some cities, we have rich data, for others, we don't.  

<div class="alert alert-info">
    <b> Question 3.9: </b> After comparing the trend in cities as described by the police vs the secretariat of women, how would you describe the main benefit of the Police dataset? What is the benefit of the Secretariat dataset?
</div>  

- Police dataset: ... 
- Secretariat dataset: ...

<div class="alert alert-info">
    <b> Question 3.10: </b> What is a dangereous bias that exists in both datasets? 
</div>  

*Replace this text with your answer*

# Conclusion
Congratulations! You've reached the end of the assignment. Run the cell below to generate a pdf. 

In [None]:
# This may take a few seconds 
from IPython.display import display, HTML
!pip install -U notebook-as-pdf -q
!jupyter-nbconvert --to PDFviaHTML notebook4.ipynb
display(HTML("Save this notebook, then click <a href='notebook4.pdf' download>here</a> to open the pdf."))

# Feedback 

Please let us know of your thoughts on this notebook! [Fill out the following survey here](https://docs.google.com/forms/d/e/1FAIpQLSeQHgYQV5qQpyp8AJkupGvA0mJ49qULLYQGa1w3Zh6jd25Z2g/viewform?usp=sf_link)