# [Global 88] Gender Violence in Colombia  
### Professor: Karenjot Bhangoo Randhawa  
**Estimated Time:** 50 minutes  
**Notebook Created By:** Vaidehi Bulusu, Emily Guo, Bella Chang, Carlos Calderon  
**Code Maintenance:** Carlos Calderon 

Welcome! Last week we got an introduction to data types and the visualizations we can conduct with them. This week, we will expand more on table operations and visualization topics that we introduced in the first and second notebooks. This week, however, we will be dealing with a dataset regarding gender violence accross Colombia. This week we will take a look at a primary, sex-disaggregated dataset. Next week we will look at a secondary, non sex-disaggregated data.  

**Learning Outcomes:**  
BY the end of this notebook, students will be able to:  
1. Understand what and when bar plots are used.  
2. Understand the differences between sex-disaggregated and non sex-disaggregated data
3. Understand the insights we can derive from a sex-disaggregated data  

# Table of Contents  
1. The Data 
2. Sex-Disaggregated Data

---

# Importing Packages   

<div class="alert alert-block alert-warning">
<b>Make sure to run this cell FIRST! Not doing so may result in pesky errors in the code.</b>
</div>

In [None]:
from datascience import *
import numpy as np

import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from collections import Counter

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

---

# 1. The Data  
Our data comes from the [Colombia National Police](https://www.policia.gov.co/grupo-informacion-criminalidad/estadistica-delictiva) database, which contains yearly information on different crime categories accross Colombia. The dataset we will be working with describes statistics on intrafamilial violence from 2015 to 2021. Intrafamilial violence is [defined as any act punishable as a criminal offense that is committed or threatened to be committed by an offender who is related to the victim, or has a child with the victim.](https://www.ncsl.org/research/human-services/domestic-violence-domestic-abuse-definitions-and-relationships.aspx#:~:text=(9)%20%22Intrafamily%20violence%22,has%20a%20child%20in%20common.) Run the cell below to load the dataset onto the variable `family_violence`.

In [None]:
family_violence = Table.read_table("../data/nb3/domestic_violence_colombia_police.csv")
family_violence.show(5)

## 1.1 Understanding the Data

Although the column names are clear, it still is useful to create a mapping between column names and what they represent. 

| Column (Variable) Name | Description                          |
|------------------------|--------------------------------------|
| State                  | Name of State                        |
| City                   | Name of City                         |
| Gender                 | Gender of the victim                 |
| Weapon Used            | Weapon used by the perpetrator       |
| Affected Age Group     | Age group that the victim belongs to |
| Total                  | Total number of crimes in the area   |
| Year                   | Year                                 |  

The rest of this section will focus on questions that will guide you in truly understanding the dataset. Whenever you start an individual project with a new dataset, it is extremely useful to start asking yourself these questions. The answers will guide your future analysis and will give you a better idea of what insights you can expect to find. 

In [None]:
# Showing the first two rows as examples for your next set of questions
family_violence.show(2)

<div class="alert alert-info">
<b> Question 1.1: </b> With the name and meaning of our dataset's columns in mind, what does each row in our dataset represent?
</div>

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 1.2: </b> Using the table properties <code>num_columns</code> and <code>num_rows</code> that we learned in Notebook 1, fill in the code below to print out the size of our dataset. 
</div>

In [None]:
family_violence_num_rows = family_violence... # Assign this to the number of rows in our dataset
family_violence_num_columns = family_violence... # Assign this to the number of columns in our dataset
print(f"Our dataset has {family_violence_num_rows} rows and {family_violence_num_columns} columns.")

<div class="alert alert-info">
    <b> Question 1.3: </b> If you had no information about the source of the dataset or the information it contained, how would you use the number of rows and columns and what they each represent to build your intuition on what the dataset is describing? 
</div>

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 1.4: </b> Fill in the blanks below with the names of the columns that contain categorical data. If you need more or less space, feel free to add or delete a bullet point. 
</div>

- ...
- ...
- ...
- ...

<div class="alert alert-info">
    <b> Question 1.5: </b> Fill in the blanks below with the names of the columns that contain numerical data. 
</div>

- ...
- ...
- ...
- ...

## 1.2 Barplots - One Categorical Variable  

Last week we went over data types and their visualizations. So far, we have seen the following set of relationships:  

| Variable/s                     | Visualization |
|--------------------------------|---------------|
| numeric                        | Histogram     |
| numeric x numeric              | Scatter plot  |
| numeric x numeric (sequential) | Line plot     |  

In this section we will be going over **barplots**. Barplots are useful in visualizing both number of groups and the number of times a group appears within a single categorical variable. For example, our `family_violence` dataset contains several categorical variables, one of them being `Affected Age Group`. In this case, `Affected Age Group` is a categorical variable, but how many groups does this variable contain? This question can be answered with a bar plot. 

In [None]:
affected_age_group = family_violence.column("Affected Age Group")
affected_age_group_counts = Counter(affected_age_group)

# Plotting code
plt.bar(affected_age_group_counts.keys(), affected_age_group_counts.values());

plt.xlabel("Age Group")
plt.ylabel("Number Affected")
plt.title("Number Affected per Age Group");

<div class="alert alert-info">
    <b> Question 1.6: </b> How many groups exist in the <code>Affected Age Group</code> variable?
</div> 

**Hint:** Look at the x-axis labels. 

*Replace this text with your answer*

<div class="alert alert-info">
    <b> Question 1.7: </b> How many members exist in each group? What does this tell us about our dataset? What value do we see the most? 
</div> 

*Replace this text with your answer*

Now, that is just one categorical variable. We have plenty more in our dataset. Run the cell below, which will allow you to select a column and visualize its bar plot. 

In [None]:
def bar_plot(column):
    counts = Counter(family_violence.column(column))
    
    plt.bar(counts.keys(), 
            counts.values())
    
    plt.xlabel(column)
    plt.ylabel("Total Number")
    plt.xticks(rotation=60)
    
    plt.title("Number of ocurrences per " + column);

col_choices = widgets.Dropdown(options=family_violence.labels, 
                               value="State", 
                               description="Pick a categorical column:")

interact(bar_plot, column=col_choices);

<div class="alert alert-info">
    <b> Question 1.8: </b> Below, replace the three dots with the name of the column, and an insight you derived from its bar plot. Format it as {Column name}: {Insight} 
</div> 

- ...  
- ...  
- ...
- ...
- ...

Notice how plotting some categorical variables takes long, and produces unreadable plots. This is a common issue called [overplotting](https://www.displayr.com/what-is-overplotting/), and there are several ways to deal with this. For categorical variables, one common approach is to plot the five most/least abundant groups. For example, we might choose to plot the 5 cities that appear the most in our dataset, or those that appear the least. Let's do that. 

In [None]:
# Just run the following cell
city_counts = Counter(family_violence.column("City"))
five_most_common_cities = city_counts.most_common(5)
five_most_common_city_names = []
five_most_common_city_values = []
for value in five_most_common_cities:
    five_most_common_city_names.append(value[0])
    five_most_common_city_values.append(value[1])

print(f"There are a total of {len(city_counts.keys())} cities in our dataset. Below are the five cities that appear the most.")

plt.bar(five_most_common_city_names, five_most_common_city_values)
plt.title("Five most common cities")
plt.xlabel("City")
plt.ylabel("Number of Appearances");

<div class="alert alert-info">
    <b> Question 1.9: </b> What are the five most common cities in our dataset? What does this mean for these cities? What assumptions can we make from this visualization?
</div> 

*Replace this text with your answer!*

<div class="alert alert-block alert-success">
<b>Bonus Question 1:</b> Write code below to generate a similar bar plot as above, but for states. Then, answer the following question in the markdown cell below: Are the five most common cities within the five most common states?
</div>

In [None]:
# Write your code here 
...
...
...
...
...
...

*Replace this text with your answer!*

## 1.3 Crime Change Throughout Time  

Recall that one of the visualizations we can generate when we have sequential data is a line plot. `Year` is a sequential variable, and thus, we can graph the change over time between `Year` and another numerical variable in our dataset. It turns out that the only other such column is `Total`. Run the cell below, which will generate a line plot for the change in total number of intrafamilial crimes in Colombia from 2015 to 2021. 

In [None]:
family_violence_grouped = family_violence.group("Year", sum)
def plot_trend():
    family_violence_grouped = family_violence.group("Year", sum)
    family_violence_grouped.plot("Year", "Total sum")
    plt.title("Change in the Number of Reported Intrafamilial Crimes in Colombia from 2015-2021")
    plt.ylabel("Total Number of Crimes")
    plt.show()
plot_trend()

<div class="alert alert-info">
<b> Question 1.10 </b> Replace the dots below with the trend that you see from year to year. Then, write the trend you see overall and what this tells us about reported intrafamilial violence in Colombia. 
</div>

- 2015 to 2016: ...  
- 2016 to 2017: ...  
- 2017 to 2018: ...  
- 2018 to 2019: ...  
- 2019 to 2020: ...  
- 2020 to 2021: ...  
- Overall: ...

<div class="alert alert-block alert-success">
<b>Bonus Question 2:</b> Generate the same plot as above, but write it within a function <code>plot_city_trend</code> that takes in one argument, the city name, and outputs a line plot for that given city. 
</div>

In [None]:
# Write your code here 
def plot_city_trend(city):
    ...
    ...
    ...
    ...
    return ...

---

# 2 Sex-disaggregated Data  
[Data 2X](https://data2x.org/) is a United Nations Foundation, whose "mission is to improve the availability, quality, and use of gender data in order to make a practical difference in the lives of women and girls worldwide."  

They define sex-disaggregated data as data that can be grouped based on sex. Although often labeled as "Gender" by many datasets, what is often really expressed is a given individual's birth-assigned sex. ["Gender", on the other hand, refers to socially constructed relations between men and women.](https://data2x.org/wp-content/uploads/2019/08/MeasuringWomensFinInclusion-ValueofSexDisaggData.pdf) The importance of sex-disaggregated data relies in its potential to uncover different, unequal experiences between men and women as a result of gender roles and expectations. Although this never tells us the whole picture, it is an important first step in getting there.  

*References:*  
*- https://data2x.org/*  
*- https://data2x.org/wp-content/uploads/2019/08/MeasuringWomensFinInclusion-ValueofSexDisaggData.pdf*  

Now, in the previous section we visualized bar plots for categorical variables in our dataset. These, however, do not give us the whole picture. We can plot two bar plots side by side to visualize the difference between a set of groups. In the context of our data, we are interested in finding how intrafamilial violence affects men and women differently accross Colombia. Run the cell below to show how `Affected Age Group` changes when we group by sex.

In [None]:
# Just run this scary code! 
def bar_plot_gender(col_name, ylabel):
    
    counts = Counter(family_violence.column(col_name))

    women = family_violence.where("Gender", "Female")
    men = family_violence.where("Gender", "Male")
    
    women_counts = Counter(women.column(col_name))
    men_counts = Counter(men.column(col_name))
    
    women_percents = np.array(list(women_counts.values())) / sum(women_counts.values())
    men_percents = np.array(list(men_counts.values())) / sum(men_counts.values())
    
    bar_width = 0.25
    br1 = np.arange(1, len(counts.keys()) + 1).tolist()
    br2 = [x + bar_width for x in br1]
    
    plt.bar(br1, 
        women_percents,
        width = bar_width,
        color='r', 
        label="Women")
    
    plt.bar(br2, 
        men_percents,
        width = bar_width,
        color="b", 
        label="Men")
    
    plt.xticks(np.arange(1, len(counts.keys()) + 1).tolist(),
           counts.keys())
    
    plt.xlabel(col_name)
    plt.ylabel(ylabel)
    plt.title(col_name + " by Gender")
    
    plt.xticks(rotation = 30)
    
    plt.legend();
    
bar_plot_gender("Affected Age Group", "Percentage Affected")
plt.show()
bar_plot("Affected Age Group")
plt.show()

<div class="alert alert-info">
    <b>Question 2.1:</b> How does the barplot for <code>Affected Age Group</code> change when we disaggregate the data by sex? What does this tell us? How are women being affected differently than men?
</div>

*Replace this text with your answer!*

In [None]:
bar_plot_gender("Weapon Used", "Percentage Used")
plt.show()
bar_plot("Weapon Used")
plt.show()

<div class="alert alert-info">
    <b>Question 2.2:</b> How does the barplot for <code>Weapon Used</code> change when we disaggregate the data by sex? What does this tell us? How are women being affected differently than men?
</div>

*Replace this text with your answer!*

Remember that we also visualized the change in number of intrafamilial crimes from 2015 to 2021. How does this visualization change when we sex-disaggregate? Run the cell below. If it produces a warning, ignore it. 

In [None]:
family_violence_grouped_by_gender = family_violence.group(["Year", "Gender"], sum)
total_crimes_against_women = family_violence_grouped_by_gender.where("Gender", "Female")
total_crimes_against_men = family_violence_grouped_by_gender.where("Gender", "Male");

plt.plot("Year", "Total sum", data=total_crimes_against_women, color="m", label="Women")
plt.plot("Year", "Total sum", data=total_crimes_against_men, color="k", label="Men")

plt.xlabel("Year")
plt.ylabel("Total Number of Crimes")
plt.title('Change in the Number of Reported Intrafamilial Crimes in Colombia from 2015-2021 by Gender')
plt.legend()
plt.show()

plt.plot("Year", "Total sum", data=family_violence_grouped)
plt.xlabel("Year")
plt.ylabel("Total Number of Crimes")
plt.title('Change in the Number of Reported Intrafamilial Crimes in Colombia from 2015-2021')
plt.show();

<div class="alert alert-info">
    <b>Question 2.3:</b> How does the trend in number of intrafamilial crimes change when we sex-disaggregate? What does this tell us? How are women being affected differently than men?
</div>

*Replace this text with your answer!*

How does this trend compare when we analyze more closely? As of now, we have been looking at changes throughout all Colombia. Run the cell below, which will allow you to choose a city and plot both the disaggregated and non-disaggregated trends for that city. 

In [None]:
def city_filter(city):
    city_table = family_violence.where("City", are.containing(city)).group(["Year", "Gender"], sum);
    total_crimes_against_women = city_table.where("Gender", "Female")
    total_crimes_against_men = city_table.where("Gender", "Male");

    plt.plot("Year", "Total sum", data=total_crimes_against_women, color="m", label="Women")
    plt.plot("Year", "Total sum", data=total_crimes_against_men, color="k", label="Men")

    plt.xlabel("Year")
    plt.ylabel("Total Number of Crimes")
    plt.title(f'Change in the Number of Reported Intrafamilial Crimes from 2015-2021 by Gender in {city}')
    plt.legend()
    plt.show()
    
    tbl = family_violence.where("City", are.containing(city)).group("Year", sum);
    plt.plot("Year", "Total sum", data=tbl)
    plt.xlabel("Year")
    plt.ylabel("Total Number of Crimes")
    plt.title(f'Change in the Number of Reported Intrafamilial Crimes from 2015-2021 in {city}')
    plt.show();

city_dropdown = widgets.Dropdown(
    options = family_violence.group("City").column("City"),
    value = "Abejorral",
    description = "Pick a city:"
)

interact(city_filter, city = city_dropdown)
plt.show()

<div class="alert alert-info">
    <b>Question 2.4:</b> What city did you choose? How does the trend in number of intrafamilial crimes change when we sex-disaggregate? What does this tell us? How are women being affected differently than men?
</div>

*Replace this text with your answer!*

<div class="alert alert-info">
    <b>Question 2.5:</b> Please select the city of Bogota from the dropdown menu. How does the trend in number of intrafamilial crimes change in Bogota when we sex-disaggregate? What does this tell us? How are women being affected differently than men?
</div>

*Replace this text with your answer!*

## Conclusion
Congratulations! You've reached the end of the assignment. 

In [None]:
# This may take a few seconds 
from IPython.display import display, HTML
!pip install -U notebook-as-pdf -q
!jupyter-nbconvert --to PDFviaHTML notebook3.ipynb
display(HTML("Save this notebook, then click <a href='notebook3.pdf' download>here</a> to open the pdf."))