# About notebook:
This notebook contains extracts from the work on the ACLED dataset for the midterm in STK-INF4000.




# Loading everything

## Loading standard python modules

In [None]:
import pandas as pd
import numpy as np
import datetime

import matplotlib.pyplot as plt
%matplotlib inline

# Adding modules folder to sys.path:
import sys
sys.path.insert(0, '../modules')

### Auto-reloading external modules
Ensure that all code inside our modules are reloaded upon new call to 'import'. Included to enable more rapid testing.

For details, see:
http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython

In [2]:
%load_ext autoreload
%autoreload 2
print('Loaded')

## Loading ACLED dataset
The ACLED dataset is the main dataset for our project. From the ACLED projects' web site (http://www.acleddata.com/):

    "ACLED (Armed Conflict Location & Event Data Project) is the most comprehensive public collection of political violence and protest data for developing states."

The dataset cointains more than 156.000 entries, from 1997 up until today. The ACLED dataset is updated with *realtime* data on a weekly basis. 

### Datasets module
The datasets.ACLED class (imported below) connects to the ACLED API to download the latest available data to a **mongodb** database on the local computer. If the database already exists, a query to the API for updates is made and database updated as required.

In [3]:
import datasets

In [4]:
acled = datasets.ACLED()
acled.mongodb_update_database()

#### Loading database content into Pandas dataframe:

In [5]:
df_full = acled.mongodb_get_entire_database()

## Loading ESRI shapefile

For plotting, we're using ESRI shapefile vector geodata from [1] with **geopandas** and **Bokeh**. 

**References:**

1) http://www.naturalearthdata.com/features/
2) http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip

**Instructions**: Download and unzip [2] and direct the variable 'link_ESRI_shp' to the directory of the .shp-file.

**Dependencies**: Following python libraries required: bokeh, shapely, geopandas. Clang (or equivalent) is also required for these (For Ubuntu linux use: 'sudo apt-get install libgeos-dev').

In [6]:
from ImportShapefile import ImportShapefile 
# Update the link to where you have stored the shapefiles:
link = '../data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp'

gpd_df_full = ImportShapefile(link).get_df()

# Mask on countries in Africa:
mask = gpd_df_full['continent']=='Africa'
gpd_df = gpd_df_full.loc[mask,:].reset_index(drop=True)

del gpd_df_full

###  Drop all except a few columns in geometry dataframe:

In [7]:
gpd_df = gpd_df.loc[:, ('name','subregion','x', 'y', 'pop_est')]

# Some preprocessing
Now that we have loaded both the ACLED dataset and the geopandas dataframe, we need to do some pre-processing before we can use them together.

In particular, the two datasets contain the name of the countries of Africa, however sometimes spelled differently (e.g. use of abbreviations). Let's align them:

In [8]:
# Begin by comparing the two:
a = set(df_full['country'].unique())
g = set(gpd_df['name'].unique())

print(a.difference(g))
print(g.difference(a))

{'Central African Republic', 'Ivory Coast', 'Democratic Republic of Congo', 'Equatorial Guinea', 'South Sudan', 'Republic of Congo', 'Mozambique '}
{'Congo', 'Central African Rep.', 'Somaliland', 'Eq. Guinea', 'Dem. Rep. Congo', 'W. Sahara', 'S. Sudan', "Côte d'Ivoire"}


In [9]:
# Manually assigning the matching countries:
new_names = {
            "Côte d'Ivoire": "Ivory Coast",
            "Dem. Rep. Congo": "Democratic Republic of Congo",
            "S. Sudan": "South Sudan",
            "Central African Rep.": "Central African Republic",
            "Congo": "Republic of Congo",
            "Eq. Guinea": "Equatorial Guinea"
}

gpd_df.replace({"name": new_names}, inplace=True)

In [10]:
# Check again:
a = set(df_full['country'].unique())
g = set(gpd_df['name'].unique())

print(a.difference(g))
print(g.difference(a))

{'Mozambique '}
{'Somaliland', 'W. Sahara'}


* One datapoint in the ACLED dataset is stored with a whitespace ('Mozambique '). Let's replace it.
 
* 'W. Sahara', 'Somaliland' are more complicated matters, as they are disputed territories.

  * 'W. Sahara' 
  The approach that aligns best with the ACLED dataset is to consider Western Sahara to be a part of Morocco, refers:
  https://en.wikipedia.org/wiki/Sahrawi_Arab_Democratic_Republic#International_recognition_and_membership
  
  * 'Somaliland'
  In the ACLED dataset, Somaliland is considered part of Somalia


In [11]:
# Solving point 1:
df_full.loc[df_full['country']=='Mozambique ', 'country'] = 'Mozambique'

# Solving point 2:
# Merging the two rows of the data and the corresponding geometries.
""" TO BE DONE
"""
None

### Creating 'area' column:
Adding column containing 'proportionate area' (*).

** *:** I don't know the unit of the resulting areas. Numbers are at least proportionate to the countries area...

In [None]:
gpd_df['area_p'] = gpd_df['geometry'].area

## Loading Bokeh-geopandas plotting class

In [None]:
import BokehPlottingScripts as BPS

In [None]:
# Example plot:
p = BPS.bokeh_plot_map(gpd_df, data='area_p', title='Africa - Colour proportionate with area')

## Another plot (test)

In [None]:
gpd_df['test'] = 0
ind = gpd_df.loc[:,'name']=='Dem. Rep. Congo'
gpd_df.loc[ind, 'test']=1
q = BPS.bokeh_plot_map(gpd_df, data='test', title='Africa - Colour proportionate with area')

# Exploring the ACLED dataset
For this project, we focus on analysis of one country. Let's list the countries with the most entries:

In [None]:
df_full['country'].value_counts().head()

In [None]:
#country_to_analyze = 'Somalia'
country_to_analyze = 'Democratic Republic of Congo'

# New dataframe, filtering on one country:
df = df_full.loc[df_full.loc[:,'country']==country_to_analyze].reset_index(drop=True)

print("The dataframe now contains", len(df), "entries.")

## Types of events
The dataset cointains the categorical column **event_type**, describing the different types of events in the dataset (e.g., 'riots/protests' and 'violence against civilians').

In [None]:
print(df['event_type'].value_counts())

In [None]:
df_piv = df.pivot_table(index='event_date',
                              columns='event_type',
                              values='fatalities',
                              aggfunc=np.count_nonzero) # Aggregate count of events

df_piv = df_piv.fillna(0)
resample_freq = '3M'
df_piv = df_piv.resample(resample_freq).sum()

ax = df_piv.plot(marker='.', linestyle='dotted', figsize=(15,8))
ax.set_title("Number of incidents over time (each data point corresponds to time window of "+resample_freq+")")
ax.set_ylabel("Number of incidents/events (per "+resample_freq+")")
ax.set_xlabel("Month")


## Fatalities
Each datapoint also contains a column with an estaimate of the number of fatalities of the incident described.

Let's look at the data:

In [None]:
df_p_fat = df.pivot_table(index='event_date',
                              columns='event_type',
                              values='fatalities',
                              aggfunc=np.sum) # This time we add (instead of count)

stats = df_p_fat.describe().loc[['mean', 'std', 'max']]
# Adding total:
stats = stats.append(pd.Series(df_p_fat.sum(), name='sum', dtype='int'))
stats

The dataset event also contains the column **fatalities**:

# Everything below here still in work:

## Outliers, if we want to...

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
mu, sd = df['fatalities'].describe()[1:3]
z_values = (df['fatalities']-mu)/sd
z_values.head(5)

In [None]:

fat = df[df['fatalities']>=1]


fat['fatalities'].plot.hist(bins=1000)
plt.yscale('log')
plt.xlim(0,1000)

In [None]:
fat_zero = sum(df['fatalities']<1)
fat_nonzero = sum(df['fatalities']>=1)

print("No fat:", fat_zero, "Fat:", fat_nonzero, "Fraction:", fat_nonzero/(fat_nonzero+fat_zero))

In [None]:
df['fatalities'].describe()

In [None]:

fat = df[df['fatalities']>=0]


fat['fatalities'].plot.hist(bins=100)
plt.yscale('log')
#plt.xlim(0,1000)

## Plotting some conflicts

In [None]:
from shapely.geometry import Point

def apply_geo_points(df):
    return Point(df['latitude'], df['longitude'])

df['geo_point'] = df.apply(apply_geo_points, axis=1)

In [None]:
df_remote_v = df.loc[:,]

In [None]:
from geopandas import GeoSeries

geoseries = GeoSeries(df['geo_point'])
geoseries.crs = {'init': 'epsg:4326'}

In [None]:
from bokeh.io import output_notebook, show


In [None]:
#https://anaconda.org/debboutr/lightning/notebook
p.circle([df.latitude], [df.longitude], size=10)

In [None]:
show(p)

In [None]:
df['geo_point'].x

In [None]:
df['latitude'][0]

In [None]:

geoseries.plot(marker='.', color='red', markersize=12, figsize=(4, 4))
#plt.xlim([-123, -119.8])
#plt.ylim([44.8, 47.7]);

In [None]:
df['geo_point'][0].plot(marker='*', color='green', markersize=5)

In [None]:
pd.set_option('Max_columns',30)

In [None]:
test = df.loc[df['fatalities']>1000]['notes']

In [None]:
test.

In [None]:
df['fatalities'].hist(bins=30)

In [None]:
df.keys()

In [None]:
print(df['actor2'].value_counts().head(15))