# Milestone 3

In this part of the notebook, we want to produce data files that will be visualized in de data story.

We use two approach to 'tell' the data, the approach by country and the approach by actor.

#### Country approach: 

For each country we want to produce a CSV summarizing as many things as possible:
- civilian deaths time series
- total deaths time series
- GBP time series

#### Actor approach:

As announced in milestone 2, we would like to be able to classify actors according to their goal for example, their philosophy etc...

It's interresting to note that we have enriched our data set using an addition UCDP data set that lists all the full names of actors and not only their acronyms.

We procede in two steps:
- first we categorize as many actors as possible by looking for keywords in their names (s.t. `Islamic`, `Cartel`, `Government`)
- then we scrap some key wikipedia pages in order to retrieve information on international armged groups. This part is explained in the notebook `wikipedia_scraping`.
- eventually we try to match those scraped armed groups with those involved in the events in our data set in order to be able to filter events by nature (for example terrorist attacks, left-wing uprisals ...)

The pages we scrap are:
- [List of left-wing rebel groups](https://en.wikipedia.org/wiki/List_of_left-wing_rebel_groups)
- [List of designated terrorist groups](https://en.wikipedia.org/wiki/List_of_designated_terrorist_groups)
- [List of active revel groups](https://en.wikipedia.org/wiki/List_of_active_rebel_groups)
- [List of guerilla movements](https://en.wikipedia.org/wiki/List_of_guerrilla_movements)



In [1]:
import pandas as pd 
import numpy as np
import pickle

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

In [3]:
df = pd.read_csv('../data/ged171.csv')
print(df.shape)

(135181, 42)


# Data on the countries

In this part of the notebook, we want to produce data files that will be visualized in de data story.

In [8]:
additionals = ['Antarctica', 'French Southern and Antarctic Lands', 'Australia', 'Austria', 'Benin', 'Bulgaria', 
               'The Bahamas', 'Belarus', 'Belize', 'Brunei', 'Switzerland', 'Chile', 'Costa Rica', 'Cuba', 
               'Northern Cyprus', 'Cyprus', 'Czech Republic', 'Denmark', 'Dominican Republic', 'Estonia', 
               'Finland', 'Fiji', 'Falkland Islands', 'Gabon', 'Gambia', 'Equatorial Guinea', 'Greece', 
               'Greenland', 'Hungary', 'Ireland', 'Iceland', 'Japan', 'Kazakhstan', 'South Korea', 'Kosovo', 
               'Lithuania', 'Luxembourg', 'Latvia', 'Montenegro', 'Mongolia', 'Malawi', 'New Caledonia', 'Norway', 
               'New Zealand', 'Oman', 'Poland', 'Puerto Rico', 'North Korea', 'Portugal', 'Western Sahara', 
               'Somaliland', 'Suriname', 'Slovakia', 'Slovenia', 'Sweden', 'Syria', 
               'Turkmenistan', 'East Timor', 'Taiwan', 'Uruguay', 'Vietnam', 'Vanuatu', 'West Bank']

In [9]:
death_counts = pd.DataFrame(df.groupby('country').sum()[['best', 'deaths_civilians']])
death_counts.columns = ['total', 'civilians']
death_counts.to_csv('../data/story/death_counts.csv', sep=';')

In [10]:
countries = set(df.drop_duplicates('country')['country'])

In [18]:
year_country = df.groupby(['country', 'year'])['deaths_civilians'].sum()
tmp = pd.DataFrame(year_country)
civilian_deaths = pd.DataFrame(index=countries, columns=df.drop_duplicates('year')['year'].sort_values().values)
civilian_deaths.fillna(value=0, inplace=True)

for i in tmp.index:
    civilian_deaths.loc[i[0], i[1]] = tmp.loc[i].values

for add in additionals:
    civilian_deaths.loc[add, :] = 0

In [19]:
cols = ['deaths_civilians_{}'.format(i) for i in range(1989, 2017)]
civilian_deaths.columns = cols
civilian_deaths['deaths_civilians_sum'] = civilian_deaths.sum(axis=1)

In [20]:
year_country = df.groupby(['country', 'year'])['best'].sum()
tmp = pd.DataFrame(year_country)
total_deaths = pd.DataFrame(index=countries, columns=df.drop_duplicates('year')['year'].sort_values().values)
total_deaths.fillna(value=0, inplace=True)
total_deaths.sort_index(inplace=True)

for i in tmp.index:
    total_deaths.loc[i[0], i[1]] = tmp.loc[i].values
    
for add in additionals:
    total_deaths.loc[add, :] = 0

In [21]:
cols = ['deaths_total_{}'.format(i) for i in range(1989, 2017)]
total_deaths.columns = cols
total_deaths['deaths_total_sum'] = total_deaths.sum(axis=1)

In [22]:
df['event'] = 1
tmp = pd.DataFrame(df.groupby(['country', 'year'])['event'].sum())
events_count = pd.DataFrame(index=countries, columns=df.drop_duplicates('year')['year'].sort_values().values)
events_count.fillna(value=0, inplace=True)

for i in tmp.index:
    events_count.loc[i[0], i[1]] = tmp.loc[i].values

for add in additionals:
    events_count.loc[add, :] = 0

In [23]:
cols = ['events_count_{}'.format(i) for i in range(1989, 2017)]
events_count.columns = cols
events_count['events_count_sum'] = events_count.sum(axis=1)

In [33]:
d = {}
d["Cote d'Ivoire"] = 'Ivory Coast'
d['Congo, Dem. Rep.'] = 'DR Congo (Zaire)'
d['Yemen, Rep.'] = 'Yemen (North Yemen)'
d['Serbia'] = 'Serbia (Yugoslavia)'
d['Romania'] = 'Rumania'
d['Iran, Islamic Rep.'] = 'Iran'
d['Kyrgyz Republic'] = 'Kyrgyzstan'
d['Myanmar'] = 'Myanmar (Burma)'
d['Bosnia and Herzegovina'] = 'Bosnia-Herzegovina'
d['Lao PDR'] = 'Laos'
d['Russian Federation'] = 'Russia (Soviet Union)'
d['Egypt, Arab Rep.'] = 'Egypt'
d['Congo, Rep.'] = 'Congo'
d['Cambodia'] = 'Cambodia (Kampuchea)'
d['Madagascar'] = 'Madagascar (Malagasy)'
d['Venezuela, RB'] = 'Venezuela'
d['United States'] = 'United States of America'
d['Zimbabwe'] = 'Zimbabwe (Rhodesia)'
d['Bahamas, The'] = 'The Bahamas'
d['Brunei Darussalam'] = 'Brunei'
d['Gambia, The'] = 'Gambia'
d['Korea, Rep.'] = 'South Korea'
d['Korea, Dem. People’s Rep.'] = 'North Korea'
d['Syrian Arab Republic'] = 'Syria'
d['Timor-Leste'] = 'East Timor'
d['West Bank and Gaza'] = 'West Bank'

In [34]:
gdp = pd.read_csv('../data/gdp_worldbank.csv')

for i in gdp.index:
    if gdp.loc[i, 'Country Name'] in d.keys():
        gdp.loc[i, 'Country Name'] = d[gdp.loc[i, 'Country Name']]
        

gdp.drop(['Indicator Name'], axis = 1, inplace=True)
gdp.drop(['Country Code'], axis = 1, inplace=True)
gdp.drop(['Indicator Code'], axis = 1, inplace=True)
gdp.drop(['Unnamed: 62'], axis = 1, inplace=True)
gdp.drop(['2017'], axis = 1, inplace=True)

for i in range(1960, 1989):
    gdp.drop([str(i)], axis = 1, inplace=True)

gdp.set_index('Country Name', inplace=True)
gdp.fillna(value=0, inplace=True)
gdp.columns = ['gdp_{}'.format(i) for i in range(1989, 2017)]
gdp = gdp.loc[events_count.index]

In [35]:
pd.concat([events_count, total_deaths, gdp, civilian_deaths], axis=1).to_csv('../data/story/countries.csv')

# Actors classification

In this part of the notebook we want to produce statistics on the actors. 

- The same data cleaning that was done in milestone 2 about the actors names. We get from this step a dictionnary of actors ids and actors names.
- Data set by using the additional UCDP data set providing full names for actors.
- Actors are classified first by looking for key words in their full names such as `Islamic`, `Cartel`, `Government` for example.
- Then data is enriched by scraping some key Wikipedia pages (`List of designated terrorist groups`, `List of left-wing rebel groups`...) 
- Actors are matched with armed groups from Wikipedia scraping by using a **string distance** (like in hw2).
- For each actor, we compute the number of deaths cause in the ranks of his opponents by going through all the events listed in the data frame and also the total number of civilian deaths he has been involved.
- Eventually we merge all the data we collected in a big data frame that summarize all we know about actors.

This final data frame will be used to visualize our data story.

## 1) Data Cleaning (like in milestone 2)

Let's get a list of all the actors involved on `side_a` or `side_b` in all events since 1989.

In [12]:
actors = {}
for i in df.index:
    tmp_id = df.loc[i, 'side_a_new_id']
    if tmp_id in actors.keys():
        actors[tmp_id].add(df.loc[i, 'side_a'])
    if tmp_id not in actors.keys():
        actors[tmp_id] = set()
        actors[tmp_id].add(df.loc[i, 'side_a'])
    tmp_id = df.loc[i, 'side_b_new_id']
    if tmp_id in actors.keys():
        actors[tmp_id].add(df.loc[i, 'side_b'])
    if tmp_id not in actors.keys():
        actors[tmp_id] = set()
        actors[tmp_id].add(df.loc[i, 'side_b'])

`actors` is a dictionary of having as keys `sides_id`s and values the `sides_name`s. 

**Need for cleaning:**
We notice that some sides have more than one actor, such as `High Council of Afghanistan Islamic Emirate, IS`. Both the HCAIE and IS were involved side by side in this event. We want to clean this. In order to do so, we will modify the initial DataFrame `df` fields `side_a`, `side_b`, `side_a_new_id` and `side_b_new_id` and turn those into lists. We now want ids to refer to single actors.

In [13]:
for k in actors.keys():
    n = actors[k].pop()
    n = n.split(',')
    if n[0] == 'Military faction (forces of Honasan': # this is the only case where the split function 
                                                      # gives something inconsistent
        n = [n[0]+n[1]]
    for i in range(len(n)):
        if n[i][0] == ' ':
            n[i] = n[i][1:]
        if n[i][-1] == '':
            n[i] = n[i][:-1]
    actors[k] = n

In [14]:
# this function goes through the dictionary to return the id of the actor.
def find_id(actor):
    for i in actors.keys():
        if actor in actors[i]:
            return i

In [15]:
initial_keys = list(actors.keys())
for k in initial_keys:
    if len(actors[k]) > 1:
        for i in range(len(actors[k])):
            if [actors[k][i]] in actors.values():
                idd = find_id(actors[k][i])
                actors[k][i] = idd
            else:
                # this actor is not in the dictionary yet (it has not been involved on his own in a event)
                # let's create a new id and insert it in the dictionary
                act = actors[k][i]
                actors[k][i] = np.amax(list(actors.keys()))+1 # this is the new id
                actors[np.amax(list(actors.keys()))+1] = [act] # the actor is inserted

Now let's complete the dataframe `df`. `side_a` and `side_b` to be a list of names and `side_a_new_id` and `side_b_new_id` to be a list of ids of actors.

In [16]:
# We need to turn column types ito objects in order to insert lists
df.side_a_new_id = df.side_a_new_id.astype(object)
df.side_b_new_id = df.side_b_new_id.astype(object)

In [17]:
for i in df.index:
    ida = int(df.at[i, 'side_a_new_id'])
    idb = int(df.at[i, 'side_b_new_id'])
    
    if len(actors[ida]) == 1: 
        df.at[i, 'side_a_new_id'] = [ida]
    else:
        df.at[i, 'side_a_new_id'] = [t for t in actors[ida]]
        
    if len(actors[idb]) == 1: 
        df.at[i, 'side_b_new_id'] = [idb]
    else:
        df.at[i, 'side_b_new_id'] = [t for t in actors[idb]]
    
    [actors[k] for k in df.at[i, 'side_a_new_id']]
    df.at[i, 'side_a'] = np.array([actors[k][0] for k in df.at[i, 'side_a_new_id']])
    df.at[i, 'side_b'] = np.array([actors[k][0] for k in df.at[i, 'side_b_new_id']])

Now we can remove the ids that corresponded to multiple actors.

In [18]:
def remove_multiples(d):
    r = dict(d)
    for k in d.keys():
        if len(d[k]) > 1:
            del(r[k])
    return r

actors = remove_multiples(actors)

The following example shows how we have parsed `side_a` and `side_b` to extract the three governments involved in the event on `side_a`.

In [19]:
df.loc[df.index >= 70444, ['conflict_name', 'dyad_name', 'side_a_new_id', 'side_b_new_id', 'side_a', 'side_b']].head(1)

Unnamed: 0,conflict_name,dyad_name,side_a_new_id,side_b_new_id,side_a,side_b
70444,"Governments of Australia, United Kingdom, Unit...","Government of Australia, Government of United ...","[6740, 28, 3]",[116],"[Government of Australia, Government of United...",[Government of Iraq]


## 2) Classifying the actors

In [20]:
act = pd.read_csv('../data/actorlist.csv')
ids = set(actors.keys())

In [21]:
def find(actor):
    s = set()
    for i in df.index:
        if actor in df.loc[i, 'side_a']:
            s.add(df.loc[i, 'country'])
        if actor in df.loc[i, 'side_b']:
            s.add(df.loc[i, 'country'])
    print(s)

In [22]:
def find_full_name(name):
    for i in act.index:
        if act.loc[i, 'Name'] == name:
            return act.loc[i, 'NameFull']
    return None

In [23]:
def find_name(full_name):
    for i in act.index:
        if act.loc[i, 'NameFull'] == full_name:
            return act.loc[i, 'Name']
    return None

In [24]:
def find_actors(dictionnary, words):
    for i in ids.copy():
        name = actors[i][0]
        full_name = find_full_name(name)
        for word in words:
            if not pd.isnull(full_name) and (word in full_name.lower()):
                dictionnary[i] = full_name
                break

### Classification by names

We look for some key words in the names of all actors in order to build a classification.

#### Governments

In [25]:
govs = {}

In [26]:
find_actors(govs, ['government'])

In [27]:
print('Number of governments: {}'.format(len(govs)))
print('Remaining actors: {}'.format(len(ids)))

Number of governments: 91
Remaining actors: 1128


#### Liberation movements

In [28]:
liberation_movements = {}

In [29]:
find_actors(liberation_movements, ['liberation', 'salvation'])

In [30]:
print('Number of liberation mouvements: {}'.format(len(liberation_movements)))
print('Remaining actors: {}'.format(len(ids)))

Number of liberation mouvements: 73
Remaining actors: 1128


#### Insurgents

In [31]:
insurgents = {}

In [32]:
find_actors(insurgents, ['insurgents'])

In [33]:
print('Number of insurgents: {}'.format(len(insurgents)))
print('Remaining actors: {}'.format(len(ids)))

Number of insurgents: 4
Remaining actors: 1128


#### Communist revolutionaries

In [34]:
communists = {}

In [35]:
find_actors(communists, ['socialist', 'communist', "people's"])

In [36]:
print('Number of communists: {}'.format(len(communists)))
print('Remaining actors: {}'.format(len(ids)))

Number of communists: 41
Remaining actors: 1128


#### Islamists

In [37]:
islamists = {}

In [38]:
find_actors(islamists, ['islam', 'jihad'])

In [39]:
print('Number of islamists: {}'.format(len(islamists)))
print('Remaining actors: {}'.format(len(ids)))

Number of islamists: 40
Remaining actors: 1128


#### Factions

In [40]:
factions = {}

In [41]:
find_actors(factions, ['faction'])

In [42]:
print('Number of factions: {}'.format(len(factions)))
print('Remaining actors: {}'.format(len(ids)))

Number of factions: 72
Remaining actors: 1128


#### Cartels

In [43]:
cartels = {}

In [44]:
find_actors(cartels, ['cartel'])

In [45]:
print('Number of cartels: {}'.format(len(cartels)))
print('Remaining actors: {}'.format(len(ids)))

Number of cartels: 22
Remaining actors: 1128


#### Muslims

In [46]:
muslims = {}

In [47]:
find_actors(muslims, ['muslim'])

In [48]:
print('Number of muslims: {}'.format(len(muslims)))
print('Remaining actors: {}'.format(len(ids)))

Number of muslims: 9
Remaining actors: 1128


#### Christians

In [49]:
christians = {}

In [50]:
find_actors(christians, ['christian'])

In [51]:
print('Number of christians: {}'.format(len(christians)))
print('Remaining actors: {}'.format(len(ids)))

Number of christians: 5
Remaining actors: 1128


#### Republics

In [52]:
republics = {}

In [53]:
find_actors(republics, ['republic'])

In [54]:
print('Number of republics: {}'.format(len(republics)))
print('Remaining actors: {}'.format(len(ids)))

Number of republics: 33
Remaining actors: 1128


In [55]:
classes = [govs, liberation_movements, insurgents, communists, islamists, factions, cartels, muslims, christians, republics]

In [56]:
classified = set()
for class_ in classes:
    classified = classified.union(set(class_.values()))

## Scraping Wikipedia

In this part, we load a dataframe that has been generated in the notebook `wikipedia_scraping`. Those are the actors listed on the following pages: 
- [List of left-wing rebel groups](https://en.wikipedia.org/wiki/List_of_left-wing_rebel_groups)
- [List of designated terrorist groups](https://en.wikipedia.org/wiki/List_of_designated_terrorist_groups)
- [List of active revel groups](https://en.wikipedia.org/wiki/List_of_active_rebel_groups)
- [List of guerilla movements](https://en.wikipedia.org/wiki/List_of_guerrilla_movements)


In [57]:
wiki = pd.read_pickle('../data/wikipedia_scraping.pickle')

In [58]:
all_actors = set()
for ind in actors.keys():
    for i in actors[ind]:
        all_actors.add(i)

In [60]:
all_actor_full_names = set()
for actor in all_actors:
    full_name = find_full_name(actor)
    if full_name is not None:
        all_actor_full_names.add(full_name)
    else:
        all_actor_full_names.add(actor)

In [61]:
wiki_names = set(wiki['name'].values)

We are now trying to match names from the `all_actor_full_names` set to `wiki_names`. Which of our actors have we found on Wikipedia ?

In [62]:
from fuzzywuzzy import process

In [63]:
all_actor_full_names.remove(np.NaN)

In [64]:
matched = {}
almost_matched = {}
pairs = set()
for name in all_actor_full_names:   
    test = process.extractOne(name, wiki_names)
    if test[0] == name:
        matched[name] = name
    if test[0] != name and test[1] > 86:
        almost_matched[name] = test[0]

The following are mismatched that we remove by hand.

In [65]:
del almost_matched['Afrikaner Resistance Movement']
del almost_matched['Amaro']
del almost_matched['Arab']
del almost_matched['Arab Movement of Azawad']
del almost_matched['Ari']
del almost_matched['Armed Islamic Group']
del almost_matched['Armed Islamic Movement']
del almost_matched['Armenian']
del almost_matched['Bari']
del almost_matched['Bru National Liberation Front']
del almost_matched['Dawa']
del almost_matched['Chad National Front']
del almost_matched["Eelam People's Revolutionary Liberation Front"]
del almost_matched['Faith Movement of Arakan']
del almost_matched['Fur']
del almost_matched['Garo National Liberation Army']
del almost_matched['Geri']
del almost_matched['Isatabu Freedom Movement']
del almost_matched['Islamic Movement of Kurdistan']
del almost_matched['Islamic Resistance Movement']
del almost_matched['Issa']
del almost_matched['Khmer People’s National Liberation Front']
del almost_matched['Kurdish Democratic Party of Iraq']
del almost_matched['Lao Resistance Movement']
del almost_matched['Macina Liberation Front']
del almost_matched['March 23 Movement']
del almost_matched['Mayi Mayi']
del almost_matched['Military faction']
del almost_matched['Military faction (Red Berets)']
del almost_matched['Military faction (forces of Amsha Desta and Merid Negusie)']
del almost_matched['Military faction (forces of Andres Rodriguez)']
del almost_matched['Military faction (forces of André Kolingba)']
del almost_matched['Military faction (forces of Godefroid Niyombare)']
del almost_matched['Military faction (forces of Maldoum Bada Abbas)']
del almost_matched['Military faction (forces of Nicolae Ceausescu)']
del almost_matched['Military faction (forces of Shahnawaz Tanay)']
del almost_matched['Military faction (forces of Suret Husseinov)']
del almost_matched['Mohajir National Movement']
del almost_matched['Moro']
del almost_matched['Movement for Justice and Peace']
del almost_matched['Movement for Unity and Jihad in West Africa']
del almost_matched['National Democratic Front for Bodoland -  Ranjan Daimary faction']
del almost_matched['National Front for the Liberation of Haiti']
del almost_matched['National Guard and Mkhedrioni']
del almost_matched['National Islamic Movement']
del almost_matched['Orma']
del almost_matched['Oromo']
del almost_matched['Palestinian National Liberation Movement']
del almost_matched['Pan']
del almost_matched['Pari']
del almost_matched['Patriotic Salvation Movement']
del almost_matched["People's Liberation Front"]
del almost_matched['Popular Movement for the Liberation of Azawad']
del almost_matched['Somali National Movement']
del almost_matched['Somali Patriotic Movement']
del almost_matched['Taleban Movement of Pakistan']
del almost_matched['The Knights Templar']
del almost_matched['The Zetas']
del almost_matched['Tiv']
del almost_matched['Union']
del almost_matched['Uzbek']
del almost_matched['Islamic Group']
del almost_matched['Islamic Party']

Now we are looking for names that can be shortened. For example, we would like to replace `Sinaloa Cartel - Los Memos faction` by `Sinaloa Cartel`. Do to so we look for the names we have that appear in other actors'names. This is done by the interractive cell below that for each match asks if we would like to list the two actors for replacement.

As the following cell needs interaction it is commented and it's result as been stored as a pickle file. 

In [66]:
"""
replacement = {}
for full_name in classified:
    for test in classified:
        if full_name != test and full_name.lower() in test.lower():
            print(full_name + ' //// ' + test)
            inp = input()
            if 'y' in inp.lower():
                name = find_name(full_name)
                test_name = find_name(test)
                if name is None:
                    name = full_name
                if test_name is None:
                    test_name = test   
                replacement[find_id(test_name)] = find_id(name)
                
                
with open('../data/replacement.pickle', 'wb') as fi:
    pickle.dump(replacement, fi)
"""

"\nreplacement = {}\nfor full_name in classified:\n    for test in classified:\n        if full_name != test and full_name.lower() in test.lower():\n            print(full_name + ' //// ' + test)\n            inp = input()\n            if 'y' in inp.lower():\n                name = find_name(full_name)\n                test_name = find_name(test)\n                if name is None:\n                    name = full_name\n                if test_name is None:\n                    test_name = test   \n                replacement[find_id(test_name)] = find_id(name)\n                \n                \nwith open('../data/replacement.pickle', 'wb') as fi:\n    pickle.dump(replacement, fi)\n"

This is the pickle file listing the replacements to do.

In [67]:
with open('../data/replacement.pickle', 'rb') as fi:
    replacement = pickle.load(fi)

Now in the original data frame, we replace the actors'names that we found earlier.

In [68]:
for i in df.index:
    tmp_a = df.loc[i, 'side_a_new_id']
    tmp_b = df.loc[i, 'side_a_new_id']
    for k in range(len(tmp_a)):
        if tmp_a[k] in replacement.keys():
            tmp_a[k] = replacement[tmp_a[k]]
    for k in range(len(tmp_b)):
        if tmp_b[k] in replacement.keys():
            tmp_b[k] = replacement[tmp_b[k]]
    df.at[i, 'side_a_new_id'] = tmp_a
    df.at[i, 'side_b_new_id'] = tmp_b

Now we list the actors ids that we found during the Wikipedia scraping.

In [69]:
wiki_ids = {}

In [70]:
for full_name in matched.keys():
    name = find_name(full_name)
    if name is None:
        name = full_name
    idd = find_id(name)
    if idd is not None:
        wiki_ids[idd] = matched[full_name]


In [71]:
for full_name in almost_matched.keys():
    name = find_name(full_name)
    if name is None:
        name = full_name
    idd = find_id(name)
    if idd is not None:
        wiki_ids[idd] = almost_matched[full_name]


Now we list the actors ids that we found during the classification process (using key-words).

In [72]:
classified_ids = {}

In [73]:
classes_names = ['govs', 'liberation_movements', 'insurgents', 'communists', 'islamists', 'factions', 'cartels', 'muslims', 'christians', 'republics']
for i in range(len(classes)):
    for idd in classes[i].keys():
        if idd in classified_ids.keys():
            classified_ids[idd] = classified_ids[idd] + [classes_names[i]]
        else:
            classified_ids[idd] = [classes[i][idd], classes_names[i]]

In [74]:
len(set(classified_ids.keys()).union(set(wiki_ids.keys())))

435

In total, we managed to get information on 435 actors.

Now we merge all the data in a single unified data frame summarizing all we now about each actor.

In [75]:
final_actors = pd.DataFrame(columns = ['id', 'name', 'countries', 'region', 'designated_terrorist', 'current_terrorist', 'former_terrorist', 'left_wing', 'successful_left', 'failed_left', 'former_left', 'guerilla', 'rebel', 'rebel_with', 'rebel_without']+classes_names)

In [76]:
for idd in wiki_ids.keys():
    if idd in replacement.keys():
        idd = replacement[idd]
    final_actors.loc[len(final_actors)] = [idd] + list(wiki.loc[wiki.name == wiki_ids[idd]].values[0]) + [np.NaN for _ in range(10)]

In [77]:
for idd in classified_ids.keys():
    if idd not in wiki_ids.keys():
        kk = classified_ids[idd][1:]
        tmp = [np.NaN for _ in range(len(classes_names))]
        for k in kk:
            tmp[classes_names.index(k)] = 1
        final_actors.loc[len(final_actors)] = [idd, classified_ids[idd][0]] + [None] + [np.NaN for _ in range(12)] + tmp

In [78]:
final_actors.head()

Unnamed: 0,id,name,countries,region,designated_terrorist,current_terrorist,former_terrorist,left_wing,successful_left,failed_left,...,govs,liberation_movements,insurgents,communists,islamists,factions,cartels,muslims,christians,republics
0,768,Popular Revolutionary Army,[Mexico],Latin America,,,,1.0,,,...,,,,,,,,,,
1,513,African National Congress,[South Africa],,1.0,1.0,0.0,1.0,1.0,0.0,...,,,,,,,,,,
2,258,Unified Communist Party of Nepal (Maoist),,,1.0,1.0,0.0,,,,...,,,,,,,,,,
3,771,United Self-Defense Forces of Colombia,,,1.0,1.0,0.0,,,,...,,,,,,,,,,
4,6742,Communist Party of Malaya,,Asia,,,,,,,...,,,,,,,,,,


#### Stats about each actor's involvement.

In [79]:
final_actors['civilian_deaths'] = 0
final_actors['total_deaths'] = 0

In [80]:
counter = 0
for i in df.index:
    counter += 1
    if counter % 10000 == 0:
        print('{}/{}'.format(counter, len(df)))
    for idd in df.loc[i, 'side_a_new_id']:
        if idd in final_actors['id'].values:
            final_actors.loc[final_actors.id == idd, 'civilian_deaths'] += df.loc[i, 'deaths_civilians']
            final_actors.loc[final_actors.id == idd, 'total_deaths'] += df.loc[i, 'deaths_b']
    for idd in df.loc[i, 'side_b_new_id']:
        if idd in final_actors['id'].values:
            final_actors.loc[final_actors.id == idd, 'civilian_deaths'] += df.loc[i, 'deaths_civilians']
            final_actors.loc[final_actors.id == idd, 'total_deaths'] += df.loc[i, 'deaths_a']

10000/135181
20000/135181
30000/135181
40000/135181
50000/135181
60000/135181
70000/135181
80000/135181
90000/135181
100000/135181
110000/135181
120000/135181
130000/135181


Adding wikipedia links for each actor.

In [82]:
final_actors['url'] = ''

In [83]:
import wikipedia

for i in final_actors.index:
    try: 
        name = final_actors.loc[i, 'name']
        page = wikipedia.page(name)
        final_actors.loc[i, 'url'] = page.url
    except (wikipedia.PageError, wikipedia.DisambiguationError) as e:
        pass

In [85]:
final_actors.to_csv('../data/story/actors.csv')