In [1]:
import findspark
findspark.init()
import pyspark

In [2]:
import datetime
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import *
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
spark = SparkSession.builder.getOrCreate()
%matplotlib inline

In [4]:
DATA_FOLDER = 'data'

A rajouter:
- Comment on a obtenu le nom des colonnes
- Quelle est la taille de chacun des fichiers
- One-hot encoding sur les themes et raison
- Comment on a gardé les events reliés à l'environment

## What are the major actors implied in environment-related events ?

In order to answer this question, we thought it would make sense to count the number of times a person, organisation, and location is mentioned as the first or second actor in an environment-related event. We use the field Actor1TypeCode to discriminate a person (ELI) from an organisation (BUS) from a location (null). ActorType2Code and ActorType3Code are used to identify inside a category, a more precise subtype. For exemple, among the elites we can distinguish a politician from a businessman. The representation we would like to use is a word cloud.

In [24]:
# Load the data
actors_occurences = spark.read.load(DATA_FOLDER + "/actors_occurences.parquet")

We print the top 100 most cited actors, regardless of their category.

In [26]:
actors_occurences = actors_occurences.sort('Count', ascending=False)
actors_occurences.show(100)

+--------+--------------+---------------+---------------+---------------+
|   Count|    Actor1Name|Actor1Type1Code|Actor1Type2Code|Actor1Type3Code|
+--------+--------------+---------------+---------------+---------------+
|19987569|          null|           null|           null|           null|
| 7358187| UNITED STATES|           null|           null|           null|
| 1492373|    GOVERNMENT|            GOV|           null|           null|
| 1244231|     PRESIDENT|            GOV|           null|           null|
| 1227057|         CHINA|           null|           null|           null|
| 1000850|        RUSSIA|           null|           null|           null|
|  984121|        POLICE|            COP|           null|           null|
|  951719|UNITED KINGDOM|           null|           null|           null|
|  911523|       COMPANY|            BUS|           null|           null|
|  803605|          IRAN|           null|           null|           null|
|  762096|        CANADA|           nu

We now discriminate a person (ELI) from an organisation (BUS) from a location (null) since it makes no sense to compare a country with a person.

**Elite:** former government officials, celebrities, spokespersons for organizations without further role categorization.

In [27]:
actors_eli_df = actors_occurences.filter(actors_occurences.Actor1Type1Code=="ELI")
actors_eli_df.show(100)

+-----+--------------------+---------------+---------------+---------------+
|Count|          Actor1Name|Actor1Type1Code|Actor1Type2Code|Actor1Type3Code|
+-----+--------------------+---------------+---------------+---------------+
|57272|               ACTOR|            ELI|           null|           null|
|46228|              HASSAN|            ELI|           null|           null|
|45386|             RETIRED|            ELI|           null|           null|
|22976|              ALBERT|            ELI|           null|           null|
|21011|          FIRST LADY|            ELI|           null|           null|
|17428|        BILL CLINTON|            ELI|           null|           null|
|16886|       UNITED STATES|            ELI|           null|           null|
|14999|              HASSAN|            ELI|            GOV|           null|
|12241|          SHINZO ABE|            ELI|            GOV|           null|
|10966|        NAWAZ SHARIF|            ELI|            GOV|           null|

**Business:** businessmen, companies, and enterprises.

In [28]:
actors_bus_df = actors_occurences.filter(actors_occurences.Actor1Type1Code=="BUS")
actors_bus_df.show(100)

+------+-------------------+---------------+---------------+---------------+
| Count|         Actor1Name|Actor1Type1Code|Actor1Type2Code|Actor1Type3Code|
+------+-------------------+---------------+---------------+---------------+
|911523|            COMPANY|            BUS|           null|           null|
|633324|           BUSINESS|            BUS|           null|           null|
|348473|           INDUSTRY|            BUS|           null|           null|
|327637|          COMPANIES|            BUS|           null|           null|
|225715|               BANK|            BUS|           null|           null|
|170326|           INVESTOR|            BUS|           null|           null|
| 99870|        CORPORATION|            BUS|           null|           null|
| 99602|           PRODUCER|            BUS|           null|           null|
| 57782|            AIRLINE|            BUS|           null|           null|
| 54460|           EMPLOYER|            BUS|           null|           null|

**Country**

In [32]:
actors_country_df = actors_occurences.filter("Actor1Type1Code is null")
actors_country_df.show(100)

+--------+--------------------+---------------+---------------+---------------+
|   Count|          Actor1Name|Actor1Type1Code|Actor1Type2Code|Actor1Type3Code|
+--------+--------------------+---------------+---------------+---------------+
|19987569|                null|           null|           null|           null|
| 7358187|       UNITED STATES|           null|           null|           null|
| 1227057|               CHINA|           null|           null|           null|
| 1000850|              RUSSIA|           null|           null|           null|
|  951719|      UNITED KINGDOM|           null|           null|           null|
|  803605|                IRAN|           null|           null|           null|
|  762096|              CANADA|           null|           null|           null|
|  685496|           AUSTRALIA|           null|           null|           null|
|  625063|              AFRICA|           null|           null|           null|
|  579017|               SYRIA|         

After having looked at the data, we think that getting the top cited actors with the actor field of the event database is probably not the best way to do. For the business category, we have a lot of results that are irrelevant (for e.g. "company", "bank", "investor"...). We would rather like to have names of actual companies. For the elite category, it's the same thing: "actor", "retired", "United Kingdom" are rather irrelevant. We would prefer to have names of actual persons. 
For this we think it could be better to use the V1LOCATIONS, V1PERSONS and V1ORGANIZATIONS of the GKG database. This is the option we are going to explore next. Otherwise, we will have to manually filter out the names of actors.