***

**GDELT MongoDB**

***

Note: Syntax can be easily transitioned to Spark DataFrame.

We will do create 2 MongoDB collections:

- `gdelt_2019_mentions`: Events tables with Mentions tables inserted as a field of Events
- `gdelt_2019_events`: GKG tables but with string pseudo-lists transformed to lists

In [528]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pymongo as mb
plt.style.use('fivethirtyeight')
from pprint import pprint
from tqdm import tqdm

# 1. GDELT Data Preprocessing with Pandas

**TODO in Spark with Scala**

1. gdelt_events:
    - Checks on variable types
2. gdelt_mentions:
    - Reduce DataFrame to 2 columns *GLOBALEVENTID* (for merge with gdelt_events) and *mentions* (the rest)
    - Checks on variable types
    - Create column for language of article / mention
3. gdelt_gkg:
    - Transfomring columns with poorly formated lists into dictionnaries
    - Checks on variable types

In [437]:
gdelt_events = pd.read_csv('test_local_gdelt_events.csv', index_col=0)
gdelt_mentions = pd.read_csv('test_local_gdelt_mentions.csv', index_col=0)
gdelt_gkg = pd.read_csv('test_local_gdelt_gkg.csv', index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [438]:
try:
    gdelt_mentions = gdelt_mentions.drop(['MentionDocOriginalLanguage'], axis=1)
except:
    pass

## 1.1. GDELT Mentions

**Get Article Language**

In [439]:
def get_article_mention_language(translateInfo):
    if translateInfo=='':
        language = 'eng'
    else:
        language = translateInfo.split(';', 1)[0][-3:]
    return language

gdelt_mentions_processed = gdelt_mentions.copy()
gdelt_mentions_processed = gdelt_mentions_processed.fillna('')

gdelt_mentions_processed['MentionDocOriginalLanguage'] = \
                    gdelt_mentions_processed['MentionDocTranslationInfo'].apply(lambda x: get_article_mention_language(x))

In [440]:
gdelt_mentions_processed['MentionDocOriginalLanguage'].value_counts(True).head(20)

eng    0.776453
spa    0.083250
ara    0.045934
fra    0.023721
rus    0.013271
por    0.012400
deu    0.007500
ita    0.004068
ron    0.003158
ell    0.002613
tur    0.002483
dan    0.002054
ukr    0.002054
swe    0.001807
srp    0.001742
urd    0.001495
ind    0.001417
pol    0.001378
slv    0.001118
ces    0.001027
Name: MentionDocOriginalLanguage, dtype: float64

In [441]:
# %%time
# gdelt_mentions_processed_bis = gdelt_mentions_processed[['GLOBALEVENTID']]
# gdelt_mentions_processed_bis['mentions'] = gdelt_mentions_processed.drop('GLOBALEVENTID', axis=1).to_dict(orient='records')
# gdelt_mentions_processed_bis = gdelt_mentions_processed_bis.groupby('GLOBALEVENTID')['mentions'].apply(list)
# gdelt_mentions_processed_bis = gdelt_mentions_processed_bis.reset_index()
# gdelt_mentions_processed_bis.head()

In [442]:
print(gdelt_mentions.shape)
print(gdelt_mentions_processed.shape)

(76937, 16)
(76937, 17)


In [443]:
gdelt_mentions_processed.GLOBALEVENTID.unique().shape[0] / gdelt_mentions_processed.shape[0]

0.55388174740372

In [444]:
gdelt_mentions_processed.head()

Unnamed: 0,GLOBALEVENTID,EventTimeDate,MentionTimeDate,MentionType,MentionSourceName,MentionIdentifier,SentenceID,Actor1CharOffset,Actor2CharOffset,ActionCharOffset,InRawText,Confidence,MentionDocLen,MentionDocTone,MentionDocTranslationInfo,Extras,MentionDocOriginalLanguage
0,410412347,20150218230000,20150218230000,1,dailymaverick.co.za,http://www.dailymaverick.co.za/article/2015-02...,19,-1,4594,4634,1,50,6665,-4.477612,,,eng
1,410412348,20150218230000,20150218230000,1,indiatimes.com,http://timesofindia.indiatimes.com/city/bengal...,2,-1,300,344,1,50,2541,2.078522,,,eng
2,410412349,20150218230000,20150218230000,1,voxy.co.nz,http://www.voxy.co.nz/entertainment/coast-new-...,4,-1,1297,1232,0,10,2576,7.517084,,,eng
3,410412350,20150218230000,20150218230000,1,voxy.co.nz,http://www.voxy.co.nz/entertainment/coast-new-...,4,-1,1298,1233,1,20,2576,7.517084,,,eng
4,410412351,20150218230000,20150218230000,1,eastidahonews.com,http://www.eastidahonews.com/2015/02/neil-patr...,1,-1,103,122,1,100,1432,0.0,,,eng


## 1.2. GDELT Events

In [446]:
gdelt_events_preprocessed = gdelt_events.copy()
gdelt_events_preprocessed['ActionGeo_CountryCode'] = gdelt_events_preprocessed['ActionGeo_CountryCode'].fillna('UNK')
gdelt_events_preprocessed = gdelt_events_preprocessed.fillna('')
# gdelt_events_preprocessed['mentions'] = np.empty((len(gdelt_events_preprocessed), 0)).tolist()
gdelt_events_preprocessed.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,410412347,20140218,201402,2014,2014.1315,,,,,,...,4,"Waterkloof, Free State, South Africa",SF,SF03,77359.0,-30.3098,25.2971,-1299321,20150218230000,http://www.dailymaverick.co.za/article/2015-02...
1,410412348,20140218,201402,2014,2014.1315,,,,,,...,4,"Bengaluru, Karnataka, India",IN,IN19,70159.0,12.9833,77.5833,-2090174,20150218230000,http://timesofindia.indiatimes.com/city/bengal...
2,410412349,20140218,201402,2014,2014.1315,,,,,,...,4,"Great Southern, Victoria, Australia",AS,AS07,5387.0,-36.0667,146.483,-1576477,20150218230000,http://www.voxy.co.nz/entertainment/coast-new-...
3,410412350,20140218,201402,2014,2014.1315,,,,,,...,1,New Zealand,NZ,NZ,,-41.0,174.0,NZ,20150218230000,http://www.voxy.co.nz/entertainment/coast-new-...
4,410412351,20140218,201402,2014,2014.1315,,,,,,...,2,"Idaho, United States",US,USID,,44.2394,-114.51,ID,20150218230000,http://www.eastidahonews.com/2015/02/neil-patr...


## 1.3. GDELT GKG

Columns to format from string to list :

- AllNames (;)
- Amounts (;)
- Counts (;)
- GCAM (,)
- **Locations** (;)
- Organizations (;)
- **Persons** (;)
- Quotations (;)
- Related Images
- **Themes** (;)
- V2Counts (;)
- V2Locations (;)
- V2Organizations (;)
- V2Persons (;)
- V2Themes (;)
- **V2Tone** (,)


In particular:

- **V2TONE**: (comma-delimited floating point numbers) This field contains a comma-delimited
list of six core emotional dimensions, described in more detail below. Each is recorded as a
single precision floating point number. This field is nearly identical in format and population as
the corresponding field in the GKG 1.0 format with the sole exception of adding the single new
WordCount variable at the end.

    1. **Tone**. (floating point number) This is the average “tone” of the document as a whole.
The score ranges from -100 (extremely negative) to +100 (extremely positive). Common
values range between -10 and +10, with 0 indicating neutral. This is calculated as
Positive Score minus Negative Score.
    2. **Positive Score**. (floating point number) This is the percentage of all words in the article
that were found to have a positive emotional connotation. Ranges from 0 to +100.
    3. **Negative Score**
    4. **Polarity**. (floating point number) This is the percentage of words that had matches in
the tonal dictionary as an indicator of how emotionally polarized or charged the text is.
If Polarity is high, but Tone is neutral, this suggests the text was highly emotionally
charged, but had roughly equivalent numbers of positively and negatively charged
emotional words.
    5. **Activity Reference Density**. (floating point number) This is the percentage of words
that were active words offering a very basic proxy of the overall “activeness” of the text
compared with a clinically descriptive text.
    6. **Self/Group Reference Density**. (floating point number) This is the percentage of all
words in the article that are pronouns, capturing a combination of self-references and
group-based discourse. News media material tends to have very low densities of such
language, but this can be used to distinguish certain classes of news media and certain
contexts.

- **Locations**: In cases where it is necessary to collapse by feature, the Geo_FeatureID column should be used, rather than the Geo_Fullname column. This is because the Geo_Fullname column captures the name of the location as expressed in the text and thus reflects differences in transliteration, alternative spellings, and alternative names for the same location. For example, Mecca is often spelled Makkah, while Jeddah is commonly spelled Jiddah or Jaddah. The Geo_Fullname column will reflect each of these different spellings, while the Geo_FeatureID column will resolve them all to the same unique GNS or GNIS feature identification number.

    1. **Location Type**
    2. **Location Fullname**
    3. **Location Country Code**
    4. **Location ADM1Code**
    5. **Location Latitude**
    6. **Location Longitude**
    7. **Location FeatureID** (text OR signed integer). This is the numeric GNS or GNIS FeatureID for this location OR a textual country or ADM1 code).
    
Example:
```python
['4',
 'Budapest, Budapest, Hungary',
 'RS',
 'HU05',
 '47.5',
 '19.0833',
 '-850553']
```

In [447]:
test = 'TERROR;REBELS;TAX_ETHNICITY;TAX_ETHNICITY_UKRAINE'
test = test.split(';')
test

['TERROR', 'REBELS', 'TAX_ETHNICITY', 'TAX_ETHNICITY_UKRAINE']

In [448]:
gdelt_gkg_preprocessed = gdelt_gkg.fillna('')

In [449]:
gdelt_gkg_preprocessed.columns

Index(['GKGRECORDID', 'DATE', 'SourceCollectionIdentifier', 'SourceCommonName',
       'DocumentIdentifier', 'Counts', 'V2Counts', 'Themes', 'V2Themes',
       'Locations', 'V2Locations', 'Persons', 'V2Persons', 'Organizations',
       'V2Organizations', 'V2Tone', 'Dates', 'GCAM', 'SharingImage',
       'RelatedImages', 'SocialImageEmbeds', 'SocialVideoEmbeds', 'Quotations',
       'AllNames', 'Amounts', 'TranslationInfo', 'Extras'],
      dtype='object')

In [450]:
%%time
split_with_semicolon = ['AllNames', 'Amounts', 'Counts', 'Locations',
                        'Organizations', 'Persons', 'Quotations',
                        'RelatedImages', 'Themes', 'V2Counts',
                        'V2Locations', 'V2Organizations', 'V2Persons',
                        'V2Themes']

split_with_comma = ['GCAM', 'V2Tone']

gdelt_gkg_preprocessed[split_with_semicolon] = gdelt_gkg_preprocessed[split_with_semicolon].applymap(lambda x: x.split(';'))
gdelt_gkg_preprocessed[split_with_comma] = gdelt_gkg_preprocessed[split_with_comma].applymap(lambda x: x.split(','))

Wall time: 6.89 s


In [451]:
# def preprocess_gdelt_gkg_rows(row):
#     """ AAA """
#     # Split String Lists
#     split_with_semicolon = ['AllNames', 'Amounts', 'Counts', 'Locations',
#                             'Organizations', 'Persons', 'Quotations',
#                             'RelatedImages', 'Themes', 'V2Counts',
#                             'V2Locations', 'V2Organizations', 'V2Persons',
#                             'V2Themes']
    
#     split_with_comma = ['GCAM', 'V2Tone']
    
#     # General Splits
#     for key_name in split_with_semicolon:
#         # Transform string into list
#         row[key_name] = row[key_name].split(';')
        
#         # Remove elements with empty strings
#         row[key_name] = [el for el in row[key_name] if el != '']
    
#     for key_name in split_with_comma:
#         # Transform string into list
#         row[key_name] = row[key_name].split(',')
        
#         # Remove elements with empty strings
#         row[key_name] = [el for el in row[key_name] if el != '']
    
#     return row

def preprocess_gdelt_gkg_tone(x):
    res = dict()
    res['tone'] = float(x[0])
    res['positive_score'] = x[1]
    res['negative_score'] = x[2]
    res['polarity'] = x[3]
    res['activity_reference_density'] = x[4]
    res['group_reference_density'] = x[5]
    return res

def preprocess_gdelt_gkg_location(x):
    res_list = []
    # print(x)
    if len(x) == 0:
        return []
        
    for location in x:
        if location != '':
            # print(location)
            x_split = location.split('#')
            # print(location, x_split)
            res = dict()
            res['type'] = x_split[0]
            res['full_name'] = x_split[1]
            res['country_code'] = x_split[2]
            res['adm1_code'] = x_split[3]
            res['latitude'] = x_split[4]
            res['longitude'] = x_split[5]
            res['featureID'] = x_split[6]
            res_list.append(res)
        
    return res_list

def preprocess_gdelt_themes(x):
    res = list(filter(lambda el: el != '', x))
    return res

In [452]:
%%time
gdelt_gkg_preprocessed['V2Tone'] = gdelt_gkg_preprocessed['V2Tone'].apply(lambda x: preprocess_gdelt_gkg_tone(x))
gdelt_gkg_preprocessed['Locations'] = gdelt_gkg_preprocessed['Locations'].apply(lambda x: preprocess_gdelt_gkg_location(x))

Wall time: 242 ms


In [453]:
gdelt_gkg_preprocessed['Themes'] = gdelt_gkg_preprocessed['Themes'].apply(lambda x: preprocess_gdelt_themes(x))

In [454]:
pprint(gdelt_gkg_preprocessed[['GKGRECORDID', 'Locations', 'AllNames', 'Themes', 'V2Tone']].transpose().iloc[:,0].to_dict())

{'AllNames': ['Channel One,628',
              'Channel One,755',
              'Channel One,1449',
              'Channel One,1672',
              'Gen Aleksandr Lentsov,1994',
              'Eduard Basuryn,2123',
              'Ukrainian President Petro Poroshenko,2410',
              'Sergey Mikheyev,2552',
              'Hungarian Prime Minister Viktor Orban,2877',
              'Deputy Prime Minister Arkadiy Dvorkovich,3603',
              'Deputy Prime Minister Dvorkovich,3985',
              'Channel One,4835'],
 'GKGRECORDID': '20150218230000-0',
 'Locations': [{'adm1_code': 'HU05',
                'country_code': 'RS',
                'featureID': '-850553',
                'full_name': 'Budapest, Budapest, Hungary',
                'latitude': '47.5',
                'longitude': '19.0833',
                'type': '4'},
               {'adm1_code': 'RS25',
                'country_code': 'RS',
                'featureID': '139814',
                'full_name': "Ogarevo, Kaluz

# 2. Analyzing GDELT with MongoDB

## 2.1. Bulk Insert Data into MongoDB

**(a) MongoDB Client Server**

In [529]:
LOCAL_HOST_SERVER = 'localhost:27017'
mongo_client = mb.MongoClient(LOCAL_HOST_SERVER)

In [530]:
msbd_database = mongo_client['MSBD_2019_2020']

In [531]:
mb_gdelt_main = msbd_database['mb_gdelt_main']
mb_gdelt_gkg = msbd_database['gdelt_gkg']

try:
    _ = mb_gdelt_main.drop()
    print('Collection dropped.')
except:
    print('Collection already dropped.')
    
try:
    _ = mb_gdelt_gkg.drop()
    print('Collection dropped.')
except:
    print('Collection already dropped.')
    
mb_gdelt_mentions = msbd_database['mb_gdelt_main']
mb_gdelt_gkg = msbd_database['gdelt_gkg']

Collection dropped.
Collection dropped.


In [532]:
print(msbd_database.list_collection_names())

['gdelt_mentions_events', 'enron_emails']


**(b) Insert GDELT Events into MongoDB**

In [533]:
%%time
pandas_to_mongo_events_data = gdelt_events_preprocessed.copy()
# pandas_to_mongo_events_data = pandas_to_mongo_events_data.rename(columns={'GLOBALEVENTID': '_id'})
pandas_to_mongo_events_data['_id'] = pandas_to_mongo_events_data['GLOBALEVENTID'].copy()
pandas_to_mongo_events_data = pandas_to_mongo_events_data.to_dict('records')

Wall time: 2.25 s


In [534]:
%%time
_ = mb_gdelt_main.insert_many(pandas_to_mongo_events_data)

Wall time: 2.33 s


In [535]:
print(mb_gdelt_main.count_documents(filter={}))

29319


**(c) Insert GDELT GKG into MongoDB**

`GKGRECORDID` doesn't work there are duplicate ids.

In [536]:
%%time
pandas_to_mongo_data_gkg = gdelt_gkg_preprocessed.copy()
# pandas_to_mongo_data_gkg = pandas_to_mongo_data_gkg.rename(columns={'GKGRECORDID': '_id'})
pandas_to_mongo_data_gkg = pandas_to_mongo_data_gkg.to_dict('records')

Wall time: 1.52 s


In [537]:
%%time
_ = mb_gdelt_gkg.insert_many(pandas_to_mongo_data_gkg)

Wall time: 41.9 s


In [538]:
# for document in pandas_to_mongo_data_gkg:
#     _ = mb_gdelt_gkg.insert_one(document)

In [539]:
# pprint(pandas_to_mongo_data_gkg[0])

In [540]:
print(mb_gdelt_gkg.count_documents(filter={}))

36478


In [541]:
# gdelt_events_preprocessed[gdelt_events_preprocessed['GLOBALEVENTID']==410412359]['mentions'].tolist()

In [542]:
# pprint(mb_gdelt_mentions.find_one({}))

In [543]:
# pprint(mb_gdelt_gkg.find_one({})['V2Tone'])

**(d) Update GDELT Events with GDELT Mentions**

In [544]:
pprint(mb_gdelt_main.find_one(filter={'_id': 410412374}))

{'ActionGeo_ADM1Code': 'IZ',
 'ActionGeo_ADM2Code': '',
 'ActionGeo_CountryCode': 'IZ',
 'ActionGeo_FeatureID': 'IZ',
 'ActionGeo_FullName': 'Iraq',
 'ActionGeo_Lat': 33.0,
 'ActionGeo_Long': 44.0,
 'ActionGeo_Type': 1,
 'Actor1Code': 'IRQ',
 'Actor1CountryCode': 'IRQ',
 'Actor1EthnicCode': '',
 'Actor1Geo_ADM1Code': 'IZ',
 'Actor1Geo_ADM2Code': '',
 'Actor1Geo_CountryCode': 'IZ',
 'Actor1Geo_FeatureID': 'IZ',
 'Actor1Geo_FullName': 'Iraq',
 'Actor1Geo_Lat': 33.0,
 'Actor1Geo_Long': 44.0,
 'Actor1Geo_Type': 1,
 'Actor1KnownGroupCode': '',
 'Actor1Name': 'IRAQ',
 'Actor1Religion1Code': '',
 'Actor1Religion2Code': '',
 'Actor1Type1Code': '',
 'Actor1Type2Code': '',
 'Actor1Type3Code': '',
 'Actor2Code': 'USA',
 'Actor2CountryCode': 'USA',
 'Actor2EthnicCode': '',
 'Actor2Geo_ADM1Code': 'USIL',
 'Actor2Geo_ADM2Code': 'IL031',
 'Actor2Geo_CountryCode': 'US',
 'Actor2Geo_FeatureID': '423587',
 'Actor2Geo_FullName': 'Chicago, Illinois, United States',
 'Actor2Geo_Lat': 41.85,
 'Actor2Geo_Lon

In [545]:
gdelt_mentions_processed[gdelt_mentions_processed.GLOBALEVENTID==410412374]

Unnamed: 0,GLOBALEVENTID,EventTimeDate,MentionTimeDate,MentionType,MentionSourceName,MentionIdentifier,SentenceID,Actor1CharOffset,Actor2CharOffset,ActionCharOffset,InRawText,Confidence,MentionDocLen,MentionDocTone,MentionDocTranslationInfo,Extras,MentionDocOriginalLanguage
60,410412374,20150218230000,20150218230000,1,ap.org,http://hosted2.ap.org/APDEFAULT/89ae8247abe849...,12,2833,2800,2821,0,10,5143,-1.052632,,,eng
61,410412374,20150218230000,20150218230000,1,stamfordadvocate.com,http://www.stamfordadvocate.com/news/politics/...,13,2841,2808,2829,0,10,5037,-1.180638,,,eng
62,410412374,20150218230000,20150218230000,1,sacbee.com,http://www.sacbee.com/news/politics-government...,13,2817,2784,2805,0,10,5670,-1.15425,,,eng
63,410412374,20150218230000,20150218230000,1,ap.org,http://hosted2.ap.org/APDEFAULT/3d281c11a96b4a...,12,2833,2800,2821,0,10,5143,-1.052632,,,eng
8508,410412374,20150218230000,20150218231500,1,newser.com,http://www.newser.com/article/73efbb63762c409f...,13,2638,2605,2626,0,10,4789,-1.35468,,,eng
8509,410412374,20150218230000,20150218231500,1,bostonherald.com,http://www.bostonherald.com/news_opinion/natio...,13,2838,2805,2826,0,10,5030,-1.182033,,,eng
17376,410412374,20150218230000,20150218233000,1,roanoke.com,http://www.roanoke.com/news/politics/wire/jeb-...,15,3819,3786,3807,0,10,6078,-1.188119,,,eng
17377,410412374,20150218230000,20150218233000,1,billingsgazette.com,http://billingsgazette.com/news/national/elect...,13,2841,2808,2829,0,10,5037,-1.180638,,,eng
17378,410412374,20150218230000,20150218233000,1,kdhnews.com,http://kdhnews.com/news/politics/jeb-bush-us-m...,15,3819,3786,3807,0,10,6080,-1.188119,,,eng
17379,410412374,20150218230000,20150218233000,1,tucson.com,http://tucson.com/news/national/govt-and-polit...,13,2841,2808,2829,0,10,5037,-1.180638,,,eng


In [546]:
%%time
batch = gdelt_mentions_processed[gdelt_mentions_processed.GLOBALEVENTID==410412374].copy()
batch_1 = batch.iloc[:6,:].copy()
batch_2 = batch.iloc[6:,:].copy()

pandas_to_mongo_batch_1 = batch_1[['GLOBALEVENTID']].copy()
pandas_to_mongo_batch_1['MENTIONS'] = batch_1.drop('GLOBALEVENTID', axis=1).to_dict('records')
pandas_to_mongo_batch_1 = pandas_to_mongo_batch_1.groupby('GLOBALEVENTID')['MENTIONS'].apply(list)
pandas_to_mongo_batch_1 = pandas_to_mongo_batch_1.reset_index()
# pandas_to_mongo_batch_1_records = pandas_to_mongo_batch_1.to_dict('records')

pandas_to_mongo_batch_2 = batch_2[['GLOBALEVENTID']].copy()
pandas_to_mongo_batch_2['MENTIONS'] = batch_2.drop('GLOBALEVENTID', axis=1).to_dict('records')
pandas_to_mongo_batch_2 = pandas_to_mongo_batch_2.groupby('GLOBALEVENTID')['MENTIONS'].apply(list)
pandas_to_mongo_batch_2 = pandas_to_mongo_batch_2.reset_index()
# pandas_to_mongo_batch_2_records = pandas_to_mongo_batch_2.to_dict('records')

Wall time: 23.9 ms


In [547]:
pandas_to_mongo_batch_1

Unnamed: 0,GLOBALEVENTID,MENTIONS
0,410412374,"[{'EventTimeDate': 20150218230000, 'MentionTim..."


In [548]:
pandas_to_mongo_batch_2

Unnamed: 0,GLOBALEVENTID,MENTIONS
0,410412374,"[{'EventTimeDate': 20150218230000, 'MentionTim..."


In [549]:
# pprint(pandas_to_mongo_batch_1_records[0])

In [550]:
# pprint(pandas_to_mongo_batch_2_records[0])

In [551]:
%%time
for _, row in pandas_to_mongo_batch_1.iterrows():
    _ = mb_gdelt_main.update_many({'GLOBALEVENTID': row['GLOBALEVENTID']},
                                 {'$push': {'MENTIONS': {'$each': row['MENTIONS']}}},
                                 upsert=True)
    break

Wall time: 56.8 ms


In [552]:
for _, row in pandas_to_mongo_batch_2.iterrows():
    _ = mb_gdelt_main.update_many({'GLOBALEVENTID': row['GLOBALEVENTID']},
                                 {'$push': {'MENTIONS': {'$each': row['MENTIONS']}}},
                                 upsert=True)
    break

In [553]:
pprint(mb_gdelt_main.find_one(filter={'_id': 410412374}))

{'ActionGeo_ADM1Code': 'IZ',
 'ActionGeo_ADM2Code': '',
 'ActionGeo_CountryCode': 'IZ',
 'ActionGeo_FeatureID': 'IZ',
 'ActionGeo_FullName': 'Iraq',
 'ActionGeo_Lat': 33.0,
 'ActionGeo_Long': 44.0,
 'ActionGeo_Type': 1,
 'Actor1Code': 'IRQ',
 'Actor1CountryCode': 'IRQ',
 'Actor1EthnicCode': '',
 'Actor1Geo_ADM1Code': 'IZ',
 'Actor1Geo_ADM2Code': '',
 'Actor1Geo_CountryCode': 'IZ',
 'Actor1Geo_FeatureID': 'IZ',
 'Actor1Geo_FullName': 'Iraq',
 'Actor1Geo_Lat': 33.0,
 'Actor1Geo_Long': 44.0,
 'Actor1Geo_Type': 1,
 'Actor1KnownGroupCode': '',
 'Actor1Name': 'IRAQ',
 'Actor1Religion1Code': '',
 'Actor1Religion2Code': '',
 'Actor1Type1Code': '',
 'Actor1Type2Code': '',
 'Actor1Type3Code': '',
 'Actor2Code': 'USA',
 'Actor2CountryCode': 'USA',
 'Actor2EthnicCode': '',
 'Actor2Geo_ADM1Code': 'USIL',
 'Actor2Geo_ADM2Code': 'IL031',
 'Actor2Geo_CountryCode': 'US',
 'Actor2Geo_FeatureID': '423587',
 'Actor2Geo_FullName': 'Chicago, Illinois, United States',
 'Actor2Geo_Lat': 41.85,
 'Actor2Geo_Lon

In [554]:
print(len(mb_gdelt_main.find_one(filter={'_id': 410412374})['MENTIONS']), batch.shape[0])

18 18


For the entire GDELT Mentions:

In [555]:
%%time
pandas_to_mongo_mentions = gdelt_mentions_processed[['GLOBALEVENTID']].copy()
pandas_to_mongo_mentions['MENTIONS'] = gdelt_mentions_processed.drop('GLOBALEVENTID', axis=1).to_dict('records')
pandas_to_mongo_mentions = pandas_to_mongo_mentions.groupby('GLOBALEVENTID')['MENTIONS'].apply(list)
pandas_to_mongo_mentions = pandas_to_mongo_mentions.reset_index()

Wall time: 12.3 s


In [556]:
pandas_to_mongo_mentions.head()

Unnamed: 0,GLOBALEVENTID,MENTIONS
0,403685556,"[{'EventTimeDate': 20150119000000, 'MentionTim..."
1,403686013,"[{'EventTimeDate': 20150119000000, 'MentionTim..."
2,403688003,"[{'EventTimeDate': 20150119000000, 'MentionTim..."
3,403688601,"[{'EventTimeDate': 20150119001500, 'MentionTim..."
4,403692727,"[{'EventTimeDate': 20150119003000, 'MentionTim..."


In [558]:
pandas_to_mongo_mentions.shape[0]

42614

In [559]:
%%time
for _, row in tqdm(pandas_to_mongo_mentions.iterrows(), position=0, total=pandas_to_mongo_mentions.shape[0]):
    _ = mb_gdelt_main.update_many({'GLOBALEVENTID': row['GLOBALEVENTID']},
                                 {'$push': {'MENTIONS': {'$each': row['MENTIONS']}}},
                                 upsert=True)

 38%|████████████████████████████▋                                               | 16065/42614 [08:29<14:01, 31.53it/s]

KeyboardInterrupt: 

## 2.2. GDELT Queries with MongoDB

- [More complex aggregations in MongoDB](https://stackoverflow.com/questions/43448389/mongo-aggregate-sum-of-values-in-a-list-of-dictionaries-for-all-documents)
- [Aggregate across multiple fields simultaneously](https://stackoverflow.com/questions/25843255/mongodb-aggregate-count-on-multiple-fields-simultaneously)

**1. Afficher le nombre d’articles/évènements qu’il y a eu pour chaque triplet (jour, pays de l’évènement, langue de l’article).**

In [566]:
query_mb_events_triplet = [
    {'$unwind': '$MENTIONS'},
    {'$group': {'_id': {'date':'$SQLDATE',
                        'event_country': '$ActionGeo_CountryCode',
                        'article_language': '$MENTIONS.MentionDocOriginalLanguage'},
                'NbEvents': {'$sum': 1}}
    }
]
pprint(query_mb_events_triplet)

[{'$unwind': '$MENTIONS'},
 {'$group': {'NbEvents': {'$sum': 1},
             '_id': {'article_language': '$MENTIONS.MentionDocOriginalLanguage',
                     'date': '$SQLDATE',
                     'event_country': '$ActionGeo_CountryCode'}}}]


In [567]:
result_events_triplet = list(mb_gdelt_main.aggregate(query_mb_events_triplet))

In [568]:
len(result_events_triplet)

354

In [575]:
sorted(result_events_triplet, key=lambda x: x['NbEvents'], reverse=True)[:20]

[{'NbEvents': 23539, '_id': {'article_language': 'eng'}},
 {'NbEvents': 3535, '_id': {'article_language': 'spa'}},
 {'NbEvents': 1709,
  '_id': {'article_language': 'eng', 'date': 20150218, 'event_country': 'LY'}},
 {'NbEvents': 1276, '_id': {'article_language': 'ara'}},
 {'NbEvents': 1087,
  '_id': {'article_language': 'eng', 'date': 20150218, 'event_country': 'US'}},
 {'NbEvents': 1019, '_id': {'article_language': 'fra'}},
 {'NbEvents': 487, '_id': {'article_language': 'rus'}},
 {'NbEvents': 420, '_id': {'article_language': 'por'}},
 {'NbEvents': 400,
  '_id': {'article_language': 'eng', 'date': 20150218, 'event_country': 'YM'}},
 {'NbEvents': 361,
  '_id': {'article_language': 'eng', 'date': 20150218, 'event_country': 'AG'}},
 {'NbEvents': 355, '_id': {'article_language': 'deu'}},
 {'NbEvents': 278,
  '_id': {'article_language': 'eng', 'date': 20150218, 'event_country': 'EG'}},
 {'NbEvents': 254,
  '_id': {'article_language': 'eng', 'date': 20150218, 'event_country': 'IZ'}},
 {'NbEv

Pour l'instant c'est normal que la requête ne marche pas à 100% car on a des :

    {'NbEvents': 23539, '_id': {'article_language': 'eng'}
    
au lieu de 

    {'NbEvents': 1087, _id': {'article_language': 'eng', 'date': 20150218, 'event_country': 'US'}}
    
Parce que je n'ai pas mis à jour encore l'ensemble des tables. Faudra utiliser Spark.

**2. Pour un pays donné en paramètre, affichez les évènements qui y ont eu place triées par le nombre de mentions (tri décroissant); permettez une agrégation par jour/mois/année.**

**3. Pour une source de données passée en paramètre (gkg.SourceCommonName) affichez les thèmes, personnes, lieux dont les articles de cette sources parlent ainsi que le le nombre d’articles et le ton moyen des articles (pour chaque thème/personne/lieu); permettez une agrégation par jour/mois/année**

**4. Dresser la cartographie des relations entre les pays d’après le ton des articles : pour chaque paire (pays1, pays2), calculer le nombre d’article, le ton moyen (aggrégations sur Année/Mois/Jour, filtrage par pays ou carré de coordonnées).**