# This notebook contains how i created multiple datasets

### Query used to pull all events from the South Pacific from BigQuery

In [4]:
SELECT 
    events.GLOBALEVENTID, SQLDATE, MonthYear, Actor1Name, Actor1CountryCode, Actor1Geo_Type, Actor1Geo_FeatureID, Actor2Name, Actor2CountryCode, Actor2Geo_Type, Actor2Geo_FeatureID, EventCode, GoldsteinScale, NumMentions, NumSources, NumArticles, AvgTone, Actor1Geo_FullName, Actor1Geo_CountryCode, Actor2Geo_FullName, Actor2Geo_CountryCode, ActionGeo_FullName, ActionGeo_CountryCode, ActionGeo_ADM1Code, ActionGeo_ADM2Code, ActionGeo_Type, ActionGeo_FeatureID, SOURCEURL, MentionTimeDate, MentionType, MentionSourceName, MentionIdentifier, SentenceID, Actor1CharOffset, Actor2CharOffset, ActionCharOffset, InRawText, Confidence,	MentionDocLen, MentionDocTone, SourceCollectionIdentifier, SourceCommonName, DocumentIdentifier, Themes, V2Themes, Locations, V2Locations, Persons, V2Persons, Organizations, V2Organizations, V2Tone, AllNames
FROM 
  `gdelt-bq.gdeltv2.eventmentions_partitioned` as eventmentions join `gdelt-bq.gdeltv2.events_partitioned` as events 
    ON eventmentions.GLOBALEVENTID = events.GLOBALEVENTID inner join `gdelt-bq.gdeltv2.gkg_partitioned`as GKG on eventmentions.MentionIdentifier = GKG.DocumentIdentifier
WHERE 
  ActionGeo_ADM1Code like 'FM%' -- Micronesia
  OR ActionGeo_ADM1Code like 'FJ%' -- Fiji
  OR ActionGeo_ADM1Code like 'KR%' -- Kiribati
  OR ActionGeo_ADM1Code like 'RM%' -- Marshall Islands
  OR ActionGeo_ADM1Code like 'NR%' -- Nauru
  OR ActionGeo_ADM1Code like 'PS%' -- Palau
  OR ActionGeo_ADM1Code like 'PP%' -- Papua New Guinea
  OR ActionGeo_ADM1Code like 'WS%' -- Samoa
  OR ActionGeo_ADM1Code like 'BP%' -- Solomon Islands
  OR ActionGeo_ADM1Code like 'TN%' -- Tonga
  OR ActionGeo_ADM1Code like 'TV%' -- Tuvalu
  OR ActionGeo_ADM1Code like 'NH%' -- Vanuatu
  OR ActionGeo_ADM1Code like 'CW%' -- Cook Islands
  OR ActionGeo_ADM1Code like 'NE%' -- Niue
  OR ActionGeo_ADM1Code like 'AQ%' -- American Samoa
  OR ActionGeo_FullName = 'Ashmore Reef, Queensland, Australia'
  OR ActionGeo_ADM1Code like 'FQ%' -- Baker Island
  OR ActionGeo_FullName = 'Coral Sea, Oceans (general), Oceans'
  OR ActionGeo_FullName like 'Easter Island, V%'
  OR ActionGeo_FullName = 'Galapagos, Imbabura, Ecuador'
  OR ActionGeo_ADM1Code like 'FP%' -- French Polynesia
  OR ActionGeo_ADM1Code like 'GQ%' -- Guam
  OR ActionGeo_ADM1Code like 'HQ%' -- Howland Island
  OR ActionGeo_ADM1Code like 'DQ%' -- Jarvis Island
  OR ActionGeo_ADM1Code like 'JQ%' -- Johnston Atoll
  OR ActionGeo_ADM1Code like 'KQ%' -- Kingman Reef
  OR ActionGeo_FullName = 'Midway Island, Western Australia, Australia'
  OR ActionGeo_ADM1Code like 'NC%' -- New Caledonia
  OR ActionGeo_ADM1Code like 'NF%' -- Norfold Island
  OR ActionGeo_ADM1Code like 'CQ%' -- Norther Mariana Islands
  OR ActionGeo_FullName = 'Ogasawaramura, Tokyo, Japan'
  OR ActionGeo_ADM1Code like 'LQ%' -- Palmyra Atoll
  OR ActionGeo_ADM1Code = 'ID36' -- Papua, Indonesia
  OR ActionGeo_ADM1Code like 'PC%' -- Pitcairn Islands
  OR ActionGeo_ADM1Code like 'TL%' -- Tokelau
  OR ActionGeo_ADM1Code like 'WQ%' -- Wake Island
  OR ActionGeo_ADM1Code like 'WF%' -- Wallis and Futuna
  OR ActionGeo_ADM1Code = 'ID39' -- West Papua, Indonesia
  OR ActionGeo_FullName = 'Bonin Islands, Tokyo, Japan'

At the time of the initial pull on Oct 26, 2020, 4:30:23 PM, this table is <b>19.9 GB</b> in size and contains <b> 4323833 </b> entries <br>
This dataset also contains a lot of extra columns that may or may not prove to be useful but are included just encase.

## Creating a mini dataset from the initial pull

Because the previous dataset is too massive to completely store in memory all the time, creating a smaller dataset from the initial dataset will allow me to do analysis faster and prevent issues where you may run out of memory

##### Assume: 
<em>The previous dataset is already saved as <b>df</b><br> Python libraries <b>numpy</b> (np), <b>pandas</b> (pd), and <b>sqlalchemy</b> are already imported</em>

In [6]:
location = '' # exact location to store the dataset along with the name 

useful_columns = ['SQLDATE', 'MonthYear', 'Actor1Name', 'Actor1CountryCode', 
                  'Actor2Name', 'Actor2CountryCode', 'AvgTone', 'Actor1Geo_FullName',
                  'Actor1Geo_CountryCode', 'Actor2Geo_FullName', 'Actor2Geo_CountryCode', 
                  'ActionGeo_FullName', 'ActionGeo_CountryCode', 'ActionGeo_ADM1Code', 
                  'SOURCEURL', 'MentionSourceName', 'MentionIdentifier', 'Confidence', 
                  'MentionDocLen', 'MentionDocTone', 'SourceCommonName', 'V2Themes', 'V2Tone']

pd.DataFrame(data=df[useful_columns], columns=[useful_columns]).to_csv(location, index=False)

This new dataset only containing columns that are immediately useful is <b>9.22 GB</b> in size

### Creating a mini dataset only containing ENV_ themes

The first thing is creating a SQL engine that converts a CSV file into an SQL file. I found that sqlalchemy was the simplest and most intuitive for me to use. There are many other options i could have choose but i stuck with the first that worked easy for me. 

In [7]:
# Create the SQL engine 
engine = create_engine('sqlite://', echo=False)

# df is the name of the dataset already created through pandas. 
# 'df' (inside the parentheses) is the name of the dataset you will query within sqlalchemy. 
# This name is arbitrary and you can name it whatever you want.
df.to_sql('df', con=engine) 

In [None]:
query = """
SELECT SQLDATE, MonthYear, Actor1Name, Actor1CountryCode, Actor2Name, 
    Actor2CountryCode, AvgTone, Actor1Geo_FullName, Actor1Geo_CountryCode, 
    Actor2Geo_FullName, Actor2Geo_CountryCode, ActionGeo_FullName, 
    ActionGeo_CountryCode, ActionGeo_ADM1Code, SOURCEURL, MentionSourceName, 
    MentionIdentifier, Confidence, MentionDocLen, MentionDocTone, 
    SourceCommonName, V2Themes, V2Tone
FROM df
WHERE V2Themes LIKE '%ENV_%'
"""

# Execute the query and save the result
temp_result = engine.execute(query).fetchall()
# Convert result into numpy array
temp1 = np.array(temp_result)

# convert the numpy array into a dataframe and save it to some location again
pd.DataFrame(data=temp1, columns=['SQLDATE', 'MonthYear', 'Actor1Name', 'Actor1CountryCode', 
                                  'Actor2Name', 'Actor2CountryCode', 'AvgTone', 
                                  'Actor1Geo_FullName', 'Actor1Geo_CountryCode', 
                                  'Actor2Geo_FullName', 'Actor2Geo_CountryCode', 
                                  'ActionGeo_FullName', 'ActionGeo_CountryCode', 
                                  'ActionGeo_ADM1Code', 'SOURCEURL', 'MentionSourceName', 
                                  'MentionIdentifier', 'Confidence', 'MentionDocLen', 
                                  'MentionDocTone', 'SourceCommonName', 'V2Themes', 
                                  'V2Tone'])
                                  .to_csv(location, index=False)

This dataset is <b>2.57 GB </b> and contains <b> 877,841 </b> articles

### Creating a mini dataset only containing Great Powers using Actor1/2CountryCode

For creating this mini dataset, i used all pandas functions to create this dataset.

The great powers of interest are: <b>United States</b>, <b>China</b>, <b>Australia</b>, <b>New Zealand</b>, <b>Russia</b>, and <b>Japan</b> <br>
Great power Actor1/2CountryCode: <b>USA</b>, <b>CHN</b>, <b>AUS</b>, <b>NZL</b>, <b>RUS</b>, and <b>JPN</b> <br>
Great power Actor1/2Geo_CountryCode: <b>US</b>, <b>CH</b>, <b>AS</b>, <b>NZ</b>, <b>RS</b>, and <b>JA</b>

In [None]:
location = '' # exact location to store the dataset along with the name 

usa_mask = ((df['Actor1CountryCode'] == 'USA') | (df['Actor2CountryCode'] == 'USA'))
chn_mask = ((df['Actor1CountryCode'] == 'CHN') | (df['Actor2CountryCode'] == 'CHN'))
aus_mask = ((df['Actor1CountryCode'] == 'AUS') | (df['Actor2CountryCode'] == 'AUS'))
nzl_mask = ((df['Actor1CountryCode'] == 'NZL') | (df['Actor2CountryCode'] == 'NZL'))
rus_mask = ((df['Actor1CountryCode'] == 'RUS') | (df['Actor2CountryCode'] == 'RUS'))
jpn_mask = ((df['Actor1CountryCode'] == 'JPN') | (df['Actor2CountryCode'] == 'JPN'))
df[(usa_mask | chn_mask | aus_mask | nzl_mask | rus_mask | jpn_mask)].to_csv(location, index=False)

This dataset is <b>2.96 GB</b> and contains <b> 1,497,930 </b> articles

### Create a mini dataset with Great Powers and ENV_ themes using Actor1/2CountryCode

When creating another subset dataset containing the Great Powers (filtered using Actor1/2CountryCode), i just loaded the mini dataset that has only ENV_ themes and used the same logic underneath the <br><em> Creating a mini dataset only containing Great Powers using Actor1/2CountryCode </em> 

This dataset is <b>742 MB</b> and contains <b> 277,215 </b> articles

### Creating a mini dataset only containing Great Powers using Actor1/2Geo_CountryCode

In [None]:
location = '' # exact location to store the dataset along with the name 

usa_mask = ((df['Actor1Geo_CountryCode'] == 'US') | (df['Actor2Geo_CountryCode'] == 'US'))
chn_mask = ((df['Actor1Geo_CountryCode'] == 'CH') | (df['Actor2Geo_CountryCode'] == 'CH'))
aus_mask = ((df['Actor1Geo_CountryCode'] == 'AS') | (df['Actor2Geo_CountryCode'] == 'AS'))
nzl_mask = ((df['Actor1Geo_CountryCode'] == 'NZ') | (df['Actor2Geo_CountryCode'] == 'NZ'))
rus_mask = ((df['Actor1Geo_CountryCode'] == 'RS') | (df['Actor2Geo_CountryCode'] == 'RS'))
jpn_mask = ((df['Actor1Geo_CountryCode'] == 'JA') | (df['Actor2Geo_CountryCode'] == 'JA'))
df[(usa_mask | chn_mask | aus_mask | nzl_mask | rus_mask | jpn_mask)].to_csv(location, index=False)

This dataset is <b>1.39 GB</b> and contains <b> 648,345 </b> articles

### Create a mini dataset with Great Powers and ENV_ themes using Actor1/2Geo_CountryCode

When creating another subset dataset containing the Great Powers (filtered using Actor1/2Geo_CountryCode), i just loaded the mini dataset that has only ENV_ themes and used the same logic underneath the <br><em> Creating a mini dataset only containing Great Powers using Actor1/2Geo_CountryCode </em> 

This dataset is <b>368 MB</b> and contains <b> 125,635 </b> articles