# Team Cyber Pacific - Analysis of Japan's DoFA press releases page

Links used as reference:
- https://www.devgem.io/posts/overcoming-html-comment-barriers-with-beautifulsoup
- https://scrapeops.io/web-scraping-playbook/403-forbidden-error-web-scraping/
- https://stackoverflow.com/questions/43814754/python-beautifulsoup-how-to-get-href-attribute-of-a-element
- https://stackoverflow.com/questions/6518291/using-index-on-multidimensional-lists
- https://stackoverflow.com/questions/67000250/web-scraping-a-certain-row-from-a-table
- https://medium.com/geekculture/web-scraping-tables-in-python-using-beautiful-soup-8bbc31c5803e
- https://towardsdatascience.com/a-guide-to-scraping-html-tables-with-pandas-and-beautifulsoup-7fc24c331cf7

Using Python 3.13

## Importing Libraries
For this project, I have used the following libraries:

*  BeautifulSoup - For webscraping the data
*  Pandas - For graphing my results
*  Re - For getting links in the HTML page
*  Requests - Python HTTP library simplifying HTTP requests
*  Certifi - For fixing SSLError when getting data archives from 2016-2005

In [379]:
# Check if beautifulsoup and pandas are already installed
# If not, install it and then import
try:
    from bs4 import BeautifulSoup
    import pandas
    print ("Skipping import, already installed")
except:
    import sys
    !conda install --yes --prefix {sys.prefix} bs4
    !conda install --yes --prefix {sys.prefix} pandas
    from bs4 import BeautifulSoup
    import pandas
    print ("BeautifulSoup and Pandas not found. Imported")

import requests
import re
import certifi

#Importing the python file with all our functions.
import cyberpacficFunctions as cypac

Skipping import, already installed


In [388]:
#REMOVE LATER

import importlib
importlib.reload(cypac)

<module 'cyberpacficFunctions' from '/Users/Kem/Desktop/CyPac/cyberpacficFunctions.py'>

## Retrieving All the Website Links Needed From Most Recent Press Release Page
Using Requests and BeautifulSoup to parse the data out of the website. We are getting all of the months from every year, which we can then use to parse through each individual month weblink.

In [389]:
#Using these headers to get passed the 403 Access Denied error. We will reuse these.
headerUsing = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }
extract = cypac.Extraction(headerUsing)
bs = extract.parse_data('https://www.mofa.go.jp/mofaj/press/release/6_11_index.html')

print(bs)

[['#contents', '本文へ'], ['/mofaj/comment/index.html', '御意見・御感想'], ['/mofaj/map/index.html', 'サイトマップ'], ['/mofaj/link/index.html', 'リンク集'], ['/index.html', 'English'], ['/about/emb_cons/over/multi.html', 'Other Languages'], ['#', '小'], ['#', '中'], ['#', '大'], ['/mofaj/annai/index.html', '外務省について'], ['/mofaj/press/index.html', '会見・発表・広報'], ['/mofaj/gaiko/index.html', '外交政策'], ['/mofaj/area/index.html', '国・地域'], ['/mofaj/toko/index.html', '海外渡航・滞在'], ['/mofaj/procedure/index.html', '申請・手続き'], ['/mofaj/index.html', 'トップページ'], ['/mofaj/press/index.html', '会見・発表・広報'], ['/mofaj/press/release/index.html', '報道発表'], ['#archives', '過去の記録(Archives)'], ['/mofaj/press/release/pressit_000001_01449.html', 'レバノン被災民に係る物資協力の実施について'], ['/mofaj/press/release/pressit_000001_01448.html', '人事異動（令和6年11月29日付）'], ['/mofaj/press/release/pressit_000001_01447.html', 'リトアニア国家安全保障担当大統領首席顧問の外務大臣表敬'], ['/mofaj/press/release/pressit_000001_01445.html', '生稲外務大臣政務官と水谷亨・共同通信社社長との面会（概要）'], ['/mofaj/press/release/pressit_0000

Looking at the data, we can tell that the press releases all start after the list with value #archives, and ends before the first 1月 value. The months are all listed until こちら.

We can use this to figure out:
- What value we need to search for to only get press releases
- What value we need to search for to extract just the months, or everything before that point.

We'll use it below to figure out the months and their links, then actually scrape the press releases once we have all the months and years compiled.

In [427]:
#We get all of the links for the websites we need to scrape.
startIndex = extract.find_value(bs, '1月', 1)
endIndex = extract.find_value(bs, 'こちら', 1)
scrapedList = bs[(startIndex):(endIndex)]

# print(scrapedList)

#Make list with only date in MM/YY format, and make list for just webpage links.
#The webpage links aren't actually clickable so we'll also make them clickable here
monthYearStart = [1, 2024]
webLinks = []

for i in scrapedList:
  webLinks.append([monthYearStart[0], monthYearStart[1], ('https://www.mofa.go.jp' + i[0])])
  #Just for 2024 as Nov. is the current maximum
  if (monthYearStart[1] == 2024 and monthYearStart[0] == 11):
    monthYearStart[0] = 1
    monthYearStart[1] -= 1
  #Check if december, if so subtract by a year and then set back to Jan.
  elif (monthYearStart[0] != 12):
    monthYearStart[0] += 1
  else:
    monthYearStart[0] = 1
    monthYearStart[1] -= 1

print(webLinks)

[[1, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_01_index.html'], [2, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_02_index.html'], [3, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_03_index.html'], [4, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_04_index.html'], [5, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_05_index.html'], [6, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_06_index.html'], [7, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_07_index.html'], [8, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_08_index.html'], [9, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_09_index.html'], [10, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_10_index.html'], [11, 2024, 'https://www.mofa.go.jp/mofaj/press/release/6_11_index.html'], [1, 2023, 'https://www.mofa.go.jp/mofaj/press/release/5_01_index.html'], [2, 2023, 'https://www.mofa.go.jp/mofaj/press/release/5_02_index.html'], [3, 2023, 'https://www.mofa.go.jp/mofaj/press/re

## Sorting the Data into a Table

We can now move on to putting the data into a table that is easily accessible. 
The data will be moved into a dictionary, with the values:

- Title, Month, Year, Link, Countries involved

In [431]:
cleanedData = []

for i in webLinks:
  bsR = extract.parse_data(i[2])
  #Remove extraneous data
  startIndex = extract.find_value(bsR, '#archives', 0)
  endIndex = extract.find_value(bsR, '1月', 1)

  scrapedListR = bsR[(startIndex+1):(endIndex)]

  #Make a list with dictionary that contains webpage title, month, year, countries involved, and webpage link. Last value will be put in later
  tempDict = {}

  for website in scrapedListR:
    tempDict = {"Title": website[1], "Month": i[0], "Year": i[1],"Countries":{'Japan'},"Link": ('https://www.mofa.go.jp' + website[0])}
    cleanedData.append(tempDict)

#Printing length as trying to print the entire thing freaks my computer out :(
print(len(cleanedData))

11124


## Retrieving Data from the Archives
The current press releases page does not actually have all the data required. There is more data that can be accessed on an online archive, and the following code will go through the exact same steps we already did but again.

In [432]:
#Removing comments as everything is explained before
bsP = extract.parse_data('https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/27_5_index.html')
startIndex = extract.find_value(bsP, '1月', 1)
endIndex = extract.find_value(bsP, '#top', 0)
scrapedListP = extract.filter_data(bsP, startIndex, endIndex)

monthYearStart = [1, 2017]
webLinksOld = []

for i in scrapedListP:
  #Just till Apr 2019 as Apr. is the maximum on this archival date
  if monthYearStart[1] == 2017 and monthYearStart[0] == 4:
    monthYearStart[0] = 1
    monthYearStart[1] -= 1
  elif (monthYearStart[1] == 2017):
    monthYearStart[0] += 1
  else:
    webLinksOld.append([monthYearStart[0], monthYearStart[1], ('https://warp.ndl.go.jp' + i[0])])
    if (monthYearStart[0] != 12):
      monthYearStart[0] += 1
    else:
      monthYearStart[0] = 1
      monthYearStart[1] -= 1

print(webLinksOld)

[[1, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_1_index.html'], [2, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_2_index.html'], [3, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_3_index.html'], [4, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_4_index.html'], [5, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_5_index.html'], [6, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_6_index.html'], [7, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_7_index.html'], [8, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_8_index.html'], [9, 2016, 'https://warp.ndl.go.jp/info:ndljp/pid/10342969/www.mofa.go.jp/mofaj/press/release/28_9_index

In [430]:
cleanedDataP = []

for i in webLinksOld:
  bsP = extract.parse_data(i[2])
  #For some reason the format changes in the middle so swap the method of filtering then.
  indices = [i for i, x in enumerate(bsP) if x[0] == "#archives"]
  try:
    startIndex = extract.find_value(bsP, '報道発表', 1)
    endIndex = extract.find_value(bsP, 'このページのトップへ戻る', 1)
  except:
    try:
      endIndex = extract.find_value(bsP, 'BACK', 1)
    except:
      print(bsP)
      break

  #Remove extraneous data. 
  if indices != []:
    scrapedListP = bsP[(indices[0]+1):(indices[1])]
  else:
    scrapedListP = bsP[(startIndex+1):(endIndex)]

  #Make a list with dictionary that contains webpage title, month, year, countries involved, and webpage link. Last value will be put in later
  tempDict = {}

  for website in scrapedListP:
    # print(i)
    tempDict = {"Title": website[1], "Month": i[0], "Year": i[1],"Countries":{'Japan'},"Link": ('https://warp.ndl.go.jp' + website[0])}
    cleanedDataP.append(tempDict)

#Printing length as trying to print the entire thing freaks my computer out :(
print(len(cleanedDataP))

15482


## Time to make a table!

We will now combine the two list with dictionaries into a singular dictionary, and then export it as a excel once it is transformed into a dataframe.

Make sure to set your language in Excel to Japanese!

In [433]:
#Combine the two lists, and then make a dataframe using Pandas
combinedData = cleanedData + cleanedDataP
combinedDF = pandas.DataFrame(combinedData)

combinedDF.to_excel('test.xlsx')

print(combinedData[10]['Title'])
print(combinedData[5])

カンボジア若手政治関係者招へい
{'Title': '東京電力福島第一原発におけるALPS処理水の海洋放出に関するレビュー・ミッションに係るIAEA報告書の公表', 'Month': 1, 'Year': 2024, 'Countries': {'Japan'}, 'Link': 'https://www.mofa.go.jp/mofaj/press/release/pressit_000001_00276.html'}


## Finding All Press Releases with Korea
We'll go through each title and add it to a new list if it has the words "韓" in it as that is the Japanese abbreviation for Korea.
After this step, we'll then go in and add the other countries involved into the list of countries as well.

In [None]:
#Go through the data and only add to the cleaned List if Korea is mentioned the press release
listKorea = []
copy = combinedData

for i in copy:
    if '韓' in i['Title']:
        listKorea.append(i)
        listKorea[-1]['Countries'].add("South Korea")

        #The rest of the checks happen after as if it doesn't have Korea we can ignore it.
        if '中' in i['Title']:
            #This is kind of a bad check as the kanji for China also stands for "Middle" in Japanese. We'll check this in post
            listKorea.append(i)
            listKorea[-1]['Countries'].add("China")
        if '米' in i['Title'] or 'アメリカ' in i['Title'] or 'G7' in i['Title']:
            #G7 also comes up from time to time and so I'm making sure to check it, but this is just checking for America
            listKorea.append(i)
            listKorea[-1]['Countries'].add("USA")
        if '北朝鮮' in i['Title']:
            #Checking for North Korea
            listKorea.append(i)
            listKorea[-1]['Countries'].add("North Korea")
    

#Make it a dataframe and export to Excel
print(listKorea)
koreaCleanedDf = pandas.DataFrame(listKorea)
koreaCleanedDf.to_excel('korea_japan_dofj_pressreleases.xlsx')

[{'Title': '旧朝鮮半島出身労働者問題（韓国大法院判決に関する我が国の立場の韓国政府への伝達）', 'Month': 1, 'Year': 2024, 'Countries': {'South Korea', 'Japan'}, 'Link': 'https://www.mofa.go.jp/mofaj/press/release/pressit_000001_00249.html'}, {'Title': '日韓外相電話会談', 'Month': 1, 'Year': 2024, 'Countries': {'South Korea', 'Japan'}, 'Link': 'https://www.mofa.go.jp/mofaj/press/release/pressit_000001_00242.html'}, {'Title': '北朝鮮に関する日米韓協議（結果）', 'Month': 1, 'Year': 2024, 'Countries': {'USA', 'North Korea', 'South Korea', 'Japan'}, 'Link': 'https://www.mofa.go.jp/mofaj/press/release/pressit_000001_00221.html'}, {'Title': '北朝鮮に関する日米韓協議（結果）', 'Month': 1, 'Year': 2024, 'Countries': {'USA', 'North Korea', 'South Korea', 'Japan'}, 'Link': 'https://www.mofa.go.jp/mofaj/press/release/pressit_000001_00221.html'}, {'Title': '北朝鮮に関する日米韓協議（結果）', 'Month': 1, 'Year': 2024, 'Countries': {'USA', 'North Korea', 'South Korea', 'Japan'}, 'Link': 'https://www.mofa.go.jp/mofaj/press/release/pressit_000001_00221.html'}, {'Title': '北朝鮮に関する日米韓協議', 'Month': 1,