# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = 'http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf;jsessionid=4EFB48D366976C8A10D5BD46E8E423BB'
response = requests.get(url, verify=False)
doc = BeautifulSoup(response.text, 'html.parser')

In [3]:
doc.text
print(doc.prettify())

<html>
 <head>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  <script language="javascript">
   var globalContextPath = "";
	var jsLangCode = "eng";
	var flashPlayer = "/download/FlashPlayer10.zip";
	var gYearStr = "Juche";
  </script>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="/sys/css/homepage.css" rel="stylesheet" type="text/css"/>
  <link href="/sys/css/homecss.css" rel="stylesheet" type="text/css"/>
  <link href="/sys/css/calendar.css" rel="stylesheet" type="text/css"/>
  <link href="/sys/css/special.css" rel="stylesheet" type="text/css"/>
  <style>
   body {
	
		font-family: Tahoma, serif, Arial, Helvetica;	
	
}
  </style>
  <!--[if IE]> 
	<link href="/sys/css/homepage_ie.css" rel="stylesheet" type="text/css"/>
<![endif]-->
  <!--[if IE 6]> 
	<link href="/sys/css/homepage_ie6.css" rel="stylesheet" type="text/css"/>
<![endif]-->
  <script language="ja

In [10]:
headlines = doc.find_all('a', class_='titlebet')
for headline in headlines: 
    print(headline.text)

Delegation of Koreans in Japan Pay Tribute to Statues of President Kim Il Sung, Chairman Kim Jong Il
Greetings to Seychellois President
Venezuelan City's Highest Order Awarded to Supreme Leader Kim Jong Un
Xi Jinping to Visit DPRK
Kim Jae Ryong Inspects Tokchon Area Coal-Mining Complex
Double-Dealing Tactics Will Never Work: KCNA Commentary
Central Election Guidance Committee Formed in DPRK
Congratulations to President of Kazakhstan
Greetings to Philippine President
KCNA Commentary Rebukes Japan's Attachment to "Flag of Rising Sun Shedding Rays"
Results of 2018-2019 DPRK Premier Football League (14)
Road Relay Held in DPRK
Results of 2018-2019 DPRK Women's Premier Football League (12)
Results of 2018-2019 DPRK Premier Football League (13)
Results of 2018-2019 DPRK Women's Premier Football League (11)
Pyongyang International Sci-Tech Exhibition of Health and Medical Appliances Opens
Fictions and Models - 2019 Held
Tour by Super-light Planes Popular in DPRK
MOU Signed between DPRK and Ru

In [6]:
headlines = doc.find_all('a', class_='titlebet') 
for headline in headlines:
    print(headline.attrs['onclick'])

type(headline.attrs['onclick'])

fn_showArticle("AR0126313", "", "NT21", "L")
fn_showArticle("AR0126295", "", "NT21", "L")
fn_showArticle("AR0126287", "", "NT21", "L")
fn_showArticle("AR0126286", "", "NT21", "L")
fn_showArticle("AR0126281", "", "NT21", "L")
fn_showArticle("AR0126277", "", "NT21", "L")
fn_showArticle("AR0126169", "", "NT21", "L")
fn_showArticle("AR0126127", "", "NT21", "L")
fn_showArticle("AR0126118", "", "NT21", "L")
fn_showArticle("AR0126117", "", "NT21", "L")
fn_showArticle("AR0126278", "", "NT11", "L")
fn_showArticle("AR0126227", "", "NT11", "L")
fn_showArticle("AR0126226", "", "NT11", "L")
fn_showArticle("AR0126058", "", "NT11", "L")
fn_showArticle("AR0126009", "", "NT11", "L")
fn_showArticle("AR0126283", "", "NT09", "L")
fn_showArticle("AR0125978", "", "NT09", "L")
fn_showArticle("AR0124762", "", "NT19", "L")
fn_showArticle("AR0122941", "", "NT19", "L")
fn_showArticle("AR0121949", "", "NT19", "L")
fn_showArticle("AR0121709", "", "NT19", "L")
fn_showArticle("AR0120090", "", "NT19", "L")
fn_showArt

str

In [7]:
ids = headline.attrs['onclick'][16:25]
for id in ids:
    print(id)

A
R
0
1
2
5
7
0
7


In [12]:
rows = []
for headline in headlines:
    
    row = {}
    
    row['headlines'] = headline.text
    row['onclick'] = headline.attrs['onclick']
    row['ID'] = headline.attrs['onclick'][16:25]
    print(row)
    
    # Take our new dictionary and add it to our list
    rows.append(row)

{'headlines': 'Delegation of Koreans in Japan Pay Tribute to Statues of President Kim Il Sung, Chairman Kim Jong Il', 'onclick': 'fn_showArticle("AR0126313", "", "NT21", "L")', 'ID': 'AR0126313'}
{'headlines': 'Greetings to Seychellois President', 'onclick': 'fn_showArticle("AR0126295", "", "NT21", "L")', 'ID': 'AR0126295'}
{'headlines': "Venezuelan City's Highest Order Awarded to Supreme Leader Kim Jong Un", 'onclick': 'fn_showArticle("AR0126287", "", "NT21", "L")', 'ID': 'AR0126287'}
{'headlines': 'Xi Jinping to Visit DPRK', 'onclick': 'fn_showArticle("AR0126286", "", "NT21", "L")', 'ID': 'AR0126286'}
{'headlines': 'Kim Jae Ryong Inspects Tokchon Area Coal-Mining Complex', 'onclick': 'fn_showArticle("AR0126281", "", "NT21", "L")', 'ID': 'AR0126281'}
{'headlines': 'Double-Dealing Tactics Will Never Work: KCNA Commentary', 'onclick': 'fn_showArticle("AR0126277", "", "NT21", "L")', 'ID': 'AR0126277'}
{'headlines': 'Central Election Guidance Committee Formed in DPRK', 'onclick': 'fn_show

In [13]:
import pandas as pd
df = pd.DataFrame(rows)
df.head() 

Unnamed: 0,ID,headlines,onclick
0,AR0126313,Delegation of Koreans in Japan Pay Tribute to ...,"fn_showArticle(""AR0126313"", """", ""NT21"", ""L"")"
1,AR0126295,Greetings to Seychellois President,"fn_showArticle(""AR0126295"", """", ""NT21"", ""L"")"
2,AR0126287,Venezuelan City's Highest Order Awarded to Sup...,"fn_showArticle(""AR0126287"", """", ""NT21"", ""L"")"
3,AR0126286,Xi Jinping to Visit DPRK,"fn_showArticle(""AR0126286"", """", ""NT21"", ""L"")"
4,AR0126281,Kim Jae Ryong Inspects Tokchon Area Coal-Minin...,"fn_showArticle(""AR0126281"", """", ""NT21"", ""L"")"


In [15]:
df.to_csv("nk-news.csv", index=False)