# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [2]:
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
url="http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf;jsessionid=442CB85D1E22023F1780D82B0C79D964"
raw_url=requests.get(url, headers=headers).content

In [3]:
soup_doc=BeautifulSoup(raw_url, "html.parser")

In [4]:
soup_doc.prettify()

'<html>\n <head>\n  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n  <script language="javascript">\n   var globalContextPath = "";\r\n\tvar jsLangCode = "eng";\r\n\tvar flashPlayer = "/download/FlashPlayer10.zip";\r\n\tvar gYearStr = "Juche";\n  </script>\n  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n  <link href="/sys/css/homepage.css" rel="stylesheet" type="text/css"/>\n  <link href="/sys/css/homecss.css" rel="stylesheet" type="text/css"/>\n  <link href="/sys/css/calendar.css" rel="stylesheet" type="text/css"/>\n  <link href="/sys/css/special.css" rel="stylesheet" type="text/css"/>\n  <style>\n   body {\r\n\t\r\n\t\tfont-family: Tahoma, serif, Arial, Helvetica;\t\r\n\t\r\n}\n  </style>\n  <!--[if IE]> \r\n\t<link href="/sys/css/homepage_ie.css" rel="stylesheet" type="text/css"/>\r\n<![endif]-->\n  <!--[if IE 6]> \r\n\t<link href="/sys/css/homepage_ie6.css" rel="stylesheet" type="te

In [18]:
news_headlines=soup_doc.find_all(class_="sub_title2")

latest_news=[]

for headline in news_headlines:
    story={}
    story["article_headline"]=headline.find("a").text
    story["onclick"]=headline.find("a")["onclick"]
    article_id=re.findall("AR\d+",story["onclick"])
    for article in article_id:
        story["article_id"]=article
        latest_news.append(story)   

print(latest_news)

[{'article_headline': 'Mansudae Art Studio in DPRK', 'onclick': 'fn_showArticle("AR0126669", "", "NT41", "L")', 'article_id': 'AR0126669'}, {'article_headline': 'Celebration Performances Given', 'onclick': 'fn_showArticle("AR0126710", "", "NT41", "L")', 'article_id': 'AR0126710'}, {'article_headline': 'Supreme Leader Kim Jong Un Authors Works Indicating Way for Economic Construction', 'onclick': 'fn_showArticle("AR0126668", "", "NT41", "L")', 'article_id': 'AR0126668'}, {'article_headline': 'Delegation of DPRK Red Cross Society Leaves for Mongolia', 'onclick': 'fn_showArticle("AR0126708", "", "NT41", "L")', 'article_id': 'AR0126708'}, {'article_headline': 'Supreme Leader Kim Jong Un Praised Abroad', 'onclick': 'fn_showArticle("AR0126707", "", "NT41", "L")', 'article_id': 'AR0126707'}, {'article_headline': 'Breast Tumour Institute of Pyongyang Maternity Hospital', 'onclick': 'fn_showArticle("AR0126706", "", "NT41", "L")', 'article_id': 'AR0126706'}, {'article_headline': "Results of 2018

In [21]:
df=pd.DataFrame(latest_news)
df

Unnamed: 0,article_headline,article_id,onclick
0,Mansudae Art Studio in DPRK,AR0126669,"fn_showArticle(""AR0126669"", """", ""NT41"", ""L"")"
1,Celebration Performances Given,AR0126710,"fn_showArticle(""AR0126710"", """", ""NT41"", ""L"")"
2,Supreme Leader Kim Jong Un Authors Works Indic...,AR0126668,"fn_showArticle(""AR0126668"", """", ""NT41"", ""L"")"
3,Delegation of DPRK Red Cross Society Leaves fo...,AR0126708,"fn_showArticle(""AR0126708"", """", ""NT41"", ""L"")"
4,Supreme Leader Kim Jong Un Praised Abroad,AR0126707,"fn_showArticle(""AR0126707"", """", ""NT41"", ""L"")"
5,Breast Tumour Institute of Pyongyang Maternity...,AR0126706,"fn_showArticle(""AR0126706"", """", ""NT41"", ""L"")"
6,Results of 2018-2019 DPRK Women's Premier Foot...,AR0126705,"fn_showArticle(""AR0126705"", """", ""NT41"", ""L"")"
7,"Supreme Leader Kim Jong Un, Peerlessly Great M...",AR0126704,"fn_showArticle(""AR0126704"", """", ""NT41"", ""L"")"
8,"""Liberal Korea Party"" Is Doomed to Ruin: KCNA ...",AR0126703,"fn_showArticle(""AR0126703"", """", ""NT41"", ""L"")"
9,Delegation of Ministry of Posts and Telecommun...,AR0126701,"fn_showArticle(""AR0126701"", """", ""NT41"", ""L"")"


In [22]:
df.to_csv("scraped.csv", index=False)