# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [5]:
# an English language session
url = 'http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf;jsessionid=7B676B8D21CC98BC398B795180481050'
response = requests.get(url)
doc = BeautifulSoup(response.text)

In [6]:
print(doc.prettify())

<html>
 <head>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  <script language="javascript">
   var globalContextPath = "";
	var jsLangCode = "eng";
	var flashPlayer = "/download/FlashPlayer10.zip";
	var gYearStr = "Juche";
  </script>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="/sys/css/homepage.css" rel="stylesheet" type="text/css"/>
  <link href="/sys/css/homecss.css" rel="stylesheet" type="text/css"/>
  <link href="/sys/css/calendar.css" rel="stylesheet" type="text/css"/>
  <link href="/sys/css/special.css" rel="stylesheet" type="text/css"/>
  <style>
   body {
	
		font-family: Tahoma, serif, Arial, Helvetica;	
	
}
  </style>
  <!--[if IE]> 
	<link href="/sys/css/homepage_ie.css" rel="stylesheet" type="text/css"/>
<![endif]-->
  <!--[if IE 6]> 
	<link href="/sys/css/homepage_ie6.css" rel="stylesheet" type="text/css"/>
<![endif]-->
  <script language="ja

In [34]:
articles = doc.find_all('a',class_='titlebet')
rows = []
for article in articles:
    row = {}
    row['headline'] = article.text
    row['onclick'] = article['onclick']
    rows.append(row)

df = pd.DataFrame(rows)
df

Unnamed: 0,headline,onclick
0,Delegation of Koreans in Japan Pay Tribute to ...,"fn_showArticle(""AR0126313"", """", ""NT21"", ""L"")"
1,Greetings to Seychellois President,"fn_showArticle(""AR0126295"", """", ""NT21"", ""L"")"
2,Venezuelan City's Highest Order Awarded to Sup...,"fn_showArticle(""AR0126287"", """", ""NT21"", ""L"")"
3,Xi Jinping to Visit DPRK,"fn_showArticle(""AR0126286"", """", ""NT21"", ""L"")"
4,Kim Jae Ryong Inspects Tokchon Area Coal-Minin...,"fn_showArticle(""AR0126281"", """", ""NT21"", ""L"")"
5,Double-Dealing Tactics Will Never Work: KCNA C...,"fn_showArticle(""AR0126277"", """", ""NT21"", ""L"")"
6,Central Election Guidance Committee Formed in ...,"fn_showArticle(""AR0126169"", """", ""NT21"", ""L"")"
7,Congratulations to President of Kazakhstan,"fn_showArticle(""AR0126127"", """", ""NT21"", ""L"")"
8,Greetings to Philippine President,"fn_showArticle(""AR0126118"", """", ""NT21"", ""L"")"
9,KCNA Commentary Rebukes Japan's Attachment to ...,"fn_showArticle(""AR0126117"", """", ""NT21"", ""L"")"


In [57]:
df['article_id'] = df.onclick.str.extract(r".*(AR\d{7}).*")

In [58]:
df.head()

Unnamed: 0,headline,onclick,article_id
0,Delegation of Koreans in Japan Pay Tribute to ...,"fn_showArticle(""AR0126313"", """", ""NT21"", ""L"")",AR0126313
1,Greetings to Seychellois President,"fn_showArticle(""AR0126295"", """", ""NT21"", ""L"")",AR0126295
2,Venezuelan City's Highest Order Awarded to Sup...,"fn_showArticle(""AR0126287"", """", ""NT21"", ""L"")",AR0126287
3,Xi Jinping to Visit DPRK,"fn_showArticle(""AR0126286"", """", ""NT21"", ""L"")",AR0126286
4,Kim Jae Ryong Inspects Tokchon Area Coal-Minin...,"fn_showArticle(""AR0126281"", """", ""NT21"", ""L"")",AR0126281


In [59]:
df.to_csv('nk-news.csv')