# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

#### Imports and setup

In [43]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

Made a few assumptions:
* Want the English site
* There are multiple lists. Get them separately

In [34]:
url = 'http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf'
data = {
    'lang': 'eng'
}

response = requests.post(url, data=data, verify=False)
doc = BeautifulSoup(response.text)
doc


<html>
<head>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<script language="javascript">
	var globalContextPath = "";
	var jsLangCode = "eng";
	var flashPlayer = "/download/FlashPlayer10.zip";
	var gYearStr = "Juche";
</script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/sys/css/homepage.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/homecss.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/calendar.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/special.css" rel="stylesheet" type="text/css"/>
<style>
	body {
	
		font-family: Tahoma, serif, Arial, Helvetica;	
	
}
</style>
<!--[if IE]> 
	<link href="/sys/css/homepage_ie.css" rel="stylesheet" type="text/css"/>
<![endif]-->
<!--[if IE 6]> 
	<link href="/sys/css/homepage_ie6.css" rel="stylesheet" type="text/css"/>
<![endif]-->
<script language="javascript" src="/sys/js/comjs.js"></scrip

In [40]:
# get all articles under titlebet
sidebars = doc.find_all('a', class_='titlebet')
sidebars

In [41]:
# obtain from text
headline = ''

# obtain from onclick fn_showArticle "AR0125885"
# assumption: will always start with AR and end with a number.
article = ''

# list to write into csv
rows = []

for sidebar in sidebars:
    headline = sidebar.text
    onclick = sidebar['onclick']
    article = re.findall(r"AR\d*",sidebar['onclick'])[0]
    d = {
        'headline':headline,
        'onclick':onclick,
        'article-id':article
    }
    rows.append(d)
rows

[{'headline': 'Greetings to Seychellois President',
  'onclick': 'fn_showArticle("AR0126295", "", "NT21", "L")',
  'article-id': 'AR0126295'},
 {'headline': "Venezuelan City's Highest Order Awarded to Supreme Leader Kim Jong Un",
  'onclick': 'fn_showArticle("AR0126287", "", "NT21", "L")',
  'article-id': 'AR0126287'},
 {'headline': 'Xi Jinping to Visit DPRK',
  'onclick': 'fn_showArticle("AR0126286", "", "NT21", "L")',
  'article-id': 'AR0126286'},
 {'headline': 'Kim Jae Ryong Inspects Tokchon Area Coal-Mining Complex',
  'onclick': 'fn_showArticle("AR0126281", "", "NT21", "L")',
  'article-id': 'AR0126281'},
 {'headline': 'Double-Dealing Tactics Will Never Work: KCNA Commentary',
  'onclick': 'fn_showArticle("AR0126277", "", "NT21", "L")',
  'article-id': 'AR0126277'},
 {'headline': 'Central Election Guidance Committee Formed in DPRK',
  'onclick': 'fn_showArticle("AR0126169", "", "NT21", "L")',
  'article-id': 'AR0126169'},
 {'headline': 'Congratulations to President of Kazakhstan',

In [44]:
# need a dictionary for each row, and then a list of them all
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,article-id,headline,onclick
0,AR0126295,Greetings to Seychellois President,"fn_showArticle(""AR0126295"", """", ""NT21"", ""L"")"
1,AR0126287,Venezuelan City's Highest Order Awarded to Sup...,"fn_showArticle(""AR0126287"", """", ""NT21"", ""L"")"
2,AR0126286,Xi Jinping to Visit DPRK,"fn_showArticle(""AR0126286"", """", ""NT21"", ""L"")"
3,AR0126281,Kim Jae Ryong Inspects Tokchon Area Coal-Minin...,"fn_showArticle(""AR0126281"", """", ""NT21"", ""L"")"
4,AR0126277,Double-Dealing Tactics Will Never Work: KCNA C...,"fn_showArticle(""AR0126277"", """", ""NT21"", ""L"")"


In [45]:
df.to_csv("nk-news.csv", index=False, header=False)

I misread the instructions and thought I needed the real link, and not just the onclick link, so I generated a second file with this information.

In [59]:
# same as before
headline1 = ''
article1 = ''
rows1 = []

#Obtained from inspector
baseurl = 'http://kcna.kp/kcna.user.article.retrieveNewsViewInfoList.kcmsf'

#the rest of the url is as follows
#?article_code=AR0126295&article_type_list=&news_type_code=NT21&show_what=L&mediaCode=&lang='

for sidebar in sidebars:
    headline1 = sidebar.text
    article1 = re.findall(r"AR\d*\"",sidebar['onclick'])[0][0:-1]
    link = baseurl + "?article_code=" + article1 + "&article_type_list=&news_type_code=NT21&show_what=L&mediaCode=&lang="

    d = {
        'headline':headline1,
        'link':link,
        'article-id':article1
    }
    rows1.append(d)
rows1 

[{'headline': 'Greetings to Seychellois President',
  'link': 'http://kcna.kp/kcna.user.article.retrieveNewsViewInfoList.kcmsf?article_code=AR0126295&article_type_list=&news_type_code=NT21&show_what=L&mediaCode=&lang=',
  'article-id': 'AR0126295'},
 {'headline': "Venezuelan City's Highest Order Awarded to Supreme Leader Kim Jong Un",
  'link': 'http://kcna.kp/kcna.user.article.retrieveNewsViewInfoList.kcmsf?article_code=AR0126287&article_type_list=&news_type_code=NT21&show_what=L&mediaCode=&lang=',
  'article-id': 'AR0126287'},
 {'headline': 'Xi Jinping to Visit DPRK',
  'link': 'http://kcna.kp/kcna.user.article.retrieveNewsViewInfoList.kcmsf?article_code=AR0126286&article_type_list=&news_type_code=NT21&show_what=L&mediaCode=&lang=',
  'article-id': 'AR0126286'},
 {'headline': 'Kim Jae Ryong Inspects Tokchon Area Coal-Mining Complex',
  'link': 'http://kcna.kp/kcna.user.article.retrieveNewsViewInfoList.kcmsf?article_code=AR0126281&article_type_list=&news_type_code=NT21&show_what=L&med

In [50]:
df = pd.DataFrame(rows1)
df.head()

Unnamed: 0,article-id,headline,link
0,AR0126295,Greetings to Seychellois President,http://kcna.kp/kcna.user.article.retrieveNewsV...
1,AR0126287,Venezuelan City's Highest Order Awarded to Sup...,http://kcna.kp/kcna.user.article.retrieveNewsV...
2,AR0126286,Xi Jinping to Visit DPRK,http://kcna.kp/kcna.user.article.retrieveNewsV...
3,AR0126281,Kim Jae Ryong Inspects Tokchon Area Coal-Minin...,http://kcna.kp/kcna.user.article.retrieveNewsV...
4,AR0126277,Double-Dealing Tactics Will Never Work: KCNA C...,http://kcna.kp/kcna.user.article.retrieveNewsV...


In [51]:
df.to_csv("nk-news_with-links.csv", index=False, header=False)