# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [45]:
import requests
from bs4 import BeautifulSoup

In [56]:
url = 'http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf;jsessionid=4EFB48D366976C8A10D5BD46E8E423BB'
response = requests.get(url, verify=False)
doc = BeautifulSoup(response.text, 'html.parser')

In [57]:
#print(doc.prettify())
headlines = doc.find_all('a', class_='titlebet')
headlines
for headline in headlines:
    print(headline.text)

Delegation of Koreans in Japan Pay Tribute to Statues of President Kim Il Sung, Chairman Kim Jong Il
Greetings to Seychellois President
Venezuelan City's Highest Order Awarded to Supreme Leader Kim Jong Un
Xi Jinping to Visit DPRK
Kim Jae Ryong Inspects Tokchon Area Coal-Mining Complex
Double-Dealing Tactics Will Never Work: KCNA Commentary
Central Election Guidance Committee Formed in DPRK
Congratulations to President of Kazakhstan
Greetings to Philippine President
KCNA Commentary Rebukes Japan's Attachment to "Flag of Rising Sun Shedding Rays"
Results of 2018-2019 DPRK Premier Football League (14)
Road Relay Held in DPRK
Results of 2018-2019 DPRK Women's Premier Football League (12)
Results of 2018-2019 DPRK Premier Football League (13)
Results of 2018-2019 DPRK Women's Premier Football League (11)
Pyongyang International Sci-Tech Exhibition of Health and Medical Appliances Opens
Fictions and Models - 2019 Held
Tour by Super-light Planes Popular in DPRK
MOU Signed between DPRK and Ru

In [64]:
headlines[0].attrs['onclick']
for headline in headlines:
    print (headline.attrs['onclick'])

fn_showArticle("AR0126313", "", "NT21", "L")
fn_showArticle("AR0126295", "", "NT21", "L")
fn_showArticle("AR0126287", "", "NT21", "L")
fn_showArticle("AR0126286", "", "NT21", "L")
fn_showArticle("AR0126281", "", "NT21", "L")
fn_showArticle("AR0126277", "", "NT21", "L")
fn_showArticle("AR0126169", "", "NT21", "L")
fn_showArticle("AR0126127", "", "NT21", "L")
fn_showArticle("AR0126118", "", "NT21", "L")
fn_showArticle("AR0126117", "", "NT21", "L")
fn_showArticle("AR0126278", "", "NT11", "L")
fn_showArticle("AR0126227", "", "NT11", "L")
fn_showArticle("AR0126226", "", "NT11", "L")
fn_showArticle("AR0126058", "", "NT11", "L")
fn_showArticle("AR0126009", "", "NT11", "L")
fn_showArticle("AR0126283", "", "NT09", "L")
fn_showArticle("AR0125978", "", "NT09", "L")
fn_showArticle("AR0124762", "", "NT19", "L")
fn_showArticle("AR0122941", "", "NT19", "L")
fn_showArticle("AR0121949", "", "NT19", "L")
fn_showArticle("AR0121709", "", "NT19", "L")
fn_showArticle("AR0120090", "", "NT19", "L")
fn_showArt

In [78]:
for headline in headlines:
    print (headline.attrs['onclick'][16:25])



AR0126313
AR0126295
AR0126287
AR0126286
AR0126281
AR0126277
AR0126169
AR0126127
AR0126118
AR0126117
AR0126278
AR0126227
AR0126226
AR0126058
AR0126009
AR0126283
AR0125978
AR0124762
AR0122941
AR0121949
AR0121709
AR0120090
AR0124586
AR0126163
AR0126072
AR0125945
AR0126135
AR0126134
AR0126098
AR0125886
AR0125876
AR0125857
AR0126284
AR0126314
AR0126313
AR0126297
AR0126296
AR0126295
AR0126293
AR0126292
AR0126291
AR0126290
AR0126289
AR0126288
AR0126287
AR0126286
AR0126283
AR0126282
AR0126281
AR0126255
AR0126279
AR0126278
AR0126277
AR0126237
AR0126262
AR0126260
AR0126258
AR0126257
AR0126256
AR0126234
AR0126232
AR0126231
AR0126233
AR0126222
AR0125917
AR0125557
AR0125472
AR0125350
AR0125232
AR0126284
AR0126295
AR0126290
AR0126289
AR0126288
AR0126287
AR0126286
AR0126283
AR0126290
AR0126289
AR0126287
AR0126257
AR0126232
AR0126230
AR0126288
AR0126258
AR0126256
AR0126231
AR0126200
AR0126222
AR0125917
AR0125884
AR0125701
AR0125557
AR0125350
AR0124795
AR0126293
AR0126292
AR0126262
AR0126260
AR0126234


In [83]:
# We start off with zero things in our database
rows = []
for headline in headlines:
    row = {}
    row['title'] = headline.text
    row['onclick'] = headline.attrs['onclick']
    row['id'] = headline.attrs['onclick'][16:25]
#print(row)
    rows.append(row)

In [84]:
import pandas as pd
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,id,onclick,title
0,AR0126313,"fn_showArticle(""AR0126313"", """", ""NT21"", ""L"")",Delegation of Koreans in Japan Pay Tribute to ...
1,AR0126295,"fn_showArticle(""AR0126295"", """", ""NT21"", ""L"")",Greetings to Seychellois President
2,AR0126287,"fn_showArticle(""AR0126287"", """", ""NT21"", ""L"")",Venezuelan City's Highest Order Awarded to Sup...
3,AR0126286,"fn_showArticle(""AR0126286"", """", ""NT21"", ""L"")",Xi Jinping to Visit DPRK
4,AR0126281,"fn_showArticle(""AR0126281"", """", ""NT21"", ""L"")",Kim Jae Ryong Inspects Tokchon Area Coal-Minin...
