# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [9]:
import requests

from selenium import webdriver

In [10]:
driver = webdriver.Chrome()

In [12]:
driver.get("http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf;jsessionid=99311A250357D4ADC047F4970E1D284C")

In [13]:
headlines = driver.find_elements_by_class_name("titlebet")
for headline in headlines:
    print(headline.text.strip())

Pyongyang Bag Factory
Autumn Ploughing Finishes in DPRK
Tanchon City
Overhaul of Irrigation Facilities Brisk in DPRK
Boost in Greenhouse Vegetable Production Witnessed in DPRK
Monthly Plans Fulfilled in Various Economic Sectors
Top-class Emergency Anti-epidemic Measures Taken across DPRK
Thaechon Terrapin Farm Inaugurated
Many Youth Workteams Reap Bumper Harvests in DPRK
Tongraegang Reservoir Completed in DPRK
4th Session of 14th SPA to Be Convened
12th Plenary Meeting of 14th Presidium of DPRK Supreme People's Assembly Held
DPRK Premier Inspects Various Industrial Establishments
Socialist Fairyland Villages Appeared in Komdok Area of South Hamgyong Province
New Hostel Built in Sinuiju Textile Mill
Supreme Leader Kim Jong Un Receives Reply Message from Syrian President
People Move to New Dwelling Houses in Orang and Hochon Counties
Heroic Myth Created in Rehabilitation Campaign in North, South Hamgyong Provinces
Heroic Feats Performed by Divisions of Elite Party Members from Capital Ci

In [14]:
headlines = driver.find_elements_by_class_name("titlebet")
for headline in headlines:
    print(headline.get_attribute('onclick'))

fn_showArticle("AR0140387", "", "NT60", "L")
fn_showArticle("AR0140401", "", "NT60", "L")
fn_showArticle("AR0140379", "", "NT60", "L")
fn_showArticle("AR0140383", "", "NT60", "L")
fn_showArticle("AR0140324", "", "NT60", "L")
fn_showArticle("AR0140311", "", "NT60", "L")
fn_showArticle("AR0140310", "", "NT60", "L")
fn_showArticle("AR0140298", "", "NT60", "L")
fn_showArticle("AR0140297", "", "NT60", "L")
fn_showArticle("AR0140274", "", "NT60", "L")
fn_showArticle("AR0140382", "", "NT21", "L")
fn_showArticle("AR0140381", "", "NT21", "L")
fn_showArticle("AR0140240", "", "NT21", "L")
fn_showArticle("AR0140195", "", "NT21", "L")
fn_showArticle("AR0140176", "", "NT21", "L")
fn_showArticle("AR0140127", "", "NT21", "L")
fn_showArticle("AR0140109", "", "NT21", "L")
fn_showArticle("AR0140088", "", "NT21", "L")
fn_showArticle("AR0140065", "", "NT21", "L")
fn_showArticle("AR0140067", "", "NT21", "L")
fn_showArticle("AR0140332", "", "NT11", "L")
fn_showArticle("AR0140315", "", "NT11", "L")
fn_showArt

In [15]:
articles = []

headlines = driver.find_elements_by_class_name("titlebet")
for headline in headlines:
        url = headline.get_attribute('onclick')
        headline = headline.text.strip()
        article = {
            'headline': headline,
            'url': url
        }
        articles.append(article)

articles

[{'headline': 'Pyongyang Bag Factory',
  'url': 'fn_showArticle("AR0140387", "", "NT60", "L")'},
 {'headline': 'Autumn Ploughing Finishes in DPRK',
  'url': 'fn_showArticle("AR0140401", "", "NT60", "L")'},
 {'headline': 'Tanchon City',
  'url': 'fn_showArticle("AR0140379", "", "NT60", "L")'},
 {'headline': 'Overhaul of Irrigation Facilities Brisk in DPRK',
  'url': 'fn_showArticle("AR0140383", "", "NT60", "L")'},
 {'headline': 'Boost in Greenhouse Vegetable Production Witnessed in DPRK',
  'url': 'fn_showArticle("AR0140324", "", "NT60", "L")'},
 {'headline': 'Monthly Plans Fulfilled in Various Economic Sectors',
  'url': 'fn_showArticle("AR0140311", "", "NT60", "L")'},
 {'headline': 'Top-class Emergency Anti-epidemic Measures Taken across DPRK',
  'url': 'fn_showArticle("AR0140310", "", "NT60", "L")'},
 {'headline': 'Thaechon Terrapin Farm Inaugurated',
  'url': 'fn_showArticle("AR0140298", "", "NT60", "L")'},
 {'headline': 'Many Youth Workteams Reap Bumper Harvests in DPRK',
  'url': 

In [16]:
import pandas as pd

df = pd.DataFrame(articles)
df.head()

Unnamed: 0,headline,url
0,Pyongyang Bag Factory,"fn_showArticle(""AR0140387"", """", ""NT60"", ""L"")"
1,Autumn Ploughing Finishes in DPRK,"fn_showArticle(""AR0140401"", """", ""NT60"", ""L"")"
2,Tanchon City,"fn_showArticle(""AR0140379"", """", ""NT60"", ""L"")"
3,Overhaul of Irrigation Facilities Brisk in DPRK,"fn_showArticle(""AR0140383"", """", ""NT60"", ""L"")"
4,Boost in Greenhouse Vegetable Production Witne...,"fn_showArticle(""AR0140324"", """", ""NT60"", ""L"")"


In [24]:
df['id'] = df.url.str.extract(r'(\w\w\d\d\d\d\d\d\d)')
df.head()

Unnamed: 0,headline,url,id
0,Pyongyang Bag Factory,"fn_showArticle(""AR0140387"", """", ""NT60"", ""L"")",AR0140387
1,Autumn Ploughing Finishes in DPRK,"fn_showArticle(""AR0140401"", """", ""NT60"", ""L"")",AR0140401
2,Tanchon City,"fn_showArticle(""AR0140379"", """", ""NT60"", ""L"")",AR0140379
3,Overhaul of Irrigation Facilities Brisk in DPRK,"fn_showArticle(""AR0140383"", """", ""NT60"", ""L"")",AR0140383
4,Boost in Greenhouse Vegetable Production Witne...,"fn_showArticle(""AR0140324"", """", ""NT60"", ""L"")",AR0140324


In [23]:
df.to_csv(r'/Users/biancapallaro/Documents/Foundations/north_korea.csv', index = False, header=True)