## Scraping the Original Articles
This notebook scrapes the original newsarticles whose links are contained in the EUvsDisinfo data base. The content is then saved as a .txt file sorted into folders by language.

Note that the scraping only downloads the plain text content of the article itself. For this, it uses the package `newspaper`. The package seems most of the time to get the correct content, only sometimes including a few sentences at the beginning that are not part of the article.
Also, a large chunk of news articles could not be scraped. This could be either because the article was removed or there were technical issues.

In [2]:
news_articles = pd.read_csv("all_news_articles.csv", index_col=0, 
                            names=["newsarticle_id", "type", "url", "name", "datePublished", "author", "claim", "webArchiveUrl", "abstract", "inLanguage"],
                           skiprows=1)

In [15]:
langs = pd.read_csv("all_languages.csv", index_col=0,
                   names=["id", "type", "name", "alternateName"], skiprows=1)

In [17]:
langs.head(3)

Unnamed: 0,id,type,name,alternateName
1,/languages/1,http://schema.org/Language,Arabic,ara
2,/languages/2,http://schema.org/Language,Armenian,arm
3,/languages/3,http://schema.org/Language,Russian,rus


In [21]:
news_articles = pd.merge(news_articles, langs[["id", "name"]], how="left", left_on="inLanguage", right_on="id", suffixes=("", "_lang"))

In [22]:
news_articles.head()

Unnamed: 0,newsarticle_id,type,url,name,datePublished,author,claim,webArchiveUrl,abstract,inLanguage,id,name_lang
0,121,http://schema.org/NewsArticle,https://sputnik-news.ee/politics/20191226/1885...,,,/organizations/90,/claims/20,https://web.archive.org/web/20200107173925/htt...,Eesti võimude tegevuse taga Sputniku tagakiusa...,/languages/9,/languages/9,Estonian
1,122,http://schema.org/NewsArticle,https://sputnik-news.ee/estonian_news/20191228...,,,/organizations/90,/claims/21,https://web.archive.org/web/20200107170137/htt...,"Siinkohal sooviksin lisada, et Sputnik Eesti v...",/languages/9,/languages/9,Estonian
2,123,http://schema.org/NewsArticle,https://sputnik-news.ee/estonian_news/20191231...,,,/organizations/90,/claims/22,,Peame Eesti režiimi tegevust oma riigi kodanik...,/languages/9,/languages/9,Estonian
3,126,http://schema.org/NewsArticle,https://sputnik-news.ee/estonian_news/20200101...,,,/organizations/90,/claims/24,https://web.archive.org/web/20200107144351/htt...,"Meie ainus ""süü"" on selles, et oleme ajakirjan...",/languages/9,/languages/9,Estonian
4,195,http://schema.org/NewsArticle,https://az.sputniknews.ru/russia/20191226/4227...,,,/organizations/125,/claims/35,https://web.archive.org/web/20200106061011/htt...,"""Вот эта травля, настоящая травля, которая сей...",/languages/3,/languages/3,Russian


In [23]:
news_articles.name_lang.value_counts()

Russian               899
English               299
Czech                 189
German                143
blr                   140
Arabic                140
Italian               116
Polish                100
Spanish, Castilian     84
Bosnian                70
Georgian               63
French                 61
Armenian               36
Hungarian              29
Serbian                22
Slovak                 20
Romanian               15
Estonian               14
Ukrainian              14
mda                    12
Azerbaijani             8
Moldavian               5
mne                     5
Finnish                 5
ltu                     4
deu                     4
rou                     4
bgr                     3
Swedish                 3
Croatian                3
Latvian                 2
Montenegrin             2
lva                     1
Macedonian              1
srb                     1
Lithuanian              1
Name: name_lang, dtype: int64

In [4]:
news_articles.shape

(5627, 10)

In [6]:
news_articles["newsarticle_id"] = news_articles["newsarticle_id"].str.extract(r"(\d+)")

In [7]:
conf = Configuration()

conf.keep_article_html = True

conf.keep_article_html

True

In [64]:
def read_url(url):
    art = Article(url, keep_article_html=True)
    art.download()
    art.parse()
    return art

error_ids = []
def save_text(url, newsarticle_id, lang):
    file_path = f"texts/{lang}/{newsarticle_id}.txt"
    if not path.exists(f"texts/{lang}"):
                       makedirs(f"texts/{lang}")                       
    if not path.exists(file_path):
        try:
            art = read_url(url)
            text = art.text
            with open(file_path, "w+") as text_file:
                text_file.write(text)
        except Exception as e:
            print(f"Could not download ID {newsarticle_id}")
            error_ids.append(newsarticle_id)

def base_url_from_article(article):
    url = article.url
    return get_base_url(url)

def get_base_url(url):
    split_url = urlsplit(url)
    base_url = split_url.netloc
    return base_url

def read_link_list(link_list):
    articles = []
    for url in link_list:
        articles.append( read_url(url) )
    return articles

In [65]:
link_list = news_articles.url
id_list = news_articles.newsarticle_id
langs = news_articles.name_lang

In [66]:
newsarticle_list = zip(link_list, id_list, langs)

In [67]:
for url, newsarticle_id, lang in newsarticle_list:
    save_text(url, newsarticle_id, lang)

Could not download ID 493
Could not download ID 658
Could not download ID 659
Could not download ID 660
Could not download ID 663
Could not download ID 664
Could not download ID 668
Could not download ID 708
Could not download ID 729
Could not download ID 730
Could not download ID 731
Could not download ID 743
Could not download ID 784
Could not download ID 791
Could not download ID 793
Could not download ID 797
Could not download ID 798
Could not download ID 820
Could not download ID 824
Could not download ID 829
Could not download ID 1640
Could not download ID 1852
Could not download ID 1988
Could not download ID 2047
Could not download ID 2074
Could not download ID 2120
Could not download ID 2151
Could not download ID 2174
Could not download ID 2237
Could not download ID 2287
Could not download ID 2522
Could not download ID 2541
Could not download ID 2781
Could not download ID 2786
Could not download ID 2788
Could not download ID 2953
Could not download ID 3238
Could not download ID