Arpitha Gurumurthy </br>
Team: Amalgam
## **Data Collection for hyperpartisan**
The SemEval-2019 task - Hyperpartisan News Detection aims to detect hyperpartisan
news given the text of the news article. The dataset has 2 parts - 
* The first part is labeled by the publishers
* The second part is crowdsourced and labeled per article

This notebook extracts the second part dataset - 'by articles'

HuggingFace datasets library is used to download it. The datasets library provides an API to download and access the data set.

*Using Glove trained 50d vectors*




In [None]:
!pip install datasets

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
%cd /content/gdrive/My Drive/hyperpartisan/

/content/gdrive/My Drive/hyperpartisan


In [None]:
import datasets
from datasets import list_datasets, load_dataset, list_metrics, load_metric
from bs4 import BeautifulSoup
import bleach
import re
import torch

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
dataset  = datasets.load_dataset('hyperpartisan_news_detection', 'byarticle')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2438.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1402.0, style=ProgressStyle(description…


Downloading and preparing dataset hyperpartisan_news_detection/byarticle (download: 976.91 KiB, generated: 2.67 MiB, post-processed: Unknown size, total: 3.63 MiB) to /root/.cache/huggingface/datasets/hyperpartisan_news_detection/byarticle/1.0.0/b468a79d33f3dd3c95ece4a2f9b5c8f8ddc6046747cbb7d50f76e49a2e4dd828...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=971841.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28511.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset hyperpartisan_news_detection downloaded and prepared to /root/.cache/huggingface/datasets/hyperpartisan_news_detection/byarticle/1.0.0/b468a79d33f3dd3c95ece4a2f9b5c8f8ddc6046747cbb7d50f76e49a2e4dd828. Subsequent calls will reuse this data.


In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'title', 'hyperpartisan', 'url', 'published_at'],
        num_rows: 645
    })
})


In [None]:
print(dataset['train'])

Dataset({
    features: ['text', 'title', 'hyperpartisan', 'url', 'published_at'],
    num_rows: 645
})


## **Convert to dataframe**

In [None]:
print("Size of train dataset: ", dataset['train'].shape)

Size of train dataset:  (645, 5)


In [None]:
import pandas as pd
df_hyperpartisan = pd.DataFrame.from_dict(dataset['train'])
df_hyperpartisan

Unnamed: 0,hyperpartisan,published_at,text,title,url
0,True,2017-09-10,"<p>Money ( <a href=""https://farm8.static.flick...",Kucinich: Reclaiming the money power,https://www.opednews.com/articles/Kucinich-Rec...
1,True,2017-10-12,<p>Donald Trump ran on many braggadocios and l...,Trump Just Woke Up & Viciously Attacked Puerto...,http://bipartisanreport.com/2017/10/12/trump-j...
2,True,2017-10-11,<p>In response to Joyce Newman&#8217;s recent ...,"Liberals wailing about gun control, but what a...",https://www.reviewjournal.com/opinion/letters/...
3,True,2017-09-24,<p>After Colin Kaepernick rightly chose to kne...,Laremy Tunsil joins NFL players in kneeling du...,https://www.redcuprebellion.com/2017/9/24/1635...
4,False,2017-10-12,"<p>Almost a half-century ago, in 1968, the Uni...",It's 1968 All Over Again,https://www.realclearpolitics.com/articles/201...
...,...,...,...,...,...
640,True,2017-03-03,"<a type=""internal"" /> Donald Trump. Photo from...",Trump Turns his Back on American Workers,http://urbanmilwaukee.com/pressrelease/trump-t...
641,False,2017-09-05,<p>President Donald Trump on Tuesday began dis...,"Cummins: Rescinding DACA ‘discriminatory, harm...",http://www.therepublic.com/2017/09/05/cummins-...
642,False,2017-12-05,<p>The US Supreme Court has ruled that Donald ...,"Trump travel ban can be enforced, says US Supr...",http://www.theweek.co.uk/90182/trump-travel-ba...
643,False,2017-10-18,"<p>Ex-FBI Director James Comey went rogue, acc...",VIDEO- AG SESSIONS: Comey Went Rogue In Hillar...,http://truepundit.com/video-ag-sessions-comey-...


In [None]:
df_hyperpartisan

## **Data Pre-processing**

In [None]:
def clean_text(text):
    text = bleach.clean(text,strip=True)
    text = text.replace('<p>', '')
    text = text.replace('</p>', '')
    text = text.replace('\n', '')
    text = text.replace('&amp;#160;', '')
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    return text

In [None]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [None]:
df_hyperpartisan['text'] = df_hyperpartisan['text'].apply(remove_html_tags)

In [None]:
df_hyperpartisan.head()

Unnamed: 0,hyperpartisan,published_at,text,title,url
0,True,2017-09-10,Money ( Image by 401(K) 2013) Permission Detai...,Kucinich: Reclaiming the money power,https://www.opednews.com/articles/Kucinich-Rec...
1,True,2017-10-12,Donald Trump ran on many braggadocios and larg...,Trump Just Woke Up & Viciously Attacked Puerto...,http://bipartisanreport.com/2017/10/12/trump-j...
2,True,2017-10-11,In response to Joyce Newman&#8217;s recent let...,"Liberals wailing about gun control, but what a...",https://www.reviewjournal.com/opinion/letters/...
3,True,2017-09-24,After Colin Kaepernick rightly chose to kneel ...,Laremy Tunsil joins NFL players in kneeling du...,https://www.redcuprebellion.com/2017/9/24/1635...
4,False,2017-10-12,"Almost a half-century ago, in 1968, the United...",It's 1968 All Over Again,https://www.realclearpolitics.com/articles/201...


In [None]:
 df_hyperpartisan.to_csv('Hyperpartisan_data.csv')

## **References**
* https://towardsdatascience.com/train-a-longformer-for-detecting-hyperpartisan-news-content-7c141230784e
* https://github.com/hyperpartisan-news-challenge/tom-jumbo-grumbo
* https://www.aclweb.org/anthology/S19-2187.pdf
