# Bringing together new articles on hate crimes

Down the road, classify articles as hate-related or not

- Unsupervised approach:
    - Topic modeling!
        - Manually find topics related to hate crimes
        - Requires bulk scraping/downloading of a paper
- Supervised approach
    - Requires labels for articles
        - Find articles that are hate-crime-related, and articles that are not

To me, this supervised approach seems much more feasible

## Finding articles related to hate crimes

- @orensegal
    - Director, Center on Extremism
    - Would need to scrape his Twitter feed for links
        - Seems to require the Twitter API, doesn't have an RSS feed
- [ADL in the News](https://www.adl.org/news/media-watch)
    - More of a post-facto ADL response, not event reporting itself
- [NYT hate crimes topic](https://www.nytimes.com/topic/subject/hate-crimes)
    - Plus other news sources that tag their articles (labels!)

Try the ADL feed first (or NYT?).

Parse that feed, grab articles, then...

### Next steps

- May also need to parse and download news articles that are NOT related to hate crimes.
    - This is if we want to dive into training a classifier
- Dive into term frequency, importance, etc.
    - Doesn't need the non-hate-related articles, but
        - Much less insightful

## Feed parsing

Try out [feedparser](https://pythonhosted.org/feedparser/introduction.html#parsing-a-feed-from-a-remote-url)!

In [None]:
import feedparser

In [17]:
# NY Times Hate Crimes feed
d = feedparser.parse('https://www.nytimes.com/svc/collections/v1/publish/http://www.nytimes.com/topic/subject/hate-crimes/rss.xml')

In [None]:
print(d['feed']['title'])
print(d['feed']['link'])
print(d['feed']['description'])

In [None]:
for i in range(10):
    print(d['entries'][i]['title'])

In [None]:
for i in range(5):
    print(i)
    print(d['entries'][i]['description'])
    print()

Let's try The Guardian now too.

In [None]:
tg = feedparser.parse('https://www.theguardian.com/society/hate-crime/rss')

In [None]:
print(tg['feed']['title'])
print(tg['feed']['link'])
print(tg['feed']['description'])

In [None]:
for i in range(10):
    print(tg['entries'][i]['title'])

In [None]:
for i in range(2):
    print(i)
    print(tg['entries'][i]['description'])
    print()

# Much longer descriptions than the NYT, but still not a full article

Beyond these titles and descriptions, the RSS feeds for these sources don't seem to contain the full text for these articles.

Will probably need to download them, and scrape as a separate step.

Other news sources may have fuller RSS feeds.

Okay, let's try to scrape our first article!

In [None]:
import requests, bs4
import re

In [None]:
d['entries'][0]['link']

In [None]:
res = requests.get(d['entries'][0]['link'])

In [None]:
res.raise_for_status()

In [None]:
starch = bs4.BeautifulSoup(res.text)

In [None]:
starch.select('#story')

In [None]:
starch.select('#story-body-text')

In [None]:
starch.body.article.get_text(strip=True)

In [None]:
rawt = starch.body.article.get_text(strip=True)

In [None]:
thetext = starch.find_all('p', attrs={'class' : 'story-body-text'})

In [None]:
thetext = starch.find_all(class_ = 'story-body-text')

In [7]:
for tag in thetext:
    print(tag.text.strip())

From Amsterdam to New York, London to Havana, Dutch men across the world held hands this week to show solidarity with a gay couple who say they were brutally beaten in Arnhem, the Netherlands.
The outpouring of support came after the married couple, Jasper Vernes-Sewratan and Ronnie Sewratan-Vernes, said they were attacked by a gang of youths while holding hands on their way home from a party early Sunday.
According to a statement the Arnhem police posted on Facebook, the two said they had been attacked by men wielding bolt cutters; one had some of his teeth smashed out.
Prosecutors said five teenage suspects would be charged on Thursday with serious bodily harm. The authorities are still investigating the motivation for the attack, which the victims have characterized as a hate crime.
The beating caused particular outrage in the Netherlands, which has long prided itself on its tolerance. Amsterdam, the capital, has been a haven for sexual minorities for centuries, and it has marketed 

In [9]:
tag.text.strip()

'It cautioned, however, that the increase in reporting could also be a result of greater awareness of the issue.'

## Putting it all together

In [15]:
import requests
import bs4
import feedparser

# NY Times Hate Crimes feed
nytrss = feedparser.parse(
    'https://www.nytimes.com/svc/collections/v1/publish/http://www.nytimes.com/topic/subject/hate-crimes/rss.xml')

for i in range(20):
    res = requests.get(nytrss['entries'][i]['link'])
    res.raise_for_status()
    starch = bs4.BeautifulSoup(res.text, 'html5lib')
    # The below is specific to the NYT
    thetext = starch.find_all(class_='story-body-text')

    with open("Output.txt", "a") as text_file:
        for tag in thetext:
            text_file.write(tag.text.strip())
            text_file.write(' ')

        # Insert a line break after each article
        text_file.write('\n\n')


IndexError: list index out of range

Should also grab some other, random NYT articles, for the other part of the training set.

In [19]:
# NY Times feed on NY local news
nytrss = feedparser.parse(
    'https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/section/nyregion/rss.xml')

for i in range(20):
    res = requests.get(nytrss['entries'][i]['link'])
    res.raise_for_status()
    starch = bs4.BeautifulSoup(res.text, 'html5lib')
    # The below is specific to the NYT
    thetext = starch.find_all(class_='story-body-text')

    with open("Output.txt", "a") as text_file:
        for tag in thetext:
            text_file.write(tag.text.strip())
            text_file.write(' ')

        # Insert a line break after each article
        text_file.write('\n\n')


IndexError: list index out of range

Let's move on to The Guardian

In [35]:
# Scrape one article, before moving on to whole feed
guardrss = feedparser.parse(
    'https://www.theguardian.com/society/hate-crime/rss')

res = requests.get(guardrss['entries'][0]['link'])
res.raise_for_status()
starch = bs4.BeautifulSoup(res.text, 'html5lib')
# The below is specific to The Guardian
thetext = starch.find_all('div', itemprop='articleBody')

with open("Output.txt", "a") as text_file:
        for tag in thetext:
            string = tag.text.strip()
            string = string.replace('\r', ' ').replace('\n', ' ')
            text_file.write(string)

        # Insert a line break after each article
        # text_file.write('\n\n')

In [36]:
# Guardian feed on hate crime
guardrss = feedparser.parse(
    'https://www.theguardian.com/society/hate-crime/rss')

for i in range(20):
    res = requests.get(guardrss['entries'][i]['link'])
    res.raise_for_status()
    starch = bs4.BeautifulSoup(res.text, 'html5lib')
    # The below is specific to The Guardian
    thetext = starch.find_all('div', itemprop='articleBody')

    with open("Output.txt", "a") as text_file:
        for tag in thetext:
            string = tag.text.strip()
            string = string.replace('\r', ' ').replace('\n', ' ')
            text_file.write(string)

        # Insert a line break after each article
        text_file.write('\n\n')
