# Finding the data

## Finding the data

Our overall goal in this chapter is to build a service that knows how to classify
textual data, the kind we encounter every day. We want our service to tell us which
general category a blog post or a news article belongs to. Classifying data like this
can be useful in many ways: building the readers’ profile and serve relevant ads,
recommend products or personalize content served.


### Existing corpora
There is a limited number of existing corpora that we could use to achieve our goal.

**Check out available corpora**

In [40]:
%%time
# Let's try out Reuters corpus
from nltk.corpus import reuters
# Let's see what are the Reuters categories
print(reuters.categories())
# ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', ...
# Let's check out the 20 newsgroups dataset

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 225 µs


In [41]:
from sklearn.datasets import fetch_20newsgroups
news20 = fetch_20newsgroups(subset='train')
print(list(news20.target_names))
# ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', ...

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


As you can notice, the categories we got aren’t that helpful for our task. They’re
narrow in scope and don’t cover the entire news spectrum as we would like.
This corpora is mostly used in benchmarking tasks rather than in real-world
applications. We can do better by gathering our own data, but bare in mind that
usually building a corpus is a tedious and expensive endeavour. If you start from
scratch, you need to manually go through a lot of news articles and blog posts
and pick the most relevant category for each of them. Most the tutorials on text
classification use bogus data, e.g. for doing sentiment analysis, or use an existing
corpora that will produce in a useless model with no practical application. In the
following section, we will take another path and explore some other ideas for
gathering data

## Ideas for Gathering Data

The goal here is to build a general tiny corpus. This task is complex and but is
only adjacent to our main goal of building a news classifier, so we will use some
shortcuts. For practical reasons, we could look at these web resources:

- Project Gutenberg Categories6 – this can prove useful depending on the domain
you are trying to cover. You will need to write a script to download the books,
transform them into plain text and use them to train your classifier.

- Reddit – This is a good source of already categorized data. Pick a list of
subreddits, assign them to a category, crawl the subreddits and extract the
links. Consider each of all the articles extracted from the subreddit as belonging
to the same category Here’s a list of all subreddits7

- Use the Bing Search API8 to get relevant articles for your categories


The general idea is to find places on the Internet where content is placed in predefined
buckets or categories. These buckets can then be assigned to a category from
your own taxonomy. Obviously, the process is quite error-prone. After gathering
the data, I suggest looking at some random samples and assessing the percentage
of it that’s correctly categorized. This will give you a rough idea of the quality of
your corpus.



Another trick I like to use after collecting the data is building a script that goes
through the labelled data and asks if the sample is correctly classified. Don’t stress
too much on the interface, its only purpose is to do the labelling. A command line
interface that accepts y/n input, or even a Tinder-like system will do the trick.
If the numbers allow you, you can then go manually through the samples that
aren’t correctly classified and fix them. It may seem like a lot of work, but keep
in mind that if it’s done right, it can save you a lot of time, especially given that the
alternative is to search for articles yourself and manually assign the appropriate
label.


### Getting the Data
Getting back to our task at hand, we’ll use a different web resource for building
our corpus: the web bookmarking service Pocket9. This service offers an explore
feature10 that requires us to input clear, unambiguous queries in order to get well-
classified articles. Here’s why this is a really good idea:
- data is socially curated and highly qualitative: the articles suggested by Pocket
are bookmarked by a big number of users
- data is current: suggestions are frequently added to the service
- data is easy to gather: the explore feature can be easily crawled and, at this
moment, it doesn’t seem to block crawlers

If you want to skip the corpus creation step, I already prepared it in advance and
you can download from here: Text Classification Data11
If you want to get your hands dirty and do it anyway, here’s how we go about it.
First, let’s figure out which should be the categories and then proceed to collecting
the data.

**Category Structure and Keywords**


In [42]:


# If you have specific needs for your corpus, remember to adjust these categories and keywords accordingly.
CATEGORIES = {
'business': [
"Business", "Marketing", "Management"
],
'family': [
"Family", "Children", "Parenting"
],
'politics': [
"Politics", "Presidential Elections",
"Politicians", "Government", "Congress"
],
'sport': [
"Baseball", "Basketball", "Running", "Sport",
"Skiing", "Gymnastics", "Tenis", "Football", "Soccer"
],
'health': [
"Health", "Weightloss", "Wellness", "Well being",
"Vitamins", "Healthy Food", "Healthy Diet"
],
'economics': [
"Economics", "Finance", "Accounting"
],
'celebrities': [
"Celebrities", "Showbiz"
],
'medical': [
"Medicine", "Doctors", "Health System",
"Surgery", "Genetics", "Hospital"
],
'science & technology': [
"Galaxy", "Physics",
"Technology", "Science"
],
'information technology': [
"Artificial Intelligence", "Search Engine",
"Software", "Hardware", "Big Data",
"Analytics", "Programming"
],
'education': [
"Education", "Students", "University"
],
'media': [
"Newspaper", "Reporters", "Social Media"
],
'cooking': [
"Cooking", "Gastronomy", "Cooking Recipes",
"Paleo Cooking", "Vegan Recipes"
],
'religion': [
"Religion", "Church", "Spirituality"
],
'legal': [
"Legal", "Lawyer", "Constitution"
],
'history': [
"Archeology", "History", "Middle Ages"
],
'nature & ecology': [
"Nature", "Ecology",
"Endangered Species", "Permaculture"
],
'travel': [
"Travel", "Tourism", "Globetrotter"
],
'meteorology': [
"Tornado", "Meteorology", "Weather Prediction"
],
'automobiles': [
"Automobiles", "Motorcycles", "Formula 1", "Driving"
],
'art & traditions': [
"Art", "Artwork", "Traditions",
"Artisan", "Pottery", "Painting", "Artist"
],
'beauty & fashion': [
"Beauty", "Fashion", "Cosmetics", "Makeup"
],
'relationships': [
"Relationships", "Relationship Advice",
"Marriage", "Wedding"
],
'astrology': [
"Astrology", "Zodiac", "Zodiac Signs", "Horoscope"
],
'diy': [
'Gardening', 'Construction', 'Decorating',
'Do it Yourself', 'Furniture'
]
}    
    

In [43]:
CATEGORIES

{'business': ['Business', 'Marketing', 'Management'],
 'family': ['Family', 'Children', 'Parenting'],
 'politics': ['Politics',
  'Presidential Elections',
  'Politicians',
  'Government',
  'Congress'],
 'sport': ['Baseball',
  'Basketball',
  'Running',
  'Sport',
  'Skiing',
  'Gymnastics',
  'Tenis',
  'Football',
  'Soccer'],
 'health': ['Health',
  'Weightloss',
  'Wellness',
  'Well being',
  'Vitamins',
  'Healthy Food',
  'Healthy Diet'],
 'economics': ['Economics', 'Finance', 'Accounting'],
 'celebrities': ['Celebrities', 'Showbiz'],
 'medical': ['Medicine',
  'Doctors',
  'Health System',
  'Surgery',
  'Genetics',
  'Hospital'],
 'science & technology': ['Galaxy', 'Physics', 'Technology', 'Science'],
 'information technology': ['Artificial Intelligence',
  'Search Engine',
  'Software',
  'Hardware',
  'Big Data',
  'Analytics',
  'Programming'],
 'education': ['Education', 'Students', 'University'],
 'media': ['Newspaper', 'Reporters', 'Social Media'],
 'cooking': ['Cookin

Moving forward, here’s what we’ll be doing next:
- querying the service and scraping the article URLs from the page using beautifulsoup412.
- iterating through the links and fetching the content of the articles using newspaper3k13
a library that helps us extract only the main content of a webpage.
- save everything, including the category, in a dataframe and dumping it in a CSV
file.

**Use Pocket Explore to Build a Corpus**

In [45]:
!pip install newspaper3k

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/5b/b6/1fcd64fe7a82b8b207a172ed30a6ee58898b245d281d6d53ed782cee1b13/newspaper3k-0.2.6.tar.gz (197kB)
[K    100% |████████████████████████████████| 204kB 2.7MB/s ta 0:00:01
Collecting PyYAML>=3.11 (from newspaper3k)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Using cached https://files.pythonhosted.org/packages/7b/44/25b7283e50585f0b4156960691d951b05d061abf4a714078393e51929b30/cssselect-1.0.3-py2.py3-none-any.whl
Collecting lxml>=3.6.0 (from newspaper3k)
[?25l  Downloading https://files.pythonhosted.org/packages/eb/59/1db3c9c27049e4f832691c6d642df1f5b64763f73942172c44fee22de397/lxml-4.2.4-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K    100% |████████████████████████████████| 5.8MB 2.3MB/s eta 0:00:01
Collecting feedparser>=5.2.1 (from newspaper3k)
[?25l  Downloading https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2 (1

> make sure you have the directory
`data/files`

In [57]:
import uuid
import atexit
import urllib
import random
import requests
import pandas as pd
from time import sleep, time
from bs4 import BeautifulSoup
from newspaper import Article, ArticleException

In [58]:
POCKET_BASE_URL = 'https://getpocket.com/explore/%s'
df = pd.DataFrame(columns=['title', 'excerpt', 'url', 'file_name', "keyword", "category"])

In [59]:
@atexit.register
def save_dataframe():
    """ Before exiting, make sure we save the dataframe to a CSV file """
    dataframe_name = "dataframe_{0}.csv".format(time())
    df.to_csv(dataframe_name, index=False)
    

In [60]:
# Shuffle the categories to make sure we are not exhaustively crawling only the first categories
categories = list(CATEGORIES.items())
random.shuffle(categories)

In [None]:
for category_name, keywords in categories:
    print("Exploring Category=\"{0}\"".format(category_name))

    for kw in keywords:
        # Get trending content from Pocket's explore endpoint
        result = requests.get(POCKET_BASE_URL % urllib.parse.quote_plus(kw))
        
        # Extract the media items
        soup = BeautifulSoup(result.content, "html5lib")
        media_items = soup.find_all(attrs={'class': 'media_item'})
        for item_html in media_items:
            title_html = item_html.find_all(attrs={'class': 'title'})[0]
            title = title_html.text
            
            
            url = title_html.a['data-saveurl']
            
            print("Indexing article: \"{0}\" from \"{1}\"".format(title, url))
            excerpt = item_html.find_all(attrs={'class': 'excerpt'})[0].text
            
            try:
                article = Article(url)
                article.download()
                article.parse()
                content = article.text
            except ArticleException as e:
                print("Encoutered exception when parsing \"{0}\": \"{1}\"".format(url, str(e)))
                continue

            if not content:
                print("Couldn't extract content from \"{0}\"".format(url))
                continue

            # Save the text file
                
            file_name = "{0}.txt".format(str(uuid.uuid4()))
            with open('./data/files/{0}'.format(file_name), 'w+') as text_file:
                text_file.write(content)
            
            # Append the row in our dataframe
            df.loc[len(df)] = [title, excerpt, url, file_name, kw, category_name]
            # Need to sleep in order to not get blocked
            sleep(random.randint(5, 15))

Exploring Category="art & traditions"
Indexing article: "The Great Chinese Art Heist" from "https://www.gq.com/story/the-great-chinese-art-heist"
Indexing article: "Sentences to ponder" from "https://marginalrevolution.com/marginalrevolution/2018/08/sentences-to-ponder-101.html"
Indexing article: "Is This the Most Powerful Sculpture at the Met?" from "https://www.nytimes.com/interactive/2018/08/20/arts/met-buddha-sculpture.html"
Indexing article: "Chinese Artist Ai Weiwei Uses Ethereum to Make Art About 'Value'" from "https://www.coindesk.com/chinese-artist-ai-weiwei-uses-ethereum-to-make-art-about-value/"
Indexing article: "Mirrored Installations by Sarah Meyohas Create Infinite Tunnels Strewn With Dangling Flowers" from "https://www.thisiscolossal.com/2018/08/mirrored-installations-by-sarah-meyohas/"
Indexing article: "31 Art Exhibitions to View in N.Y.C. This Weekend" from "https://www.nytimes.com/2018/08/16/arts/design/art-and-museums-in-nyc-this-week.html"
Indexing article: "The L

In [51]:
ls

[0m[01;32mGetting Started with Scikit-Learn.ipynb[0m*
[01;32mIntroduction to Machine Learning.ipynb[0m*
