# BBC Dataset demo
This demo uses a freely available dataset from the BBC (http://mlg.ucd.ie/datasets/bbc.html). We will use this dataset as an example to automatcly cluster documents.

## Download and extract dataset
We will use the raw text dataset. We fill first download the ZIP if not done already and then extract it

In [1]:
from urllib.request import urlretrieve
from zipfile import ZipFile
import os

srcUrl = 'http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip'

# Create folder if it doesn't exists
if not os.path.exists('./bbc-corpus'):
    print("Create bbc-corpus folder")
    os.makedirs('./bbc-corpus')

# Check for existance of ZIP file
if not os.path.exists('./bbc-corpus/bbc-fulltext.zip'):
    print("Download %s" % srcUrl)
    urlretrieve(srcUrl, './bbc-corpus/bbc-fulltext.zip')
    
# Extract zipFile
with ZipFile('./bbc-corpus/bbc-fulltext.zip', 'r') as zipFile:
    print("Extract %d files from bbc-fulltext.zip" % len(zipFile.namelist()))
    zipFile.extractall('./bbc-corpus')

Extract 2232 files from bbc-fulltext.zip


## Extract titles and text body
We will go trough all the files and extract the titles and the body of each file.

In [2]:
# Create two lists
item_titles = []
item_texts = []

# Go trough folders and trough files
for subfolder in [ x for x in os.listdir('./bbc-corpus/bbc') if os.path.isdir('./bbc-corpus/bbc/' + x) ]:
    for filename in os.listdir('./bbc-corpus/bbc/' + subfolder):
        full_filename = './bbc-corpus/bbc/' + subfolder + '/' + filename            
        file_contents = open(full_filename, 'r', encoding='iso 8859-15').read().split('\n')
        
        # First line is title, then a empty line and then the main body
        item_titles.append(file_contents[0])
        item_texts.append('\n'.join(file_contents[2:]))

        
# Test output
print(item_titles[0])
print("%s..." % item_texts[0][:100])

Rochus shocks Coria in Auckland
Top seed Guillermo Coria went out of the Heineken Open in Auckland on Thursday with a surprise loss ...


## Cleanup item texts
Cleanup the texts by first removing all non alpha characters. Change it to lowercase. And finally remove all stopwords. 

In [3]:
import re
from nltk.corpus import stopwords

for idx, item_text in enumerate(item_texts):
    # Remove all non alpha characters and change it to lowercase
    letters_only = re.sub('[^a-zA-Z]', ' ', item_texts[idx])
    
    # Split into words by space
    words = letters_only.lower().split()
    
    # Remove stopwords
    stopwords_eng = set(stopwords.words("english"))    
    useful_words = [x for x in words if not x in stopwords_eng]
    
    # Store result
    item_texts[idx] = ' '.join(useful_words)
    
# Test output
print(item_texts[0])

top seed guillermo coria went heineken open auckland thursday surprise loss olivier rochus belgium coria lost semi final rochus goes face czech jan hernych winner jose acasuso argentina fifth seed fernando gonzalez eased past american robby ginepri chilean meet sixth seed juan ignacio chela next argentine beat potito starace rochus made semi finals australian hardcourt championships adelaide last week naturally delighted form two unbelievable weeks said today knew nothing lose beat great lost would losing top player coria conceded rochus played good added give best sad


## Create Tf-Idf vectors for all document terms
Tf-Idf stands for *term frequency–inverse document frequency*. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection. Words that are mentioned a lot in a document but are not mentioned in the whole collection are deemed important. Also words that are mentioned but are very common in the whole collection are not so important.

In [4]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_words(words_list, stemmer):
    return [stemmer.stem(word) for word in words_list]

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_words(tokens, stemmer)
    return stems

# Create Tf-Idf vectors from our item_texts
tfidf = TfidfVectorizer(tokenizer=tokenize)
tfs = tfidf.fit_transform(item_texts)

# Test output
print(tfs)

  (0, 17047)	0.0931072466343
  (0, 14870)	0.250494157871
  (0, 7174)	0.11442958744
  (0, 3628)	0.358187717552
  (0, 18387)	0.0563737957786
  (0, 7534)	0.0969661000928
  (0, 11960)	0.0465113632758
  (0, 1017)	0.131361856926
  (0, 16908)	0.0640594744112
  (0, 16368)	0.0669752067127
  (0, 9937)	0.0690257928222
  (0, 11905)	0.10043032211
  (0, 14252)	0.490173450729
  (0, 1508)	0.108932051169
  (0, 9939)	0.110435334484
  (0, 14905)	0.164902953089
  (0, 5995)	0.103380048144
  (0, 6898)	0.0746795583793
  (0, 5712)	0.04833491531
  (0, 3986)	0.0969661000928
  (0, 8697)	0.108932051169
  (0, 7601)	0.131361856926
  (0, 18563)	0.0647739164312
  (0, 8881)	0.0834980526236
  (0, 83)	0.138361489591
  :	:
  (2224, 773)	0.036520717039
  (2224, 4201)	0.0626900548222
  (2224, 15063)	0.0749579757996
  (2224, 947)	0.0389504153461
  (2224, 8512)	0.0349596315479
  (2224, 8713)	0.118509033808
  (2224, 9466)	0.133149384987
  (2224, 8247)	0.038686940249
  (2224, 17012)	0.102745662014
  (2224, 16538)	0.12478167109

## See some nearest neighbors
We will use nearest neighbors to see if by using cosine similarity we actually get usefull correlations.

In [10]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

neigh = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10)
neigh.fit(tfs)

bill_id = 11

print(item_titles[bill_id])
for x in neigh.kneighbors(tfs[bill_id], return_distance=False).reshape([ -1 ]):
    print(" - %s (#%d)" % (item_titles[x], x))

D'Arcy injury adds to Ireland woe
 - D'Arcy injury adds to Ireland woe (#11)
 - O'Driscoll out of Scotland game (#314)
 - Ireland surge past Scots (#468)
 - Ireland v USA (Sat) (#151)
 - Preview: Ireland v England (Sun) (#51)
 - O'Connor aims to grab opportunity (#366)
 - Italy 17-28 Ireland (#14)
 - O'Driscoll saves Irish blushes (#508)
 - Ireland call up uncapped Campbell (#111)
 - Ireland 17-12 South Africa (#301)
