# Download and Explore DBpedia Dataset
*by Marvin Bertin*
<img src="../images/tensorflow.png" width="400">

## DBpedia Ontology Classification Dataset

**Description**

The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from [DBpedia 2014](http://wiki.dbpedia.org/).

They are listed in classes.txt. 
The dataset is composed of 14 ontology classes, each with 40,000 training samples and 5,000 testing samples.

**Total dataset size**
- 560,000 training set
- 70,000 testing set

** 14 Target classes**
1. Company
2. EducationalInstitution
3. Artist
4. Athlete
5. OfficeHolder
6. MeanOfTransportation
7. Building
8. NaturalPlace
9. Village
10. Animal
11. Plant
12. Album
13. Film
14. WrittenWork

## Import Tensorflow Slim

In [1]:
import sys  
sys.path.append("../") 

import tensorflow as tf
slim = tf.contrib.slim

%load_ext autoreload
%autoreload 2

## Import Text Dataset Helper Functions

In [2]:
import pandas as pd
import numpy as np
from utils.datasets import text_datasets

## Download and Load DBpedia Dataset
In `text_datasets.load_dbpedia()`, you can set the `size` parameter:

`small` only loads 0.1% of the total dataset (ie 560 train and 70 test observations)
`normal` loads total dataset. 

In [3]:
data = text_datasets.load_dbpedia(size='small')

## Text Data Helper Functions

In [4]:
def load_target_classes(class_file_path):
    with open(class_file_path) as f:  
        return map(lambda x: x.strip(), f.readlines())

def explore_dbpedia_data(data, classes, sample_size = 5):
    idx = np.random.choice(len(data.train.data), sample_size, replace=False)
    target_sample = data.train.target[idx]
    train_data_sample = data.train.data[idx]
    
    for target, sample in zip(target_sample, train_data_sample):
        print("Title: {}".format(sample[0]))
        print("Target Class: {}".format(classes[target-1]))
        print("Content:\n {}\n".format(sample[1]))

## Load Target Classes

In [5]:
class_file_path = "dbpedia_data/dbpedia_csv/classes.txt"

classes = load_target_classes(class_file_path)
print("Number of classes: {}".format(len(classes)))
classes

Number of classes: 14


['Company',
 'EducationalInstitution',
 'Artist',
 'Athlete',
 'OfficeHolder',
 'MeanOfTransportation',
 'Building',
 'NaturalPlace',
 'Village',
 'Animal',
 'Plant',
 'Album',
 'Film',
 'WrittenWork']

## Explore DBpedia Data
Pick sample size and the function below will return a random sample with actual text content, title and target class name.

In [6]:
explore_dbpedia_data(data, classes, sample_size=5)

Title: Sadako 3D 2
Target Class: Film
Content:
  Sadako 3D 2 (貞子3D2) is a 2013 Japanese horror film directed by Tsutomu Hanabusa and the second installment of the Sadako 3D series.

Title: The Humpbacked Horse (1941 film)
Target Class: Film
Content:
  The Humpbacked Horse (Russian: Конёк-Горбунок) is a 1941 Soviet film directed by Alexander Rou and produced at Soyuzdetfilm studios. It is based on a fairy tale by Pyotr Pavlovich Yershov.

Title: Child of the Prophecy
Target Class: WrittenWork
Content:
  Child of the Prophecy is an historical fantasy novel by Juliet Marillier and the third book in the Sevenwaters Trilogy first published in 2001. Book Three steps slightly out of the tradition of Sevenwaters with the young heroine Fainne being raised far from the homestead in Kerry. Fainne is the daughter of Niamh and Ciaran and is a dangerous combination of four races.

Title: Harold Hess Lustron House
Target Class: Building
Content:
  Harold Hess Lustron House is located in Closter Berge

## Next Lesson
### Deep Convolutional Neural Network for text classification in TensorFlow-Slim
-  Define Deep Convolutional Neural Network for classification task on text dataset.

<img src="../images/divider.png" width="100">