In [1]:
%load_ext autoreload
%autoreload 2

# Data loader

This script is for loading data into Elasticsearch for easy exploitation in the rest of the project. All datasets are normalized into a forum-like format, so all texts are either post openers or comments to that post. This will make the task to follow easier.

In [2]:
from texata import dataloader

## CDETS data

Read all the XMLs into a Elasticsearch index. Description of the defect is saved as the main text, comments as "replies" to that text:

In [None]:
# Sample data
#dataloader.CDETSloadES('../data/CDETS-Data-Sample')
# Real data
dataloader.CDETSloadES('../data/Hackathon-Texata-2015/Defects-ASR9k')

## Cisco Support Forums data

Split the post files into all of its fields. I save all the data I can find, but in particular the opening text as the main text, and all the replies following it:

In [3]:
# Sample data
#dataloader.forumsloadES('../data/Support-Forums-Sample')
# Real data
dataloader.forumsloadES('../data/Hackathon-Texata-2015/SupportCommunity/RS/content')

## Techzone data

Each file is badly formed, as it contains several XML trees per file. I have splitted the datafiles into proper separate XML file trees using the tzsplit.sh script I developed, and now I can read it correctly into Elasticsearch. Since these data do not include any kind of replies or the kind, we store each text as an individual "post":

In [None]:
# Sample data
#dataloader.techzoneESloader('../data/techzone-sample')
# Real data
dataloader.techzoneESloader('../data/Hackathon-Texata-2015/TechZone-Splitted')

## Stackoverflow data

This is a public dataset made up of the dump of several stackexchange Q&A communities. While not directly related to CISCO products or services, it does represent a significan corpus of texts of techy nature, and so can be a good training set for a language model in this project.

I have downloaded dumps from the following communities:
* DSP (Signal Processing)
* Network Engineering
* Reverse Engineering
* Robotics
* Security
* Serverfault (sistems administration)
* Stackoverflow (programming, general tech topics)
* Superuser
* Unix

I will now load all the dumps into Elasticsearch for further reference.

In [None]:
dataloader.stackoverflowESLoader('/media/alvaro/DATA/Datasets/StackOverflow')

## Database test

Everything should be loaded now, so let's do a little check:

In [None]:
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

Try to get 100 opener texts:

In [None]:
print len(es.search(body={"fields": "text"}, size="100")['hits']['hits'])

Try to get 100 comments:

In [None]:
print len(es.search(body={"fields": "text.comment", "size":100}, size="100")['hits']['hits'])

In [None]:
print es.search(fields = "text", size="100", index = "robotics.stackexchange.com")['hits']['hits']