# Estructura de la clase Dataset

El dataset puede poseer dos estructuras según el método "split_dataset" haya sido ejecutado.

* Cuando el método no ha sido ejecutado, la estructura es la siguiente:

``` python
Dataset = {
    domains = ['domain_1',...,'domain_m']
    
    labeled = {
        'domain_1': {'X': X, 'y': y},
        .
        .
        .
        'domain_m': {'X': X, 'y': y} 
    }
    unlabeled = {
        'domain_1': {'X': X},
        .
        .
        .
        'domain_m': {'X': X}
    }
}


```


* Cuando el método ha sido ejecutado, la estructura es:

``` python
Dataset = {
    domains = ['domain_1',...,'domain_m']
    
    labeled = {
        'domain_1': {'X_tr': X_tr, 'y_tr': y_tr, 'X_ts': X_ts, 'y_ts': y_ts},
        .
        .
        .
        'domain_m': {'X_tr': X_tr, 'y_tr': y_tr, 'X_ts': X_ts, 'y_ts': y_ts} 
    }
    unlabeled = {
        'domain_1': {'X_tr': X_tr, 'y_tr': y_tr, 'X_ts': X_ts, 'y_ts': y_ts},
        .
        .
        .
        'domain_m': {'X_tr': X_tr, 'y_tr': y_tr, 'X_ts': X_ts, 'y_ts': y_ts}
    }
}
```


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
#carga de datasets
from utils.DatasetStorage import Dataset
from utils.paths import *
import numpy as np
import os
import pandas as pd

In [3]:
print datasets

['amazon', 'twitter']


# Dataset de Amazon

In [4]:
dims = dimensions['amazon']
dims

3000

In [5]:
# Crea el dataset de amazon con 3000 caracteristicas
%run ./preprocesamiento.py --dataset amazon --dims $dims

Leyendo directorio raw_data/multi-domain/processed_acl
Leyendo dominio: 
- electronics
- dvd
- kitchen
- books
Procesando datos.

Etiquetas:
	Etiqueta: positive - Valor: 1
	Etiqueta: negative - Valor: 0

Guardando datos en data/amazon.pkl
Operacion terminada.


In [6]:
# Se comprueba que el dataset haya sido creado correctamente
dataset_path = os.path.join(data_path, datasets[0]+'.pkl')
dataset_object = Dataset().load(dataset_path)

domains = dataset_object.domains

print domains

labeled = dataset_object.labeled
unlabeled = dataset_object.unlabeled

['electronics', 'dvd', 'kitchen', 'books']


In [7]:
instances = dataset_object.get_all_X()
print instances.shape

(27677, 3000)


In [11]:
dataset_object.split_dataset(test_size=0.2)
instances2 = dataset_object.get_all_X(test_data=True)
print instances2.shape

Dataset already splitted
(27677, 3000)


In [9]:
training_instances = dataset_object.get_all_X()
print training_instances.shape

(26077, 3000)


In [12]:
df = pd.DataFrame(columns=['Dominio', 'Entrenamiento', "% Pos", 'Prueba', "% Pos", 'Total'])
labeled = dataset_object.labeled

i=0
for domain in labeled:
    tr = labeled[domain]['X_tr'].shape[0]
    ts = labeled[domain]['X_ts'].shape[0]
    
    y_tr = labeled[domain]['y_tr'].todense().argmax(axis=1)
    y_tr_pos = np.sum(y_tr)
    y_tr_pos = int(100 * y_tr_pos / float(tr))
    y_tr_pos = "%.1f" % y_tr_pos
    
    y_ts = labeled[domain]['y_ts'].todense().argmax(axis=1)
    y_ts_pos = np.sum(y_ts)
    y_ts_pos = int(100 * y_ts_pos / float(ts))
    y_ts_pos = "%.1f" % y_ts_pos
    
    df.loc[i] = [domain,tr, y_tr_pos, ts, y_ts_pos,tr+ts]
    i+=1
df  

Unnamed: 0,Dominio,Entrenamiento,% Pos,Prueba,% Pos.1,Total
0,dvd,1600,50.0,400,49.0,2000
1,electronics,1600,50.0,400,49.0,2000
2,books,1600,50.0,400,49.0,2000
3,kitchen,1600,50.0,400,49.0,2000


In [13]:
dataset_object.save(dataset_path)

# Dataset de Twitter

In [14]:
dims = dimensions['twitter']
dims

2000

In [15]:
%run ./preprocesamiento.py --dataset twitter --dims $dims

Leyendo directorio raw_data/twitter
Leyendo dominio: 
- rio2016
- thevoice
- general
Procesando datos.

Etiquetas:
	Etiqueta: positivo - Valor: 1
	Etiqueta: negativo - Valor: 0

Guardando datos en data/twitter.pkl
Operacion terminada.


In [16]:
# Se comprueba que el dataset haya sido creado correctamente

dataset_path = os.path.join(data_path, datasets[1]+'.pkl')
dataset_object = Dataset().load(dataset_path)

domains = dataset_object.domains

print domains

labeled = dataset_object.labeled

['rio2016', 'thevoice', 'general']


In [17]:
instances = dataset_object.get_all_X()
print instances.shape

(5796, 2000)


In [18]:
dataset_object.split_dataset(test_size=0.2)
instances2 = dataset_object.get_all_X(test_data=True)
print instances2.shape

(5796, 2000)


In [19]:
training_instances = dataset_object.get_all_X()
print training_instances.shape

(4635, 2000)


In [20]:
df = pd.DataFrame(columns=['Dominio', 'Entrenamiento', "% Pos", 'Prueba', "% Pos", 'Total'])
labeled = dataset_object.labeled

i=0
for domain in labeled:
    tr = labeled[domain]['X_tr'].shape[0]
    ts = labeled[domain]['X_ts'].shape[0]
    
    y_tr = labeled[domain]['y_tr'].todense().argmax(axis=1)
    y_tr_pos = np.sum(y_tr)
    y_tr_pos = int(100 * y_tr_pos / float(tr))
    y_tr_pos = "%.1f" % y_tr_pos
    
    y_ts = labeled[domain]['y_ts'].todense().argmax(axis=1)
    y_ts_pos = np.sum(y_ts)
    y_ts_pos = int(100 * y_ts_pos / float(ts))
    y_ts_pos = "%.1f" % y_ts_pos
    
    df.loc[i] = [domain,tr, y_tr_pos, ts, y_ts_pos,tr+ts]
    i+=1
df    

Unnamed: 0,Dominio,Entrenamiento,% Pos,Prueba,% Pos.1,Total
0,thevoice,519,51.0,130,49.0,649
1,rio2016,380,53.0,96,60.0,476
2,general,3736,46.0,935,47.0,4671


In [22]:
dataset_object.save(dataset_path)