# `data_generation` package
___


## Introduction

This package is centered in the automatic generation of sinthetic datasets specially suited for high dimensional analysis. with a grouped structure. The package currently includes:

* `EqualGroupSize`: A python class for the generation of datasets formed by groups of the same size
* `UnequalGroupSize`: A python class for the generation of datasets formed by groups of different size
* `dg_enet`: A function that generates data following the elastic net paper example 4
* `dg_hierarchical`: A function that generates data following the hierarchical lasso structure example 2

## Requirements
___
This package relies only in `numpy` and `scipy`

## Usage examples
___

In [1]:
import numpy
import data_generation as dgen

### EqualGroupSize

The input parameters are:
* `n_obs=200`: Number of observations
* `ro=0.5`: Between-groups correlation
* `error_distribution='student_t'`: Distribution error (accepts `normal`, `student_t`, `cauchuy` and `chisq`)
* `e_df=3`: Degrees of freedom used in student t and chi squared distributions
* `e_loc=0`: Location parameter used in normal and cauchy distributions
* `e_scale=3`: Location parameter used in normal and cauchy distributions
* `random_state=None`: random state value in case reproducible data is required

* `num_groups=20`: Number of groups to be generated
* `group_size=20`: Size of the groups to be generated
* `non_zero_groups=7`: Number of groups with variables that are not 0
* `non_zero_coef=8`: Number of coefficients that are not 0 among the groups defined by `num_non_zero_groups`

In [2]:
data_equal = dgen.EqualGroupSize(n_obs=5000, ro=0.2, error_distribution='student_t', 
                                 e_df=5, random_state=1, group_size=10, non_zero_groups=3, 
                                 non_zero_coef=5, num_groups=7)

x, y, beta, group_index = data_equal.data_generation().values()

The result is a predictor matrix x of dimension `group_size*num_groups` and array of beta coefficients with `non_zero_groups*non_zero_coef` coefficients different than 0

## UnequalGroupSize

The input parameters are:
* `n_obs=200`: Number of observations
* `ro=0.5`: Between-groups correlation
* `error_distribution='student_t'`: Distribution error (accepts `normal`, `student_t`, `cauchuy` and `chisq`)
* `e_df=3`: Degrees of freedom used in student t and chi squared distributions
* `e_loc=0`: Location parameter used in normal and cauchy distributions
* `e_scale=3`: Location parameter used in normal and cauchy distributions
* `random_state=None`: random state value in case reproducible data is required

* `tuple_group_size=(5, 15, 30)`: Number of groups to be generated
* `tuple_number_of_group=(15, 15, 15)`: Size of the groups to be generated
* `tuple_non_zero_groups=(3, 3, 3)`: Number of groups with variables that are not 0
* `tuple_non_zero_coef=(3, 6, 10)`: Number of coefficients that are not 0 among the groups defined by `num_non_zero_groups`

In [3]:
data_different = dgen.UnequalGroupSize(n_obs=5000, ro=0.8, error_distribution='normal', e_loc=1, e_scale=4,
                                       random_state=2, tuple_group_size=(2, 4, 6, 8),
                                       tuple_number_of_groups=(5, 10, 15, 20),
                                       tuple_non_zero_coef=(1, 2, 3, 4),
                                       tuple_non_zero_groups=(1, 3, 5, 7))

x, y, beta, group_index = data_different.data_generation().values()

The result is a predictor matrix x that has:
* 5 groups of size 2
* 10 groups of size 4
* 15 groups of size 6
* 20 groups of size 8

And a beta array with coefficients different than 0:
* 1 coefficient in 1 group of size 2
* 2 coefficients in 3 groups of size 4
* 3 coefficients in 5 groups of size 6
* 4 coefficients in 7 groups of size 8
A total number of coefficients of 1 + 2*3 + 3*5 + 4*7