## Description

## Dataset Details

Taken directly from the [UCI Machine Learning Repo](https://archive.ics.uci.edu/ml/datasets/HIGGS):
>The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper. The last 500,000 examples are used as a test set.

Attribute info:
>The first column is the **class label**
>  * 1 for signal 
>  * 0 for background
>
> The following 28 columns are **features** (21 low-level features then 7 high-level features):
>   * lepton pT 
>   * lepton eta 
>   * lepton phi
>   * missing energy magnitude
>   * missing energy phi 
>   * jet 1 pt 
>   * jet 1 eta 
>   * jet 1 phi 
>   * jet 1 b-tag 
>   * jet 2 pt 
>   * jet 2 eta 
>   * jet 2 phi 
>   * jet 2 b-tag 
>   * jet 3 pt 
>   * jet 3 eta 
>   * jet 3 phi 
>   * jet 3 b-tag 
>   * jet 4 pt 
>   * jet 4 eta 
>   * jet 4 phi 
>   * jet 4 b-tag 
>   * m_jj 
>   * m_jjj 
>   * m_lv 
>   * m_jlv 
>   * m_bb 
>   * m_wbb 
>   * m_wwbb

## Libraries

In [1]:
import pandas as pd


## Versions
The code in this notebook runs with Python 3.5. You may encounter issues if you run it with another version, especially Python 2.x.

In [2]:
#items = [("Numpy", np), ("Pandas", pd), ("Matplotlib", matplotlib), ("Seaborn", sns)]
#for item in items:
#    print(item[0] + " version: " + str(item[1].__version__))

## Get Data

In [6]:
columns = ("label",
           "lepton pT",
           "lepton_eta", 
           "lepton_phi", 
           "missing_energy_magnitude", 
           "missing_energy_phi",
           "jet_1_pt",
           "jet_1_eta",
           "jet_1_phi",
           "jet_1_b_tag",
           "jet_2_pt",
           "jet_2_eta",
           "jet_2_phi",
           "jet_2_b_tag",
           "jet_3_pt",
           "jet_3_eta",
           "jet_3_phi",
           "jet_3_b_tag",
           "jet_4_pt",
           "jet_4_eta",
           "jet_4_phi",
           "jet_4_b_tag",
           "m_jj",
           "m_jjj",
           "m_lv",
           "m_jlv",
           "m_bb",
           "m_wbb",
           "m_wwbb")

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"

raw_data = pd.read_csv(url, header=None, names=columns, compression="gzip")

## Information About Raw Data

In [7]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11000000 entries, 0 to 10999999
Data columns (total 29 columns):
label                       float64
lepton pT                   float64
lepton_eta                  float64
lepton_phi                  float64
missing_energy_magnitude    float64
missing_energy_phi          float64
jet_1_pt                    float64
jet_1_eta                   float64
jet_1_phi                   float64
jet_1_b_tag                 float64
jet_2_pt                    float64
jet_2_eta                   float64
jet_2_phi                   float64
jet_2_b_tag                 float64
jet_3_pt                    float64
jet_3_eta                   float64
jet_3_phi                   float64
jet_3_b_tag                 float64
jet_4_pt                    float64
jet_4_eta                   float64
jet_4_phi                   float64
jet_4_b_tag                 float64
m_jj                        float64
m_jjj                       float64
m_lv                 

In [8]:
raw_data.head()

Unnamed: 0,label,lepton pT,lepton_eta,lepton_phi,missing_energy_magnitude,missing_energy_phi,jet_1_pt,jet_1_eta,jet_1_phi,jet_1_b_tag,...,jet_4_eta,jet_4_phi,jet_4_b_tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb
0,1.0,0.869293,-0.635082,0.22569,0.32747,-0.689993,0.754202,-0.248573,-1.092064,0.0,...,-0.010455,-0.045767,3.101961,1.35376,0.979563,0.978076,0.920005,0.721657,0.988751,0.876678
1,1.0,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,2.173076,...,-1.13893,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343
2,1.0,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.0,...,1.128848,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118
3,0.0,1.344385,-0.876626,0.935913,1.99205,0.882454,1.786066,-1.646778,-0.942383,0.0,...,-0.678379,-1.360356,0.0,0.946652,1.028704,0.998656,0.728281,0.8692,1.026736,0.957904
4,1.0,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.0,...,-0.373566,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487


## Create Multiple HDF5 Files (for iterating)

## Write to HDF5 w/Compression

In [13]:
raw_data.to_hdf('/Users/davidziganto/data/raw_HIGGS_data.h5', 
                'table',
                mode='w',
                append=True,
                complevel=9,
                complib='blosc',
                fletcher32=True)

## Notes

Downloading and unzipping the HIGGS gzip file creates a 8.04 GB CSV file. Writing that data to HDF5 with Blosc compression set to the max level results in a 2.14 GB H5 file.