Skip to content

Collection of scripts for doing common transformations in machine learning

Notifications You must be signed in to change notification settings

coreylynch/sklearn-transform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

#Categorical DataFrames to Sparse SVM-Light Format

Converts a categorical design matrix X to a sparse CSR matrix, then writes to SVM-lite format.

This takes a pandas DataFrame with categorical features, converts category values to a sparse one-hot representation, then writes the sparse matrix to SVM-light format.

SVM-light is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variable to predict.

Parameters

X : pandas DataFrame, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape = [n_samples] Target values.

filename : string or file-like in binary mode If string, specifies the path that will contain the data. If file-like, data will be written to f. f should be opened in binary mode.

cat_columns: array-like List of categorical columns

num_columns: array-like, optional List of numerical columns

zero_based : boolean, optional Whether column indices should be written zero-based (True) or one-based (False).

comment : string, optional Comment to insert at the top of the file. This should be either a Unicode string, which will be encoded as UTF-8, or an ASCII byte string.

query_id : array-like, shape = [n_samples] Array containing pairwise preference constraints (qid in svmlight format).

Examples

    import pandas as pd
    import numpy as np

    category_data_1 = ['tcp','udp','udp','tcp','dns','tcp']
    category_data_2 = ['red','blue','red','green','blue','red']
    numerical_data = [1,2,1,1,3,4]
    data  = {'category_data_1': category_data_1,
             'category_data_2': category_data_2,
             'numerical_data':numerical_data}
    X = pd.DataFrame(data)
    y = np.array([1.,0.,1.,1.,0.,0.])
    
    cat_columns = ['category_data_1', 'category_data_2']
    num_columns = ['numerical_data']

    dump_categorical_df_to_svm_light(X, y, 'example', cat_columns, num_columns)

    head example    
    # Generated by dump_svmlight_file from scikit-learn 0.13-git
    # Column indices are zero-based
    1.000000 2:1.0000000000000000e+00 4:2.0000000000000000e+00
    0.000000 0:1.0000000000000000e+00 2:1.0000000000000000e+00 4:2.0000000000000000e+00
    1.000000 0:1.0000000000000000e+00 4:2.0000000000000000e+00
    1.000000 2:1.0000000000000000e+00 3:1.0000000000000000e+00 4:1.0000000000000000e+00
    0.000000 1:1.0000000000000000e+00 2:1.0000000000000000e+00 4:3.0000000000000000e+00
    0.000000 2:1.0000000000000000e+00 4:5.0000000000000000e+00

About

Collection of scripts for doing common transformations in machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages