Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
- Enter Paracel's home directory
- Generate training dataset for classification
python ./tool/datagen.py -m classification -o training.dat -n 2500 -k 100
- Set up link library path:
- Create a json file named
cfg.json, see example in Parameters section below.
- Run (4 workers, local mode in the following example)
./prun.py -w 4 -p 2 -c cfg.json -m local your_paracel_install_path/bin/lr
Default parameters are set in a JSON format file. For example, we create a cfg.json as below(modify
"training_input" : "training.dat",
"test_input" : "training.dat",
"predict_input" : "training.dat",
"output" : "./lr_result/",
"update_file" : "your_paracel_install_path/lib/liblr_update.so",
"update_func" : "lr_theta_update",
"method" : "ipm",
"rounds" : 100,
"alpha" : 0.001,
"beta" : 0.01,
"debug" : false
In the configuration file,
predict_input is set to be the same as
training_input, you can modify them if you have a test or predict dataset.
update_func stores the information of registry function needed in our implementation of logistic regression.
rounds refers to the number of training iterations.
alpha is the learning rate of the sgd algorithm and
beta is the regularization parameter. There are four types of learning method you can choose with the
- dgd: distributed gradient descent learning
- ipm: iterative parameter mixtures learning
- downpour: asynchrounous gradient descent learning
- agd: slow asynchronous gradient descent learning
Training data, test data have the same format as below:
feature1,feature2, ...,featurek,1 feature1,feature2, ...,featurek,0 feature1,feature2, ...,featurek,1 ...
Each line represents a sample containing a label in the last dimension. Predict data format is similar except that it do not contain the label dimension. But you can use the same format as training data, in this case, our program will ignore data in the last dimension.
lr_theta_0: weight value for each dimension.
pred_v_x : predict result which stores predict label information in the last dimension of each line.
- You do not need to know the theory behind all the leraning method, we recommend
ipmmethod. For more information, click on their link and see reference paper below.
- In output files, we append an extra dimension valued 1.0 in the first column.
Hall, Keith B., Scott Gilpin, and Gideon Mann. "MapReduce/Bigtable for distributed optimization." NIPS LCCC Workshop. 2010.