# appnexus/logistic-regression-L1

No description or website provided.
Python Shell
Latest commit 2cd21ae May 10, 2016 reverting
 Failed to load latest commit information. images Dec 10, 2015 logistic_regression_L1 May 10, 2016 tests Nov 24, 2015 .gitignore Dec 2, 2015 .travis.yml Nov 24, 2015 LICENSE Jan 13, 2016 README.md Dec 21, 2015 __init__.py Nov 4, 2015 run-tests Nov 24, 2015 setup.py Dec 2, 2015

# Logistic Regression with L1 Penalty

This class implements L1 (Lasso) regularization on Logistic Regression using coordinate descent to achieve sparse solutions. It allows for both in-memory inputs and data that lives in a PySpark RDD. This implementation is meant to be used for large datasets. For small datasets, scikit-learn is sufficient.

### Mathematical Formulation

This method is formulated as the maximization of the concave log likelihood function. We can write this as the optimization problem like so:

where the probabilities are defined by the logistic function:

The ith index denotes a set of observations with the same x_i, but with m_i occurrences of this observation, and y_i trials that result in a "success." There are n observations. The objective is to find the optimal beta cofficients to minimize the negative log likelihood.

The technique that is utilized is described in section 3 of "Regularization Paths for Generalized Linear Models via Coordinate Descent".

For an explanation of how this implementation solves the non-differentiable objective function from the L1 penalty term, refer to Chapter 13 of Machine Learning: A Probabilistic Perspective (Murphy, 2012).

## Installation

Clone repository and install

``````\$ python setup.py install
``````

## Usage and Example

The primary input will be a matrix. If you do not need to use PySpark, then pass the data in as a numpy array. If pyspark=True, then the input must be an RDD.

The matrix format for both cases should include observations, a weight for each row, and the number of successes. It will look something like this, where each variable listed is an array:

``````[ x1, x2, ... , num_occurrences_of_row, num_successes ]
``````

If each row is unweighted, then the second to last column (num_occurrences_of_row) will be ones. The last column (num_successes) will then be zeros or ones.

The `fit` method will automatically insert an intercept term and return coefficients that includes this bias.

### The PySpark RDD

For improved performance, the PySpark RDD's partitions should each hold a single NumPy array.

The `fit` method will check to see if your partition is a list of a NumPy array. If it is not, it will be converted into this format.

### Example

#### Without PySpark

```import numpy as np
from logistic_regression_l1 import LogisticRegressionL1

# x_1, x_2, m, y are predefined arrays
data = np.array([x_1, x_2, m, y]).T
lambda_grid = np.exp(-1*np.linspace(1,17,200))

logit_no_pyspark = LogisticRegressionL1()
logit_no_pyspark.fit(data, lambda_grid, .00000001, False)

# x_3, x_4 are arrays of new observations
new_observations = np.array([x_3, x_4]).T
logit_no_pyspark.predict(new_observations, False)```

#### With PySpark

```import numpy as np
from logistic_regression_l1 import LogisticRegressionL1

# sparkrdd is predefined RDD with same format as non-PySpark case
lambda_grid = np.exp(-1*np.linspace(1,17,200))

logit_pyspark = LogisticRegressionL1()
logit_pyspark.fit(sparkrdd, lambda_grid, .00000001, True)

# sparkrdd_new_obs is the RDD holding the observations to predict
logit_pyspark.predict(sparkrdd_new_obs, True)```

## Regularization Path

If you want to use the benefit of the Lasso regularization, then the output of the `fit` method will return the entire regularization path, each row corresponding to the lambda parameter that you choose.

The lambda grid will control how you iterate through the L1 penalty/constraint. Note that the step sizes should be small, as this algorithm utilizes a second-order Taylor approximation.

If you plot the coefficients against the sum of coefficients in each iteration, the result will be something like this:

The most important features will appear in the most constrained iterations.