# SciPy

In [10]:
import numpy as np
import scipy as sp

### Sparse Matrices

In some machine learning tasks, especially those associated
with textual analysis, the data may be mostly zeros.  Storing all these zeros is very
inefficient, and representing in a way that only contains the "non-zero" values can be much more efficient.

In [11]:
from scipy import sparse

# turn X into a CSR (Compressed-Sparse-Row) matrix
rnd = np.random.RandomState(seed=123)
X = rnd.uniform(low=0.0, high=1.0, size=(10, 5))
X[X < 0.9] = 0

# turn X into a CSR (Compressed-Sparse-Row) matrix
X_csr = sparse.csr_matrix(X)
print(X_csr)

# Go back to an array
Y = X_csr.toarray()

  (1, 1)	0.980764198385
  (7, 3)	0.944160018204
  (9, 2)	0.985559785611


The CSR representation can be very efficient for computations, but it is not
as good for adding elements.  For that, the LIL (List-In-List) representation
is better:

In [12]:
# Create an empty LIL matrix and add some items
X_lil = sparse.lil_matrix((5, 5))

for i, j in np.random.randint(0, 5, (15, 2)):
    X_lil[i, j] = i + j

# Conversion to an array
X_dense = X_lil.toarray()

# turn X into a CSR (Compressed-Sparse-Row) matrix
X_csr = X_lil.tocsr()

The available sparse formats that can be useful for various problems are:

- `CSR` (compressed sparse row)
- `CSC` (compressed sparse column)
- `BSR` (block sparse row)
- `COO` (coordinate)
- `DIA` (diagonal)
- `DOK` (dictionary of keys)
- `LIL` (list in list)

The [``scipy.sparse``](http://docs.scipy.org/doc/scipy/reference/sparse.html) submodule also has a lot of functions for sparse matrices
including linear algebra, sparse solvers, graph algorithms, and much more.

## Training and Testing Data

In [13]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

train_X, test_X, train_y, test_y = train_test_split(X, y, 
                                                    train_size=0.5, 
                                                    random_state=123)

**Tip: Stratified Split**

Stratification means that we maintain the original class proportion of the dataset in the test and training sets.

In [14]:
train_X, test_X, train_y, test_y = train_test_split(X, y, 
                                                    train_size=0.5, 
                                                    random_state=123,
                                                    stratify=y)

print('All:', np.bincount(y) / float(len(y)) * 100.0)
print('Training:', np.bincount(train_y) / float(len(train_y)) * 100.0)
print('Test:', np.bincount(test_y) / float(len(test_y)) * 100.0)

All: [ 33.33333333  33.33333333  33.33333333]
Training: [ 33.33333333  33.33333333  33.33333333]
Test: [ 33.33333333  33.33333333  33.33333333]


## Training a classifier

In [15]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()
classifier.fit(train_X, train_y)
pred_y = classifier.predict(test_X)

classifier.score(test_X, test_y)

0.95999999999999996

Depending on the model, you can access estimated model parameters. They are attributes of the estimator object and end by an underscore.

```
print(classifier.coef_)
print(classifier.intercept_)
```