* [Question 1](#Question)

* [Question 2](#Question2)

* [Question 3](#Question3)

In [None]:
from sklearn.model_selection import KFold
def cv_score(clf, X, y, scorefunc):
    result = 0.
    nfold = 5
    for train, test in KFold(nfold).split(X): # split data into train/test groups, 5 times
        clf.fit(X[train], y[train]) # fit the classifier, passed is as clf.
        result += scorefunc(clf, X[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

In [None]:
def log_likelihood(clf, x, y):
    prob = clf.predict_log_proba(x)
    rotten = y == 0
    fresh = ~rotten
    return prob[rotten, 0].sum() + prob[fresh, 1].sum()

In [None]:
from sklearn.model_selection import train_test_split
_, itest = train_test_split(range(critics.shape[0]), train_size=0.7)
mask = np.zeros(critics.shape[0], dtype=np.bool)
mask[itest] = True

In [3]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Classifier Documentation

**Init signature:** MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

**Docstring:**     
Naive Bayes classifier for multinomial models

The multinomial Naive Bayes classifier is suitable for classification with
discrete features (e.g., word counts for text classification). The
multinomial distribution normally requires integer feature counts. However,
in practice, fractional counts such as tf-idf may also work.

**Parameters**
****
alpha : float, optional (default=1.0)
    Additive (Laplace/Lidstone) smoothing parameter
    (0 for no smoothing).

fit_prior : boolean, optional (default=True)
    Whether to learn class prior probabilities or not.
    If false, a uniform prior will be used.

class_prior : array-like, size (n_classes,), optional (default=None)
    Prior probabilities of the classes. If specified the priors are not
    adjusted according to the data.

**Attributes**
****
class_log_prior_ : array, shape (n_classes, )
    Smoothed empirical log probability for each class.

intercept_ : property
    Mirrors ``class_log_prior_`` for interpreting MultinomialNB
    as a linear model.

feature_log_prob_ : array, shape (n_classes, n_features)
    Empirical log probability of features
    given a class, ``P(x_i|y)``.

coef_ : property
    Mirrors ``feature_log_prob_`` for interpreting MultinomialNB
    as a linear model.

class_count_ : array, shape (n_classes,)
    Number of samples encountered for each class during fitting. This
    value is weighted by the sample weight when provided.

feature_count_ : array, shape (n_classes, n_features)
    Number of samples encountered for each (class, feature)
    during fitting. This value is weighted by the sample weight when
    provided.

In [None]:
#the grid of parameters to search over
alphas = [.1, 1, 5, 10, 50]
min_dfs = [1, 2, 3, 5, 10, 15, 21, 22, 23, 24, 25]

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
maxscore=-np.inf

for min_df in min_dfs:
    for alpha in alphas:        
        vectorizer = CountVectorizer(min_df=min_df)       
        X2, y2 = make_xy(critics, vectorizer)
        X2train = X2[mask]
        y2train = y2[mask]
        
        # your turn
        clf = MultinomialNB(alpha=alpha)
        score = cv_score(clf, X2train, y2train, log_likelihood)
        if score > maxscore: 
            maxscore = score
            best_min_df = min_df
            best_alpha = alpha
            
print('best min_df:', best_min_df)
print('best alpha:', best_alpha)
print('best score:', maxscore)

****

# Question
- _I struggled with how to evaluate the correct range of alphas for CV/tuning._
    - Just structuring the documentation in this way is generating a slightly clearer understanding.
    
****

In [None]:
vectorizer = CountVectorizer(min_df=best_min_df)
X, y = make_xy(critics, vectorizer)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=42)

clf = MultinomialNB(alpha=best_alpha).fit(Xtrain, ytrain)

#your turn. Print the accuracy on the test and training dataset
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)

print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))

# Count Vectorizer

**Init signature:** CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

**Docstring:**    
Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using
scipy.sparse.csr_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer
that does some kind of feature selection then the number of features will
be equal to the vocabulary size found by analyzing the data.

**min_df :** float in range [0.0, 1.0] or int, default=1
    When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold. This value is also
    called cut-off in the literature.
    If float, the parameter represents a proportion of documents, integer
    absolute counts.
    This parameter is ignored if vocabulary is not None.

****

# DataCamp Introduction Examples

****

## CV

In [4]:
# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [None]:
# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

# LinearRegression

**Init signature:** LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

**Docstring:**     
Ordinary least squares Linear Regression.

**Parameters**
****
fit_intercept : boolean, optional, default True
    whether to calculate the intercept for this model. If set
    to False, no intercept will be used in calculations
    (e.g. data is expected to be already centered).

normalize : boolean, optional, default False
    This parameter is ignored when ``fit_intercept`` is set to False.
    If True, the regressors X will be normalized before regression by
    subtracting the mean and dividing by the l2-norm.
    If you wish to standardize, please use
    :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on
    an estimator with ``normalize=False``.

copy_X : boolean, optional, default True
    If True, X will be copied; else, it may be overwritten.

n_jobs : int, optional, default 1
    The number of jobs to use for the computation.
    If -1 all CPUs are used. This will only provide speedup for
    n_targets > 1 and sufficient large problems.
    
*From the implementation point of view, this is just plain Ordinary
Least Squares*

# cross_val_score

**Signature:** cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')

**Docstring:**
Evaluate a score by cross-validation

Read more in the :ref:`User Guide <cross_validation>`.

**Parameters**
****
estimator : estimator object implementing 'fit'
    The object to use to fit the data.

X : array-like
    The data to fit. Can be for example a list, or an array.

y : array-like, optional, default: None
    The target variable to try to predict in the case of
    supervised learning.

groups : array-like, with shape (n_samples,), optional
    Group labels for the samples used while splitting the dataset into
    train/test set.

scoring : string, callable or None, optional, default: None
    A string (see model evaluation documentation) or
    a scorer callable object / function with signature
    ``scorer(estimator, X, y)``.

cv : int, cross-validation generator or an iterable, optional
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:

    - None, to use the default 3-fold cross validation,
    - integer, to specify the number of folds in a `(Stratified)KFold`,
    - An object to be used as a cross-validation generator.
    - An iterable yielding train, test splits.

    For integer/None inputs, if the estimator is a classifier and ``y`` is
    either binary or multiclass, :class:`StratifiedKFold` is used. In all
    other cases, :class:`KFold` is used.

    Refer :ref:`User Guide <cross_validation>` for the various
    cross-validation strategies that can be used here.

n_jobs : integer, optional
    The number of CPUs to use to do the computation. -1 means
    'all CPUs'.

verbose : integer, optional
    The verbosity level.

fit_params : dict, optional
    Parameters to pass to the fit method of the estimator.

pre_dispatch : int, or string, optional
    Controls the number of jobs that get dispatched during parallel
    execution. Reducing this number can be useful to avoid an
    explosion of memory consumption when more jobs get dispatched
    than CPUs can process. This parameter can be:

        - None, in which case all the jobs are immediately
          created and spawned. Use this for lightweight and
          fast-running jobs, to avoid delays due to on-demand
          spawning of the jobs

        - An int, giving the exact number of total jobs that are
          spawned

        - A string, giving an expression as a function of n_jobs,
          as in '2*n_jobs'

**Returns**
****
scores : array of float, shape=(len(list(cv)),)
    Array of scores of the estimator for each run of the cross validation.

****

# Question2

I believe I am confusing CV with hyper-parameter tuning and using the terms interchangably. I feel that both are iterative processes and therefore easily confused.
****

# From my notes:

# Hyperparameter tuning

- Linear regression: Choosing parameters
- Ridge/lasso regression: Choosing alpha
- k-Nearest Neighbors: Choosing n_neighbors
- Parameters like alpha and k are called **Hyperparameters**
- Hyperparameters cannot be learned by fitting the model

## Choosing the correct hyperparameter

- Try a bunch of different hyperparameter values
- Fit all of them separately
- See how well each performs
- Choose the best performing one
- It is essential to **use cross-validation**
    - Using `train-test-split` alone would risk over-fitting the hyperparameter to the test set.

In [5]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {'n_neighbors': np.arange(1, 50)}

knn = KNeighborsClassifier()

knn_cv = GridSearchCV(knn, param_grid, cv=5)

### GridSearchCV

**Init signature:** GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score='raise', return_train_score='warn')

**Docstring:**
Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a "fit" and a "score" method.
It also implements "predict", "predict_proba", "decision_function",
"transform" and "inverse_transform" if they are implemented in the
estimator used.

The parameters of the estimator used to apply these methods are optimized
by cross-validated grid-search over a parameter grid.

**Parameters**
****
estimator : estimator object.
    This is assumed to implement the scikit-learn estimator interface.
    Either estimator needs to provide a ``score`` function,
    or ``scoring`` must be passed.

param_grid : dict or list of dictionaries
    Dictionary with parameters names (string) as keys and lists of
    parameter settings to try as values, or a list of such
    dictionaries, in which case the grids spanned by each dictionary
    in the list are explored. This enables searching over any sequence
    of parameter settings.

scoring : string, callable, list/tuple, dict or None, default: None
    A single string (see :ref:`scoring_parameter`) or a callable
    (see :ref:`scoring`) to evaluate the predictions on the test set.

    For evaluating multiple metrics, either give a list of (unique) strings
    or a dict with names as keys and callables as values.

    NOTE that when using custom scorers, each scorer should return a single
    value. Metric functions returning a list/array of values can be wrapped
    into multiple scorers that return one value each.

    See :ref:`multimetric_grid_search` for an example.

    If None, the estimator's default scorer (if available) is used.

fit_params : dict, optional
    Parameters to pass to the fit method.

    .. deprecated:: 0.19
       ``fit_params`` as a constructor argument was deprecated in version
       0.19 and will be removed in version 0.21. Pass fit parameters to
       the ``fit`` method instead.

n_jobs : int, default=1
    Number of jobs to run in parallel.

pre_dispatch : int, or string, optional
    Controls the number of jobs that get dispatched during parallel
    execution. Reducing this number can be useful to avoid an
    explosion of memory consumption when more jobs get dispatched
    than CPUs can process. This parameter can be:

        - None, in which case all the jobs are immediately
          created and spawned. Use this for lightweight and
          fast-running jobs, to avoid delays due to on-demand
          spawning of the jobs

        - An int, giving the exact number of total jobs that are
          spawned

        - A string, giving an expression as a function of n_jobs,
          as in '2*n_jobs'

iid : boolean, default=True
    If True, the data is assumed to be identically distributed across
    the folds, and the loss minimized is the total loss per sample,
    and not the mean loss across the folds.

cv : int, cross-validation generator or an iterable, optional
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:
      - None, to use the default 3-fold cross validation,
      - integer, to specify the number of folds in a `(Stratified)KFold`,
      - An object to be used as a cross-validation generator.
      - An iterable yielding train, test splits.

    For integer/None inputs, if the estimator is a classifier and ``y`` is
    either binary or multiclass, :class:`StratifiedKFold` is used. In all
    other cases, :class:`KFold` is used.

    Refer :ref:`User Guide <cross_validation>` for the various
    cross-validation strategies that can be used here.

refit : boolean, or string, default=True
    Refit an estimator using the best found parameters on the whole
    dataset.

    For multiple metric evaluation, this needs to be a string denoting the
    scorer is used to find the best parameters for refitting the estimator
    at the end.

    The refitted estimator is made available at the ``best_estimator_``
    attribute and permits using ``predict`` directly on this
    ``GridSearchCV`` instance.

    Also for multiple metric evaluation, the attributes ``best_index_``,
    ``best_score_`` and ``best_parameters_`` will only be available if
    ``refit`` is set and all of them will be determined w.r.t this specific
    scorer.

    See ``scoring`` parameter to know more about multiple metric
    evaluation.

verbose : integer
    Controls the verbosity: the higher, the more messages.

error_score : 'raise' (default) or numeric
    Value to assign to the score if an error occurs in estimator fitting.
    If set to 'raise', the error is raised. If a numeric value is given,
    FitFailedWarning is raised. This parameter does not affect the refit
    step, which will always raise the error.

return_train_score : boolean, optional
    If ``False``, the ``cv_results_`` attribute will not include training
    scores.

    Current default is ``'warn'``, which behaves as ``True`` in addition
    to raising a warning when a training score is looked up.
    That default will be changed to ``False`` in 0.21.
    Computing training scores is used to get insights on how different
    parameter settings impact the overfitting/underfitting trade-off.
    However computing the scores on the training set can be computationally
    expensive and is not strictly required to select the parameters that
    yield the best generalization performance.
    
**Attributes**
****
cv_results_ : dict of numpy (masked) ndarrays
    A dict with keys as column headers and values as columns, that can be
    imported into a pandas ``DataFrame``.

    For instance the below given table

    +------------+-----------+------------+-----------------+---+---------+
    |param_kernel|param_gamma|param_degree|split0_test_score|...|rank_t...|
    +============+===========+============+=================+===+=========+
    |  'poly'    |     --    |      2     |        0.8      |...|    2    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'poly'    |     --    |      3     |        0.7      |...|    4    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.1   |     --     |        0.8      |...|    3    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.2   |     --     |        0.9      |...|    1    |
    +------------+-----------+------------+-----------------+---+---------+

    will be represented by a ``cv_results_`` dict of::

        {
        'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
                                     mask = [False False False False]...)
        'param_gamma': masked_array(data = [-- -- 0.1 0.2],
                                    mask = [ True  True False False]...),
        'param_degree': masked_array(data = [2.0 3.0 -- --],
                                     mask = [False False  True  True]...),
        'split0_test_score'  : [0.8, 0.7, 0.8, 0.9],
        'split1_test_score'  : [0.82, 0.5, 0.7, 0.78],
        'mean_test_score'    : [0.81, 0.60, 0.75, 0.82],
        'std_test_score'     : [0.02, 0.01, 0.03, 0.03],
        'rank_test_score'    : [2, 4, 3, 1],
        'split0_train_score' : [0.8, 0.9, 0.7],
        'split1_train_score' : [0.82, 0.5, 0.7],
        'mean_train_score'   : [0.81, 0.7, 0.7],
        'std_train_score'    : [0.03, 0.03, 0.04],
        'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
        'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
        'mean_score_time'    : [0.007, 0.06, 0.04, 0.04],
        'std_score_time'     : [0.001, 0.002, 0.003, 0.005],
        'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
        }

    NOTE

    The key ``'params'`` is used to store a list of parameter
    settings dicts for all the parameter candidates.

    The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and
    ``std_score_time`` are all in seconds.

    For multi-metric evaluation, the scores for all the scorers are
    available in the ``cv_results_`` dict at the keys ending with that
    scorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shown
    above. ('split0_test_precision', 'mean_train_precision' etc.)

best_estimator_ : estimator or dict
    Estimator that was chosen by the search, i.e. estimator
    which gave highest score (or smallest loss if specified)
    on the left out data. Not available if ``refit=False``.

    See ``refit`` parameter for more information on allowed values.

best_score_ : float
    Mean cross-validated score of the best_estimator

    For multi-metric evaluation, this is present only if ``refit`` is
    specified.

best_params_ : dict
    Parameter setting that gave the best results on the hold out data.

    For multi-metric evaluation, this is present only if ``refit`` is
    specified.

best_index_ : int
    The index (of the ``cv_results_`` arrays) which corresponds to the best
    candidate parameter setting.

    The dict at ``search.cv_results_['params'][search.best_index_]`` gives
    the parameter setting for the best model, that gives the highest
    mean score (``search.best_score_``).

    For multi-metric evaluation, this is present only if ``refit`` is
    specified.

scorer_ : function or a dict
    Scorer function used on the held out data to choose the best
    parameters for the model.

    For multi-metric evaluation, this attribute holds the validated
    ``scoring`` dict which maps the scorer key to the scorer callable.

n_splits_ : int
    The number of cross-validation splits (folds/iterations).

**Notes**
****
The parameters selected are those that maximize the score of the left out
data, unless an explicit score is passed in which case it is used instead.

If `n_jobs` was set to a value higher than one, the data is copied for each
point in the grid (and not `n_jobs` times). This is done for efficiency
reasons if individual jobs take very little time, but may raise errors if
the dataset is large and not enough memory is available.  A workaround in
this case is to set `pre_dispatch`. Then, the memory is copied only
`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 *
n_jobs`.


# CV and Scaling Pipeline

```python
steps = [('scaler', StandardScaler()), 
        ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
parameters = {knn__n_neighbors=np.arange(1, 50)}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
cv = GridSearchCV(pipeline, param_grid=parameters)

cv.fit(X_train, y_train)

y_pred = cv.predict(X_test)

print(cv.best_params_)

print(cv.score(X_test, y_test))

print(classification_report(y_test, y_pred))
```

****

# Question3
Is grid search a general purpose cross validation + parameter tuner?

****