# Overview

Our objective is to predict a new venue's popularity from information available when the venue opens.  We will do this by machine learning from a dataset of venue popularities provided by Yelp.  The dataset contains meta data about the venue (where it is located, the type of food served, etc ...).  It also contains a star rating.  

For this project, the star rating will be our **dependent variable.** All of the other information in the dataset will be our **independent variables.** 

There are 5 parts to this project. In parts 1 through 4, you will build estimators which each use different independent variables in order to predict the dependent variable. In part 5, you will combine the estimators you have created in parts 1 through 4 to build a FullModel, which performs better than any of its components parts alone.

## Download and parse the data

In your VM, use the `wget` command to download the dataset:

```bash
wget http://eds.thecads.org/yelp_train_academic_dataset_business.json.gz
```

Notice that each row of the file is a json blurb.  You can read it in python.  *Hints:*
1. `gzip.open` ([docs](https://docs.python.org/2/library/gzip.html)) has the same interface as `open` but is for `.gz` files.
2. The library `json` has a function called `loads()` which can read a single json blurb into a Python dictionary

Put the data into a pandas dataframe

In [None]:
#Imports for project
import pandas as pd
import numpy as np
import gzip
import json
import matplotlib
import matplotlib.pyplot as plt
from pprint import pprint
import sklearn as sk
%matplotlib inline

In [None]:
# Load up the data
with gzip.open('yelp_train_academic_dataset_business.json.gz', "rb") as f:
    yelp_data = []
    for i in f:
        yelp_data.append(json.loads(i)) 

source_df = pd.DataFrame(yelp_data)
source_df.reset_index()

## Setup cross-validation:
In order to track the performance of your machine-learning, use `cross_validation.train_test_split` ([docs](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html)).

Be sure to also separate the dependent and independent variables at this stage.

In [None]:
#Plot relationships
y = source_df["stars"]
for col in source_df.columns:
    if (col != "stars" and source_df[col].dtypes == np.float64 or source_df[col].dtypes == np.int64  ):
        plt.figure()
        plt.title(col)
        plt.plot(source_df[col], y, '.')
        plt.xlabel(col)
        plt.ylabel('Stars')

In [None]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
y = source_df["stars"]
X = source_df.ix[:, source_df.columns != 'stars']

In [None]:
# Train the model using the training sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Building models in sklearn

All estimators (e.g. linear regression, kmeans, etc ...) support `fit` and `predict` methods.  In fact, you can build your own by inheriting from classes in `sklearn.base` by using this template:
``` python
class Estimator(base.BaseEstimator, base.RegressorMixin):
  def __init__(self, ...):
   # initialization code

  def fit(self, X, y):
    # fit the model ...
    return self

  def predict(self, X):
    return # prediction
```
The intended usage is:
``` python
estimator = Estimator(...)  # initialize
estimator.fit(X_train, y_train)  # fit data
y_pred = estimator.predict(X_test)  # predict answer
estimator.score(X_test, y_test)  # evaluate performance
```
The regressor provides an implementation of `.score`.  Conforming to this convention has the benefit that many tools (e.g. cross-validation, grid search) rely on this interface so you can use your new estimators with the existing `sklearn` infrastructure.

For example `grid_search.GridSearchCV` ([docs](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)) takes an estimator and some hyperparameters as arguments, and returns another estimator.  Upon fitting, it fits the best model (based on the inputted hyperparameters) and uses that for prediction.

## I. CityModel

`sklearn` provides a number of different types of estimators, which we will use for subsequent problems. For this first problem, we will write our own (simple) estimator.

This estimator will look only at one piece of data: which city the venue is in. You can image that the ratings in some cities are probably higher than others on average.

When the `fit()` function is called, the estimator will use `groupby` and `mean` to compute the average rating for each city and store that information within the class. Note that in order for this to work, it will be necessary to join the `X` and `y` dataframes. This can easily be done by using:

```python
joined_df = X.join(y)
```

When a new observation is requested from the `predict()` function, the estimator should retrieve and report the average rating belonging to the city of the new observation. If the city of the new observation was not in the training set, return instead the average of all venues.

**Note:** It may be tempting to create an estimator which requires that you give only the name of the city as its input. Don't do it... it will make your life harder later. For every model in this project, build estimators which accept the full dataset as inputs. In other words, design your estimator so that you can use it as follows:
```python
y_pred = city_model.predict(X_test)
```
where `X_test` is the result of your call to `train_test_split` above. Make sure the type of `y_pred` is `ndarray`.

In [None]:
# CityModel

class Estimator(sk.base.BaseEstimator, sk.base.RegressorMixin):
    
    def __init__(self,groupby_col = 'city',mean_col = 'stars'):
        self.groupby_col = groupby_col
        self.mean_col = mean_col
        pass

    def fit(self, X, y):
        joined_df = X.join(y)
        self.fit_df = joined_df.groupby([self.groupby_col])[self.mean_col].mean().to_frame()
        return self


    def predict(self, X):
        stars = []
        for index, row in X.iterrows():
            try:
                stars.append(self.fit_df.loc[row[self.groupby_col]][self.mean_col])
            except:
                stars.append(self.fit_df[self.mean_col].mean())
        return np.array(stars)
        


In [None]:
my_estimator =  Estimator()
my_estimator.fit(X_train,y_train)
my_estimator.predict(X_test)


### Transformers

In the previous problem, we created the estimator and were able to design it in such a way that it accepted input in the format of our choosing.

For more complex operations, we will prefer to use estimators provided to us by `sklearn`. To do so, we sometimes need to process or **transform** the data before we can do machine-learning on it.  `sklearn` has Transformers to help with this.  They implement this interface:
``` python
class Transformer(base.BaseEstimator, base.TransformerMixin):
  def __init__(self, ...):
   # initialization code
   return

  def fit(self, X, y=None):
    # fit the transformation ... (often this is empty!)
    return self

  def transform(self, X):
    return ... # transformation
```
When combined with our previous `estimator`, the intended usage is
``` python
transformer = Transformer(...)  # initialize
X_trans_train = transformer.fit_transform(X_train)  # fit / transform data
estimator.fit(X_trans_train, y_train)  # fit new model on training data
X_trans_test = transformer.transform(X_test)  # transform test data
estimator.score(X_trans_test, y_test)  # fit new model
```
Here, `.fit_transform` is implemented based on the `.fit` and `.transform` methods in `base.TransformerMixin` ([docs](http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html)).  Especially for transformers, `.fit` is often empty and only `.transform` actually does something.

### Pipelines
One of the main advantages of using sklearn transformers is that we can chain them together with pipelines.  For example, this
``` python
new_model = pipeline.Pipeline([
    ('trans', Transformer(...)),     # Add a transformer to the pipeline
    ('est', Estimator(...))          # Add an estimator to the pipeline
  ])
new_model.fit(X_train, y_train)
new_model.score(X_test, y_test)
```
would replace all the fitting and scoring code above.  The pipeline itself is an estimator (and implements the `.fit` and `.predict` methods).  Note that a pipeline can have multiple transformers chained up but at most one (optional) terminal estimator.

## II. LatLongModel

We will now build a more fine-grained model based on geographical location.  We know that some neighborhoods are trendier than others.  To account for this, we might consider a K Nearest Neighbors or Random Forest based on the latitude and longitude.

This time, we will use a built-in estimator for `sklearn`. However, in order that the result of our work is reuasable later when we build our FullModel, we want to have an estimator which accepts the entire dataset as its input. For this reason, we will create a custom transformer which accepts our data in its original format and transforms it into something suitable for the built-in `sklearn` estimators (i.e., `numpy.ndarray`). By putting this transformer and our selected estimator into a pipeline together, we effectively create a single estimator which both accepts the data in our original format and takes advantage of `sklearn`'s algorithms.

Let's focus on the transformer first. Implement a generic `ColumnSelectTransformer` that is passed which columns to select when initialized. **Hint:** You will want to include some code in the transformer's `__init__()` function.

In [None]:
# LatLong model

class ColumnSelectTransformer(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self,column_list):
        if type(column_list) == list:
            self.column_list = column_list
        else:
            raise Exception("Param must be list")
        return
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if set(self.column_list).issubset(X.columns):
            return X[self.column_list]
        else:
            raise Exception("Some columns may not exist: {}".format(self.column_list))

Let's use `neighbors.KNeighborsRegressor` ([docs](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)) as the estimator. Create a pipeline which contains the transformer you just created as well as one of these estimators. For now, we'll set the hyperparameters `n_neighbors` to 5. In the next step, we'll find the optimal value for this hyperparameter.

In [None]:
# Pipeline with KNeighborsRegressor
from sklearn import pipeline
from sklearn import neighbors

k_pipe = pipeline.Pipeline([
  ('truncate', ColumnSelectTransformer(["latitude","longitude"])),
  ('knearestRegressor', neighbors.KNeighborsRegressor(n_neighbors=119))
  ])
k_pipe.fit(X_train, y_train)
print(k_pipe.score(X_test, y_test))

We will now find the optimal value for the hyperparameter `n_neighbors`. 

Pass your pipeline to `grid_search.GridSearchCV` ([doc](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)). You'll need to specify a dictionary for the parameter `param_grid`. Be sure to experiment the number of parameters to find the optimal value. After calling the `fit()` function on your `GridSearchCV` object, you can find the estimator with the best hyperparameter in the `.best_estimator_` parameter. Store this estimator for the full model later.

In [None]:
# Grid search for hyperparameter
from sklearn.grid_search import GridSearchCV
n_neighbors = np.arange(1,200)
d = dict()
d['knearestRegressor__n_neighbors']=n_neighbors
gscv_neighbors = GridSearchCV(estimator=k_pipe, param_grid=d, cv= 5)

In [None]:
gscv_neighbors.fit(X_train, y_train)
print(gscv_neighbors.best_params_)

## III. CategoryModel

Venues have categories with varying degrees of specificity, e.g.

``` python
  [Doctors, Health & Medical]
  [Restaurants]
  [American (Traditional), Restaurants]
```
  
With a large sparse feature set like this, we often use a cross-validated regularized linear model.

Build a custom transformer that massages the data so that it can be fed into `feature_extraction.DictVectorizer` ([docs](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html),[Example](Dict Vectorizer Example.ipynb)), which in turn generates a large matrix gotten by One-Hot-Encoding. 

Use a pipeline to feed the result of this into a `LinearRegression` estimator.

In [None]:
# CategoryModel
from sklearn.feature_extraction import DictVectorizer




class MultipleCategoryTransformer(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self,dict_vect,category_column_name = "categories"):
        self.category_column_name = category_column_name
        self.dict_vect = dict_vect
        return

    def multiple_transform(self,X,category_column_name):
        column_categories = []
        for index, row in X.iterrows():
            row_categories = {}
            for category in row[category_column_name]:
                row_categories[category] = True
            column_categories.append(row_categories)
        return column_categories
    
    
    def fit(self, X, y=None):
        self.dict_vect.fit(self.multiple_transform(X,self.category_column_name))
        return self
    
    def transform(self, X):
        return self.dict_vect.transform(self.multiple_transform(X,self.category_column_name))


In [None]:
from sklearn import linear_model
from sklearn import pipeline

v = DictVectorizer(sparse=False)
m = MultipleCategoryTransformer(v)
r_pipe = pipeline.Pipeline([
  ('truncate', m),
  ('linearregression', linear_model.LinearRegression())
  ])


r_pipe.fit(X_train, y_train)
print(r_pipe.score(X_test, y_test))

## IV. AttributeModel

Venues have (potentially nested) attributes.  For example,

``` python
  { 'Attire': 'casual',
    'Accepts Credit Cards': True,
    'Ambience': {'casual': False, 'classy': False }}
```
  
Categorical data like this should often be transformed by a One Hot Encoding.  For example, we might flatten the above into something like this:

``` python
  { 'Attire_casual' : 1,
    'Accepts Credit Cards': 1,
    'Ambience_casual': 0,
    'Ambience_classy': 0 }
```

Build a custom transformer that flattens attributes and feed this into `DictVectorizer`.  Feed it into a (cross-validated) linear model 

In [None]:
# AttributeModel
print(X_train["attributes"][0])

In [None]:
class MultipleCategoryTransformer2(sk.base.BaseEstimator, sk.base.TransformerMixin):
    def __init__(self,dict_vect,category_column_name = "attributes"):
        self.category_column_name = category_column_name
        self.dict_vect = dict_vect
        return
    
    def unpack(self,d, parent_key='', sep='_'):
        items = []
        for k, v in d.items():
            new_key = parent_key + sep + k if parent_key else k
            if isinstance(v, dict):
                items.extend(self.unpack(v, new_key, sep=sep).items())
            else:
                if isinstance(v,bool):
                    if (v):
                        items.append((new_key, 1))
                    else:
                        items.append((new_key, 0))
                elif isinstance(v,int):
                    items.append((new_key+ sep + str(v), 1))
                elif isinstance(v,str):
                    items.append((new_key+ sep + v, 1))
                else:
                    print(type(v))
                    print(v)
        return dict(items)
    
    def unpacking_columns(self,X,category_column_name):
        column_categories = []
        for index, row in X.iterrows():
            column_categories.append(self.unpack(row[category_column_name]))
        return column_categories
    
    def fit(self, X, y=None):
        self.dict_vect.fit(self.unpacking_columns(X,self.category_column_name))
        return self
    
    def transform(self, X):
        return self.dict_vect.transform(self.unpacking_columns(X,self.category_column_name))
    

from sklearn import linear_model
from sklearn import pipeline

v = DictVectorizer(sparse=False)
m = MultipleCategoryTransformer2(v)
l_pipe = pipeline.Pipeline([
  ('truncate', m),
  ('lassocv', linear_model.LassoCV())
  ])


l_pipe.fit(X_train, y_train)
print(l_pipe.score(X_test, y_test))

## V. FullModel

So far we have only built models based on individual features.  Now, we combine them into An **ensemble model**.  We do this using a [feature union](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html).

Combine all the above models using a feature union.  Notice that a feature union takes transformers, not models as arguments.  The way around this is to convert your existing estimators into transformers. You can do this manually by modifying the estimators you've written so far, but an easier way is to wrap your estimators in a class such as [this](Model Transformer.ipynb).

The feature union itself is a transformer. Use a pipeline to combine this transformer with linear regression.

In [None]:
# FullModel

class ModelTransformer(sk.base.TransformerMixin):

    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return pd.DataFrame(self.model.predict(X))
    
from sklearn.pipeline import FeatureUnion
from sklearn import pipeline

#City
my_estimator =  Estimator()

#Lang Long
k_pipe = pipeline.Pipeline([
  ('truncate', ColumnSelectTransformer(["latitude","longitude"])),
  ('knearestRegressor', neighbors.KNeighborsRegressor(n_neighbors=119))
  ])

#Categories
v = DictVectorizer(sparse=False)
m = MultipleCategoryTransformer(v)
r_pipe = pipeline.Pipeline([
  ('truncate', m),
  ('linearregression', linear_model.LinearRegression())
  ])

#Attributes
v1 = DictVectorizer(sparse=False)
m1 = MultipleCategoryTransformer2(v1)
l_pipe = pipeline.Pipeline([
  ('truncate', m1),
  ('lassocv', linear_model.LassoCV())
  ])

pipe = pipeline.Pipeline([
    ('features', FeatureUnion([
        ('city',  ModelTransformer(my_estimator)),
        ('longlat', ModelTransformer(k_pipe)),
        ('categories', ModelTransformer(r_pipe)),
        ('attributes', ModelTransformer(l_pipe))
        ])),
    ('finalestimator', linear_model.LinearRegression())
    ])

pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))