In [1]:
import seaborn as sns
sns.set()

In [1]:
from static_grader import grader

# ML: Predicting Star Ratings


Our objective is to predict a new venue's popularity from information available when the venue opens.  We will do this by machine learning from a data set of venue popularities provided by Yelp.  The data set contains meta data about the venue (where it is located, the type of food served, etc.).  It also contains a star rating. Note that the venues are not limited to restaurants. This tutorial will walk you through one way to build a machine learning algorithm.


For most questions, you are asked to submit your models `predict` method to the grader. The grader uses a test set to evaluate your model's performance against our reference solution, using the $R^2$ score. It **is** possible to receive a score greater than one, indicating that you've beaten our reference model. We compare our model's score on a test set to your score on the same test set. See how high you can go!

## Download and parse the incoming data


We start by downloading the data set from Amazon S3:

In [2]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_business.json.gz'

The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/3/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in `json` package has a `loads` function that converts a JSON string into a Python dictionary. We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` package, but is *substantially* faster (at the cost of non-robust handling of malformed JSON). We will use that inside a list comprehension to get a list of dictionaries:

In [3]:
import ujson as json
import gzip

with gzip.open('yelp_train_academic_dataset_business.json.gz') as f:
    data = [json.loads(line) for line in f]

In [5]:
data[3700]

{'business_id': 'FiiNtpF1GW0aSzW7Af0TdA',
 'full_address': '7810 S. Priest Drive Suite D\nTempe, AZ 85284',
 'hours': {'Monday': {'close': '23:00', 'open': '04:30'},
  'Tuesday': {'close': '23:00', 'open': '04:30'},
  'Friday': {'close': '22:00', 'open': '04:30'},
  'Wednesday': {'close': '23:00', 'open': '04:30'},
  'Thursday': {'close': '23:00', 'open': '04:30'},
  'Sunday': {'close': '20:00', 'open': '07:00'},
  'Saturday': {'close': '20:00', 'open': '07:00'}},
 'open': True,
 'categories': ['Active Life', 'Gyms', 'Trainers', 'Fitness & Instruction'],
 'city': 'Tempe',
 'review_count': 8,
 'name': 'LA Fitness',
 'neighborhoods': [],
 'longitude': -111.9631388,
 'state': 'AZ',
 'stars': 2.5,
 'latitude': 33.3453944,
 'attributes': {'Good for Kids': True},
 'type': 'business'}

In scikit-learn, the labels to be predicted, in this case, the stars, are always kept in a separate data structure than the features. Let's get in this habit now, by creating a separate list of the ratings.

In [4]:
star_ratings = [row['stars'] for row in data]

A few things to consider:

1. The test set used by the grader will be in the same form as `data`. For this miniproject, it will be a list of dictionaries. The models you will build will need to handle data of this type; we'll discuss this more further in the questions.
1. You may find it useful to serialize your trained model using either [`dill`](https://pypi.python.org/pypi/dill) or [`joblib`](http://scikit-learn.org/stable/modules/model_persistence.html). That way, you can reload your model after restarting the Jupyter notebook without needing to retrain it.
1. There are obvious mistakes in the data; there is no need to try to correct them.

# Questions


## Question 1: city_avg

The venues belong to different cities.  You can imagine that the ratings in some cities are probably higher than others.  We wish to build an estimator to make a prediction based on this, but first we need to work out the average rating for each city.  For this problem, create a list of tuples (city name, star rating), one for each city in the data set. There are many ways to do this; please feel free to experiment on your own.  If you get stuck, the steps below attempt to guide you through the process.

A simple approach is to go through all of the dictionaries in our array, calculating the sum of the star ratings and the number of venues for each city. At the end, we can just divide the stars by the count to get the average. We could create a separate sum and count variable for each city, but that will get tedious quickly. A better approach is to create a dictionary for each. The key will be the city name, and the value the running sum or running count.

One slight annoyance of this approach is that we will have to test whether a key exists in the dictionary before adding to the running tally.  The `collections` module's `defaultdict` class works around this by providing default values for keys that haven't been used. Thus, if we do

In [7]:
city_ratings = [(d['city'], d['stars']) for d in data]

#print(city_ratings)

In [8]:
from collections import defaultdict
city_stars_sum = defaultdict(int)
city_counts = defaultdict(int)

for dictionary in data:
    city = dictionary['city']
    stars = dictionary['stars']
    city_stars_sum[city] += stars
    city_counts[city] += 1

city_averages = {}

for city, stars_sum in city_stars_sum.items():
    count = city_counts[city]
    average = stars_sum / count
    city_averages[city] = average

print(len(city_averages))

167


we can increment any key of `star_sum` or `count` without first worrying whether the key exists. We need to go through the `data` and `star_ratings` list together, which we can do with the `zip` function.

Now we can calculate the average ratings.  Again, a dictionary makes a good container.

There should be 167 different cities:

In [9]:
# Check to see that we have 167 entries in the dictionary.
grader.check(len(city_averages) == 167)

True

We can get that list of tuples by converting the returned view object from the `items` method into a list.

In [10]:
avg_stars = city_averages

In [11]:
grader.score('ml__city_avg', list(avg_stars.items()))

Your score: 1.0000


## Question 2: city_model

Now, let's build a custom estimator that will make a prediction based solely on the city of a venue.  It is tempting to hard-code the answers from the previous section into this model, but we're going to resist and do things properly.

This custom estimator will have a `fit` method.  It will receive `data` as its argument `X` and `star_ratings` as `y`, and should repeat the calculation of the previous problem there.  Then the `predict` method can look up the average rating for the city of each record it receives.

In [46]:
from sklearn.base import BaseEstimator, RegressorMixin

class CityRegressor(BaseEstimator, RegressorMixin):
    def __init__(self):
        self.city_ratings = {}

    def fit(self, X, y): # Store the average rating per city in self.avg_stars
        for row, rating in zip(X, y):
            city = row['city']
            if city not in self.city_ratings:
                self.city_ratings[city] = []
            self.city_ratings[city].append(rating)

    def predict(self, X):
        predictions = []
        for row in X:
            city = row['city']
            if city in self.city_ratings:
                ratings = self.city_ratings[city]
                average_rating = sum(ratings) / len(ratings)
                predictions.append(average_rating)
            else:
                # If city is not seen during training, return a default value or raise an exception
                predictions.append(3.5)  # Default value
        return predictions

Now we can create an instance of our regressor and train it.

In [13]:
city_averages

{'Phoenix': 3.6702903946388683,
 'De Forest': 3.75,
 'Mc Farland': 3.1,
 'Middleton': 3.611111111111111,
 'Madison': 3.6457337883959045,
 'Sun Prairie': 3.455223880597015,
 'Windsor': 3.5,
 'Monona': 3.4727272727272727,
 'Chandler': 3.667574931880109,
 'Scottsdale': 3.8206757594544327,
 'Tempe': 3.644621295279912,
 'Florence': 3.6176470588235294,
 'Peoria': 3.6388367729831144,
 'Glendale': 3.607404021937843,
 'Cave Creek': 3.9122137404580153,
 'Paradise Valley': 3.6690140845070425,
 'Mesa': 3.5901461829994585,
 'Ahwatukee': 3.6875,
 'Pheonix': 3.0,
 'Anthem': 3.7818181818181817,
 'Gilbert': 3.752396166134185,
 'Gold Canyon': 3.5,
 'Apache Junction': 3.6375,
 'Goldfield': 3.5,
 'Casa Grande': 3.5172413793103448,
 'Coolidge': 3.4375,
 'Higley': 3.5,
 'Queen Creek': 3.6456043956043955,
 'Sun Lakes': 3.2222222222222223,
 'Goodyear': 3.5313653136531364,
 'Fort Mcdowell': 4.0,
 'Fountain Hills': 3.7904761904761903,
 'Fountain Hls': 3.0,
 'Maricopa': 3.52,
 'chandler': 5.0,
 'Buckeye': 3.4084

In [47]:
city_model = CityRegressor()
#city_model.fit(data, star_ratings)

And let's see if it works.

In [15]:
city_model.predict(data[:5])

[3.6702903946388683, 3.75, 3.75, 3.75, 3.75]

There is a problem, however.  What happens if we're asked to estimate the rating of a venue in a city that's not in our training set?

In [16]:
city_model.predict([{'city': 'Phoenix'}, {'city': 'Timbuktu'}, {'city': 'Madison'}])

[3.6702903946388683, 3.5, 3.6457337883959045]

Your model should always return a number, even if the city was not in the training data. Make sure it does before submitting your model's predict method to the grader.

In [17]:
grader.score('ml__city_model', city_model.predict)

Your score: 1.0000


## Question 3: lat_long_model

You can imagine that a city-based model might not be sufficiently fine-grained. For example, we know that some neighborhoods are trendier than others.  Use the latitude and longitude of a venue as features that help you understand neighborhood dynamics.

Since we need to select the appropriate columns from our dictionaries to build our latitude-longitude model, we will have to use scikit-learn's [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html). However, the `ColumnTransformer` works with either NumPy arrays or pandas data frames. While we can convert our training data into a data frame easily, the test set the grader uses is a list of dictionaries. Thus, our first estimator in our workflow should be a transformer that converts a list of dictionaries into a pandas data frame.

In [82]:
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator

#this tells us how to make an object
class ToDataFrame(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        #here we use X as a variable that represents a list of dictionaries
        Z = pd.DataFrame(X) 
        return Z
        # Return a pandas data frame from X

In [83]:
X = pd.DataFrame(data)

Let's test out the transformer.

In [84]:
#making a instance(to_data_frame) of the class(ToDataFrame())
to_data_frame = ToDataFrame()
X_t = to_data_frame.fit_transform(data[:5])

# Check that our transformer properly transform the input data into a data frame
grader.check((X_t == pd.DataFrame(data[:5])).all(axis=None))

True

Now we are ready to use `ColumnTransformer` and test it out.

In [89]:
import numpy as np
from sklearn.compose import ColumnTransformer

selected_columns = ['latitude', 'longitude']
ct = ColumnTransformer(
    transformers=[("latlon", 'passthrough', selected_columns)])

expected = np.array([data[0]['latitude'], data[0]['longitude']])

# Check that our selector returns just two columns, the latitude and longitude
grader.check((ct.fit_transform(X)[0] == expected).all())

True

Now, let's feed the output of the transformer in to a `KNeighborsRegressor`. As a sanity check, we'll test it with the first 5 rows.

In [90]:
from sklearn.neighbors import KNeighborsRegressor

# Training the model
data_transform = to_data_frame.transform(data)
data_transform = selector.fit_transform(data_transform)
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(data_transform, star_ratings)

# Making predictions
test_data = data[:5]
test_data_transform = to_data_frame.transform(test_data)
test_data_transform = selector.transform(test_data_transform)
knn.predict(test_data_transform)

array([4. , 4.2, 4. , 3.8, 4.2])

We are not ready to submit to the grader; there are two things we still need to do:
1. Wrap all the steps necessary to go from our data (list of dictionaries) to predicted ratings
1. Determine the optimal value for our predictor's hyperparameter

For the first point, we will use a pipeline, ensuring that our model applies all the required transformations given the form of the input data. Remember that a pipeline is made with a list of `(step_name, estimator)` tuples.

In [91]:
#make pipeline with the two custom predictors, pass the instance, not the class
#pipeline takes on the function of the item
from sklearn.pipeline import Pipeline

pipe = Pipeline([
        ("df_trans", to_data_frame),
        ("col_trans", ct),
        ("knn", knn)
])

Now let's fit and predict.

In [92]:
pipe.fit(data, star_ratings)
pipe.predict(data[:5])

array([4. , 4.2, 4. , 3.8, 4.2])

Let's now focus on the second point. The `KNeighborsRegressor` takes the `n_neighbors` hyperparameter, which tells it how many nearest neighbors to average together when making a prediction. There is no reason to believe that 5 is the optimum value. We will need to determine a better value for this hyperparameter. A common approach is to use a hyperparameter searching tool such as [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV). You may need to refer back to the notebooks about the ways to interface searching tools and pipelines.

You should consider whether the data needs to be shuffled as it might not have been randomized. For example, the data could be ordered by a certain feature or by the labels. If you perform a train/test split with [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split), the data is shuffled by default. However, when using `GridSearchCV`, the folds are not shuffled when you use the default K-folds cross-validation.

The code below will plot a rolling mean of the star ratings. Do you need to shuffle the data?

In [26]:
from pandas import Series
import matplotlib.pyplot as plt

#plt.plot(Series.rolling(Series(star_ratings), window=1000).mean());

Once you've found a good value of `n_neighbors`, submit the model to the grader. Note, "good" is a relative measure here. The reference solution has an $R^2$ score of only 0.02. There is just rather little signal available for modeling.

In [24]:
#will allow you to shuffle feature matrix and star_ratings label vec
from sklearn.utils import shuffle

#shuffle data to better predict
shuff_data,shuff_ratings = shuffle(data, star_ratings)

#check this again with shuffled labels
#plt.plot(Series.rolling(Series(shuff_ratings), window=1000).mean());

In [28]:
#pd.DataFrame(shuff_data)

In [93]:
#implement GridSearchCV using shuffled data
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

# Get data and perform train/test split
X_train, X_test, y_train, y_test = train_test_split(shuff_data,shuff_ratings)

knn_pipe = Pipeline([
        ("df_trans", to_data_frame),
        ("col_trans", ct),
        ("Kregressor", KNeighborsRegressor())
])
#knn_pipe.fit(X_train, y_train)

k_range = list(range(1, 5))
param_grid = dict(n_neighbors=k_range)
#k_range 116 steve used!
GridSearchCV(knn_pipe, param_grid)

In [96]:
# Perform hyperparameter tuning on pipeline estimator
k_range = list(range(1, 100))
param_grid = dict(Kregressor__n_neighbors=k_range)

gs_est = GridSearchCV(knn_pipe, param_grid, cv=3, n_jobs=2, verbose=1)
gs_est.fit(X_train, y_train)
#print(gs_est.score(X_test, y_test))

Fitting 3 folds for each of 99 candidates, totalling 297 fits


In [97]:
gs_est.best_params_

{'Kregressor__n_neighbors': 90}

In [98]:
grader.score('ml__lat_long_model', gs_est.predict)  # Edit to appropriate name

Your score: 1.0000


In [33]:
#TypeError: float() argument must be a string or a real number, not 'dict'

## Question 4: category_model

While location is important, we could also try seeing how predictive the
venue's category is. Build an estimator that considers only the `'categories'` field of the data.

The categories come as a list of strings, but the scikit-learn's predictors all need numeric input. We ultimately want to create a column in our feature matrix to represent every category. For a given row, only the columns that represent the categories it contains will be filled with a one, otherwise, it will be filled with a zero. The described method is similar to **one-hot encoding**, however, an observation/row can contain more than one "hot", non-zero, column.

To achieve our encoding plan, we need to use scikit-learn's provides [`DictVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer). This transformer takes a 1-D array of dictionaries and creates a column in the output matrix for each key in the dictionary and fills it with the value associated with it. Missing keys are filled with zeros. However, we need to build a transformer that takes an array of strings and returns an array of dictionaries with keys given by those strings and values of one. For example, it should transform `X_in` into `X_out`.

In [8]:
import pandas as pd
X_in = pd.Series([['a'], ['b', 'c']])
X_out = pd.Series([{'a': 1}, {'b': 1, 'c': 1}])

print(X_in)
print(X_out)

0       [a]
1    [b, c]
dtype: object
0            {'a': 1}
1    {'b': 1, 'c': 1}
dtype: object


In [13]:
#the goal of this function is to take in a list of lists and return a list of dictionaries
#expecting an array will output a list of dictionaries where each keys value is 1
def makedict(X):
    listtester = []
    for i in range(len(X)):
        counts = pd.value_counts(X[i]).to_dict()
        #print(counts)

        listtester.append(counts)
    return listtester


In [36]:
makedict(X_in)

[{'a': 1}, {'b': 1, 'c': 1}]

In [60]:
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
from sklearn.feature_extraction import DictVectorizer

class DictEncoder(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # X will be a pandas series. Return a pandas series of dictionaries
        v = DictVectorizer(sparse=False)
        d = makedict(X) #if you want to run with X_in
        #d = X #if you want to run with X_out which is already a dictionary
        #vect_out = v.fit_transform(d)
        #return vect_out
        return d

In [61]:
to_dict_en = DictEncoder()
dict_test = to_dict_en.fit_transform(X_in)

dict_test

[{'a': 1}, {'b': 1, 'c': 1}]

Now let's test out that our `DictEncoder` works out as expected.

In [72]:
# Check that DictEncoder transforms a series of list of strings into the expected series of dictionaries
grader.check((DictEncoder().fit_transform(X_in) == X_out).all())

True

Now, create a pipeline object of the two step transformation for the categories data. Afterwards, create a `ColumnTransformer` object that will use the aforementioned pipeline object to transform the `'categories'` field.

In [73]:
#Run ridge rigression over grid
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import DictVectorizer


selected_columns = ['categories']
ct = ColumnTransformer(
    transformers=[("catcol", 'passthrough', selected_columns)])

class DictEncoder_pipe(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # X will be a pandas series. Return a pandas series of dictionaries
        X = [item for sublist in X for item in sublist]
        #print(X)
        d = makedict(X)   
        #vect_out = v.fit_transform(d)
        return d

to_dict_en = DictEncoder_pipe()
        
CategoriesPipeline_fit_model = Pipeline([
    ('df_trans', to_data_frame),#first step is to transform dictionaries into dataframe
    ('col_trans', ct), #column transformer for 'categories'
    ('dict_trans', to_dict_en),#two steps are in the encoder
    ('vectorize', DictVectorizer(sparse=False)),
    ('scaler', StandardScaler()), #train model
    ('regressor', Ridge()), #choose Ridge()
])

#used cross validation to get this value
CategoriesPipeline_fit_model.set_params(regressor__alpha=0.91)

CategoriesPipeline_fit_model.fit(data, star_ratings)

Finally, create a pipeline object that will
1. Convert our list of dictionaries into a data frame
1. Select the `'categories'` column and encode the data
1. Train a regularized linear model such as `Ridge`

There will be a large number of features, one for each category, so there is a significant danger of overfitting. Use cross validation to choose the best regularization parameter.

In [75]:
grader.score('ml__category_model', CategoriesPipeline_fit_model.predict)  # Edit to appropriate name

Your score: 0.9366


**Extension:** Some categories (e.g., Restaurants) are not very specific. Others (Japanese sushi) are much more so.  One way to deal with this is with an measure call term frequency-inverse document frequency (tf-idf). Add in a [`TfidfTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) between the `DictVectorizer` and the linear model, and see if that improves performance.

## Question 5: attribute_model

There is even more information in the attributes for each venue.  Let's build an estimator based on these.

Venues attributes may be nested:
```python
{
  'Attire': 'casual',
  'Accepts Credit Cards': True,
  'Ambiance': {'casual': False, 'classy': False},
  'Price Range': 3
}
```
We wish to encode them in the same manner as our categories data using the `DictVectorizer`. Before we do so, we need to flatten the dictionary to a single level:
```python
{
  'Attire_casual' : 1,
  'Accepts Credit Cards': 1,
  'Ambiance_casual': 0,
  'Ambiance_classy': 0,
  'Price Range_3': 1
}
```
Build a custom transformer that flattens the dictionary for the `'attributes'` field. 

In [17]:
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
from sklearn.feature_extraction import DictVectorizer
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

class AttributeTransformer_exampleVersion(BaseEstimator, TransformerMixin):
    def flatten_dict(self, d, prefix=''):
        result = {}
        for key, value in d.items():
            if isinstance(value, dict):
                result.update(self.flatten_dict(value, f'{key}_'))
            elif isinstance(value, bool):
                result[f'{prefix}{key}'] = int(value)
            elif isinstance(value, str) or isinstance(value, int):
                result[f'{prefix}{key}_{value}'] = 1
            else:
                result[f'{prefix}{key}'] = value
        return result

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        flattened_dicts = X.apply(self.flatten_dict)
        output_df = pd.DataFrame(flattened_dicts.tolist())
        output_df.fillna(0, inplace=True)
        self.feature_names = output_df.columns.tolist()
        return output_df

We can check that we're getting the right number of columns _after_ vectorization to allow for different ways of constructing this. 

In [47]:
attribute_example = pd.Series([
    {
        'Wi-Fi': 'free',
        'Price Range': 2,
        'Parking': {
            'garage': False,
            'street': True
        }
    },
    {
        'Wi-Fi': 'no',
        'Price Range': 3,
        'Accepts Credit Cards': True,
        'Take-out': False
    },
    {
        'Wi-Fi': 'paid',
        'Parking': {
            'garage': True,
            'valet': True
        },
        'Take-out': True
    }
])

n_columns = 10

test_att_transformer = AttributeTransformer_exampleVersion()
output_df = test_att_transformer.fit_transform(attribute_example)
print("Feature names:", test_att_transformer.feature_names)
print(output_df)

# If this check fails, look at your output column names. 
# Are 'Wi-Fi' and 'Price Range' being treated correctly?
grader.check(len(test_att_transformer.feature_names) == n_columns)

Feature names: ['Wi-Fi_free', 'Price Range_2', 'Parking_garage', 'Parking_street', 'Wi-Fi_no', 'Price Range_3', 'Accepts Credit Cards', 'Take-out', 'Wi-Fi_paid', 'Parking_valet']
   Wi-Fi_free  Price Range_2  Parking_garage  Parking_street  Wi-Fi_no  \
0         1.0            1.0             0.0             1.0       0.0   
1         0.0            0.0             0.0             0.0       1.0   
2         0.0            0.0             1.0             0.0       0.0   

   Price Range_3  Accepts Credit Cards  Take-out  Wi-Fi_paid  Parking_valet  
0            0.0                   0.0       0.0         0.0            0.0  
1            1.0                   1.0       0.0         0.0            0.0  
2            0.0                   0.0       1.0         1.0            1.0  


True

Similar to what was done before, create a model that properly encodes the attribute data and learns to predict the ratings.

You may find it difficult to find a single regressor that does well enough. A common solution is to use a linear model to fit the linear part of some data, and use a non-linear model to fit the residual that the linear model can't fit. Build a custom predictor that takes as an argument two other predictors. It should use the first to fit the raw data and the second to fit the residuals of the first.

In [18]:
from sklearn.compose import ColumnTransformer


selected_columns_att = ['attributes']
ct_att = ColumnTransformer(
    transformers=[("catcol", 'passthrough', selected_columns_att)])

class ArrayToList(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        #here we use X as a variable that represents a list of dictionaries
        Z = pd.Series(X.tolist()) 
        return Z
        # Return a pandas data frame from X

array_to_list = ArrayToList()


#This transformer will use list comprehension to iterate through each dictionary in the list
#import pandas as pd
#from sklearn.base import BaseEstimator, TransformerMixin

class AttributeTransformer(BaseEstimator, TransformerMixin):
    def flatten_dict(self, d, prefix=''):
        if not isinstance(d, dict):
            return {}

        result = {}
        for key, value in d.items():
            if isinstance(value, dict):
                result.update(self.flatten_dict(value, f'{key}_'))
            elif isinstance(value, bool):
                result[f'{prefix}{key}'] = int(value)
            elif isinstance(value, str) or isinstance(value, int):
                result[f'{prefix}{key}_{value}'] = 1
            else:
                result[f'{prefix}{key}'] = value
        return result

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        flattened_dicts = [self.flatten_dict(inner_dict[0]) for inner_dict in X]
        #output_df = pd.DataFrame(flattened_dicts)
        #output_df.fillna(0, inplace=True)
        #self.feature_names = output_df.columns.tolist()
        return flattened_dicts  

att_transformer = AttributeTransformer()

In [None]:
#If you want to run this model without a pipeline

#Before test_train_split on the two shuffled data objects, you need to vectorize each and flatten the attributes
#This will ensure the output object has the number of features is the same in each _train and _test object
#shuff_data_frame = to_data_frame.fit_transform(shuff_data[:5])
#attributes_col_shuff_data_frame = ct_att.fit_transform(shuff_data_frame) #this is a array of dictionaries but our attribute transformer expects a list
#series_of_dicts = pd.Series(attributes_col_shuff_data_frame.tolist()) #turn the array into a list aka pd series but instead gives us a list of lists
#attributes_col_shuff_data_frame_transform = att_transformer.fit_transform(series_of_dicts)
#print("Feature names:", att_transformer.feature_names) #gives you column names
#print(attributes_col_shuff_data_frame_transform[0]) #lets you see transformed data

In [70]:
# Create the attribute model and custom predictor
class CustomPredictor(BaseEstimator, RegressorMixin):

    def __init__(self, first_predictor, second_predictor):
        self.first_predictor = first_predictor
        self.second_predictor = second_predictor

    def fit(self, X, y):
        self.first_predictor.fit(X, y)
        residuals = y - self.first_predictor.predict(X)
        self.second_predictor.fit(X, residuals)
        return self

    def predict(self, X):
        return self.first_predictor.predict(X) + self.second_predictor.predict(X)

# Create the custom predictor with a linear model and a non-linear model
linear_model = Ridge()
# Create the GridSearchCV object with ridge_model, param_grid, and cross-validation options
param_grid = {'alpha': range(1,100)}
new_linear_model = GridSearchCV(linear_model, param_grid, cv=5, n_jobs=-1)

non_linear_model = RandomForestRegressor(min_samples_leaf=5, random_state=42) #RandomForest from sklear and do parameter tuning
custom_predictor = CustomPredictor(new_linear_model, non_linear_model)



#Data transformation and model implementation pipeline    
att_pipe = Pipeline([
        ("df_trans", to_data_frame),
        ("col_trans", ct_att),
        ("array_trans", array_to_list),
        ("att_trans", att_transformer),
        ('vectorize', DictVectorizer(sparse=False)),
        ("cust_pre", custom_predictor)
])


# Fit the custom predictor on the training data
att_pipe.fit(data,star_ratings)

In [71]:
grader.score('ml__attribute_model', att_pipe.predict)  # Edit to appropriate name

Your score: 0.9750


## Question 6: full_model

So far we have only built models based on individual features.  Now we will build an ensemble regressor that averages together the estimates of the four previous regressors.

In order to use the existing models as input to a predictor, we will have to turn them into transformers; a predictor can only be in the final step of a pipeline. Build a custom `ModelTransformer` class that takes a predictor as an argument. When `fit` is called, the predictor should be fit. When `transform` is called, the predictor's `predict` method should be called, and its results returned as the transformation.

Note that the output of the `transform` method should be a 2-D array with a single column in order for it to work well with the scikit-learn pipeline. If you're using NumPy arrays, you can use `.reshape(-1, 1)` to create a column vector. If you are just using Python lists, you will want a list of lists of single elements.

In [44]:
#import everything
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import DictVectorizer

class ModelTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, predictor):
        self.predictor = predictor

    def fit(self, X, y):
        self.predictor.fit(X, y) #fits the predictor to the input data
        return self

    def transform(self, X):
        predictions = self.predictor.predict(X) #method calls the predictor's predict method
        return np.array(predictions).reshape(-1, 1) #return 2-D array with a single column
        print(predictions)

Let's now test it out on our `city_model`.

In [48]:
#city model
city_trans = ModelTransformer(city_model)
city_trans.fit(data, star_ratings)
X_t = city_trans.transform(data[:5])


# Check that the transformation output is a 2-D array with one column
grader.check(np.array(X_t).shape[-1] == 1)

True

In [49]:
y_pred = np.array(city_model.predict(data[:5]))

# Check that the transformation output is the same as the model's predictions
grader.check((y_pred.reshape(-1, 1) == X_t).all())

True

Create an instance of `ModelTransformer` for each of the previous four models. Combine these together in a single feature matrix with a
[`FeatureUnion`](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion).

In [50]:
#Modify pipelines to accept "raw data or shuff_data" as input

####Modified attributes pipeline
selected_columns_att = ['attributes']
ct_att = ColumnTransformer(
    transformers=[("catcol", 'passthrough', selected_columns_att)])


class ToDataFrame(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        #here we use X as a variable that represents a list of dictionaries
        Z = pd.DataFrame(X) 
        return Z
        # Return a pandas data frame from X

to_data_frame = ToDataFrame()

class ArrayToList(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        #here we use X as a variable that represents a list of dictionaries
        Z = pd.Series(X.tolist()) 
        return Z
        # Return a pandas data frame from X

array_to_list = ArrayToList()

class AttributeTransformer(BaseEstimator, TransformerMixin):
    def flatten_dict(self, d, prefix=''):
        if not isinstance(d, dict):
            return {}

        result = {}
        for key, value in d.items():
            if isinstance(value, dict):
                result.update(self.flatten_dict(value, f'{key}_'))
            elif isinstance(value, bool):
                result[f'{prefix}{key}'] = int(value)
            elif isinstance(value, str) or isinstance(value, int):
                result[f'{prefix}{key}_{value}'] = 1
            else:
                result[f'{prefix}{key}'] = value
        return result

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        flattened_dicts = [self.flatten_dict(inner_dict[0]) for inner_dict in X]
        #output_df = pd.DataFrame(flattened_dicts)
        #output_df.fillna(0, inplace=True)
        #self.feature_names = output_df.columns.tolist()
        return flattened_dicts
    
att_transformer = AttributeTransformer()

#Custom predictor (combines linear and nonlinear model) to be called in attributes pipeline
class CustomPredictor(BaseEstimator, RegressorMixin):

    def __init__(self, first_predictor, second_predictor):
        self.first_predictor = first_predictor
        self.second_predictor = second_predictor

    def fit(self, X, y):
        self.first_predictor.fit(X, y)
        residuals = y - self.first_predictor.predict(X)
        self.second_predictor.fit(X, residuals)
        return self

    def predict(self, X):
        return self.first_predictor.predict(X) + self.second_predictor.predict(X)

# Create the custom predictor with a linear model and a non-linear model
linear_model = Ridge()
# Create the GridSearchCV object with ridge_model, param_grid, and cross-validation options
param_grid = {'alpha': range(1,100)}
new_linear_model = GridSearchCV(linear_model, param_grid, cv=5, n_jobs=-1)

non_linear_model = RandomForestRegressor(min_samples_leaf=5, random_state=42) #RandomForest from sklear and do parameter tuning
custom_predictor = CustomPredictor(new_linear_model, non_linear_model)



att_pipe = Pipeline([
        ("df_trans", to_data_frame),
        ("col_trans", ct_att),
        ("array_trans", array_to_list),
        ("att_trans", att_transformer),
        ('vectorize', DictVectorizer(sparse=False)),
        ("cust_pre", custom_predictor)
])


####Modified categories pipeline
selected_columns_cat = ['categories']
ct_cat = ColumnTransformer(
    transformers=[("catcol", 'passthrough', selected_columns_cat)])

def makedict(X):
    listtester = []
    for i in range(len(X)):
        counts = pd.value_counts(X[i]).to_dict()
        #print(counts)

        listtester.append(counts)
    return listtester

class DictEncoder_pipe(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # X will be a pandas series. Return a pandas series of dictionaries
        X = [item for sublist in X for item in sublist]
        #print(X)
        d = makedict(X)   
        #vect_out = v.fit_transform(d)
        return d

to_dict_en = DictEncoder_pipe()
        
CategoriesPipeline_fit_model = Pipeline([
    ('df_trans', to_data_frame),#first step is to transform dictionaries into dataframe
    ('col_trans', ct), #column transformer for 'categories'
    ('dict_trans', to_dict_en),#two steps are in the encoder
    ('vectorize', DictVectorizer(sparse=False)),
    ('scaler', StandardScaler()), #train model
    ('regressor', Ridge()), #choose Ridge()
])

#CategoriesPipeline_fit_model.fit(shuff_data[:5],shuff_ratings[:5]) #test to make sure pipeline runs

In [81]:
city_model.fit(shuff_data[:5],shuff_ratings[:5]) #test to make sure pipeline runs

In [99]:
#Run model transformer to create predictors from 4 models

#city model
city_trans = ModelTransformer(city_model)

#knn model
knn_o = gs_est #feed in optimal number that was found previously with .best_params_
knn_trans = ModelTransformer(knn_o)

#categories ridge regressor
ridge_model = CategoriesPipeline_fit_model
ridge_trans = ModelTransformer(ridge_model)

#custom predictor for attributes
lin_nonlin_model = att_pipe
lin_nonlin_trans = ModelTransformer(att_pipe)


In [100]:
from sklearn.pipeline import FeatureUnion

# Create a FeatureUnion that combines the ModelTransformer objects
union = FeatureUnion([
    ("ridge", ridge_trans), 
    ("knn", knn_trans),
    ("city", city_trans), 
    ("lin_nonlin", lin_nonlin_trans)
])

Our `FeatureUnion` object should return a feature matrix with four columns.

In [None]:
# Fit the FeatureUnion to the training data
union.fit(data[:5], star_ratings[:5])
X_t = union.transform(data[:5])

# Transformed data should have 5 rows and 4 columns
grader.check(X_t.shape == (5, 4))

Finally, use a pipeline to combine the feature union with a linear regression (or another model) to weight the predictions.

In [102]:

# Create the pipeline with the FeatureUnion and a linear regression model
pipeline = Pipeline([
    ("union", union),
    ("regressor", LinearRegression())
])

# Fit and predict using the pipeline
# Assuming you have already split your data into X_train, X_test, y_train, and y_test

#X_train, X_test, y_train, y_test = train_test_split(shuff_data,shuff_ratings, test_size=0.2, random_state=150)

pipeline.fit(data, star_ratings)
#y_pred = pipeline.predict(shuff_data)

Fitting 3 folds for each of 99 candidates, totalling 297 fits


In [103]:
grader.score('ml__full_model', pipeline.predict)  # Edit to appropriate name

Your score: 0.9665


**Extension:** Try a non-linear model such as [`RandomForestRegressor`](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor) to blend the predictions of the four models. Are you able to get better results? If so, what do you think it's learning how to do?

*Copyright &copy; 2023 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*