# Preparing Business Data for Modeling

Using the data we prepared in previous notebooks, we create a validation set and a feature matrix for use in classification models.

In [1]:
import pandas as pd
import numpy as np

## Load the review features dataset `bus` and the topic matrix `dt_matrix`

In [2]:
bus = pd.read_csv('../data/businesses.csv', compression='gzip')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
dt_matrix = np.load('../data/business_doc_topic_matrix.npy')

We will combine the two matrices together into one `features` matrix, so we need to ensure that they are the same shape. 

First we eliminate non-English reviews using the index array we made previously:

In [4]:
correct_index = np.load('../data/bus_eng_index.npy')
bus = bus[bus.index.isin(correct_index)]

In [5]:
bus = bus.reset_index(drop=True)

In [6]:
bus.shape, dt_matrix.shape

((1952542, 28), (1952542, 325))

We can see that `bus` and `dt_matrix` have the same shape. Let's make a final check for null values before we combine our feature matrices:

In [7]:
bus[bus.isnull().any(axis=1)]

Unnamed: 0,stars,text,useful,funny,cool,state,active_life,arts_and_entertainment,automotive,beauty_and_spas,...,local_services,mass_media,nightlife,pets,professional_services,public_services_and_government,religious_organizations,restaurants,shopping,review_length
1567253,5.0,Working with Tina and Marcia has been such a p...,,,,,,,,,...,,,,,,,,,,
1567254,My husband and I had not purchased a home befo...,0,0.0,0.0,NV,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,52.0,


We can drop these reviews as they don't have any feature data other than their text content.

In [8]:
bus.drop([1567253, 1567254], 0, inplace=True)

In [9]:
bus.shape

(1952540, 28)

We also eliminate the topic vectors for those rows and assign the new doc-topic matrix to `right_matrix`:

In [10]:
right_matrix = np.delete(dt_matrix, [1567253, 1567254], 0)

In [11]:
right_matrix.shape

(1952540, 325)

## Create validation set

At this point, we have our two matrices in the same shape. I propose creating a validation set of our data using the following scheme:

 - Reviews with no useful votes will be classified as not useful (0).
 - Reviews with either 1 or 2 useful votes will be held out as a validation set. 
 - Reviews with at least 3 useful votes will be classified as useful (1).
 
This approach intends to control for the unknown factor of page/click count influencing the number of votes a review could get, in addition to the algorithm Yelp uses to filter and display reviews on a specific businesses' page. Following the scheme, we believe that having at least three individuals tag a review as 'Useful' is representative of a consensus, whereas reviews with only 1 or 2 votes may or may not be useful. 

In [12]:
def useful_mapper(x):
    if x == 0:
        return 0
    elif x in (1, 2):
        return "Validation"
    elif x >= 3:
        return 1

In [13]:
bus['usefulness'] = bus['useful'].map(lambda x: useful_mapper(x))

In [14]:
bus['usefulness'].value_counts()

0             905873
Validation    635308
1             411359
Name: usefulness, dtype: int64

In [15]:
bus.shape[0]

1952540

### Use index of validation set to create array of topic vectors for train set

In [16]:
valid_index = bus[bus['usefulness'] == 'Validation'].index

In [17]:
valid_index

Int64Index([      3,       4,       5,       7,       9,      17,      19,
                 21,      25,      27,
            ...
            1952513, 1952514, 1952515, 1952518, 1952521, 1952523, 1952526,
            1952527, 1952529, 1952539],
           dtype='int64', length=635308)

In [19]:
right_matrix_valid = right_matrix[valid_index]
right_matrix_valid.shape

(635308, 325)

In [18]:
right_matrix_train = np.delete(right_matrix, valid_index, axis=0)

In [24]:
right_matrix_train.shape

(1317232, 325)

### Create the validation and train datasets and save to csv

In [None]:
bus_valid_set = bus[bus.index.isin(valid_index)]

bus_valid_set.to_csv('../data/businesses_validation.csv', index=False)

In [20]:
bus_train = bus[~bus.index.isin(valid_index)]

In [22]:
bus_train.to_csv('../data/businesses_train.csv', index=False)

In [23]:
bus_train.values.shape

(1317232, 29)

### Create train feature vector

We drop `text`, `useful`, `cool`, and `state` from the feature vectors.

In [26]:
bus_train.drop(['text', 'useful', 'cool', 'state'], 1, inplace=True)

left_matrix_train = bus_train[bus_train.columns[:-1]].values

left_matrix_train.shape, right_matrix_train.shape

del bus, bus_train, correct_index, dt_matrix, right_matrix

features = np.hstack((left_matrix_train, right_matrix_train))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [36]:
np.save('../data/businesses_train_features.npy', features)

### Create validation feature vector

In [20]:
bus_valid = pd.read_csv('../data/businesses_validation.csv')

bus_valid.columns

bus_valid.drop(['text', 'useful', 'cool', 'state'], 1, inplace=True)

left_matrix_valid = bus_valid[bus_valid.columns[:-1]].values

left_matrix_valid.shape, right_matrix_valid.shape

del bus, bus_valid, correct_index, dt_matrix, right_matrix

valid_features = np.hstack((left_matrix_valid, right_matrix_valid))

np.save('../data/valid_features.npy', valid_features)

### Create train target vector

In [37]:
bus_train = pd.read_csv('../data/businesses_train.csv')

In [41]:
np.save('../data/business_target.npy', bus_train.usefulness.values)