# Assignment 2

***The tasks of this assignment are to implement an improved version of the Naive Bayes algorithm that is able to predict the category of points of interest from the Yelp dataset - one of "Restaurants", "Shopping" and "Nightlife". Then apply the implementation on the test set without class labels and submit the predictions to Kaggle, the system will evaluate the results with the ground-truth data and report the accuracy.***

## Report Section

### 1. Splitting the dataset

The training set and test set of this assignment are separate files, which means unlike in the previous experimental environment, we cannot evaluate the model locally on the test set.</br>
Two split procedures are considered when splitting the training set: training/validation split by 80/20, and 5-fold cross-validation.</br>

### 2. Data preprocessing

After loading the data and inspection, the integrity of the dataset is good with no missing values.</br>
But the tasks of this assignment are to classify text data, so it is necessary to do some preprocessing on the text data before training the model.

#### 2.1 Preprocessor

First we define some preprocessing rules, including: convert letters to lowercase; remove numeric characters; remove punctuation.</br>
These rules will be applied as a preprocessor in the next feature extraction stage to remove some unnecessary noise.

#### 2.2 Vectorizer

Sklearn provides vectorizers such as CountVectorizer and TfidfVectorizer etc., both of which are common text feature extraction methods.</br>
For each text data, CountVectorizer only considers the frequency of each word appearing in the text.</br>
While TfidfVectorizer is based on CountVectorizer and also pays attention to the inverse of the number of other texts that contain this word.</br>
In contrast, the larger the number of training sets, the more advantageous the feature quantization method TfidfVectorizer is.</br>
In this assignment, due to the limited number of training sets, we choose CountVectorizer.

#### 2.3 Parameters of CountVectorizer

CountVectorizer has several parameters that can be adjusted.</br>
Setting stop words can filter out some frequently occurring but meaningless words, such as articles, prepositions or conjunctions.</br>
Setting token_pattern can filter out one- or two-letter words that interfere with training.</br>
The larger the max_features, the higher the accuracy of the model, but too large max_features may lead to a decrease in the generalization ability of the model.

### 3. Model selection

Sklearn provides Naive Bayesian models such as BernoulliNB, GaussianNB, MultinomialNB etc.</br>
The MultinomialNB is suitable for classification with discrete features such as text classification.</br>
The multinomial distribution normally requires integer feature counts, so CountVectorizer and MultinomialNB are good combination for this assignment.</br>
The parameters of MultinomialNB include: smoothing parameter 'alpha', 'fit_prior' and 'class_prior'.</br>
We use GridSearchCV to tune 'alpha' and 'fit_prior' to get the best model.

### 4. Implementation 

#### 4.1 Task_1

In task 1, we train the model separately with the above two procedures while keeping the model parameters unchanged.</br>
The results show that the performance of the 5-fold cross-validation method is better than that of training/validation split by 80 /20.</br>
So in task 2, we only use 5-fold cross-validation to train the model.</br>
As requested, we only train the model based on the "review" attribute in task 1.</br>
While we use "predict_proba" to preserve the probabilities of the test results to facilitate combining with other naive Bayesian models in task 2.</br>
In task 1, we get two outputs: "predict_r_s.csv" and "predict_R_cv.csv" from the training/validation split and 5-fold cross-validation models, respectively.

#### 4.2 Task_2

In Task 2, we consider that attributes "name" and "mean_checkin_time" may be useful for the prediction.</br>
First we train the model based only on the "name" attribute, and then get the probability of the test data under this model via "predict_proba".</br>
According to the Naive Bayes theorem, the attributes are independent of each other, so the largest p(y|x1,x2) can be calculated by multiplying the "predict_proba" of the two models and dividing by p(y).</br>
Similarly, We retrain the model only based on the "mean_checkin_time" attribute and get the test probability under that model.</br>
Next, we combine the conditional probabilities obtained in:</br>
Task 1 Section 1.3 (ie, the model based only on the "review" attribute and cross-validation) with</br>
Task 2 Section 2.2 (ie, the model based only on the "name" attribute), to get the Prediction result "predict_R_N.csv".</br>
Similarly, we combine "predict_R_N.csv" with</br>
Task 2 Section 2.3 (i.e. the model based only on the "time" attribute) to get the prediction result "predict_R_N_T.csv".

### 5. Evaluation 

Task 1 Section 1.2: The validation set running on the model gets an accuracy of 0.89043.</br>
We submit the prediction result 'predict_r_s.csv' to Kaggle and get a Public Score of 0.87572, slightly higher than the baseline of 0.87283.</br>
Task 1 Section 1.3: The validation set running on the model gets a mean accuracy of 0.89001.</br>
We submit the prediction result 'predict_R_cv.csv' to Kaggle and get a Public Score of 0.87861, slightly higher than 'predict_1.csv' of 0.87572 and the baseline of 0.87283.</br>
Task 2 Section 2.2: We train the model only based on the "name" attribute with cross-validation, and the validation accuracy is not high (0.75844).</br>
However, after we combine the model trained with cross-validation in task 1, the accuracy of the prediction results is greatly improved.</br>
We submit the prediction result 'predict_R_N.csv' to Kaggle and get a Public Score of 0.90462, higher than the baseline of 0.87283.</br>
Task 2 Section 2.3: We train the model only based on the "time" attribute with cross-validation, and the validation accuracy is very low (0.61922).</br>
After we combine 'predict_R_N.csv', the accuracy of the prediction results is not changed.</br>
We submit the prediction result 'predict_R_N_T.csv' to Kaggle and get a Public Score of 0.90462, same like 'predict_R_N.csv', higher than the baseline of 0.87283.

## Code Section

## Task 0

***Before starting, we need to import the relevant packages and set the random seed.***

In [1]:
import re
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Reload local python files every 2 seconds
%load_ext autoreload
%autoreload 2

In [3]:
# Set random seed
RANDOM_STATE = 1234
np.random.seed(RANDOM_STATE)

## Task 1

***Improve the benchmark model based on the review attribute only. Since it is a text classification problem, we will consider "MultinomialNB" as the classifier, and "CountVectorizer" as the text preprocessor.</br>
In addition, two procedures for splitting the training set will be tried separately: splitting the training and validation sets by 80/20, and splitting the entire training set with 5-fold cross-validation.***

### 1.1 Load data

#### 1.1.1 Load training data

In [4]:
# Load data and check integrity
train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,ID,name,latitude,longitude,mean_checkin_time,review,category
0,3007,The New Orleans Vampire Cafe,29.959033,-90.064036,17.0,Amazing service. Cool vibe. It's not spooky or...,Restaurants
1,1829,Ted's Frostop,29.947026,-90.113604,17.0,Breakfast here is great and there's never a hu...,Restaurants
2,298,The Will & The Way,29.957573,-90.065827,9.5,So glad that we stumbled in here! The cheesebu...,Restaurants
3,1245,Public Belt,29.946393,-90.063729,3.0,AMAZING! Try this place out. Great for some g...,Nightlife
4,2902,Phillys Cafe,29.941818,-90.094797,18.0,"WooHoo, best philly cheese staks I have had in...",Restaurants


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2873 entries, 0 to 2872
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 2873 non-null   int64  
 1   name               2873 non-null   object 
 2   latitude           2873 non-null   float64
 3   longitude          2873 non-null   float64
 4   mean_checkin_time  2873 non-null   float64
 5   review             2873 non-null   object 
 6   category           2873 non-null   object 
dtypes: float64(3), int64(1), object(3)
memory usage: 157.2+ KB


In [6]:
train.isnull().sum()

ID                   0
name                 0
latitude             0
longitude            0
mean_checkin_time    0
review               0
category             0
dtype: int64

In [7]:
train.category.value_counts()

Restaurants    1779
Shopping        782
Nightlife       312
Name: category, dtype: int64

***The integrity of the training set is good, no missing values are found. But in these three categories, "restaurants" has the largest proportion, while "nightlife" has the smallest proportion.***

#### 1.1.2 Testing data

In [8]:
# Load data and check integrity
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,ID,name,latitude,longitude,mean_checkin_time,review
0,2406,Courtyard Grill Restaurant at Bourbon Heat,29.958905,-90.0655,7.5,Terrible food and service . don't waste your m...
1,1401,Papa John's Pizza,29.944986,-90.07683,18.0,"Like everyone else said, don't order if you're..."
2,2783,District Donut & Coffee Bar,29.921412,-90.117817,16.0,Great little breakfast spot! The donuts are ...
3,1352,Pyramids Cafe,29.947334,-90.113001,17.0,"This place gets 4 stars for service, delicious..."
4,303,Tsunami Sushi,29.949966,-90.069677,14.0,Pleasantly surprised by the presentation of th...


In [9]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693 entries, 0 to 692
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 693 non-null    int64  
 1   name               693 non-null    object 
 2   latitude           693 non-null    float64
 3   longitude          693 non-null    float64
 4   mean_checkin_time  693 non-null    float64
 5   review             693 non-null    object 
dtypes: float64(3), int64(1), object(2)
memory usage: 32.6+ KB


In [10]:
test.isnull().sum()

ID                   0
name                 0
latitude             0
longitude            0
mean_checkin_time    0
review               0
dtype: int64

***The integrity of the testing set is good, no missing values are found.***

### 1.2 Method 1 (Split dataset by 80/20)

#### 1.2.1 Split dataset

In [11]:
# Split features and class
x_r = train['review'].to_numpy()
y_r = train['category'].to_numpy()
x_test_r = test['review'].to_numpy()
print(x_r.shape, y_r.shape, x_test_r.shape)

(2873,) (2873,) (693,)


In [12]:
x_train_r, x_val_r, y_train_r, y_val_r = train_test_split(x_r, y_r, test_size=0.2, random_state=RANDOM_STATE)
print(x_train_r.shape, x_val_r.shape, y_train_r.shape, y_val_r.shape)

(2298,) (575,) (2298,) (575,)


#### 1.2.2 Vectorization

In [13]:
# Define preprocessing rules
def preprocessor_r(text):
    text = text.lower() # convert to lowercase
    text = re.sub(r'\d+', '', text) # remove numeric characters
    text = re.sub(r'\W+', ' ', text) # remove punctuation
    return text

In [14]:
cv_r = CountVectorizer(preprocessor=preprocessor_r, stop_words='english', token_pattern=r'(?u)\b\w\w\w+\b', max_features=5000)
x_train_cv_r = cv_r.fit_transform(x_train_r)
x_val_cv_r = cv_r.transform(x_val_r)
print(x_train_cv_r.shape, x_val_cv_r.shape)

(2298, 5000) (575, 5000)


***Before training the model, we need to preprocess the text data, remove unnecessary noise by setting some preprocessing rules.</br>
Then vectorize the data by using "CountVectorizer":</br>
Setting stop words can filter out some frequently occurring but meaningless words, such as articles, prepositions or conjunctions.</br>
Setting token_pattern can filter out one- or two-letter words that interfere with training.***

#### 1.2.3 Train the model

In [15]:
estimator_r = MultinomialNB()
param_grid_r = {'alpha': np.arange(0, 1, 0.001), 'fit_prior': [True, False]}
grid_search_r = GridSearchCV(estimator=estimator_r, param_grid=param_grid_r, cv=10, n_jobs=-1)
grid_search_r.fit(x_train_cv_r, y_train_r)
best_estimator_r = grid_search_r.best_estimator_
best_estimator_r.fit(x_train_cv_r, y_train_r)
print(best_estimator_r)

MultinomialNB(alpha=0.986)


***The parameters of MultinomialNB include smoothing parameter alpha, fit_prior and class_prior.
We use GridSearchCV to tune some parameters to get the best model.***

#### 1.2.4 Evaluate the model

In [16]:
score_r = best_estimator_r.score(x_val_cv_r, y_val_r)
print('Accuracy:', score_r)

Accuracy: 0.8904347826086957


#### 1.2.5 Predict the test data and output the results

In [17]:
# Prediction
x_test_cv_r = cv_r.transform(x_test_r)
y_test_r = best_estimator_r.predict(x_test_cv_r)
proba_test_r = best_estimator_r.predict_proba(x_test_cv_r)

# Output
result_r = zip((test['ID']), y_test_r)
output_r = pd.DataFrame(data=result_r, columns=['ID', 'category'])
output_r.to_csv('predict_r_s.csv', index=False)

***The validation set running on the model gets an accuracy of 0.89043.
We submit the prediction result 'predict_r_s.csv' to Kaggle and get a Public Score of 0.87572, slightly higher than the baseline of 0.87283.***

### 1.3 Method 2 (Split dataset with 5-fold CV)

In [18]:
# Split dataset
x_R = train['review'].to_numpy()
y_R = train['category'].to_numpy()
x_test_R = test['review'].to_numpy()
print(x_R.shape, y_R.shape, x_test_R.shape)

(2873,) (2873,) (693,)


In [19]:
# Create list to store the results of each evaluation
scoreList_R = []

kf = KFold(n_splits=5, random_state=RANDOM_STATE, shuffle=True)
for train_index, val_index in kf.split(x_R, y_R):
    x_train_R, x_val_R = x_R[train_index], x_R[val_index]
    y_train_R, y_val_R = y_R[train_index], y_R[val_index]
    
    # Vectorization
    cv_R = CountVectorizer(lowercase=True, preprocessor=preprocessor_r, stop_words='english', token_pattern=r'(?u)\b\w\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=5000)
    x_train_cv_R = cv_R.fit_transform(x_train_R)
    x_val_cv_R = cv_R.transform(x_val_R)
    
    # Train the model
    estimator_R = MultinomialNB()
    param_grid_R = {'alpha': np.arange(0, 1, 0.001), 'fit_prior': [True, False]}
    grid_search_R = GridSearchCV(estimator=estimator_R, param_grid=param_grid_R, cv=10, n_jobs=-1)
    grid_search_R.fit(x_train_cv_R, y_train_R)
    best_estimator_R = grid_search_R.best_estimator_
    best_estimator_R.fit(x_train_cv_R, y_train_R)
    
    # Evaluate the model
    score_R = best_estimator_R.score(x_val_cv_R, y_val_R)
    scoreList_R.append(score_R)

print('Mean accuracy:', np.mean(scoreList_R))

Mean accuracy: 0.8900087865474928


In [20]:
# Prediction
x_test_cv_R = cv_R.transform(x_test_R)
y_test_R = best_estimator_R.predict(x_test_cv_R)
proba_test_R = best_estimator_R.predict_proba(x_test_cv_R)

# Output
result_R = zip((test['ID']), y_test_R)
output_R = pd.DataFrame(data=result_R, columns=['ID', 'category'])
output_R.to_csv('predict_R_cv.csv', index=False)

***The validation set running on the model gets a mean accuracy of 0.89001.
We submit the prediction result 'predict_R_cv.csv' to Kaggle and get a Public Score of 0.87861, slightly higher than 'predict_1.csv' of 0.87572 and the baseline of 0.87283.***

### 1.4 Summary

***From the above data, it can be seen that when the parameters of MultinomialNB and CountVectorizer are unchanged, the model trained by cross-validation has better performance, although it is not very significant.
It shows that the cross-validation method has a positive effect on the model training, but it also means that the computational cost is increased.
The above two sets of prediction results are better than the baseline, so we start task 2.***

## Task 2

***Improve your model by adding additional attributes to your model.
Based on the study of the training set, we consider that attributes "name" and "mean_checkin_time" may be useful for the prediction.
In the following tasks, we will add them separately and test the prediction results.***

### 2.1 Load data

In [21]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

### 2.2 Base on the "name" attribute

#### 2.2.1 Train the model

In [22]:
# Split dataset
x_N = train['name'].to_numpy()
y_N = train['category'].to_numpy()
x_test_N = test['name'].to_numpy()
print(x_N.shape, y_N.shape, x_test_N.shape)

(2873,) (2873,) (693,)


In [23]:
# Define preprocessing rules
def preprocessor_n(text):
    text = text.lower() # convert to lowercase
    text = re.sub(r'\d+', '', text) # remove numeric characters
    text = re.sub(r'\W+', ' ', text) # remove punctuation
    return text

In [24]:
# Create list to store the results of each evaluation
scoreList_N = []

kf = KFold(n_splits=5, random_state=RANDOM_STATE, shuffle=True)
for train_index, val_index in kf.split(x_N, y_N):
    x_train_N, x_val_N = x_N[train_index], x_N[val_index]
    y_train_N, y_val_N = y_N[train_index], y_N[val_index]
    
    # Vectorization
    cv_N = CountVectorizer(lowercase=True, preprocessor=preprocessor_n, stop_words='english', token_pattern=r'(?u)\b\w\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=5000)
    x_train_cv_N = cv_N.fit_transform(x_train_N)
    x_val_cv_N = cv_N.transform(x_val_N)
    
    # Train the model
    estimator_N = MultinomialNB()
    param_grid_N = {'alpha': np.arange(0, 1, 0.001), 'fit_prior': [True, False]}
    grid_search_N = GridSearchCV(estimator=estimator_N, param_grid=param_grid_N, cv=10, n_jobs=-1)
    grid_search_N.fit(x_train_cv_N, y_train_N)
    best_estimator_N = grid_search_N.best_estimator_
    best_estimator_N.fit(x_train_cv_N, y_train_N)
    
    # Evaluate the model
    score_N = best_estimator_N.score(x_val_cv_N, y_val_N)
    scoreList_N.append(score_N)

print('Mean accuracy:', np.mean(scoreList_N))

Mean accuracy: 0.7584438721405847


#### 2.2.2 Predict the probability of test data

In [25]:
# Prediction
x_test_cv_N = cv_N.transform(x_test_N)
y_test_N = best_estimator_N.predict(x_test_cv_N)
proba_test_N = best_estimator_N.predict_proba(x_test_cv_N)

#### 2.2.3 Combine the results of 2 models

In [26]:
p_R_N = (proba_test_R*proba_test_N)/[312/2873, 1779/2873, 782/2873]

In [27]:
l_R_N = []
for i in range(len(p_R_N)):
    x = np.argmax(p_R_N[i])
    if x == 0:
        x = 'Nightlife'
    elif x == 1:
        x = 'Restaurants'
    else:
        x = 'Shopping'
    l_R_N.append(x)
l_R_N = np.array(l_R_N)

#### 2.2.4 Output result

In [28]:
# Output
result_R_N = zip((test['ID']), l_R_N)
output_R_N = pd.DataFrame(data=result_R_N, columns=['ID', 'category'])
output_R_N.to_csv('predict_R_N.csv', index=False)

#### 2.2.5 Summary

***We train the model only based on the "name" attribute with cross-validation, and the validation accuracy is not high (0.75844).</br>
However, after we combine the model trained with cross-validation in task 1, the accuracy of the prediction results is greatly improved.</br>
We submit the prediction result 'predict_R_N.csv' to Kaggle and get a Public Score of 0.90462, higher than the baseline of 0.87283.***

### 2.3 Base on the "mean_checkin_time" attribute

#### 2.3.1 Train the model

In [29]:
# Split dataset
x_T = train['mean_checkin_time'].to_numpy()
y_T = train['category'].to_numpy()
x_test_T = test['mean_checkin_time'].to_numpy()
print(x_T.shape, y_T.shape, x_test_T.shape)

(2873,) (2873,) (693,)


In [30]:
# Create list to store the results of each evaluation
scoreList_T = []

kf = KFold(n_splits=5, random_state=RANDOM_STATE, shuffle=True)
for train_index, val_index in kf.split(x_T, y_T):
    x_train_T, x_val_T = x_T[train_index], x_T[val_index]
    y_train_T, y_val_T = y_T[train_index], y_T[val_index]
    
    # Reshape
    x_train_T = x_train_T.reshape(-1, 1)
    x_val_T = x_val_T.reshape(-1, 1)
    
    # Train the model
    estimator_T = MultinomialNB()
    param_grid_T = {'alpha': np.arange(0.001, 1, 0.001), 'fit_prior': [True, False]}
    grid_search_T = GridSearchCV(estimator=estimator_T, param_grid=param_grid_T, cv=10, n_jobs=-1)
    grid_search_T.fit(x_train_T, y_train_T)
    best_estimator_T = grid_search_T.best_estimator_
    best_estimator_T.fit(x_train_T, y_train_T)
    
    # Evaluate the model
    score_T = best_estimator_T.score(x_val_T, y_val_T)
    scoreList_T.append(score_T)

print('Mean accuracy:', np.mean(scoreList_T))

Mean accuracy: 0.6192219360702924


#### 2.3.2 Predict the probability of test data

In [31]:
# Prediction
x_test_T = x_test_T.reshape(-1, 1)
# x_test_T = cv_T.transform(x_test_T)
y_test_T = best_estimator_T.predict(x_test_T)
proba_test_T = best_estimator_T.predict_proba(x_test_T)

#### 2.3.3 Combine the results of 3 models

In [32]:
p_R_N = (proba_test_R*proba_test_N)/[312/2873, 1779/2873, 782/2873]
p_R_T = (proba_test_R*proba_test_T)/[312/2873, 1779/2873, 782/2873]
p_N_T = (proba_test_N*proba_test_T)/[312/2873, 1779/2873, 782/2873]
p_R_N_T = (p_R_N*p_R_T*p_N_T)/[312/2873, 1779/2873, 782/2873]

In [33]:
l_R_N_T = []
for i in range(len(p_R_N_T)):
    x = np.argmax(p_R_N_T[i])
    if x == 0:
        x = 'Nightlife'
    elif x == 1:
        x = 'Restaurants'
    else:
        x = 'Shopping'
    l_R_N_T.append(x)
l_R_N_T = np.array(l_R_N_T)

#### 2.3.4 Output result

In [34]:
# Output
result_R_N_T = zip((test['ID']), l_R_N_T)
output_R_N_T = pd.DataFrame(data=result_R_N_T, columns=['ID', 'category'])
output_R_N_T.to_csv('predict_R_N_T.csv', index=False)

#### 2.3.5 Summary

***We train the model only based on the "time" attribute with cross-validation, and the validation accuracy is very low (0.61922).</br>
After we combine 'predict_R_N.csv', the accuracy of the prediction results is not changed.</br>
We submit the prediction result 'predict_R_N_T.csv' to Kaggle and get a Public Score of 0.90462, same like 'predict_R_N.csv', higher than the baseline of 0.87283.***