**NLP With Hotel Review Part 2**<br>
Author: Sourish Dasgupta <br>
Date : July-18,2022<br>

Modeling

1. **Employ a linear classifier on this dataset:**
    - Fit a logisitic regression model to this data with the solver set to `lbfgs`. 
    - What is the accuracy score on the test set?
    - What are the 20 words most predictive of a good review (from the positive review column)? What are the 20 words most predictive with a bad review (from the negative review column)? Use the regression coefficients to answer this question
    - Reduce the dimensionality of the dataset using PCA, what is the relationship between the number of dimensions and run-time for a logistic regression?
    - List one advantage and one disadvantage of dimensionality reduction <br>

2. **Employ a K-Nearest Neighbour classifier on this dataset:**
    - Fit a KNN model to this data. What is the accuracy score on the test set?
    - KNN is a computationally expensive model. Reduce the number of observations (data points) in the dataset.
    - What is the relationship between the number of observations and run-time for KNN?
    - List one advantage and one disadvantage of reducing the number of observations.
    - Use the dataset to find an optimal value for K in the KNN algorithm. You will need to split your dataset into train and validation sets.
    - What is the issue with splitting the data into train and validation sets after performing vectorization?<br>
   
3. **Employ a Decision Tree classifier on this dataset:**
    - Fit a decision tree model to this data. What is the accuracy score on the test set?
    - Use the data set (or a subsample) to find an optimal value for the maximum depth of the decision tree. You will need to split your data set into train and validation.
    - Provide two advantages of decision trees over KNN. Provide two weaknesses of decision trees (classification or regression trees)<br>
4. **What is the purpose of the validation set, i.e., how is it different than the test set?**<br>

5. **Re-run a decision tree or logistic regression on the data again:**

    - Perform a 5-fold cross validation to optimize the hyperparameters of your model.
    - What does your confusion matrix look like for your best model on the test set?<br>

6. **Create one new feature of your choice:**
    - Explain your new feature and why you consider it will improve accuracy.
    - Run the model from question 5 again. You will have to re-optimize your hyperparameters. 
    - Has the accuracy score of your best model improved on the test set after adding the new feature you created?


### Import Library

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
import string

# sklearn 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Handling sparse matrix
import scipy.sparse    

# nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package wordnet to /Users/apple/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/apple/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/apple/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/apple/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Import the data

#### Train Data

In [5]:
train_data= pd.read_csv("clean_data/clean_train_dataframe.csv")
train_data.head()

Unnamed: 0,Additional_Number_of_Scoring,Average_Score,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,days_since_review,lat,lng,weekday_of_review,...,n_worry,n_worth,n_would,n_write,n_wrong,n_year,n_yes,n_yet,n_young,rating
0,620,9.0,0,1974,164,1,562,51.506558,-0.004514,1,...,0,0,0,0,0,0,0,0,0,1
1,1258,9.4,6,4204,4,5,276,51.502435,-0.00025,0,...,0,0,0,0,0,0,0,0,0,1
2,995,8.1,2,3826,38,1,129,51.504348,-0.033444,0,...,0,0,0,0,0,0,0,0,0,1
3,853,8.4,7,2726,10,10,164,51.507377,0.038657,0,...,0,0,0,0,0,0,0,0,0,0
4,1243,8.1,11,6608,8,69,639,51.513556,-0.180002,1,...,0,0,0,0,0,0,0,0,0,0


#### Test Data

In [4]:
test_data= pd.read_csv("clean_data/clean_train_dataframe.csv")
test_data.head()

Unnamed: 0,Additional_Number_of_Scoring,Average_Score,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,days_since_review,lat,lng,weekday_of_review,...,n_worry,n_worth,n_would,n_write,n_wrong,n_year,n_yes,n_yet,n_young,rating
0,620,9.0,0,1974,164,1,562,51.506558,-0.004514,1,...,0,0,0,0,0,0,0,0,0,1
1,1258,9.4,6,4204,4,5,276,51.502435,-0.00025,0,...,0,0,0,0,0,0,0,0,0,1
2,995,8.1,2,3826,38,1,129,51.504348,-0.033444,0,...,0,0,0,0,0,0,0,0,0,1
3,853,8.4,7,2726,10,10,164,51.507377,0.038657,0,...,0,0,0,0,0,0,0,0,0,0
4,1243,8.1,11,6608,8,69,639,51.513556,-0.180002,1,...,0,0,0,0,0,0,0,0,0,0
