1. We first split the target and features from the train and test dataset
2. Transform the following features:
 camis: We drop this feature because these are unique identifiers
 dba: We would drop the 'dba' since we expect the words in name feature of the restaurants to be unrealted to the grading.
 boro: We will use OHE on the restaurant regions which is a categorical variable 
 zipcode: Since there are so many restaurants with the same zipcodes, we would OHE it (with appropriate values for max_categories to select the most frequent 20)
 cuisine_description: OHE on the descriptions (there are not many words) which is a categorical variable
 inspection_date: We would assume that the date of the inspection is unrealted to how restaurants are graded, so we drop the 'inspection_date' feature.
 action: We would use OHE on categorical variable
 violation_code: We would use OHE on categorical variable
 violation_description: We would use Bag of Words for the text with CountVectorizer()
 critical_flag: We would use OHE on categorical variable
 score: Since 'score' is the ONLY numeric feature, We would not apply any transformation on it (no need to do scaling).
 inspection_type: We would drop the 'inspection_type' feature since we expect it does not relate to the grading target.
3. Perform cross validations on dummy, logreg and svm:
- Both logreg and svm are better than baseline model
- Found logreg to have highest score - choosing logreg
(insert image here with comparisons)
4. Fit the logreg model to the training data to get max length of count vectorizer vocabulary. Max length was found to be : 335
5. Perform hyper parameter tuning using randomizedsearchcv to find optimum parameters to train the logistic regression model.
- Keeping n_iters at 10 because model takes too long train if n_iters is higher than that and ultimately crashes. Since it does not compromise the accuracy by any margin, we can have 10 iterations in the random search.
- When using randomized search to arrive at the optimum hyperparameters, the solver failed to converge. This happens when there are fluctuating errors at each random search. If all errors are within a certain threshold, the solver has the ability to converge and give us the optimum hyperparameters. The error we get when this happens:
ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT
To solve this some of the below mentioned methods could work (REFERENCE : https://stackoverflow.com/questions/62658215/convergencewarning-lbfgs-failed-to-converge-status-1-stop-total-no-of-iter):
    - Increase the number of iterations
    - Try a different optimizer
    - Scale your data
    - Add engineered features
    - Data pre-processing
    - Add more data
We employed the first 2 solutions mentioned above - we increased the max_iter to 2500 and used a different optimizer. For selecting the correct solver in our case, we went through the scikit documentation (https://scikit-learn.org/stable/modules/linear_model.html#solvers) to find that 'lbfgs' is the default solver in scikit learn. This does not work very well for large datasets. Since we have around 150,000 training points, our dataset is quite large. In this case the best solver we can use that could help in convergence is "SAG" (Stochastic Average Gradient descent). It is faster than other solvers for large datasets, when both the number of samples and the number of features are large. This finally helped in converging and finding the optimized hyper parameters.
- Best parameters found to be : {'columntransformer__countvectorizer__max_features': 80,
 'logisticregression__C': 0.6930605534498594,
 'logisticregression__class_weight': 'balanced',
 'logisticregression__solver': 'sag'}
6. Performing cross validation using the optimum hyper parameters on logreg
- Using max_features as 20 for one hot encoding to limit the feature space
- performing logistic regression and f
IMAGE : https://github.com/UBC-MDS/newyork_restaurant_grading/blob/model_script_nikita/results/images/comparison_summy_lr_lrbest.png, https://github.com/UBC-MDS/newyork_restaurant_grading/blob/model_script_nikita/results/images/fitted_best_lr_model.png
7. Fitting the best model on training data
8. Scoring the best model on the test data (unseen data) : 0.6491
9. PR curve
IMAGE : https://github.com/UBC-MDS/newyork_restaurant_grading/blob/model_script_nikita/results/logistic_regression_PR_curve.png


In [None]:
# to load the model
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import make_scorer, recall_score, precision_score

train_df = pd.read_csv("../data/processed/train_df.csv")
test_df = pd.read_csv("../data/processed/test_df.csv")

# split features and target for train and test data

X_train = train_df.drop(columns=["grade"])
y_train = train_df["grade"]

X_test = test_df.drop(columns=["grade"])
y_test = test_df["grade"]
loaded_model = pickle.load(open("../results/finalized_model.sav", 'rb'))
# result = loaded_model.score(X_test, Y_test)

print("\nCreating and saving PR curve plot...")
precision, recall, thresholds = precision_recall_curve(
y_test, loaded_model.predict_proba(X_test)[:, 1], pos_label="F"
)
plt.plot(precision, recall, label="logistic regression: PR curve")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.plot(
    precision_score(y_test, loaded_model.predict(X_test), pos_label="F"),
    recall_score(y_test, loaded_model.predict(X_test), pos_label="F"),
    "or",
    markersize=10,
    label="threshold 0.5",
)
plt.legend(loc="best");
plt.savefig('../results/' + 'logistic_regression_PR_curve.png')