### Logistic Regression

In this notebook we're using logistic regression to attempt to classify the hotel images by hotel ID.

The benefit of choosing Logistic Regression is that it is able to handle very large amounts of data. Unlike with the Convolutional Neural Networks, we are able to work with the entire dataset in a (relatively!) reasonable amount of time.

Though time constraints limited the amount of tuning we could do on this model, the out-of-the-box model performed almost as well as the CNNs that went through hundreds of epochs of training.

Hyperparameters used:
No regularization
Tolerance: 0.1
Solver: saga
Multi_class: multinomial

####  This notebook was made on Google Colab

Filepaths and some cells are necessary to run on Google colab.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
import numpy as np
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from PIL import Image
from sklearn.dummy import DummyClassifier

In [2]:
!mkdir kaggle_dataset

In [3]:
os.environ['KAGGLE_CONFIG_DIR'] = '/content/kaggle_dataset'

In [None]:
!pip uninstall -y kaggle
!pip install --upgrade pip
!pip install kaggle==1.5.6

### Kaggle.json

In order for the next cell to run, be sure to upload your Kaggle API token (kaggle.json) to the kaggle_dataset directory. If you don't have one, visit your Kaggle account to download.

In [5]:
!chmod 600 /content/kaggle_dataset/kaggle.json

### Fetch data from Kaggle

<p style="color: red";>Warning, the next couple cells take a very long time to run</p>

In [None]:
!kaggle competitions download -c hotel-id-2021-fgvc8

In [None]:
!unzip hotel-id-2021-fgvc8.zip -d /content/kaggle_dataset/hotels/

In [None]:
train_df = pd.read_csv('/content/kaggle_dataset/hotels/train.csv')
train_df = train_df.drop_duplicates(subset=['image'], keep='first')

### Create training and validation data

We know them as X and y

In [None]:
def get_X_and_y():
  X = []
  y = []
    
  # the data comes in folders grouping images by their hotel chain
  # loop through all the chains 0-92

  folders = [str(chain) for chain in range(0, 92)]
  dir = "/content/kaggle_dataset/hotels/train_images/"
  for folder in folders:
    chain_path = dir + folder + '/'
    # we have to check that the chain exists, since not all number 0-92 are used
    if os.path.exists(chain_path):
      # loop through all the files in the chain folder
      for image_file in os.listdir(chain_path):
        if image_file.endswith('.jpg'):
          # based on the filename, we can look up the hotel ID in the csv and append it to y
          hotel_id = str(train_df[train_df['image'] == image_file]['hotel_id'].values[0])
          y.append(hotel_id)
          full_path = chain_path + image_file
          # Use PIL to open the image, resize it, and scale it. Then append to X
          img = np.array(Image.open(full_path).resize((56,56))).astype('float32')/255
          X.append(img)
  return (X, y)

In [167]:
X, y = get_X_and_y()

array([0.45490196, 0.35686275, 0.28627452, ..., 0.18431373, 0.1254902 ,
       0.10196079], dtype=float32)

In [67]:
X_np = np.asarray(X)

(25764, 56, 56, 3)

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X_np, y, random_state=42, train_size=0.8, test_size=0.2)

In [70]:
X_train = X_train.reshape(X_train.shape[0],56*56*3)
X_test = X_test.reshape(X_test.shape[0],56*56*3)

In [71]:
logreg = LogisticRegression(penalty='none',
                            tol=0.1, solver='saga',
                            multi_class='multinomial').fit(X_train, y_train)

### Missing output:

I lost the output to the following cell and I'd have to rerun it on Google Colab to get it again, but it's very busy running fits. So I'll just write the results of the following cell here:

    Training score: 0.09621076124399593 accuracy
    Testing score: 0.01785367746943528 accuracy

In [None]:
logreg.score(X_train, y_train), logreg.score(X_test, y_test)

In [73]:
logreg.predict(X_test)

array(['17759', '37003', '50666', ..., '34668', '3659', '16161'],
      dtype='<U5')

In [74]:
logreg.predict_proba(X_test)

array([[0.00031559, 0.00154734, 0.00253714, ..., 0.00025764, 0.00064314,
        0.00212435],
       [0.00101512, 0.00310091, 0.00137814, ..., 0.00064141, 0.01830098,
        0.00120207],
       [0.00066789, 0.00329392, 0.00324878, ..., 0.00243235, 0.00166146,
        0.00304914],
       ...,
       [0.00046179, 0.00170852, 0.00105208, ..., 0.00029902, 0.00057317,
        0.00185655],
       [0.00121953, 0.0067286 , 0.00497844, ..., 0.00052207, 0.00372806,
        0.00078459],
       [0.00124133, 0.00392612, 0.00542996, ..., 0.00254608, 0.00112786,
        0.00158214]], dtype=float32)

### Predictions

Success is not measured for this by prediction accuracy.

Instead, the goal is to narrow down a search by providing likely hotels where the image came from.

The hotel the model thinks is most likely should be listed first, and then descending down to least likely.

The function below returns the model's predictions for top n most likely hotels.

In [145]:

def get_top_predictions(model, n=5, dir="/content/kaggle_dataset/hotels/test_images/"):
    
  df = pd.DataFrame()
  X = []
  for image_file in os.listdir(dir):
    if image_file.endswith('.jpg'):
      full_path = dir + image_file
      img = np.array(Image.open(full_path).resize((56,56))).astype('float32')/255
      X.append(img)
  X_np = np.asarray(X)
  X = X_np.reshape(X_np.shape[0],56*56*3)
  prob = model.predict_proba(X)
  for n in range(len(prob)):
    # get indices of the top n probabilities predicted by the model
    ind = np.argpartition(prob[n], -1*n)[-1*n:]
    # get said top n probabilities
    conf = [prob[n][i] for i in ind]
    # get the hotel ids associated with the top probabilities
    tops = [model.classes_[i] for i in ind] 
    # zip the probabilities with their hotel_id
    conf_t = list(zip(conf, tops))
    # sort the zipped pairs by probabilitity in descending order
    conf_t.sort(key=lambda tup: tup[0], reverse=True)
    # get just the top n hotel predictions
    sorted_preds = [t for _, t in conf_t]
    df[n] = sorted_preds
  # return a transposed dataframe containing the top n likely hotels
  return df.T
    

In [None]:
kaggle_sub = get_predictions(logreg)

### Meanwhile, a Dummy Regressor

Let's check how our results compare to a dummy regressor.

Again, I lost my output to this cell (thanks to juggling back and forth between my local machine and Google Colab in an effort to multitask, and periodically getting kicked off of Google Colab and having to restart my session) so I'll put my results from the following cell here:

    Training score: 0.003735869196060356 accuracy
    Testing score: 0.0034931108092373375 accuracy

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_train, y_train), dummy_clf.score(X_test, y_test)

### For the Kaggle submission

Kaggle wants a csv with image (the filename of the image) and the top hotel ids separated by a space

In [159]:
kaggle_sub['image'] = ['99e91ad5f2870678.jpg', 'b5cc62ab665591a9.jpg', 'd5664a972d5a644b.jpg']
kaggle_sub['hotel_id'] = kaggle_sub[0] + ' ' + kaggle_sub[1] + ' ' + kaggle_sub[2] + ' ' + kaggle_sub[3] + ' ' + kaggle_sub[4]
kaggle_sub.drop(columns=[0,1,2,3,4], inplace=True)
kaggle_sub

Unnamed: 0,image,hotel_id
0,99e91ad5f2870678.jpg,16124 30410 15433 7833 16505
1,b5cc62ab665591a9.jpg,3637 1460 45248 41325 28427
2,d5664a972d5a644b.jpg,3890 16137 39435 9520 47276


In [160]:
kaggle_sub.to_csv('/content/drive/MyDrive/hotels/kaggle_sub1.csv', index=False) 

### Surprise

Turns out you can't just submit a csv to the Kaggle competition.

You need to turn in your notebook with your model, and they'll run the model with new unseen data and use whatever file output your notebook produces for evaluation.