# Machine learning - sprint 2

Authors: Allart Ewoud, Van Hees Maxime, Breda Bram

## Introduction

The main focus of this paper is going to be about making prediction based on the images of the restaurant listings. We see alot of oppertunities where this can come in handy such as: 
- selecting the best pictures from a listing (the one where the model predicts the highest rating)
- creating a model that can predict the cuisine types of a restaurant based om the images that are available for that restaurant

## Importing packages

To start off we're importing all the packages.

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from PIL import Image, ImageOps
import glob
import PIL
import cv2

In [None]:
original_df = pd.read_csv("tripadvisor_dataset/restaurant_listings.csv")

# display the data and see how it formulated
pd.set_option("display.max_columns", None)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df = original_df.select_dtypes(include=numerics)

## Preprocessing

Our main focus of this assignment is about making predictions based on pictures, so we start off by creating a dataframe that includes our pictures. This step includes joining the pictures togheter with the other data from sprint 1. We continue by extracting features from the pictures, we do this with the HOG method. 

Als laatste reduceren we de dimenties van de features met PCA.

In [None]:
# general rating
df = df.drop(columns=["atmosphere rating"])

df["general rating"] = original_df["general rating"].apply(lambda x: float(str(x).split(' ')[0]))
df["general rating"] = pd.to_numeric(df["general rating"])
mean = df["general rating"].loc[df["general rating"] != -1].mean()
df["general rating"] = df["general rating"].replace(-1,mean)

df.head()

### Joining pixels to data from sprint 1

This step consists of creating a dataframe with the pixels of every image. During this step we noticed that some pictures where invalid, they coudn't be opened, this is also handled. At last we join the dataframe with the images togheter with the dataframe of sprint 1.

In [None]:
# Getting list of all images
fileNameList = glob.glob("tripadvisor_dataset/tripadvisor_images/*.jpg")
images = []

# putting all images in a dict where the key is the name of the picture
for fileName in tqdm(fileNameList, total=len(fileNameList)):
    try:
        img = Image.open(f"{fileName}")
        img = img.resize((128,128))
        img_np = np.array(img).flatten()
        #TODO resize all the images
        #images.append(pd.Series(data=[fileName.split('/')[-1].split("_")[0], img_np]))
        images.append(pd.Series(data=[fileName.split('\\')[-1].split("_")[0], img_np]))
    except PIL.UnidentifiedImageError:
        pass

# changing the list of pd.Series to pandas dataframe
images = pd.concat(images, axis=1).T
images = images.rename(columns={0 : "id", 1 : "pixels"})
images["id"] = pd.to_numeric(images["id"])

# merging 2 dataframes
df = pd.merge(df, images, on="id")

df.head()

### HOG feature extraction

In [None]:
from skimage.feature import hog
from skimage import exposure
import swifter

tqdm.pandas()

def hog_transformer(img):
    if len(img) != 49152:
        return np.empty(0)
    
    img = img.reshape(128,128,3)
    
    fd, hog_image = hog(img,orientations=8, pixels_per_cell=(16, 16),
                   cells_per_block=(1, 1), visualize=True, channel_axis=-1)
    return np.array(fd).flatten()

# Use swifter to parrallelize the apply, if possible
df['hog'] = df['pixels'].swifter.apply(hog_transformer)

### Cleaning up the images that are to small

We noticed that some pictures are to small to resize to the 128 by 128 format so we excluded those.

In [None]:
df["pixel_size"] = df["pixels"].apply(lambda x: len(x))
df = df.drop(df[df.pixel_size != 49152].index)
df.head()

The last step of preprocessing is splitting the data in a train and test set.

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, random_state=0, train_size = 0.8)

## Case 1

For the first case we are going to look **if we can predict scores based on the images of the listings**. One of the first things where this can come to use is for choosing the best picture of an listing. Another thing that comes to mind is by determining if the pictures corresponds to the score. This can for example be used to tell the restaurants if they have bad or good pictures.

De predictie zal op basis van regressie zijn, en de predicte score kan een soort van metriek vormen die vertelt hoe goed de fotos van een restorant scoren tov van de andere restaurants.

**gevraagd aan de leerkracht en was een intressant idee, een goede combinatie is om dit eventueel te bekijken in combinatie met prijsklasse of price range**.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import time

# kga deze stap nog proberen toe te voegen aan het dataframe
# hogs uit het dataframe halen voor de scaler
train_images_hogs = np.stack(df_train['hog'].values)
train_hogs_scaled = StandardScaler().fit_transform(train_images_hogs)

train_images_hogs = np.stack(df_test['hog'].values)
train_hogs_scaled = StandardScaler().fit_transform(train_images_hogs)

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# juiste PCA componentes zoeken van de grafiek
pca = PCA()
data_reduced = pca.fit_transform(train_hogs_scaled)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

TODO: model opbouwen en eventueel verbeteren. je kan normaal hogs gebruiken als x_train

In [None]:
# we namen als n_components 350
pca = PCA(n_components = 350)
# de getransformeerde data zijn onze features
x_train = pca.fit_transform(train_images_hogs)
y_train = df_train['general rating'].values

x_test = pca.fit_transform(test_images_hogs)
y_test = df_train['general rating'].values


In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)


## Case 2

Creating a model that can predict the cuisine types of a restaurant based om the images that are available for that restaurant

In Sprint 1 we discovered that the tags column sometimes holds cuisine types that are not listed in the cuisines column, but because this were only a few cuisines, and none of them appeared more than 5 times, we don't need to add these to our dataframe. This is the case beacause with only so few data to learn from, it is not possible to train a good ML model. 

In [None]:
# first, we need to create a dataframe that contains the id of a restaurant and one of the cuisines on each row
df_with_cuisines = original_df[['id', 'cuisines']]
df_with_cuisines['cuisines'] = df_with_cuisines['cuisines'].apply(lambda x: str(x).split(','))
df_with_cuisines = df_with_cuisines.explode('cuisines', ignore_index=True)
df_with_cuisines = pd.merge(df_with_cuisines, df, on="id")
df_with_cuisines['cuisines'].apply(lambda x: x.strip())
df_with_cuisines['cuisine'] = df_with_cuisines['cuisines'].apply(lambda x: x.strip())
df_with_cuisines.drop(columns=['cuisines'], inplace=True)
cuisine_counts = dict(df_with_cuisines['cuisine'].value_counts())

# we only keep the cuisines that occur more than 100 times, to be able to train a descent model on them
# a google search learned us that 100 is a good number to start with
cuisines_subset = { key: value for (key,value) in cuisine_counts.items() if value > 100 }

# now we replace all values that are not in the cuisines_subset with 'other'
df_with_cuisines['cuisine'] = df_with_cuisines['cuisine'].apply(lambda x: x if x in cuisines_subset else 'Other')


print("Value counts: \n") 
print(df_with_cuisines['cuisine'].value_counts())

print('\nDifferent amount of cuisines: ', len(df_with_cuisines['cuisine'].value_counts()))


We will now split the data in train and test data, with input and output data. 

In [None]:
x2_train, x2_test, y2_train, y2_test = train_test_split(df_with_cuisines['hog'], df_with_cuisines['cuisine'], test_size=0.2, random_state=0)

Now, we can train and test different machine learning models. 