# Machine learning - sprint 2

Authors: Allart Ewoud, Van Hees Maxime, Breda Bram

## Introduction

The main focus of this paper is going to be about making prediction based on the images of the restaurant listings. We see alot of oppertunities where this can come in handy such as: 
- selecting the best pictures from a listing (the one where the model predicts the highest rating)
- ...

## Importing packages

To start off we're importing all the packages.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from PIL import Image
import glob
import PIL

In [2]:
original_df = pd.read_csv("tripadvisor_dataset/restaurant_listings.csv")

# display the data and see how it formulated
pd.set_option("display.max_columns", None)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df = original_df.select_dtypes(include=numerics)

## Case 1

For the first case we are going to look **if we can predict scores based on the images of the listings**. One of the first things where this can come to use is for choosing the best picture of an listing. Another thing that comes to mind is by determining if the pictures corresponds to the score. This can for example be used to tell the restaurants if they have bad or good pictures.

### preprocessing 

The first step is getting the scores and by adding the preprocessing steps.

In [3]:
# general rating
df = df.drop(columns=["atmosphere rating"])

df["general rating"] = original_df["general rating"].apply(lambda x: float(str(x).split(' ')[0]))
df["general rating"] = pd.to_numeric(df["general rating"])
mean = df["general rating"].loc[df["general rating"] != -1].mean()
df["general rating"] = df["general rating"].replace(-1,mean)

df.head()

Unnamed: 0,food rating,service rating,value rating,id,general rating
0,5.0,5.0,4.5,13969825,5.0
1,5.0,5.0,4.5,740727,5.0
2,4.5,4.5,4.5,12188645,4.5
3,5.0,4.5,5.0,9710340,5.0
4,4.5,4.5,4.5,8298124,4.5


In [4]:
# Getting list of all images
fileNameList = glob.glob("tripadvisor_dataset/tripadvisor_images/*.jpg")
images = []

# putting all images in a dict where the key is the name of the picture
for fileName in tqdm(fileNameList, total=len(fileNameList)):
    try:
        img = Image.open(f"{fileName}")
        #TODO resize all the images
        images.append(pd.Series(data=[fileName.split('/')[-1].split("_")[0], np.array(img).flatten()]))
        
    except PIL.UnidentifiedImageError:
        pass

# changing the list of pd.Series to pandaf dataframe
images = pd.concat(images, axis=1).T
images = images.rename(columns={0 : "id", 1 : "pixels"})
images["id"] = pd.to_numeric(images["id"])

# merging 2 dataframes
df = pd.merge(df, images, on="id")

df.head()

100%|████████████████████████████████████| 15183/15183 [01:17<00:00, 195.16it/s]


Unnamed: 0,food rating,service rating,value rating,id,general rating,pixels
0,5.0,5.0,4.5,13969825,5.0,"[121, 87, 49, 120, 86, 49, 113, 79, 44, 106, 7..."
1,5.0,5.0,4.5,13969825,5.0,"[96, 67, 37, 98, 69, 39, 101, 72, 42, 101, 72,..."
2,5.0,5.0,4.5,13969825,5.0,"[39, 38, 33, 36, 35, 30, 37, 34, 27, 38, 34, 2..."
3,5.0,5.0,4.5,13969825,5.0,"[246, 235, 215, 182, 171, 151, 150, 139, 119, ..."
4,5.0,5.0,4.5,13969825,5.0,"[54, 44, 34, 54, 44, 34, 55, 45, 35, 55, 45, 3..."
