### Content Based Filtering
- Content Based Filtering (CBF) hanya membutuhkan single user/input, tidak seperti Collaborative Filtering yang membutuhkan banyak user/inputan untuk dibandingkan.
- Cara kerja algoritma CBF adalah mencari sesuatu informasi yang dapat digali lalu dihubungkan dengan nilai kemiripannya dengan data yang ada (*similarity score*), dalam kasus ini adalah nama hotel yang sedang ini dituju dan rekomendasi hotel yang mirip dengan pencarian kita. 

### Import Dataset

In [73]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import random

In [74]:
hotel = pd.read_csv('C:\\Users\\andimu064127\\Tugas\\Use Case - Marketing Analysis\\Dataset\\hotels.csv', encoding="latin-1")

In [75]:
hotel.head()

Unnamed: 0,name,address,desc
0,Hilton Garden Seattle Downtown,"1821 Boren Avenue, Seattle Washington 98101 USA","Located on the southern tip of Lake Union, the..."
1,Sheraton Grand Seattle,"1400 6th Avenue, Seattle, Washington 98101 USA","Located in the city's vibrant core, the Sherat..."
2,Crowne Plaza Seattle Downtown,"1113 6th Ave, Seattle, WA 98101","Located in the heart of downtown Seattle, the ..."
3,Kimpton Hotel Monaco Seattle,"1101 4th Ave, Seattle, WA98101",What?s near our hotel downtown Seattle locatio...
4,The Westin Seattle,"1900 5th Avenue,?Seattle,?Washington?98101?USA",Situated amid incredible shopping and iconic a...


In [76]:
print('Pada data, kita memiliki', len(hotel), 'hotels yang terdata dan tercatat')

Pada data, kita memiliki 152 hotels yang terdata dan tercatat


In [77]:
def desc(index):
    example = df[df.index == index][['desc', 'name']].values[0]
    if len(example) > 0:
        print('Deskripsi Hotel:\n',example[0])
        print('\nNama Hotel:\n',example[1])

In [78]:
desc(2)

Deskripsi Hotel:
 Located in the heart of downtown Seattle, the award-winning 
Crowne Plaza Hotel Seattle ? Downtown offers an exceptional blend of service, style and comfort. You?ll notice Cool, Comfortable and Unconventional touches that set us apart as soon as you step inside. Marvel at stunning views of the city lights while relaxing in our new Sleep Advantage? Beds. Enjoy complimentary wireless Internet throughout the hotel and amenities to help you relax like our Temple Spa? Sleep Tight Amenity kits featuring lavender spray and lotions to help you rejuvenate and unwind. Enjoy an invigorating workout at our 24-hour fitness center, get dining suggestions from our expert concierge or savor sumptuous cuisine at our Regatta Bar & Grille restaurant where you can enjoy Happy Hour in our lounge daily from 4pm - 7pm and monthly drink specials. Come and experience all that The Emerald City has to offer with us!

Nama Hotel:
 Crowne Plaza Seattle Downtown


### Data Description dan Distribution

In [79]:
hotel['word_count'] = hotel['desc'].apply(lambda x: len(str(x).split()))

In [80]:
desc_lengths = list(hotel['word_count'])

print("Jumlah Deskripsi:",len(desc_lengths),
      "\nRata-Rata Kata Terhitung", np.average(desc_lengths),
      "\nMinimum Kata Terhitung", min(desc_lengths),
      "\nMaximum Kata Terhitung", max(desc_lengths))

Jumlah Deskripsi: 152 
Rata-Rata Kata Terhitung 156.46052631578948 
Minimum Kata Terhitung 16 
Maximum Kata Terhitung 492


In [81]:
import warnings
warnings.filterwarnings("ignore")

format_spacing = re.compile('[/(){}\[\]\|@,;]')
bad_simbol = re.compile('[^0-9a-z #+_]')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = format_spacing.sub(' ', text) # replace format_spacing symbols by space in text. substitute the matched string in format_spacing with space.
    text = bad_simbol.sub('', text) # remove symbols which are in bad_simbol from text. substitute the matched string in bad_simbol with nothing. 
    text = ' '.join(word for word in text.split() if word not in stop_words) # remove stopwors from text
    return text
    
hotel['desc_clean'] = hotel['desc'].apply(clean_text)

In [82]:
def print_description(index):
    example = hotel[hotel.index == index][['desc_clean', 'name']].values[0]
    if len(example) > 0:
        print('Deskripsi Hotel:\n',example[0])
        print('\nNama Hotel:\n',example[1])
print_description(2)

Deskripsi Hotel:
 located heart downtown seattle awardwinning crowne plaza hotel seattle downtown offers exceptional blend service style comfort youll notice cool comfortable unconventional touches set us apart soon step inside marvel stunning views city lights relaxing new sleep advantage beds enjoy complimentary wireless internet throughout hotel amenities help relax like temple spa sleep tight amenity kits featuring lavender spray lotions help rejuvenate unwind enjoy invigorating workout 24hour fitness center get dining suggestions expert concierge savor sumptuous cuisine regatta bar grille restaurant enjoy happy hour lounge daily 4pm 7pm monthly drink specials come experience emerald city offer us

Nama Hotel:
 Crowne Plaza Seattle Downtown


### Preprocessing Hotel Desc, Vectorizing, TF-IDF, Cosine Similarity, and Indexing

In [83]:
hotel.set_index('name', inplace = True)

In [84]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df['desc_clean'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

In [85]:
indices = pd.Series(hotel.index)

In [90]:
indices[:25]

0                        Hilton Garden Seattle Downtown
1                                Sheraton Grand Seattle
2                         Crowne Plaza Seattle Downtown
3                         Kimpton Hotel Monaco Seattle 
4                                    The Westin Seattle
5                           The Paramount Hotel Seattle
6                                        Hilton Seattle
7                                         Motif Seattle
8                                       Warwick Seattle
9                            Four Seasons Hotel Seattle
10                                            W Seattle
11                                   Gand Hyatt Seattle
12                                 Kimpton Alexis Hotel
13                                            Hotel Max
14                                    Ace Hotel Seattle
15                          Seattle Marriott Waterfront
16                          The Edgewater Hotel Seattle
17                   SpringHill Suites Seattle?D

### Recommendation

In [87]:
def rekomendasi(name, cosine_similarities = cosine_similarities):
    
    rekomendasi_hotel = []
    
    # gettin the index of the hotel that matches the name
    idx = indices[indices == name].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar hotels except itself
    top_5_indexes = list(score_series.iloc[1:6].index)
    
    # populating the list with the names of the top 10 matching hotels
    for i in top_5_indexes:
        rekomendasi_hotel.append(list(hotel.index)[i])
        
    return rekomendasi_hotel

In [88]:
rekomendasi('Hotel Max')

['Hotel Theodore',
 'Stay Alfred on 1st Avenue',
 'Residence Inn by Marriott Seattle Downtown/Convention Center',
 'Sheraton Grand Seattle',
 'The Westin Seattle']

In [89]:
rekomendasi("Warwick Seattle")

['The Edgewater Hotel Seattle',
 'Hyatt Place Seattle',
 'Holiday Inn Seattle Downtown',
 'Holiday Inn Express & Suites North Seattle - Shoreline',
 'The State Hotel']

### Interpretasi dan Kesimpulan

Berdasarkan hasil pencarian kita terhadap satu nama hotel. Sebagai contoh kita gunakan nama hotelnya adalah **`Hotel Max`**, maka akan muncul nama-nama hotel yang memiliki kemiripan dengan **`Hotel Max`** tadi berdasarkan deskripsi hotelnya yang sudah dibagi-bagi dan diberi bobot dengan metode pemisahan text, vectorizer, dan tf-idf yang benar-benar membantu dalam menyaring kemiripan pada input pencarian.

Begitu pula dengan **`Warwick Seattle`**, menampilkan 5 nama hotel yang memiliki kemiripan dengan inputan pencarian user. Inilah yang dimaksud dengan Content Based Filtering.