# Date with Data

## Problem statement

You wish to go on a date on valentines day and impress your partner by taking her to a good restaurant.  
Can you use data science to automate the process of finding a date and also come up with a suggestion for the best matched restaurant?


## Well of course you can!

## The Data - Let's analyze the data at hand

## Restaurant Data

![Zomato](Slide3.PNG)

## The Data we could scrap 
- **Area of the restaurant** 
- **Cost for two**
- **Cuisine**
- **Latitude and Longitude**
- **Name**
- **Rating**
- **Votes**

---

## People Data

![Humans](Slide5.PNG)

# Let's make a model to do something with the data

## Importing some python modules

In [21]:
import math
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn
import os
import pprint
from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics 
from sklearn.metrics import classification_report

## Loading the Data

In [39]:
labelled_data = pd.read_excel("Restaurant_final.xlsx")[0:700]
labelled_data.head(10)

Unnamed: 0,area,cost,cuisine,latitude,longitude,name,preference,ratings,type,votes
0,Garia,700,['Biryani' 'North Indian'],22.502441,88.357122,The Biryani Company,0,3.5,"[""['Casual Dining']""]",63
1,Golpark,250,['Juices' 'South Indian' 'Street Food'],22.562931,88.351263,Ralli's,1,3.4,"[""['Casual Dining']""]",144
2,Hatibagan,650,['Chinese' 'North Indian'],22.586452,88.367751,North Point,0,3.5,"[""['Quick Bites']""]",79
3,Ballygunge,1200,['Asian' 'Continental' 'Middle Eastern'],22.526294,88.364614,Spice Kraft,1,3.4,"[""['Casual Dining']""]",2231
4,Garia,600,['Fast Food'],22.470767,88.388901,Hunk Hurry,0,2.5,"[""['Quick Bites']""]",39
5,Park Street Area,900,['Chinese' 'Continental' 'North Indian'],22.552581,88.351074,Om Ganpati Restaurant,0,3.4,"[""['Casual Dining']""]",183
6,Garia,500,['Chinese' 'North Indian'],22.472137,88.389439,Machhranga,0,3.5,"[""['Casual Dining']""]",61
7,Jadavpur,200,['Chinese' 'North Indian' 'Rolls'],22.497826,88.374692,Zaika,0,3.4,"[""['Quick Bites']""]",141
8,Beliaghata,800,['Bengali' 'Chinese' 'North Indian'],22.564053,88.395759,Drumstick,0,3.4,"[""['Casual Dining']""]",259
9,Nagerbazar,400,['Chinese' 'North Indian'],22.623331,88.414703,Barsha,0,3.5,"[""['Casual Dining']""]",112


## Projecting the data

In [23]:
X = labelled_data[['cost', 'latitude', 'longitude', 'ratings', 'votes']]
y = labelled_data[['preference']]
print(X.head(5))
print(y.head(5))

   cost   latitude  longitude  ratings  votes
0   700  22.502441  88.357122      3.5     63
1   250  22.562931  88.351263      3.4    144
2   650  22.586452  88.367751      3.5     79
3  1200  22.526294  88.364614      3.4   2231
4   600  22.470767  88.388901      2.5     39
   preference
0           0
1           1
2           0
3           1
4           0


## Splitting the data
![TrainTest](traintestsplit.png)

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=25)
y_train, y_test = np.squeeze(y_train), np.squeeze(y_test)
print('X_train.shape: {}'.format(X_train.shape))
print('y_train.shape: {}'.format(y_train.shape))
print('X_test.shape: {}'.format(X_test.shape))
print('y_test.shape: {}'.format(y_test.shape))

X_train.shape: (490, 5)
y_train.shape: (490,)
X_test.shape: (210, 5)
y_test.shape: (210,)


# Training the Model
![sk-learn](sklearn-cheatsheet.png)

## Setting up a LogisticRegression

In [25]:
LogReg=LogisticRegression()
LogReg.fit(X_train, y_train)

y_pred = LogReg.predict(X_test)

## Analyzing the accuracy using a Confusion Matrix

![Confusion Matrix](confusionmatrix.png)

In [26]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
confusion_matrix

array([[105,  15],
       [ 65,  25]])

In [31]:
total_samples = np.sum(confusion_matrix)
corect_predictions = np.sum(confusion_matrix*np.eye(2, dtype=np.int))
print('Correct pred {} \nTotal Samples {} \nAccuracy {}%'.format(corect_predictions, total_samples, corect_predictions/total_samples*100.0))

Correct pred 130 
Total Samples 210 
Accuracy 61.904761904761905%


## Now Lets use this model to predict good and bad restaurants on the entire dataset(Labelled and Unlabelled)

In [33]:
entire_data = pd.read_excel("Restaurant_final.xlsx")[['cost', 'latitude', 'longitude', 'ratings', 'votes']]
predictions = LogReg.predict(entire_data)
predictions

array([0, 0, 0, ..., 1, 0, 1])

In [35]:
entire_data['pred_pref'] = predictions
entire_data.head(5)

Unnamed: 0,cost,latitude,longitude,ratings,votes,pred_pref
0,700,22.502441,88.357122,3.5,63,0
1,250,22.562931,88.351263,3.4,144,0
2,650,22.586452,88.367751,3.5,79,0
3,1200,22.526294,88.364614,3.4,2231,1
4,600,22.470767,88.388901,2.5,39,0


In [37]:
entire_data.to_csv('Restaurant_Prediction.csv', index=False)

## KNN ( K Nearest Neighbours )

In [None]:
import numpy as np
from sklearn.cross_validation import train_test_split

# create design matrix X and target vector y
X = np.array( t_s.ix[:,(1,3,4,7,9)]) 	# end index is exclusive
y = np.array(t_s.ix[:,(6)]) 	# another way of indexing a pandas df

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
# loading library
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors=3)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
pred = knn.predict(X_test)

# evaluate accuracy
print accuracy_score(y_test, pred)

In [None]:
X = np.array( t.ix[:,(1,3,4,7,9)])
y_KNN = knn.predict(X)

In [None]:
l=[]
for i in range(len(y_KNN)):
    l+=[{'KNN_pred':y_KNN[i]}]

In [None]:
Knn_p=pd.DataFrame(l)
df=pd.concat([t,R_p, Knn_p], axis=1)

In [None]:
writer=pd.ExcelWriter('E:\\Anaconda2\\Date with data\\Restaurant_Prediction1.xlsx')
df.to_excel(writer,'Sample')
writer.save()

## Finding your favorite restaurant 

In [None]:
Boy_cuisine=['Juices', 'South Indian', 'Street Food']
Girl_cuisine=['Chinese', 'Mughlai', 'North Indian']
Boy_c=np.array(Boy_cuisine)
Girl_c=np.array(Girl_cuisine)

In [None]:
Cuisine_pref=np.append(Boy_c,Girl_c)
Cuisine_pref=np.unique(Cuisine_pref)
Cuisine_pref=np.sort(Cuisine_pref)
Boy_lat=22.57940
Boy_lon=88.35409
Girl_lat=22.57940
Girl_lon=88.35409

In [None]:
df=pd.read_excel('E:\\Anaconda2\\Date with data\\Restaurant_Prediction.xlsx')
len(df)

In [None]:
list_cost=[]
for i in range(len(df)):
    Res_lat=float(df.iloc[i]['latitude'])
    Res_lon=float(df.iloc[i]['longitude'])
    Boy_dis=math.sqrt((Boy_lat-Res_lat)*(Boy_lat-Res_lat)+(Boy_lon-Res_lon)*(Boy_lon-Res_lon))
    Girl_dis=math.sqrt((Girl_lat-Res_lat)*(Girl_lat-Res_lat)+(Girl_lon-Res_lon)*(Girl_lon-Res_lon))
    distance=Boy_dis+Girl_dis
    if(df.iloc[i]['Regression_pred']==0):
        list_cost+=[{'Food_match':0,'Euclidean_dist':distance}]
        continue
    cuisine=str(df.iloc[i]['cuisine'])
    cuisine=cuisine.strip('[')
    cuisine=cuisine.strip(']')
    cuisine=cuisine.strip("'")
    Cuisine_restaurant=cuisine.split("' '")
    l=0
    k=0
    count=0
    while( l <len(Cuisine_pref) and k <len(Cuisine_restaurant)):
        if(Cuisine_restaurant[k]==Cuisine_pref[l]):
            count+=1
            l+=1
            k+=1
        elif(Cuisine_restaurant[k]>Cuisine_pref[l]):
            l+=1
        elif(Cuisine_restaurant[k]<Cuisine_pref[l]):
            k+=1
    list_cost+=[{'Food_match':count,'Euclidean_dist':distance}]
      