### IMPORT PACKAGES & FUNCTIONS

In [3]:
import hdbscan
import random
from Recommendation_System_Location_Based_Pre_Work import df2_co_topcat
from Recommendation_System_Location_Based_Pre_Work import mapVisualize
from Recommendation_System_Location_Based_Pre_Work import addPredictedClusters
from Recommendation_System_Location_Based_Pre_Work import numOfClusters

In [4]:
df2_co_topcat.info() #5141 non-null entries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5141 entries, 0 to 5140
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   uid         5141 non-null   object 
 1   category    5141 non-null   object 
 2   categoryid  5141 non-null   float64
 3   merchant    5141 non-null   object 
 4   city        5141 non-null   object 
 5   latitude    5141 non-null   float64
 6   longitude   5141 non-null   float64
dtypes: float64(3), object(4)
memory usage: 281.3+ KB


### Current view of transactions on the map

In [5]:
mapVisualize(df2_co_topcat,"latitude","longitude",'category','Current_Transactions_Map_Over_CO') #Visualize clusters on the map

#### From above map, we can see that there are clearly "outliers" i.e. points that will not or should not be part of clusters. The clusters produced by the clustering algorithm should mark outliers as noise and seperate them from the island clusters with data points that are closest to each other. 

##############################################################################################################################

## Clustering

### 1.HDBScan

In [6]:
#Using lat and long to define clusters.
df_cluster=df2_co_topcat[['latitude','longitude']]

In [7]:
#create a clustering object
min_size=10
clusterer = hdbscan.HDBSCAN(min_cluster_size=min_size, prediction_data=True).fit(df_cluster)

In [8]:
# get the labels from the clusterer object
clusterer.labels_
unique_labels = set(clusterer.labels_)

# get the number of clusters
num_clusters = len(set(clusterer.labels_))
print('Total number of clusters formed with min_cluster_size of '+ str(min_size) + ': ' + str(num_clusters))

Total number of clusters formed with min_cluster_size of 10: 100


In [9]:
#Add new predicted cluster to the dataframe
df2_co_topcat=addPredictedClusters(df2_co_topcat,clusterer.labels_,'cluster')  

In [10]:
# Distribution of data within each cluster
print(numOfClusters(df2_co_topcat,'cluster'))

cluster  -1    0    1    2    3    4    5    6    7    8   ...   89   90   91  \
count    342   12   47   11   13   14   24   57   48   12  ...  236   12   20   

cluster   92   93   94   95   96   97   98  
count     46   16   69   76   15   16   14  

[1 rows x 100 columns]


In [11]:
#Visualize clusters on the map
mapVisualize(df2_co_topcat,"latitude","longitude",'cluster','HDBScan-lat-long-TopCat') 

### HDBSCAN and min_cluster_size
1. The value of min_cluster_size is the key in building the location based recomm engine. After experimenting with small values (5-10), there are greater number of clusters but the noise (i.e. points that are not part of any clusters) is lower between 234-342.

2. With larger numbers for min_cluster_size (12-20) there are lesser number of clusters but the noise is also very high (440-615)

3. In order to make majority of the transaction data part of the clusters I went with min_cluster_size= 10. In doing so a few of the data points that were close but not 10 in number, were tagged as noise. However, majority of the data is part of some cluster. 
#### min_cluster_size=10, noise(cluster -1) = 342


### Recommendation Engine Code

1. Create a dataframe (recomm_df) from df2_co_topcat that contains merchants per tagged clusters and the number of transactions(popularity criteria) per merchant. The most popular merchants per cluster are on top of the recomm_df dataframe. 
It also contains the latitude/longitude information and city the recommended merchant is in.

In [12]:
recomm_df = df2_co_topcat.copy()
recomm_df = (recomm_df.groupby(['cluster', 'merchant']).agg({'latitude' : 'first',
                                                  'longitude' : 'first',
                                                  'city' : 'first',
                                                  'category' : 'first',
                                                  'cluster' : 'count'})
          .rename({'cluster' : 'cluster_count'},axis=1).reset_index()
          .sort_values(['cluster', 'cluster_count'], ascending = [True, False])
          .drop('cluster_count', axis=1))
#recomm_df.sample(5)

2. Recommendation engine will first use method of hdbscan called 'approximate_predict' to predict the cluster of the given point (lat,long).
3. If the predicted cluser is -1 (noise), no merchant will be recommended.
4. It will then match the predicted cluster to the 'cluster' column of recomm_df.
5. Top 5 merchants based on popularity of the merchants are then displayed. The display contains the name, city, lat/long of the recommended merchants.

In [13]:
# Recommendation Engine Code:

def recommend_co_merchants_hdb(df,lat,long,city,merchant):
    # Predict the cluster for longitude and latitude provided
    test_labels, strengths = hdbscan.approximate_predict(clusterer, [[lat,long]])
    predicted_cluster=test_labels[0]
    print('Predicted cluster for this lat/long combination is: '  + str(predicted_cluster))
    print("_______________________________________________________________________________")
      # Get the best merchant in this cluster
    if predicted_cluster==-1:
        return ('No merchants close by')
    else:
        pop_merch_recomm_df=(df[df['cluster']==predicted_cluster].iloc[0:5][['merchant','city','latitude','longitude']])
        pop_merch_recomm_df=pop_merch_recomm_df.reset_index(drop=True)
        mask = (pop_merch_recomm_df.merchant==merchant) & (pop_merch_recomm_df.latitude==lat) & (pop_merch_recomm_df.longitude==long)
        print ('Since you are currently in '+ city.capitalize() + ' ' + 'at ' + merchant.capitalize() + ', how about you visit these merchants around this area? ')
        return pop_merch_recomm_df[~mask]

### Test 1:

#### 1. Select a random row in df2_co_topcat dataframe and extract the lat, long and city information. 

In [14]:
test_index=df2_co_topcat.sample(random_state=42).index.tolist()
test_index[0]
test_lat=df2_co_topcat.iloc[test_index[0]]['latitude']
test_long=df2_co_topcat.iloc[test_index[0]]['longitude']
test_city=df2_co_topcat.iloc[test_index[0]]['city']
test_merchant=df2_co_topcat.iloc[test_index[0]]['merchant']
test_lat, test_long, test_city, test_merchant

(40.016358, -105.130597, 'Lafayette', 'conoco gas station')

#### 2.  Provide the collected informaiton to the 'recommend_co_merchants_hdb' func

In [15]:
recommend_co_merchants_hdb(recomm_df,test_lat,test_long,test_city,test_merchant)

Predicted cluster for this lat/long combination is: 20
_______________________________________________________________________________
Since you are currently in Lafayette at Conoco gas station, how about you visit these merchants around this area? 


Unnamed: 0,merchant,city,latitude,longitude
0,smashburger,Lafayette,40.013218,-105.131553


###########################################################################################################################

### Test 2:

#### 1. Select a random row in df2_co_topcat dataframe and extract the lat, long and city information

In [16]:
test_index=df2_co_topcat.sample(random_state=10).index.tolist()
test_index[0]
test_lat=df2_co_topcat.iloc[test_index[0]]['latitude']
test_long=df2_co_topcat.iloc[test_index[0]]['longitude']
test_city=df2_co_topcat.iloc[test_index[0]]['city']
test_merchant=df2_co_topcat.iloc[test_index[0]]['merchant']
test_lat, test_long, test_city, test_merchant

(40.00163, -105.10196, 'Lafayette', 'king soopers')

#### 2.  Provide the collected informaiton to the 'recommend_co_merchants_hdb' func

In [17]:
recommend_co_merchants_hdb(recomm_df,test_lat,test_long,test_city,test_merchant)

Predicted cluster for this lat/long combination is: 44
_______________________________________________________________________________
Since you are currently in Lafayette at King soopers, how about you visit these merchants around this area? 


Unnamed: 0,merchant,city,latitude,longitude
1,walmart,Lafayette,40.00163,-105.10196
2,otis craft,Lafayette,40.00163,-105.10196
3,morning glory cafe,Lafayette,40.00163,-105.10196
4,natural grocers,Lafayette,40.00163,-105.10196
