Note: Building the LSH Forest and building the integrated table consumes alot of time and processing power.
There is an already build forest and integrated table in file `21_matching`, which can be loaded and tested. With the data in hessenbox

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasketch import MinHash, MinHashLSHForest
import time
import numpy as np

import pickle


pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)



Loading the new cleaned datasets

In [2]:
df_google = pd.read_csv('../data/google_cleaned.csv',parse_dates=["Released","Updated"],index_col=[0])
df_apple = pd.read_csv('../data/apple_cleaned.csv',parse_dates=["Released","Updated"],index_col=[0])

- preprocess function splits the App Id and App Name into sets
- get_forest function will build the LSH Forest from the google dataset

In [3]:
from tqdm import tqdm
import re


def preprocess(text):
    tokens = text.lower()
    tokens = re.split('\.| ',tokens)
    return tokens

def get_forest(data, perms):
    start_time = time.time()
    
    minhash = []
    
    for text in tqdm(data['text']):
        tokens = preprocess(text)
        m = MinHash(num_perm=perms)
        for s in tokens:
            m.update(s.encode('utf8'))
        minhash.append(m)
        
    forest = MinHashLSHForest(num_perm=perms)
    
    for i,m in enumerate(minhash):
        forest.add(i,m)
        
    forest.index()
    
    print('It took %s seconds to build forest.' %(time.time()-start_time))
    
    return forest

predict function will take a text as input and return the similar results from the forest

In [4]:
def predict(text, database, perms, num_results, forest):    
    tokens =  preprocess(text)
    
    m = MinHash(num_perm=perms)
    for s in tokens:
        m.update(s.encode('utf8'))
        
    idx_array = np.array(forest.query(m, num_results))
    
    result = database.iloc[idx_array]#['App Name']
     
    return result

Here we are building a forest from the combination of ("App Id" + "App Name") from the google dataset

In [6]:
df = pd.DataFrame()
df['text'] = df_google['App Id'] + ' ' + df_google['App Name'] 

forest = get_forest(df, 265)

100%|███████████████████████████████████████████████████████| 2312942/2312942 [1:11:22<00:00, 540.06it/s]


It took 4728.405108213425 seconds to build forest.


In [6]:
def jaccard(v1,v2):
    title_set = set(v1)
    result_set = set(v2)
    return float(len(title_set.intersection(result_set)))/float(len(title_set.union(result_set)))


Sample set from the ("App Id" + "App Name")

In [7]:
title = "com.hkbu.arc.apaper A+ Paper Guide"
print(preprocess(title))

['com', 'hkbu', 'arc', 'apaper', 'a+', 'paper', 'guide']


Here are going through the apple dataset
1. Query the forest for the top match from the google dataset
2. Calculate the Jaccard similarity between the record and the result, using weights.

`final score = 0.7 * ("App Id" + "App Name") + 0.3 ("Developer" + "Developer Name)`

In [141]:
dfi = pd.DataFrame(columns=["apple_id","android_id","score"])

def compare(val):
    title = val['App Id'] + ' ' + val['App Name'] + ' ' + val['Developer'] + ' ' + str(val['Developer Website'])
    result = predict(title, df_google, 265, 1, forest)
    
    if(result.empty):
        return None
    
    
    j_title = jaccard(preprocess(val['App Id'] + ' ' + val['App Name']),
                 preprocess(result.iloc[0]["App Id"] + ' ' + result.iloc[0]["App Name"])
                 )
    j_developer = jaccard(preprocess( val['Developer'] + ' ' + str(val['Developer Website'])),
                 preprocess(result.iloc[0]["Developer"] + ' ' + result.iloc[0]["Developer Website"])
                 )
    

    
    return (result,.7*j_title+.3*j_developer);


arr = []
for i in tqdm(range(0,len(df_apple))):
    
    apple_id = df_apple.iloc[i]
    res = compare(apple_id)
   
    if res == None:
        continue;
        
    arr.append(
        {
        "apple_id": apple_id["App Id"],
        "android_id": res[0].iloc[0]["App Id"],
        "score": res[1]
        }
    )

100%|██| 1230375/1230375 [2:13:17<00:00, 153.85it/s]


At the end we have the integrated table with all combined App Ids from the two datasets.
Here we can specify a threshold on the score, to consider the match as valid

In [6]:
df = pd.DataFrame(arr)
df.loc[df["score"] > .5]

Unnamed: 0.1,Unnamed: 0,apple_id,android_id,score
0,0,com.hkbu.arc.apaper,com.hkbu.arc.apaper,0.733333
9,9,com.aaaakh.news,com.aaaakh.news,0.775000
14,14,com.goodbarber.bigbookaudio,com.goodbarber.bigbookaudio,0.745833
17,17,com.kazo0.dailyreflection,com.kazo0.dailyreflection,0.733333
29,29,ca.bintec.meescan.84021342,ca.bintec.meescan.c84021342,0.700000
...,...,...,...,...
1229201,1229201,com.syscon.fitmobile,com.syscon.fitmobile,0.742857
1229204,1229204,com.syslor.syslorar,com.syslor.AR,0.600000
1229205,1229205,com.zettaservicios.sysmos,com.zettaservicios.sysmos,0.733333
1229237,1229237,fr.emotic.SystoviPhone,fr.emotic.systovi,0.600000
