<a href="https://colab.research.google.com/github/amitkp57/Jupyter/blob/pramit-dev/Jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Move to GPU mode if you are in Google Colab
Go to `Runtime` -> `Change runtime type` to activate GPU.

### Provide your credentials to the runtime


In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


### Mount Google Drive

In [2]:
import os
from google.colab import drive
drive.mount('/content/gdrive')
WORKING_DIRECTORY = '/content/gdrive/MyDrive/Data/Jupyter'
os.environ['WORKING_DIRECTORY'] = WORKING_DIRECTORY
%cd $WORKING_DIRECTORY
!ls -latr

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/MyDrive/Data/Jupyter
total 25
-rw------- 1 root root  530 Mar 10 12:53 requirements.txt
-rw------- 1 root root    9 Mar 10 12:53 README.md
-rw------- 1 root root   85 Mar 10 12:53 main.py
-rw------- 1 root root 1069 Mar 10 12:53 LICENSE
-rw------- 1 root root 3565 Mar 10 12:53 Jupyter.ipynb
-rw------- 1 root root 1824 Mar 10 12:53 .gitignore
drwx------ 2 root root 4096 Mar 10 12:53 data
drwx------ 3 root root 4096 Mar 10 12:53 scripts
drwx------ 8 root root 4096 Mar 10 12:53 .git
-rw------- 1 root root    0 Mar 10 12:53 app.log
drwx------ 3 root root 4096 Mar 10 13:07 results


### Clone git repo

In [3]:
%cd $WORKING_DIRECTORY
# !git clone https://github.com/amitkp57/Jupyter
!git reset --hard
!git pull origin pramit-dev
# !pip install -r requirements.txt
!pip install datasketch

/content/gdrive/MyDrive/Data/Jupyter
HEAD is now at d20cbaa calculate_jaccard_similarity
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 16 (delta 12), reused 8 (delta 6), pack-reused 0[K
Unpacking objects: 100% (16/16), done.
From https://github.com/amitkp57/Jupyter
 * branch            pramit-dev -> FETCH_HEAD
   d20cbaa..c20252e  pramit-dev -> origin/pramit-dev
Updating d20cbaa..c20252e
Fast-forward
 scripts/ClusterColumns.py | 104 [32m+++++++++++++++++++++++++[m[31m---------------------[m
 scripts/QueryDatabase.py  |  20 [32m+++++++++[m
 2 files changed, 77 insertions(+), 47 deletions(-)


### Setup meta data

Creates datasets.txt, tables.txt and columns.json in the /data folder. These files are used for querying Google Bigquery tables. 

In [4]:
from scripts.MetaData import save_locally
DATA_PATH = f'{WORKING_DIRECTORY}/data'
os.environ[
        'GOOGLE_APPLICATION_CREDENTIALS'] = f'{DATA_PATH}/amit-pradhan-compute-23315413b3a3.json'
# save_locally(DATA_PATH)
print('Completed!')

Completed!


### Jaccard similarity

Jaccard similarity between columns based on minHash

In [None]:
import scripts.QueryDatabase as queryDatabase
import scripts.ClusterColumns as clusterColumns
import numpy as np

RESULTS_PATH = f'{WORKING_DIRECTORY}/results'
string_columns = queryDatabase.get_columns('STRING')
# np.savez(f'{RESULTS_PATH}/string_columns.npz', string_columns=string_columns)
clusterColumns.serialize_min_hash(string_columns)
similarity = clusterColumns.calculate_jaccard_similarity(string_columns)
print(np.unique(np.round(similarity.ravel(), 2)))
print('Completed!')

### Find columns with Jaccard similarity greater than a threshold value

We use MinHash based LSH t find columns with Jaccard similarity greater than a given threshold value.

In [12]:
import scripts.QueryDatabase as queryDatabase
import scripts.ClusterColumns as clusterColumns

RESULTS_PATH = f'{WORKING_DIRECTORY}/results'
string_columns = queryDatabase.get_columns('STRING')
# np.savez(f'{RESULTS_PATH}/string_columns.npz', string_columns=string_columns)
lsh = clusterColumns.build_minhash_lsh(string_columns, threshold=0.7)
print(clusterColumns.get_all_similar_columns(lsh, string_columns[0]))
print('Completed!')

['bigquery-public-data.covid19_public_forecasts.county_14d.county_fips_code', 'bigquery-public-data.covid19_symptom_search.symptom_search_sub_region_2_daily.sub_region_2_code', 'bigquery-public-data.covid19_public_forecasts_asia_ne1.county_28d_historical_.county_fips_code', 'bigquery-public-data.covid19_symptom_search.symptom_search_sub_region_2_weekly.sub_region_2_code', 'bigquery-public-data.covid19_public_forecasts.county_14d_historical_.county_fips_code', 'bigquery-public-data.covid19_public_forecasts.county_28d_historical_.county_fips_code', 'bigquery-public-data.covid19_public_forecasts_asia_ne1.county_28d.county_fips_code', 'bigquery-public-data.covid19_aha.hospital_beds.county_fips_code', 'bigquery-public-data.covid19_public_forecasts.county_28d_historical.county_fips_code', 'bigquery-public-data.covid19_public_forecasts.county_28d.county_fips_code', 'bigquery-public-data.covid19_jhu_csse.summary.fips', 'bigquery-public-data.covid19_aha.staffing.county_fips_code', 'bigquery-pub

### Top-k similar columns based on Jaccard similarity

We use minhash LSH based forest to query top k columns based on Jaccard similarity

In [10]:
import scripts.QueryDatabase as queryDatabase
import scripts.ClusterColumns as clusterColumns

string_columns = queryDatabase.get_columns('STRING')
forest = clusterColumns.build_lsh_forest(string_columns)
print(clusterColumns.get_top_k(forest, string_columns[0], 10))
print('Completed!')

['bigquery-public-data.covid19_google_mobility.mobility_report.census_fips_code', 'bigquery-public-data.covid19_nyt.us_counties.county_fips_code', 'bigquery-public-data.covid19_public_forecasts.county_14d.county_fips_code', 'bigquery-public-data.covid19_aha.hospital_beds.county_fips_code', 'bigquery-public-data.covid19_jhu_csse.summary.fips', 'bigquery-public-data.covid19_symptom_search.symptom_search_sub_region_2_daily.sub_region_2_code', 'bigquery-public-data.covid19_nyt.mask_use_by_county.county_fips_code', 'bigquery-public-data.covid19_google_mobility_eu.mobility_report.census_fips_code', 'bigquery-public-data.covid19_aha.staffing.county_fips_code', 'bigquery-public-data.covid19_jhu_csse_eu.summary.fips']
Completed!


### ScratchPad
Try anything here!

In [5]:
import scripts.QueryDatabase as queryDatabase
import scripts.ClusterColumns as clusterColumns
import numpy as np

RESULTS_PATH = f'{WORKING_DIRECTORY}/results'
string_columns = queryDatabase.get_columns('STRING')
# np.savez(f'{RESULTS_PATH}/string_columns.npz', string_columns=string_columns)
clusterColumns.serialize_min_hash(string_columns)
similarity = clusterColumns.calculate_jaccard_similarity(string_columns)
# print(np.unique(np.round(similarity.ravel(), 2)))
print('Completed!')

NameError: ignored