Skip to content

chaffeechenyefei/locationIntelligencePipeline

Repository files navigation

locationIntelligencePipeline

1. linkCompanyAndLocation_v2_focus_on_location.py

Function: Build the relationship between company and location, where only geo position of both are given. Either one company assigned with one location or one company assigned with multiple location in range will be fine.

Usage: --run_root [path of data] --ls_card [name of location_scorecard ( location feature ) ] --apps [output file will append it after the original name] --geobit [precision used for geohash matching] --dist_thresh [threshold of distance that used to get rid of some companis far away from the location in the same geohash]

Inputs:

1.1 location_scorecard:

    The feature of buildings including the building class, size, city ,wework_belongs and so on with key 'atlas_location_uuid'.

1.2. company feature in each city separately:

		E.g: ['dnb_pa.csv', 'dnb_sf.csv', 'dnb_sj.csv', 'dnb_Los_Angeles.csv', 'dnb_New_York.csv']

		Each file contains information of a company with key 'duns_number'.

Outputs:

File that contains pair of company and location. 
Lets call such file as 'company-location relationship file'.
It will be named as ['PA', 'SF', 'SJ', 'LA', 'NY'] + app_date + '.csv'.

2. get_csv_for_training_and_testing.py

Function: Generate the training/testing file for ML.

Usage: --run_root [path of data] --ls_card [name of location_scorecard ( location feature ) ] --app_date [output file will append date after the original name] --ratio [ratio of training sample: testing sample]

Inputs:

2.1. location_scorecard:

	The feature of buildings including the building class, size, city ,wework_belongs and so on with key 'atlas_location_uuid'.

2.2. company feature in each city separately:

	E.g: ['dnb_pa.csv', 'dnb_sf.csv', 'dnb_sj.csv', 'dnb_Los_Angeles.csv', 'dnb_New_York.csv']

	Each file contains information of a company with key 'duns_number'.

2.3. company-location relationship file:

	E.g: ['PA', 'SF', 'SJ', 'LA', 'NY'] + apps.

	It was generated by linkCompanyAndLocation_v2_focus_on_location.py ahead.

Outputs:

Files that can be used to train a model:

'train_val_test_location_company_82split'+apps: Train/test pairs where train set only contains P data while test pairs contains P/N data.

'company_feat' + apps: Normalized feature of each company.

'location_feat' + apps: Normalized feature of each location.

'comp_feat_norm_param' + app_date + '.pkl': Parameters for normalization of continuous feature of company : mean/std, column names.

'loc_feat_norm_param' + app_date + '.pkl': Parameters for normalization of continuous feature of location : mean/std, column names.

'comp_feat_dummy_param' + app_date + '.pkl': Parameters for dummy feature of company: {key: original column name, item: category list}.

'loc_feat_dummy_param' + app_date + '.pkl': Parameters for dummy feature of location: {key: original column name, item: category list}.

3. get_csv_for_new_city_addtionally.py

Function: Generate additional normalized feature file for ML according to the parameters of normalization generated last time.

Usage: --run_root [path of data] --ls_card [name of location_scorecard ( location feature ) ] --app_date [output file will append date after the original name] --ratio [ratio of training sample: testing sample]

Inputs:

3.1. location_scorecard:

	The feature of buildings including the building class, size, city ,wework_belongs and so on with key 'atlas_location_uuid'.

3.2. company feature in each city separately:

	E.g: ['dnb_pa.csv']

	Each file contains information of a company with key 'duns_number'.

3.3. company-location relationship file:

	E.g: ['PA'] + apps.
				
	It was generated by linkCompanyAndLocation_v2_focus_on_location.py ahead.

3.4. normalization file:

	'comp_feat_norm_param' + app_date + '.pkl': Parameters for normalization of continuous feature of company : mean/std, column names.
	
	'loc_feat_norm_param' + app_date + '.pkl': Parameters for normalization of continuous feature of location : mean/std, column names.
	
	'comp_feat_dummy_param' + app_date + '.pkl': Parameters for dummy feature of company: {key: original column name, item: category list}.

	'loc_feat_dummy_param' + app_date + '.pkl': Parameters for dummy feature of location: {key: original column name, item: category list}.

Outputs:

'train_val_test_location_company_82split'+appsadd: Train/test pairs where train set only contains P data while test pairs contains P/N data.

'company_feat' + appsadd: Normalized feature of each company.

'location_feat' + appsadd: Normalized feature of each location.

4. get_sub_recommend_reason_after_similarity.py

Function: Generate reason for <cid,bid,score> pairs.

Usage: --run_root [path of data] --ls_card [name of location_scorecard ( location feature ) ] --apps [apps aligned with output file from previous part of pipeline] --sampled [If True, 'sampled_' will be added.] --ww [If True, 'ww_' will be added.]

Inputs:

4.1. location_scorecard:

	The feature of buildings including the building class, size, city ,wework_belongs and so on with key 'atlas_location_uuid'.

4.2. company feature in each city separately:

	E.g: ['dnb_pa.csv']

	Each file contains information of a company with key 'duns_number'.

4.3. company-location similarity score file:

	E.g: ['sampled_ww_PA_similarity'] + apps.
	
	Each file contains the similarity score of company and building.

Outputs:

'dlsub_sampled_ww_PA_similarity' + apps: Column named 'note' stored the reason. 

It is used for unploading.

About

Data processing for location and company

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages