# Workflow for age and country information retrieval

In this notebook I detailed a suggested workflow I've been following to retrieve age and country information. The models we use to do semantic search are a big help but not perfect, so in order to be 100% sure the info. is correct we should manually checked; however, nobody wants to be checking the same users' info every time we scraped new profiles. The models have different advantages and disadvantages, which are detailed in another R notebook, so they usually retrieve different users, and we would like to use as much info as we can.

First we import the IR object and the ids for every url. 

In [1]:
import pandas as pd
from nannies_encoder_search import InformationRetrieval

link_ids = pd.read_csv("../data/input/link_ids.csv")

Next we load the objects, one using the bi-encoder and another using bm25. One important thing to note is that this nb is meant to be used once the embeddings are already downloaded, otherwise set `load_embeddings=False`, but I recomend doing that using a GPU.

In [2]:
nannies_ir = InformationRetrieval('multi-qa-mpnet-base-cos-v1', load_embeddings=True, save_results=False, 
                                  path_to_load_embeddings="../data/output/python_tests/multi-qa-mpnet-base-cos-v1/embeddings_2023-10-23.pkl")
nannies_ir_bm25 = InformationRetrieval('bm25', save_results=False)

nannies_ir.embeddings_loader()
nannies_ir.data = nannies_ir.data.merge(link_ids, on="link", how="left")
nannies_ir_bm25.data = nannies_ir.data

Getting the data...
Setting up the model...
Getting the data...
Setting up the model...


This functions are important

In [61]:
def save_to_folder(main_path, name):
    return main_path + name + nannies_ir.date + ".csv"

# Save one results to a csv and manually check it
#results_query_bm[results_query_bm.score > 0].to_csv(query_output_bm, index=False)

# Function to retrieve countries or ages from a specific ir_object. 

def get_new_results(ir_object, retrieval_method, current_results, alt_query=None, path_to_csv=None):
    sentences = ir_object.data.merge(current_results["id"], on="id", how="left", indicator=True) \
                              .drop_duplicates() \
                              .reset_index(drop=True) \
                              .query('_merge == "left_only"').sentences
    
    retrieval_function = ir_object.__getattribute__(retrieval_method)

    new_results = retrieval_function(sentences=sentences)
    new_results = new_results.merge(link_ids, on="link", how="left").drop("link", axis=1)
    new_results["model"] = ir_object.model_name

    if alt_query:
        new_results_alt = retrieval_function(sentences=sentences, query=alt_query)
        new_results_alt = new_results_alt.merge(link_ids, on="link", how="left").drop("link", axis=1)
        new_results_alt["model"] = ir_object.model_name

        new_results = pd.concat([new_results, new_results_alt]) \
                        .sort_values("score").drop_duplicates(subset="id", keep="last")
    
    if path_to_csv:
            new_results.to_csv(path_to_csv, index=False)

## Country of origin

In [4]:
canadian_provinces = ["ontario", "nova scotia", "new brunswick", "alberta", "british columbia", 
                              "quebec", "saskatchewan", "manitoba", "newfoundland", "prince edward island", 
                              "montreal", "toronto", "Canada", "United Kingdom of Great Britain and Northern Ireland", 
                              "United States of America"]

Load previous results which have been manually checked.

In [68]:
user_countries_path = "../data/output/python_tests/user_countries.csv"
user_countries = pd.read_csv(user_countries_path).dropna()
user_countries

Unnamed: 0,id,query,country,model
0,21613,I am from,Ukraine,bm25
1,23620,I am from,Japan,bm25
2,22049,I am from,Thailand,bm25
3,23103,I am from,India,bm25
4,22455,I am from,Philippines,bm25
...,...,...,...,...
515,12902,Where are you from?,Brazil,multi-qa-mpnet-base-cos-v1
516,13383,Where are you from?,Philippines,multi-qa-mpnet-base-cos-v1
517,13112,Where are you from?,Ireland[i],multi-qa-mpnet-base-cos-v1
518,7102,Where are you from?,Nigeria,multi-qa-mpnet-base-cos-v1


Get the sentences from users whose country of origin hasn't been retrieved, and search through them using the BM25 model and both queries: "I am from" and "Where are you from?". We can do this using the previously defined `get_new_results` function. This in order to get the "easiest" countries to retrieve first. 

In [8]:
get_new_results(nannies_ir_bm25, 
                "retrieve_countries", 
                user_countries, 
                alt_query="Where are you from?", 
                path_to_csv=save_to_folder(nannies_ir_bm25.output_path, "countries_temp_"))

Next we go to the path and manually check the results on excel, then we reload them, and add them to the `user_countries` dataframe.

In [69]:
new_results_bm = pd.read_csv(save_to_folder(nannies_ir_bm25.output_path, "countries_temp_")).dropna()
user_countries = pd.concat([user_countries, new_results_bm[["id", "query", "country", "model"]]]).reset_index(drop=True)

We repeat the procedure, but now using the bi-encoder model.

In [62]:
get_new_results(nannies_ir, 
                retrieval_method="retrieve_countries", 
                current_results=user_countries, 
                alt_query="Where are you from?", 
                path_to_csv=save_to_folder(nannies_ir.output_path, "countries_temp_"))

In [70]:
new_results_bi = pd.read_csv(save_to_folder(nannies_ir.output_path, "countries_temp_")).dropna()
user_countries = pd.concat([user_countries, new_results_bi[["id", "query", "country", "model"]]]).drop_duplicates(subset="id").reset_index(drop=True)
user_countries.to_csv(user_countries_path, index=False)

## Retrieve user ages

Load previous revised results

In [None]:
ages_rev = pd.read_csv("../data/output/python_tests/user_ages.csv").dropna()
ages_rev

Unnamed: 0,id,link,age
0,17600,https://canadiannanny.ca/care/nanny-in-calgary...,12.0
1,18706,https://canadiannanny.ca/care/available-i-am-f...,14.0
2,16718,https://canadiannanny.ca/care/experienced-nann...,14.0
3,13373,https://canadiannanny.ca/care/experienced-nann...,15.0
4,6115,https://canadiannanny.ca/care/oakville-ontario...,15.0
...,...,...,...
999,11072,https://canadiannanny.ca/care/i-am-a-64-years-...,64.0
1000,10118,https://canadiannanny.ca/care/experienced-nann...,64.0
1001,9868,https://canadiannanny.ca/care/my-name-is-katha...,65.0
1002,12270,https://canadiannanny.ca/care/talented-caretak...,69.0


Retrieve ages for all profiles again using both models and queries.

In [None]:
ages_bi = nannies_ir.retrieve_ages(cache_folder=save_to_folder(nannies_ir.output_path, "user_ages_"))
ages_bi_alt = nannies_ir.retrieve_ages(query="How old are you?", cache_folder=save_to_folder(nannies_ir.output_path, "user_ages_alt_query_"))
ages_bm = nannies_ir_bm25.retrieve_ages(cache_folder=save_to_folder(nannies_ir_bm25.output_path, "user_ages_"))
ages_bm_alt = nannies_ir_bm25.retrieve_ages(query="How old are you?", cache_folder=save_to_folder(nannies_ir_bm25.output_path, "user_ages_alt_query_"))

### Bi encoder
Concatenate the results of the bi encoder dropping duplicates

In [None]:
ages_concat_bi = pd.concat([ages_bi, ages_bi_alt])
ages_concat_bi = ages_concat_bi[~ages_concat_bi.age.isna()].sort_values("score").drop_duplicates(subset="link", keep="last")
ages_concat_bi = ages_concat_bi.merge(link_ids, how="left", on="link")

Merge the results of the bi-encoder with the revised results and get those profiles that aren't in the revised results and print them to a csv to be checked.

In [None]:
temp_1 = ages_concat_bi.drop("age", axis=1).merge(ages_rev[["id", "age"]], how="left", on="id").dropna()
temp_2 = ages_concat_bi.merge(ages_rev.drop(["age", "link"], axis=1), how='left', on='id', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])
temp_2.to_csv(save_to_folder(nannies_ir.output_path, "user_ages_temp_"), index=False)

Once checked we load the results again. And cocatenate with the rest.

In [None]:
temp_2 = pd.read_csv(save_to_folder(nannies_ir.output_path, "user_ages_temp_")).dropna()
temp_2

Unnamed: 0,score,query,link,name,sentences,age,id
17,0.511229,How old are you?,https://canadiannanny.ca/care/part-time-nanny-...,Angie W,"I am in my late 40's, I really adore children,...",45.0,2976
20,0.540234,How old are you?,https://canadiannanny.ca/care/gatineau-nanny-e...,Promesse N,"Hello, I'm Promesse (just promise, no trick pr...",17.0,19764
21,0.557338,I am years old,https://canadiannanny.ca/care/experienced-trus...,Rachel A,"I am mature, in my 30s and take my work seriou...",30.0,762
25,0.596892,I am years old,https://canadiannanny.ca/care/a-mature-qualifi...,Nelfa S,I have 15 yrs.,15.0,7554
26,0.645065,I am years old,https://canadiannanny.ca/care/francophone-baby...,Charlotte D,"My name is Charlotte, I am in my late 20s.",25.0,6435
27,0.730221,I am years old,https://canadiannanny.ca/care/interviewing-for...,June X,I am in my 50's.,50.0,21073


In [None]:
ages_concat_bi_new = pd.concat([temp_1, temp_2]).sort_index()
ages_concat_bi_new["model"] = nannies_ir.model_name
ages_concat_bi_new

Unnamed: 0,score,query,link,name,sentences,id,age,model
1,0.453069,How old are you?,https://canadiannanny.ca/care/flexible-light-h...,Lindsay M,Hello my name is Lindsay McMichael I’m an 18 y...,15866,18.0,multi-qa-mpnet-base-cos-v1
2,0.453109,How old are you?,https://canadiannanny.ca/care/hi-everyone-call...,Cherry C,"Hi everyone calls me Cherry, 36 years of age,m...",13587,36.0,multi-qa-mpnet-base-cos-v1
3,0.453573,How old are you?,https://canadiannanny.ca/care/my-names-dakota-...,Dakota L,"My names Dakota, I am 20 years old and I live ...",9332,20.0,multi-qa-mpnet-base-cos-v1
4,0.453672,How old are you?,https://canadiannanny.ca/care/realiable-and-re...,Nallely R,"Hello, my name is Nallely, I am 29 years old, ...",7860,29.0,multi-qa-mpnet-base-cos-v1
5,0.453676,How old are you?,https://canadiannanny.ca/care/hey-there-my-nam...,Britney F,I am a 24 year old with a lot of background ba...,16779,24.0,multi-qa-mpnet-base-cos-v1
...,...,...,...,...,...,...,...,...
832,0.891232,I am years old,https://canadiannanny.ca/care/available-reliab...,Victoria P,I am 21 years old.,11933,21.0,multi-qa-mpnet-base-cos-v1
833,0.891232,I am years old,https://canadiannanny.ca/care/my-name-is-kiara...,Kiara G,I am 21 years old.,18066,21.0,multi-qa-mpnet-base-cos-v1
834,0.892229,I am years old,https://canadiannanny.ca/care/light-housekeepi...,katie p,I am 17 years old.,17749,17.0,multi-qa-mpnet-base-cos-v1
835,0.897147,I am years old,https://canadiannanny.ca/care/hello-canadian-n...,Sabrina M,I am 24 years old.,14154,24.0,multi-qa-mpnet-base-cos-v1


In [None]:
ages_rev_new = pd.concat([ages_rev, ages_concat_bi_new[["id", "link", "age"]]]) \
.sort_values("age") \
.drop_duplicates(subset="link", keep="last").dropna() \
.reset_index(drop=True)

ages_rev_new.to_csv("../data/output/python_tests/user_ages.csv", index=False)

### BM25
We repeat the process for the BM25 model. First we concatenate the results of the BM25 model for both queries.

In [None]:
ages_concat_bm = pd.concat([ages_bm, ages_bm_alt])
ages_concat_bm = ages_concat_bm[~ages_concat_bm.age.isna()].sort_values("score").drop_duplicates(subset="link", keep="last")
ages_concat_bm = ages_concat_bm.merge(link_ids, how="left", on="link")
ages_concat_bm

Unnamed: 0,score,query,link,name,sentences,age,id
0,5.716499,How old are you?,https://canadiannanny.ca/care/experienced-nann...,Cathy D,"I'm Cathy, 32 yrs old .",32.0,16832
1,5.716499,How old are you?,https://canadiannanny.ca/care/vancouver-nanny-...,Grace C,23yr old experienced Irish nanny.,23.0,4068
2,5.716499,How old are you?,https://canadiannanny.ca/care/i-m-jenny-pulhin...,Jenny P,"I'm Jenny Pulhin,38 yrs old",38.0,1580
3,5.716499,How old are you?,https://canadiannanny.ca/care/recently-retired...,Kay B,Recently retired 56 year old.,56.0,1346
4,5.762149,How old are you?,https://canadiannanny.ca/care/markham-ontario-...,Neelab W,Rates are $20 per hour and I invite you to mes...,20.0,2728
...,...,...,...,...,...,...,...
611,15.164791,I am years old,https://canadiannanny.ca/care/experienced-chil...,Morgan A,I am 23 years old.,23.0,22988
612,15.164791,I am years old,https://canadiannanny.ca/care/hello-canadian-n...,Sabrina M,I am 24 years old.,24.0,14154
613,15.164791,I am years old,https://canadiannanny.ca/care/experienced-vanc...,Lina L,I am 38 years old.,38.0,11794
614,15.164791,I am years old,https://canadiannanny.ca/care/toronto-babysitt...,Tori C,I am 19 years old.,19.0,21531


Next we merge with the previous revised and get a dataframe with entries not yet in the revised one.

In [None]:
temp_1 = ages_concat_bm.drop("age", axis=1).merge(ages_rev_new[["id", "age"]], how="left", on="id").dropna()
temp_2 = ages_concat_bm.merge(ages_rev_new.drop(["link", "age"], axis=1), on='id', how='left', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])
temp_2.to_csv(save_to_folder(nannies_ir_bm25.output_path, "user_ages_temp_"), index=False)

After checkin manually we load the results again

In [None]:
temp_2 = pd.read_csv(save_to_folder(nannies_ir_bm25.output_path, "user_ages_temp_"))
ages_concat_bm_new = pd.concat([temp_1, temp_2]).sort_index().dropna()
ages_concat_bm_new["model"] = nannies_ir_bm25.model_name
ages_concat_bm_new

Unnamed: 0,score,query,link,name,sentences,id,age,model
0,5.716499,How old are you?,https://canadiannanny.ca/care/experienced-nann...,Cathy D,"I'm Cathy, 32 yrs old .",16832,32.0,bm25
1,5.716499,How old are you?,https://canadiannanny.ca/care/vancouver-nanny-...,Grace C,23yr old experienced Irish nanny.,4068,23.0,bm25
2,5.716499,How old are you?,https://canadiannanny.ca/care/i-m-jenny-pulhin...,Jenny P,"I'm Jenny Pulhin,38 yrs old",1580,38.0,bm25
3,5.716499,How old are you?,https://canadiannanny.ca/care/recently-retired...,Kay B,Recently retired 56 year old.,1346,56.0,bm25
6,5.997844,How old are you?,https://canadiannanny.ca/care/trustworthy-sitt...,Vivian M,Im vivian 44years old.,13566,44.0,bm25
...,...,...,...,...,...,...,...,...
609,15.164791,I am years old,https://canadiannanny.ca/care/i-am-jermielyn-p...,Jermielyn P,I am 22 years old.,18073,22.0,bm25
610,15.164791,I am years old,https://canadiannanny.ca/care/a-caring-and-res...,WENYING L,I am 49 years old.,18847,49.0,bm25
612,15.164791,I am years old,https://canadiannanny.ca/care/hello-canadian-n...,Sabrina M,I am 24 years old.,14154,24.0,bm25
613,15.164791,I am years old,https://canadiannanny.ca/care/experienced-vanc...,Lina L,I am 38 years old.,11794,38.0,bm25


Finally we concatenate the revised BM25 results with the previous one and write to csv.

In [None]:
pd.concat([ages_rev_new, ages_concat_bm_new[["id", "link", "age"]]]) \
.sort_values("age") \
.drop_duplicates(subset="link", keep="last").dropna() \
.reset_index(drop=True).to_csv("../data/output/python_tests/user_ages.csv", index=False)