# Solving localization problems using word vectors

---

> ##### Using NLP word vectors in a novel way to solve the problem of localization


### >> Problem Statement:
---
###### A sample third class (grade) math question in their question bank looks like this —

> *_Frank lives in San Francisco and Elizabeth lives in Los Angeles. If the flight time is 2 hrs when will Elizabeth reach Frank if she starts at 8am in the morning?_*

###### The same question if you want to write it in Indian books.
> *_Sanjay Verma lives in Bangalore and Rekha lives in Mumbai. If the flight time is 2 hrs when will Rekha reach Sanjay Verma if she starts at 8am in the morning?_*


Before we start it's really important to know some terms like Word Embeddings, Word2Vec.

## >> King - Man + Woman = Queen
---
We humans can easily understand the relationship or similarity between words, but it becomes a problem when it comes to Machines. How will you make a machine to understand the similarity between words?

For a computer to perform any **reasoning** on words, we need to represent words numerically as vectors of numbers termed **embeddings**.

Intuitively, where words are similar in some respect, that can be reflected by certain values in their embeddings being similar.

##### Algorithms:
* Word2Vec
* Glove

These algorithms learn word embeddings by extracting info from huge text sources such as Wikipedia.

So you know, analogies like 
"man is to king as woman is to ...?" or "Paris is to France as Rome is to ...?",
can often be solved simply by adding and subtracting embeddings.

![Image1](https://blog.acolyer.org/wp-content/uploads/2016/04/word2vec-king-queen-vectors.png)

### The result of the vector composition King – Man + Woman = ?

![Image2](https://blog.acolyer.org/wp-content/uploads/2016/04/word2vec-king-queen-composition.png)




So here, we will be solving the localization problem, but what is localization?

>  _**Localization** is the general concept of adopting a product or idea to a different country or region respecting local norms, customs, and any other preferences. The goal is to resonate with the target audience for whom the content is localized._

#### In simple terms, this conversion is basically localization.
> *_Frank lives in San Francisco and Elizabeth lives in Los Angeles. If the flight time is 2 hrs when will Elizabeth reach Frank if she starts at 8am in the morning?_*

> *_Sanjay Verma lives in Bangalore and Rekha lives in Mumbai. If the flight time is 2 hrs when will Rekha reach Sanjay Verma if she starts at 8am in the morning?_*

## >> Let's Approach the problem:

Now let’s look at how we can localize our original USA math question to the Indian context.

> **Frank** lives in **San Francisco** and **Elizabeth** lives in **Los Angeles**. If the flight time is 2 hrs when will **Elizabeth** reach **Frank** if she starts at 8am in the morning?

**Goal:** The goal here is, first find which words needs to be localized and then do it.

---
###### How to find that?
> We will use one library named as **Spacy**. It is free, open-source library for advanced Natural Language Processing (NLP) in Python.

---
![7bf65640.png](https://miro.medium.com/max/875/1*EgZzlN3IdU6Q7Js7p6P7WA.png)

**Step 1:** You see those highlights, Person, GPE, Time. Those are named entities and we will use  **Spacy Named Entity Recognition** to achieve this. 


**Step-2:** Filter named entities that are irrelevant. For example entities like numbers (cardinal) and time doesn’t need localization in our case.

**Step-3:** Now comes the most interesting part. We will use the King-Man + Woman = Queen framework to convert each of the entities.

* **Frank** - USA + India = Sanjay Verma
* **San Franciso** - USA + India = Bangalore
* **Elizabeth** - USA + India = Rekha
* **Los Angeles** - USA + India = Mumbai

**Step-4:** We go back and change the entities with their replacements to get

> **Sanjay Verma** lives in **Bangalore** and **Rekha** lives in **Mumbai**. If the flight time is 2 hrs when will **Rekha** reach **Sanjay Verma** if she starts at 8am in the morning?

---

### >> Let's Code it out!

### Step 1: Extract entities that need to be localized
Example 1 focuses on an example that it completely automated and needs no manual intervention.

In [None]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.tokens import Span
nlp = spacy.load("en")

In [None]:
original_input = "Frank lives in San Francisco and Elizabeth lives in Los Angeles. If the flight time is 2 hrs when will Elizabeth reach Frank if she starts at 8am in the morning?"
processed_input_text=nlp(original_input)

keyword_set = set()
entity_mapping = []
for token in processed_input_text.ents:
    if token.text not in keyword_set:
      keyword_set.add(token.text )
      entity_mapping.append((token.text,token.label_))

print (entity_mapping)

# Display the entities
displacy.render(processed_input_text, style='ent', jupyter=True)

[('Frank', 'PERSON'), ('San Francisco', 'GPE'), ('Elizabeth', 'PERSON'), ('Los Angeles', 'GPE'), ('2', 'CARDINAL'), ('8am in the morning', 'TIME')]


In [None]:
# Now all entities cannot be localized. Example no need to localize numbers. So keep only relevant entities that need to be localized.
keep_entities_list = ['PERSON','GPE','FAC','ORG','PRODUCT','NORP','MONEY','LOC','WORK_OF_ART','LAW','LANGUAGE','QUANTITY']
finalized_entity_mapping = {}
for ent in entity_mapping:
  if ent[1] in keep_entities_list:
    finalized_entity_mapping[ent[0]] = []

print (finalized_entity_mapping)

{'Frank': [], 'San Francisco': [], 'Elizabeth': [], 'Los Angeles': []}


### Step 2: Initialize the Google news word vectors from Gensim and perform localization

In [None]:
import gensim.downloader as api
model = api.load("word2vec-google-news-300") 
word_vectors = model.wv



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
Origin_country='USA' 
Target_country='India'

final_mapping ={}

for word in finalized_entity_mapping: 
  word = word.strip()
  word = word.replace(" ","_")
  try:
    similar_words_list= model.most_similar(positive=[Target_country,word],negative=[Origin_country],topn=10)
    # Remove the scores for the retrieved choices
    similar_words_list = [choices[0].replace("_"," ") for choices in similar_words_list ]
    final_mapping[word.replace("_"," ")] = similar_words_list
  except:
    similar_words_list = []
    print (" Fetching similar words failed for ",word)
  print (word," -- Replacement suggestions -- ",similar_words_list)


  if np.issubdtype(vec.dtype, np.int):


Frank  -- Replacement suggestions --  ['Sanjay Verma', 'Sabyasachi Sen', 'JK Jain', 'Sunil Chauhan', 'Don', 'Sudip', 'Ajay Shankar', 'Robert', 'V. Srinivasan', 'Kanwar Sain']
San_Francisco  -- Replacement suggestions --  ['Bangalore', 'Kolkata', 'Mumbai', 'Chennai', 'Delhi', 'Hyderabad', 'Calcutta', 'San Franciso', 'Bombay', 'Bengaluru']
Elizabeth  -- Replacement suggestions --  ['Rekha', 'Nandita', 'Meera', 'Margaret', 'Katharine', 'Bhagirath', 'Monica', 'Lakshmi', 'Manisha', 'Anita']
Los_Angeles  -- Replacement suggestions --  ['Mumbai', 'Los Angles', 'Kolkata', 'Chennai', 'Bangalore', 'LA', 'Delhi', 'Hyderabad', 'Ahmedabad', 'Calcutta']


In [None]:
from IPython.display import Markdown, display

#  Here localization is performed assuming the correct choice is returned first.
#  Elizabeth  -- Replacement suggestions --  ['Rekha', 'Nandita', 'Meera', 'Margaret', 'Katharine', 'Bhagirath', 'Monica', 'Lakshmi', 'Manisha', 'Anita']
#  Example Elizabeth  is replaced with Rekha.

#  This function is used to bolden the relevant entities that are changed.
def prepare_string(sentence,mapping,orig=True):
  if orig:
    for k in mapping:
      sentence = sentence.replace(k,"**"+k+"**")
  else:
    for k in mapping:
      sentence = sentence.replace(mapping[k][0],"**"+mapping[k][0]+"**")

  return sentence


def localize(sentence,mapping):
  for k in mapping:
    sentence = sentence.replace(k,mapping[k][0])
  return sentence


def printmd(string):
    display(Markdown(string))



print('Original Sentence:')
printmd(prepare_string(original_input,final_mapping))

localized_string =  localize(original_input,final_mapping)

print('\nLocalized Sentence:')
printmd(prepare_string(localized_string,final_mapping,orig=False))

Original Sentence:


**Frank** lives in **San Francisco** and **Elizabeth** lives in **Los Angeles**. If the flight time is 2 hrs when will **Elizabeth** reach **Frank** if she starts at 8am in the morning?


Localized Sentence:


**Sanjay Verma** lives in **Bangalore** and **Rekha** lives in **Mumbai**. If the flight time is 2 hrs when will **Rekha** reach **Sanjay Verma** if she starts at 8am in the morning?