## Setting up

Run the following commands in CLI  :
```
git clone https://github.com/dmort27/epitran.git

```
**Flite** is needed for english transliteration, run the following commands to install it :
```
git clone https://github.com/festvox/flite.git
cd flite/

./configure && make
sudo make install
cd testsuite
make lex_lookup
sudo cp lex_lookup /usr/local/bin
```

For further information about the module usage, you can consult [this github repo](https://github.com/dmort27/epitran). 

### You can uncomment the following code block to install the two packages 

## Importing libraries

In [1]:
import epitran 
import pandas as pd
import numpy as np 
from bs4 import BeautifulSoup
import requests 
import time

INFO:numexpr.utils:NumExpr defaulting to 4 threads.


## Nationalities scrapper

In [2]:
def origins_name(name):
    headers = {'User-Agent': 'Mozilla/5.0'}
    url = f"https://www.familysearch.org/fr/surname?surname={name}"
    c = []
    
    page = requests.get(url,headers = headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    countries = soup.find_all('h3', class_ = 'countryTitleText')
    
    for k in range(len(countries)):
        
        country = countries[k].text
        c.append(country)
        
    time.sleep(10) ## This parameter slows down computation time, it can be set to a lower value but the
                   ## scrapper might not work on more than 10 names. 
    
    return c

## Nationalities dictionary

In [3]:
## This dictionnary is not exhaustive and can be updated depending on the nationalities 
## returned by the website https://www.familysearch.org/fr/

dic_nationalities = {
    'Chili' : epitran.Epitran('spa-Latn'),
    'Russie' : epitran.Epitran('rus-Cyrl'),
    'Argentine' : epitran.Epitran('spa-Latn'),
    'Liban' : epitran.Epitran('ara-Arab'),
    'Allemagne' : epitran.Epitran('deu-Latn'),
    'Hongrie' : epitran.Epitran('hun-Latn'),
    'Italie' : epitran.Epitran('ita-Latn'),
    'Inde' : epitran.Epitran('hin-Deva'),
    'Bangladesh' : epitran.Epitran('hin-Deva'),
    'Autriche' : epitran.Epitran('deu-Latn'),
    'Espagne' : epitran.Epitran('spa-Latn'),
    'Pérou' : epitran.Epitran('spa-Latn'),
    'Équateur' : epitran.Epitran('spa-Latn'),
    'Porto Rico' : epitran.Epitran('spa-Latn'),
    'Uruguay' : epitran.Epitran('spa-Latn'),
    'Bolivia' : epitran.Epitran('spa-Latn'),
    'Croatie' : epitran.Epitran('hrv-Latn'),
    'Brésil' : epitran.Epitran('por-Latn'),
    'Mexique' : epitran.Epitran('spa-Latn'),
    'Pays-Bas' : epitran.Epitran('nld-Latn'),
    'Portugal' : epitran.Epitran('por-Latn'),
    'Pologne' : epitran.Epitran('pol-Latn'),
    'Viêt-Nam' : epitran.Epitran('vie-Latn'),
    "États-Unis d'Amérique" : epitran.Epitran('eng-Latn'),
    "Canada" : epitran.Epitran('eng-Latn'),
    "Galles" : epitran.Epitran('eng-Latn'),
    "Écosse" : epitran.Epitran('eng-Latn'),
    "Angleterre" : epitran.Epitran('eng-Latn'),
    "Afrique du Sud" : epitran.Epitran('eng-Latn'),
    "Irlande" : epitran.Epitran('eng-Latn')
}

## Ready to use functions

In [4]:
## Function to transliterate a surname to a phoneme

## Arbitarirly, we chose to return 2 french transliterations.

def g2p(name , n_translations = 2):
    
    translations = []
    epitrans = [epitran.Epitran('fra-Latn'), epitran.Epitran('fra-Latn-np')]
    countries = ["France", "France bis"]
    count = 2
    
    if n_translations > 2:
        
        origins = origins_name(str(name)) ## This is the longest step, hence we only look for origins if the 
                                          ## user wants more than 2 translations.
            
        i = 0
        while i < len(origins) and count <= n_translations:
            
            if origins[i] in dic_nationalities.keys() and dic_nationalities[origins[i]] not in epitrans:
                
                epitrans.append(dic_nationalities[origins[i]])
                countries.append(origins[i]) 
                count += 1
            
            i += 1
                
        epitrans = epitrans
        
    for k in range(min(len(epitrans),n_translations)):
        
        translation = epitrans[k].transliterate(name) 
        translations.append(translation)
        
    res = dict(zip(countries,translations))
    
    return res

In [5]:
def g2p_series(series, n_translations = 2, nat = False):
     
    res  = series.map(lambda x : g2p(x,n_translations))
    
    if nat == False :
        
        return res.map(lambda x : list(x.values()))
    
    else:
        
        return res.map(lambda x : list(x.values())), res.map(lambda x : list(x.keys()))
        

## How to use the two provided functions

### On a single name

In [6]:
trans = g2p('Convert', n_translations = 4) ## It will look for 4 different translations
print(trans)
trans = g2p('Rossignol') ## The second parameter is optionnal and is arbitrarily set to 2
print(trans)

{'France': 'kɔ̃vɛʀ', 'France bis': 'kɔnvɛrt', "États-Unis d'Amérique": 'kɑnvɹ̩t'}
{'France': 'ʀɔsiɲɔl', 'France bis': 'rɔssiɲɔl'}


### On a pandas Series

In [7]:
df = pd.read_csv('../Dataset.csv', nrows = 10)
df = df.rename(columns={'Unnamed: 0' : 'Nom propre', '0' : 'Phonétique'})
df = df.drop('Phonétique', axis = 1)

In [8]:
df.head(5)

Unnamed: 0,Nom propre
0,Abadie
1,Abart
2,Salem
3,Abraham
4,Adam


In [9]:
df['Translation'] = g2p_series(df['Nom propre'])
df.head()

Unnamed: 0,Nom propre,Translation
0,Abadie,"[abadi, abadi]"
1,Abart,"[abaʀ, abart]"
2,Salem,"[salɑ̃, salɛm]"
3,Abraham,"[abʀaɑ̃, abraham]"
4,Adam,"[adɑ̃, adam]"


Two parameters can be added : 
- *n_translations* :  arbitrarily set to 2. If higher than 2, computational time will be quite large.
- *nat* :  arbitrarily set to False. If set to True, it will return another series in which will be stored the nationalities corresponding to the translations

In [10]:
df['Translation'], df['Nationalities'] = g2p_series(df['Nom propre'], n_translations = 3, nat = True)
df.head()

Unnamed: 0,Nom propre,Translation,Nationalities
0,Abadie,"[abadi, abadi, əbædi]","[France, France bis, États-Unis d'Amérique]"
1,Abart,"[abaʀ, abart, abart]","[France, France bis, Italie]"
2,Salem,"[salɑ̃, salɛm, sejləm]","[France, France bis, États-Unis d'Amérique]"
3,Abraham,"[abʀaɑ̃, abraham, ejbɹəhæm]","[France, France bis, États-Unis d'Amérique]"
4,Adam,"[adɑ̃, adam, ædəm]","[France, France bis, Écosse]"
