<font size="3"> Now let's try and test the two models on some other datasets to see how they compare.

In [91]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.utils import resample
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, plot_confusion_matrix, cohen_kappa_score, balanced_accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from joblib import dump, load

def padding(name):
    padded_text = "^" + name.str.upper() + "$"
    return padded_text

def print_metrics(y_test,y_pred):
    bas = balanced_accuracy_score(y_test,y_pred)
    cks = cohen_kappa_score(y_test,y_pred)
    print("Balanced accuracy score : ", round(bas,3))
    print("Cohen Kappa score : ", round(cks, 3))


    report = classification_report(y_test, y_pred, output_dict=True)
    df = pd.DataFrame(report).transpose()
    results = df.sort_values("f1-score",ascending=False)
    return results


pubmed_clf = load("MR_model.joblib")
FB_clf = load("mixed_model.joblib")

# Forebears

<font size="3"> [forebears.io/](https://forebears.io/) is a website dedicated to tracking the prevalence of names accross the world. It is based on the largest genealogical database in the world, [FamilySearch.org](https://www.familysearch.org/fr/). I scraped the top 1000 names of every country in the world.

In [241]:
df_fam = pd.read_csv("../data/other_data/forebears_surnames.csv")
df_fam = df_fam[df_fam["region"] != 'ambiguous'].dropna()

X = df_fam["name"]
y = df_fam["region"]

df_fam

Unnamed: 0,country,name,freq,region,alpha-3
0,South Korea,Kim,0.200000,Asian,KOR
1,South Korea,I,0.142857,Asian,KOR
2,South Korea,Pak,0.083333,Asian,KOR
3,South Korea,Chong,0.047619,Asian,KOR
4,South Korea,Choe,0.047619,Asian,KOR
...,...,...,...,...,...
153270,Dr Congo,Malemba,0.000125,African,COD
153271,Dr Congo,Njiba,0.000125,African,COD
153272,Dr Congo,Nshimbi,0.000124,African,COD
153273,Dr Congo,Futi,0.000124,African,COD


## Results for the Facebook model

In [242]:
y_pred = FB_clf.predict(X)
results = print_metrics(y ,y_pred)
results

Balanced accuracy score :  0.68
Cohen Kappa score :  0.626


Unnamed: 0,precision,recall,f1-score,support
CentralSouthEuropean,0.871898,0.697665,0.775111,33443.0
Slavic,0.787782,0.746835,0.766762,14141.0
NorthEuropean,0.601465,0.84349,0.702209,14210.0
weighted avg,0.717581,0.689533,0.695075,108469.0
accuracy,0.689533,0.689533,0.689533,0.689533
African,0.736177,0.612337,0.66857,19308.0
macro avg,0.657063,0.679935,0.660077,108469.0
Asian,0.631639,0.582955,0.606321,8589.0
Arabian,0.528624,0.656533,0.585676,12138.0
Indian,0.441855,0.619729,0.51589,6640.0


## Results for the Pubmed model

In [111]:
y_pred = pubmed_clf.predict(X)
results = print_metrics(y ,y_pred)
results

Balanced accuracy score :  0.675
Cohen Kappa score :  0.632


Unnamed: 0,precision,recall,f1-score,support
CentralSouthEuropean,0.874338,0.715905,0.787229,33443.0
Slavic,0.804718,0.747755,0.775192,14141.0
weighted avg,0.718598,0.694779,0.69976,108469.0
accuracy,0.694779,0.694779,0.694779,0.694779
NorthEuropean,0.582927,0.850598,0.691773,14210.0
African,0.706572,0.667599,0.686533,19308.0
macro avg,0.65665,0.674926,0.658693,108469.0
Arabian,0.633214,0.58115,0.606066,12138.0
Asian,0.568687,0.56584,0.56726,8589.0
Indian,0.426094,0.595633,0.496797,6640.0


## Detailed predictions

Both models seem equally good.
Let's see in detail how the FB model performed. Here's its predictions for each country as well the share of correct predictions, either raw or weighted by the frequency of each name in the population.

In [243]:
df_fam["pred"] = y_pred
df_fam["result"] = np.where(df_fam["pred"] == df_fam["region"], "correct",'false')

c = df_fam.groupby(["country","result"])["freq"].count()
c = c/c.groupby("country").transform(sum)
c = (c*100).round(1).unstack()["correct"].rename("correct-unweighted")

a = df_fam.groupby(["country","result"])["freq"].sum()
a = a/a.groupby("country").transform(sum)
a = (a*100).round(1).unstack()["correct"].rename('correct-weighted')


b = df_fam.groupby(["country","pred"])["name"].count()
b = b/b.groupby('country').transform(sum)
b = b.multiply(100).round(1).unstack()
data = pd.concat([c,a,b],axis=1).sort_values("correct-unweighted",ascending=False).fillna(0)
data[:45]

Unnamed: 0_level_0,correct-unweighted,correct-weighted,African,Arabian,Asian,CentralSouthEuropean,Indian,NorthEuropean,Slavic
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Tajikistan,100.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0
Uzbekistan,99.6,99.8,0.0,0.0,0.0,0.0,0.4,0.0,99.6
Kyrgyzstan,99.3,99.1,0.1,0.1,0.2,0.1,0.1,0.1,99.3
Russia,97.8,98.5,0.1,0.4,0.3,0.2,0.4,0.8,97.8
Turkmenistan,96.9,97.5,0.6,0.0,0.0,1.9,0.0,0.6,96.9
Kazakhstan,94.7,92.6,0.3,0.9,1.1,0.9,0.7,1.4,94.7
Bulgaria,94.5,95.6,0.7,2.3,0.2,0.4,1.6,0.3,94.5
Netherlands,94.5,95.6,0.7,2.0,1.1,1.1,0.2,94.5,0.3
Belgium,93.8,94.7,0.9,1.4,0.5,2.9,0.2,93.8,0.3
Azerbaijan,93.6,98.7,0.0,0.5,0.0,1.9,3.9,0.1,93.6


In [244]:
data[45:90]

Unnamed: 0_level_0,correct-unweighted,correct-weighted,African,Arabian,Asian,CentralSouthEuropean,Indian,NorthEuropean,Slavic
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Brazil,81.8,92.9,3.6,2.2,2.0,81.8,0.7,9.2,0.5
Switzerland,81.6,84.9,1.8,6.1,1.1,6.6,1.6,81.6,1.1
Venezuela,81.3,89.4,5.2,2.7,2.0,81.3,1.3,6.4,1.1
Uruguay,80.1,91.5,3.1,3.7,1.6,80.1,0.9,9.3,1.3
Paraguay,80.1,90.7,3.9,2.3,1.2,80.1,1.4,10.2,1.0
Ecuador,79.9,88.4,7.1,2.6,2.4,79.9,1.6,5.3,1.1
Nicaragua,79.6,87.9,5.3,2.2,1.6,79.6,1.2,9.2,0.9
Dominican Republic,79.5,89.3,3.9,3.1,2.1,79.5,0.6,9.1,1.7
El Salvador,79.4,90.8,5.5,2.0,2.2,79.4,1.7,7.6,1.6
Spain,79.1,90.5,4.4,3.2,2.5,79.1,1.0,8.8,1.0


In [245]:
data[90:]

Unnamed: 0_level_0,correct-unweighted,correct-weighted,African,Arabian,Asian,CentralSouthEuropean,Indian,NorthEuropean,Slavic
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
South Africa,63.0,70.2,63.0,1.5,1.4,2.0,2.5,29.3,0.3
Taiwan,62.8,96.2,8.3,7.1,62.8,7.1,3.9,7.6,3.2
India,62.8,75.5,7.3,11.2,5.0,4.1,62.8,7.1,2.5
Malta,62.6,56.5,7.0,10.5,4.1,62.6,0.6,13.5,1.8
China,60.7,94.0,7.5,8.5,60.7,2.8,6.9,12.2,1.4
Haiti,60.6,72.3,8.5,6.5,1.3,19.3,2.0,60.6,1.8
Saudi Arabia,60.5,60.9,3.7,60.5,1.6,0.0,30.0,3.2,1.1
Iraq,59.8,67.1,4.9,59.8,2.3,2.2,24.8,4.2,1.8
Kenya,59.8,59.0,59.8,9.2,8.0,5.6,8.8,7.2,1.4
Mali,58.5,76.5,58.5,19.5,5.1,4.6,2.1,9.7,0.5


# Olympic athletes

<font size="3"> A [dataset](https://github.com/rgriff23/Olympic_history) of athletes having competed in the Olympics.

In [138]:
df_sport = pd.read_csv("https://raw.githubusercontent.com/rgriff23/Olympic_history/master/data/athlete_events.csv").drop_duplicates(subset=["Name","Team"])
df_sport["name"] = df_sport["Name"].str.split().apply(lambda x : x[-1]).str.upper()
df_sport["country"] = df_sport["Team"]
df_sport = df_sport.merge(labels,on="country")[["country",'Name',"name","region"]].dropna()
df_sport = df_sport[df_sport["region"] != 'ambiguous']
df_sport

Unnamed: 0,country,Name,name,region
0,China,A Dijiang,DIJIANG,Asian
1,China,A Lamusi,LAMUSI,Asian
2,China,Abudoureheman,ABUDOUREHEMAN,Asian
3,China,Ai Linuer,LINUER,Asian
4,China,Ai Yanhan,YANHAN,Asian
...,...,...,...,...
97171,Lesotho,Masempe Theko,THEKO,African
97172,Lesotho,Mamorallo Tjoka,TJOKA,African
97173,Lesotho,M'apotlaki Ts'elho,TS'ELHO,African
97174,Lesotho,Lefa Tsapi,TSAPI,African


## Performance of the FB model

In [139]:
y = df_sport["region"]
y_pred = FB_clf.predict(df_sport['name'])
results = print_metrics(y ,y_pred)
results

Balanced accuracy score :  0.708
Cohen Kappa score :  0.657


Unnamed: 0,precision,recall,f1-score,support
Asian,0.849078,0.803084,0.825441,10375.0
Slavic,0.786534,0.811755,0.798945,11570.0
NorthEuropean,0.756835,0.84361,0.79787,29663.0
weighted avg,0.770805,0.734439,0.740615,92103.0
accuracy,0.734439,0.734439,0.734439,0.734439
CentralSouthEuropean,0.882178,0.617045,0.726168,30202.0
macro avg,0.635046,0.708042,0.653484,92103.0
Arabian,0.380345,0.789811,0.513437,3435.0
Indian,0.407337,0.635978,0.496604,2357.0
African,0.383019,0.45501,0.415922,4501.0


## Performance of the PubMed model

In [140]:
y = df_sport["region"]
y_pred = pubmed_clf.predict(df_sport['name'])
results = print_metrics(y ,y_pred)
results

Balanced accuracy score :  0.716
Cohen Kappa score :  0.672


Unnamed: 0,precision,recall,f1-score,support
Asian,0.817989,0.817831,0.81791,10375.0
NorthEuropean,0.751641,0.849071,0.797391,29663.0
Slavic,0.791052,0.797753,0.794389,11570.0
weighted avg,0.772443,0.747326,0.751257,92103.0
accuracy,0.747326,0.747326,0.747326,0.747326
CentralSouthEuropean,0.881571,0.645752,0.745456,30202.0
macro avg,0.655027,0.716332,0.675607,92103.0
Arabian,0.527982,0.733333,0.613941,3435.0
Indian,0.418696,0.634705,0.504553,2357.0
African,0.396254,0.535881,0.45561,4501.0


<font size="3"> The Pubmed dataset does a bit better.

# Insee names dataset

<font size="3">  The french National Institute of Statistics keeps a database of most names given at birth since the 1890's. In 1890-1900 almost all names should be of north european origin with a minority coming from south europe. Let's see how the FB model performs.

## FB model

In [149]:
fr_names = pd.read_table("https://www.insee.fr/fr/statistiques/fichier/3536630/noms2008nat_txt.zip")
fr_names = fr_names[fr_names["NOM"] != "AUTRES"]

fr_names["pred"] = FB_clf.predict(fr_names["NOM"])

data_xix = fr_names.groupby("pred")["_1891_1900"].sum()
data_xxi = fr_names.groupby("pred")["_1991_2000"].sum()

data_xix = ((data_xix/data_xix.sum())*100).sort_values(ascending=False).round(1)
data_xxi = ((data_xix/data_xxi.sum())*100).sort_values(ascending=False).round(1)

data = pd.concat([data_xix.astype(str) + "%",data_xix.cumsum()],axis=1)
data.columns = ["share_1890_1900",'cumulative_share_1890_1900']
data

Unnamed: 0_level_0,share_1890_1900,cumulative_share_1890_1900
pred,Unnamed: 1_level_1,Unnamed: 2_level_1
NorthEuropean,87.5%,87.5
CentralSouthEuropean,5.5%,93.0
Arabian,2.8%,95.8
African,1.6%,97.4
Slavic,1.1%,98.5
Indian,0.9%,99.4
Asian,0.7%,100.1


<font size="3"> Quite good, although at least 7% of names are wrongly labeled as non european. Let's review some random predictions.

In [150]:
for region in ['CentralSouthEuropean',"African","Arabian",'Indian',"Asian","Slavic"]:
    n = fr_names[(fr_names["pred"] == region) & (fr_names["_1891_1900"] > 20)].sample(50)["NOM"]
    print('Examples of predicted', region, " names : ")
    for i in n:
        print(i.capitalize(), end=", ")
    print('\n')

Examples of predicted CentralSouthEuropean  names : 
Hermez, Barbaste, Carrega, Canel, Pasquini, Rostagni, Cistac, Cristol, Ottomani, Prete, Giuliano, Franceschini, Morio, Occelli, Gaglio, Kerlidou, Ghigo, Marin, Sabate, Mauri, Durando, Esposito, Dorso, Catalan, Sacaze, Zanetti, Ozon, Guiliani, Seveno, Barbas, Corticchiato, Capus, Mondoloni, Pellan, Colomban, Ayral, Roccaserra, Sartori, Chambas, Rosso, Donato, Petrignani, Solari, Benedetti, Marchini, Francoz, Filippini, Garibaldi, Maestracci, Benedetto, 

Examples of predicted African  names : 
Emile, Mazy, Mangon, Joye, Lehuede, Dazy, Moity, Nony, Marle, Dodu, Lelu, Menu, Pogu, Danilo, Bondu, Molas, Sibue, Magendie, Dibon, Cabane, Dile, Matheu, Sence, Quidu, Surugue, Mabru, Thabuis, Dube, Lebegue, Lerendu, Mazingue, Basile, Rame, Andree, Nogue, Eme, Bigo, Maxime, Seube, Ramus, Phalip, Adams, Chiesa, Angele, Cande, Emery, Donze, Uteza, Gabon, Comba, 

Examples of predicted Arabian  names : 
Baysse, Jaouen, Mine, Lamm, Knab, Eloi, Routa

<font size="3"> Most Southern european names are right, almost all of the others are wrong. Breton names are classified a lot as slavic because of "ic", "ec" endings. 

## Pubmodel model

In [146]:
fr_names["pred"] = pubmed_clf.predict(fr_names["NOM"])

data_xix = fr_names.groupby("pred")["_1891_1900"].sum()
data_xxi = fr_names.groupby("pred")["_1991_2000"].sum()

data_xix = ((data_xix/data_xix.sum())*100).sort_values(ascending=False).round(1)
data_xxi = ((data_xix/data_xxi.sum())*100).sort_values(ascending=False).round(1)

data = pd.concat([data_xix.astype(str) + "%",data_xix.cumsum()],axis=1)
data.columns = ["share_1890_1900",'cumulative_share_1890_1900']
data

Unnamed: 0_level_0,share_1890_1900,cumulative_share_1890_1900
pred,Unnamed: 1_level_1,Unnamed: 2_level_1
NorthEuropean,75.7%,75.7
CentralSouthEuropean,14.6%,90.3
African,3.4%,93.7
Arabian,2.1%,95.8
Slavic,1.5%,97.3
Asian,1.4%,98.7
Indian,1.2%,99.9


In [148]:
for region in ['CentralSouthEuropean',"African","Arabian",'Indian',"Asian","Slavic"]:
    n = fr_names[(fr_names["pred"] == region) & (fr_names["_1891_1900"] > 20)].sample(50)["NOM"]
    print('Examples of predicted', region, " names : ")
    for i in n:
        print(i.capitalize(), end=", ")
    print('\n')


Examples of predicted CentralSouthEuropean  names : 
Cance, Cabal, Cristofari, Raimondi, Gilli, Verdan, Espie, Echegut, Hocde, Cayla, Blanque, Perfettini, Ristori, Larrere, Serra, Capus, Nalis, Peraldi, Ughetto, Mauran, Lofficial, Lea, Humez, Pasquiou, Busso, Escaffre, Daragon, Russo, Borredon, Soyez, Guidicelli, Guinel, Solere, Bossis, Roumeas, Rolando, Bocage, Vives, Gas, Gandolfo, Gipoulou, De vos, Pantalacci, Calvin, Caraes, Riou, Lahondes, Vivares, Car, Lauriol, 

Examples of predicted African  names : 
Mao, Sereni, Lembeye, Gouzou, Fousse, Adolphe, Sire, Mone, Demade, Coffy, Onno, Mabile, Blampain, Lesire, Gouyou, Donguy, Lelong, Fol, Duchange, Caritey, Bole, Aguesse, Foata, Brou, Mathou, Laye, Le doare, Peru, Marguerite, Dange, Boussange, Longere, Chaze, Bugat, Roye, Emery, Chenu, Machu, Manne, Duminy, Petitbon, Justin, Seraphin, Justine, Deleplanque, Meresse, Bally, Bodo, Odiot, Isidore, 

Examples of predicted Arabian  names : 
Belgy, Retif, Gibout, Sache, Belou, Bas, Derail, 

<font size="3">  The Pubmed dataset has a tendancy to wrongly classify north european names as south european names

# Brevet des collèges

I've gathered data about the brevet des collèges taken by almost all middle school students in France.

<font size="3">Can we infer past immigration figures from this?</font>

<font size="4"><u>**NO**<u></font>

First of all, because of mixed unions, a lot of descendants of immigrants have mixed origins and wear a french name (or other).

Second, the models are far from the level of accuracy we would need for this, particularly in a very imbalanced population.
    
Quick example : let's say 90% of a population is of french origin and 10% of arabic origin and the model is correct 90% of the time. Then 10% of the dominant group, that is, 9% of the population, is going to be misclassified as having an arabic name while 90% of the minority group, that is, 9% of the population as well is going to be correctly classified as having an arabic name. Half of the people predicted as having an arabic name have in fact a french name.
    
Let's see that in practice : 

## FB model

In [264]:
df = pd.read_csv("../data/other_data/brevet_2019.csv")
df["name"] = df["noms"].str.split(" ").apply(lambda x : " ".join(x[:-1])).str.upper()
df["prenom"] = df["noms"].str.split(" ").apply(lambda x : x[-1])

df["pred_FB"] = FB_clf.predict(df["name"])
df["pred_FB"].value_counts(normalize=True)*100


  exec(code_obj, self.user_global_ns, self.user_ns)


NorthEuropean           65.767569
CentralSouthEuropean    13.240034
Arabian                 11.445240
African                  3.836014
Slavic                   2.649154
Indian                   1.802126
Asian                    1.259863
Name: pred_FB, dtype: float64

## Pubmed model

In [155]:
df["pred_pubmed"] = pubmed_clf.predict(df["name"])
df["pred_pubmed"].value_counts(normalize=True)*100

NorthEuropean           62.248753
CentralSouthEuropean    15.319190
Arabian                  9.594019
African                  5.663325
Slavic                   2.892073
Indian                   2.224206
Asian                    2.058434
Name: pred_pubmed, dtype: float64

## Predictions correction

Let's assume that the names given at birth in 1890-1900 were all of european origin, which is probably not far from truth.
How many of those names were wrongly classified as not european by the models?

In [171]:
## correction
fr_names.loc[~fr_names["pred"].isin(["NorthEuropean","CentralSouthEuropean"]) ,"pred"] = "NorthEuropean"
df_merge = df.merge(fr_names[fr_names["_1891_1900"] > 5][["NOM","pred"]],left_on="name",right_on="NOM",how="left")
df_merge

Unnamed: 0,ville,code_postal,noms,name,prenom,pred_FB,pred_pubmed,NOM,pred
0,manosque,04112,Alaverdyan Rafik,ALAVERDYAN,Rafik,Slavic,CentralSouthEuropean,,
1,manosque,04112,Alpin Donovan,ALPIN,Donovan,Slavic,NorthEuropean,,
2,manosque,04112,Arciszewski Sunita,ARCISZEWSKI,Sunita,Slavic,Slavic,,
3,manosque,04112,Artaud Mathis,ARTAUD,Mathis,NorthEuropean,NorthEuropean,ARTAUD,NorthEuropean
4,manosque,04112,Avena Prince,AVENA,Prince,CentralSouthEuropean,CentralSouthEuropean,AVENA,CentralSouthEuropean
...,...,...,...,...,...,...,...,...,...
313680,paris,75000,Zieseniss Milena,ZIESENISS,Milena,NorthEuropean,NorthEuropean,,
313681,paris,75000,Zimmermann Anna,ZIMMERMANN,Anna,NorthEuropean,NorthEuropean,ZIMMERMANN,NorthEuropean
313682,paris,75000,Zivkovic Sylvie,ZIVKOVIC,Sylvie,Slavic,Slavic,,
313683,paris,75000,Zuber Marwan,ZUBER,Marwan,NorthEuropean,NorthEuropean,ZUBER,NorthEuropean


In [205]:
for region in ["Arabian",'Asian',"Slavic",'African']:
    for model in ("pred_FB","pred_pubmed"):
        a = df_merge[(df_merge["pred"].isin(["NorthEuropean","CentralSouthEuropean"])) & (df_merge[model] == region)]
        b = df_merge[(df_merge[model] == region)]
        share = len(a)/len(b)
        share = round(share*100,1)
        print(region,model," : ",share,"% at least were in fact european names")
    print("")

Arabian pred_FB  :  16.5 % at least were in fact european names
Arabian pred_pubmed  :  14.9 % at least were in fact european names

Asian pred_FB  :  34.1 % at least were in fact european names
Asian pred_pubmed  :  46.0 % at least were in fact european names

Slavic pred_FB  :  29.1 % at least were in fact european names
Slavic pred_pubmed  :  36.3 % at least were in fact european names

African pred_FB  :  30.8 % at least were in fact european names
African pred_pubmed  :  42.3 % at least were in fact european names



<font size="3"> A sizeable share of the predictions were wrong and *this is just the lower bound*.
    
<font size="3">Let's now try a more stringent set of criteria : we include only the names for which both models agree and that weren't given in 1890-1900 :

In [302]:
for region in ['CentralSouthEuropean',"African","Arabian",'Indian',"Asian","Slavic"]:
    n = df_merge[(df_merge["pred_FB"] == region) & (df_merge["pred_pubmed"] == region) & (~df_merge["pred"].isin(["NorthEuropean","CentralSouthEuropean"]))].sample(50)["name"]
    print('Examples of predicted', region, " names : ")
    for i in n:
        print(i.capitalize(), end=", ")
    print('\n')


Examples of predicted CentralSouthEuropean  names : 
Mitanta kandolo, Lucchini-laplanche, Silveri, Kaci, Oval-martin, Pecchia, Goncalves gomes, Santos mendes, Bottalico, Perettoni, Dias vieira, Venancio, Da costa, Calcul, Brito tolosa, Pierelli, Ruffinetto, Touenti, Alves de amorim, Sousa, Delgado, Akkaya, Manzano, Vaz, Silva rocha, Doumbia - guayroso, Sebastia, Gonzalez sanchez, Garrigos-sin, Fernandes, Maggini, Villafane garberoglio, Goncalves, De carvalho, Migliaccio, Placente, Anoni, Beramice-dracan, Navarro torres, Dos reis, Demirelisci, Cayuela, De sampaio ribeiro, Yerebasmaz, Pozzobon, Bordonado, Lacerda ferreira, Fares, De barros simoes, Da silva, 

Examples of predicted African  names : 
Nguizani-nkiatoko, Diakite, Mombo di-lutete, Diongue, Kazi, Ndoye, Diawara, Fondjou, Connaly-missongo, Diallo, Traore, Timera, Ehue, Beseme, Kuedi, Bulawa, Sidibe, Nimaga, Toure, Ndoye, Abodunrin, Akabi, M'baye, Samah, Mpuki, Meviane, Traore, Ntumba, Cuvele, Diwulu umba, Kpodar, Kofarago, Aupy

<font size="3"> Those predictions look much better than those of the indivual models taken separately, although still far from 100% accurate.

# conclusion

<font size="3">The two datasets exhibit similar results overwhole with the Pubmed model being maybe a bit better. I would have excepted the opposite since the Facebook dataset is so much richer but maybe it's just too dirty or imbalanced. I bet there is a lot of room for improvement.
I<font size="3">n the case of France, the Facebook dataset seems to do a bit better though.