<a href="https://colab.research.google.com/github/ashkanallahyari/name_finder/blob/main/Name_Sim_2_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Enter the name you like in (Preferebly in Farsi, but English also works).**

This widget accepts a long list of names in Farsi. It currently contains the names of all Iranian villages (about 48.200 names), but you can import any other list you like as long as it includes a column named **'name'**.
The widget will then convert the names to phonetics and suggest 10 names with phonetic similarities to the name you entered.

In [34]:
# What name do you like? Say it preferably in Farsi, but English also works
# Here is some Farsi name examples (زاو، اوریم، نیما، اسنپ)

my_word = input("What is your favorit name in Farsi: ")

What is your favorit name in Farsi: اسنپ


## **Codes**

### **required Libraries**

In [35]:
import pandas as pd
import regex as re
import gdown
import pandas as pd

### **Name Source on Google Drive**

In [36]:
# Define the Google Drive file ID
file_id = "1diRpnHII0HCeemkZhdXxQTG94JvM54_g"
download_url = f"https://drive.google.com/uc?export=download&id={file_id}"
output_file = "data.xlsx"

# Download the file
gdown.download(download_url, output_file, quiet=False)

# Load the file into a DataFrame
df = pd.read_excel(output_file)

# Display the first few rows
df.head()

Downloading...
From: https://drive.google.com/uc?export=download&id=1diRpnHII0HCeemkZhdXxQTG94JvM54_g
To: /content/data.xlsx
100%|██████████| 1.12M/1.12M [00:00<00:00, 81.9MB/s]


Unnamed: 0,provinces,cities,villages_names,names
0,رده:روستاهای استان آذربایجان شرقی,رده:روستاهای شهرستان آذرشهر,آخی‌جهان,آخی‌جهان
1,رده:روستاهای استان آذربایجان شرقی,رده:روستاهای شهرستان آذرشهر,الوانق,الوانق
2,رده:روستاهای استان آذربایجان شرقی,رده:روستاهای شهرستان آذرشهر,امیردیزج,امیردیزج
3,رده:روستاهای استان آذربایجان شرقی,رده:روستاهای شهرستان آذرشهر,پیرچوپان,پیرچوپان
4,رده:روستاهای استان آذربایجان شرقی,رده:روستاهای شهرستان آذرشهر,تیرامین (آذرشهر),تیرامین


### **Functions**

In [37]:
# Converting names to phonetics
def NameVectorizer (name):
    phonetics = [
        [r'\s', ' '], ['‌', ' '],
        [r'(^|\s)ا', 'æ'], ['ا', 'ɑ'], ['آ', 'ʔɑ'],
        ['ء', 'ʔ'], ['ع', 'ʔ'], ['أ', 'ʔ'], ['ؤ', 'ʔ'], ['ئ', 'ʔ'],
        ['ب', 'b'],
        ['پ', 'p'],
        ['ت', 't'], ['ط', 't'],
        ['ث', 's'], ['س', 's'], ['ص', 's'],
        ['ج', 'ʤ'],
        ['چ', 'ʧ'],
        ['ه', 'h'], ['ح', 'h'],
        ['خ', 'χ'],
        ['د', 'd'],
        ['ذ', 'z'], ['ز', 'z'], ['ض', 'z'], ['ظ', 'z'],
        ['ر', 'r'],
        ['ژ', 'ʒ'],
        ['ش', 'ʃ'],
        ['غ', 'ʁ'], ['ق', 'ʁ'],
        ['ف', 'f'],
        ['ک', 'k'],
        ['گ', 'g'],
        ['ل', 'l'],
        ['م', 'm'],
        ['ن', 'n'],
        ['و', 'v'],
        ['ی', 'j'],
    ]

    for item in phonetics:
        vaj, phonem = item
        name = re.sub(vaj, phonem, name)

    return name

In [38]:
# Preparing phonetics
df_new = df[['names']].dropna().drop_duplicates()
df_new['phonetics'] = df_new['names'].apply(NameVectorizer)
df_new.info()

<class 'pandas.core.frame.DataFrame'>
Index: 34828 entries, 0 to 48216
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   names      34828 non-null  object
 1   phonetics  34828 non-null  object
dtypes: object(2)
memory usage: 816.3+ KB


In [39]:
# TF-IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(token_pattern='\w')
tfidf.fit(df_new['phonetics'])
tfidf_matrix = tfidf.transform(df_new['phonetics'])
tfidf_matrix.shape

(34828, 35)

In [40]:
tfidf.get_feature_names_out()

array(['b', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'r', 's',
       't', 'v', 'z', 'æ', 'ɑ', 'ʁ', 'ʃ', 'ʒ', 'ʔ', 'ʤ', 'ʧ', 'χ', '۰',
       '۱', '۲', '۳', '۴', '۵', '۶', '۷', '۸', '۹'], dtype=object)

In [41]:
my_word = pd.Series(my_word)
my_word = my_word.apply(NameVectorizer)
my_vect = tfidf.transform(my_word)
my_vect.shape

(1, 35)

In [42]:
from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(my_vect, tfidf_matrix).tolist()
sim_scores = list(enumerate(cosine_sim[0]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)


In [43]:
sim_scores = sim_scores[0:10]
name_indices = [i[0] for i in sim_scores]
selected_names = df_new['names'].iloc[name_indices].tolist()

def suggested_names(name_list):
  for i, item in enumerate(name_list, start=1):
    print(f"{i}. {item}")

## **Result**

In [44]:
suggested_names(selected_names)

1. اسپند
2. اسپرین
3. اسپر
4. اسپی بن
5. اسپس
6. اسپیدان
7. اسپاس
8. اسپیکان
9. پی‌استان
10. اسپید
