# City name search

Customer: Yandex Practicum Career Center

## Project desription 
<a name="project-descr"></a>

**Goal**

- Mapping arbitrary geo names to uniform geonames for internal use by the Career Center

**Objectives**

- Create a solution to match the most appropriate names with geonames. For example Ереван -> Yerevan

- On the example of RF and countries most popular for relocation - Belarus, Armenia, Kazakhstan, Kyrgyzstan, Turkey, Serbia. Cities with population from 15000 people (with the possibility of scaling on the customer's server).

- Returned fields geonameid, name, region, country, cosine similarity

- output data format: list of dictionaries, e.g. [{dict_1}, {dict_2}, .... {dict_n}] where dictionary is one record with specified fields

*Optional*:

- possibility to customize the number of matching names output (e.g. in method parameters)

- correction of errors and misprints. For example Моченгорск -> Monchegorsk

- storage of geonames data in PostgreSQL

- storing vectorized intermediate data in PostgreSQL

- provide methods for configuring connection to the database

- provide methods for class initialization (primary vectorization of geonames)

- provide methods for adding vectors of new geonames

**Implementation period**

Tentative timeframe for the project is 3 weeks from 11/28/2023 

## Data sources

* [geonames.org](http://download.geonames.org/export/dump/)
* [Test dataset](https://disk.yandex.ru/d/wC296Rj3Yso2AQ) 

----------------------------



# Approach

There are multple ways to compare text similarity. Toponyms (like city names) hold only scarce semantic relations with each other and with the common words ('dog', 'summer', 'theater'), so advanced language models (like sentence transformers) do not expected to provide major benifits, if they are feasible at all. On the other hand, word similarity estimates algorithms like Levenstein distance and cosine distanse are relatively simple yet powerful techiques that are robust to misspellings. 

**Research plan**
- Preparation: set up the PostgreSQL engine and re-create the customer's database 
- Implementation of the solution based on `fuzzy` search
- ??? Trying out other matching algorithms based on charachter embeddings 

The matches are ranged by
1. The similarity score with the main or any of the alteratve names -  **MVP**
   - Where the current official name is given priority
2. In the rare event of equality of the scores range the cities by
   - population 
   - admnistrative significance

As such, the necessary columns are:

**`cities15000`** or
**`geonames`** (composed from source tables **`XX.txt`**)
- `geonameid`
- `name`
- `asciiname`
- `alternatenames`
- `population`
- `admin1_code`
- `country_code`

For the region:
**`admin1CodesASCII`**
- `code`
- `name`

**`countryInfo`**
- `ISO`
- `Country`
- `Capital` (?)

For the extra features:

**`alternateNamesV2`**
- `alternate name`
- `isPreferredName`
- `isShortName`
- `isColloquial`
- `isHistoric` - to have an option to exclude / include the historical names. 
- `admin2_code`
- `admin3_code`
- `admin4_code` (?)



# 1. Preparations
## 1.1 Connecting to the database

In [1]:
## Installing depedencies

# %pip install pandas sqlalchemy>=2.0.23 psycopg2 python-dotenv transliterate

In [1]:
import pandas as pd
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine, MetaData, Table, Column,select, Integer, String, DECIMAL, CHAR, BIGINT, func, DATE
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL
from sqlalchemy.exc import SQLAlchemyError
from sqlalchemy.sql import text

from dotenv import load_dotenv # The sensitive info about the database connection is stored in the .env file
import os
load_dotenv()  

True

In [2]:
import sqlalchemy
sqlalchemy.__version__ # Important to have SQLAlchemy > 2.0!

'2.0.23'

In [2]:
# Read the following from the environment variables
USR = os.getenv('USR') # Tip: NEVER name an env variable "USERNAME"
PWD = os.getenv('PWD')
DB_HOST = os.getenv('DB_HOST')
PORT = os.getenv('PORT')
DB = os.getenv('DB')

DATABASE = {
    'drivername': 'postgresql',
    'username': USR,
    'password': PWD,
    'host': DB_HOST,
    'port': PORT,
    'database': DB, 
    'query': {}
}

# Creating an Engine object
engine = create_engine(URL.create(**DATABASE))

# Checking the connection
try:
    # Подключаемся к базе данных
    with engine.connect() as conn:
         # Trying to execute a simple test query. The `text` function converst a string into and SQL-query
        result = conn.execute(text("SELECT 1"))
        for _ in result:
            pass  # don't do anything
    print(f"Connection established: {DATABASE['database']} на {DATABASE['host']}")
except SQLAlchemyError as e:
    print(f"Connection error: {e}")

Connection established: geo_v2 на 77.222.36.33


## 1.2 Initializing the data on the dev side
This part is left here for demonstration, reproduibility and consistency. The customer already have their DB set up.

Dataset specifications are taken from [geonames.org](https:\\geonames.org) and followed as is when creating the database. 
From all the data presented there, for an 

Many of the columns are easy to read for the users, they are not convenient to handle in queries and scripts. Despite that, they are *not* renamed for compatibility with the customer's database. 

### `countryInfo`

In [92]:
column_names= ['ISO', 'ISO3', 'ISO-Numeric', 'fips', 'Country', 'Capital',
       'Area(in sq km)', 'Population', 'Continent', 'tld', 'CurrencyCode',
       'CurrencyName', 'Phone', 'Postal Code Format', 'Postal Code Regex',
       'Languages', 'geonameid', 'neighbours', 'EquivalentFipsCode']

data = pd.read_csv('../datasets/countryInfo.txt', skiprows=50, sep='\t', index_col=None, names=column_names, encoding='utf-8')
#data=data.rename(columns={"#ISO": "ISO"})
data.head()

Unnamed: 0,ISO,ISO3,ISO-Numeric,fips,Country,Capital,Area(in sq km),Population,Continent,tld,CurrencyCode,CurrencyName,Phone,Postal Code Format,Postal Code Regex,Languages,geonameid,neighbours,EquivalentFipsCode
0,AD,AND,20,AN,Andorra,Andorra la Vella,468.0,77006,EU,.ad,EUR,Euro,376,AD###,^(?:AD)*(\d{3})$,ca,3041565,"ES,FR",
1,AE,ARE,784,AE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS,.ae,AED,Dirham,971,,,"ar-AE,fa,en,hi,ur",290557,"SA,OM",
2,AF,AFG,4,AF,Afghanistan,Kabul,647500.0,37172386,AS,.af,AFN,Afghani,93,,,"fa-AF,ps,uz-AF,tk",1149361,"TM,CN,IR,TJ,PK,UZ",
3,AG,ATG,28,AC,Antigua and Barbuda,St. John's,443.0,96286,,.ag,XCD,Dollar,+1-268,,,en-AG,3576396,,
4,AI,AIA,660,AV,Anguilla,The Valley,102.0,13254,,.ai,XCD,Dollar,+1-264,,,en-AI,3573511,,


In [None]:
# Uploading the table to the database
data.to_sql('countryInfo', con=engine, if_exists='replace', index=False) # mind the "replace" option!

In [119]:
metadata = MetaData()
cinfo = Table('countryInfo', metadata,
    Column('ISO', CHAR(2)),
    Column('ISO3', CHAR(3)),
    Column('ISO-Numeric', Integer),
    Column('fips', CHAR(2)),
    Column('Country', String(200)),
    Column('Capital', String(200)),    
    Column('Area(in sq km)', DECIMAL),
    Column('Population', BIGINT),
    Column('Continent', CHAR(2)),
    Column('tld', CHAR(3)),
    Column('CurrencyName', CHAR(3)),
    Column('Phone', String(30)),
    Column('Postal Code Format', String(30)),
    Column('Postal Code Regex', String(100)),
    Column('Languages', String(100)),
    Column('geonameid', Integer),
    Column('neighbours', String(200)),
    Column('EquivalentFipsCode', CHAR(2))
)

metadata.create_all(engine)

In [98]:
column_names= ['ISO', 'ISO3', 'ISO-Numeric', 'fips', 'Country', 'Capital',
       'Area(in sq km)', 'Population', 'Continent', 'tld', 'CurrencyCode',
       'CurrencyName', 'Phone', 'Postal Code Format', 'Postal Code Regex',
       'Languages', 'geonameid', 'neighbours', 'EquivalentFipsCode']

data = pd.read_csv('../datasets/countryInfo.txt', skiprows=49, sep='\t', index_col=None, encoding='utf-8', usecols=['#ISO', 'ISO3', 'Country', 'Capital', 'Population', 'Continent', 'Languages', 'geonameid'])
#data=data.rename(columns={"#ISO": "ISO"})
data.head()

Unnamed: 0,#ISO,ISO3,Country,Capital,Population,Continent,Languages,geonameid
0,AD,AND,Andorra,Andorra la Vella,77006,EU,ca,3041565
1,AE,ARE,United Arab Emirates,Abu Dhabi,9630959,AS,"ar-AE,fa,en,hi,ur",290557
2,AF,AFG,Afghanistan,Kabul,37172386,AS,"fa-AF,ps,uz-AF,tk",1149361
3,AG,ATG,Antigua and Barbuda,St. John's,96286,,en-AG,3576396
4,AI,AIA,Anguilla,The Valley,13254,,en-AI,3573511


In [99]:
# Data upload
data.to_sql('countryInfo2', con=engine, if_exists='append', index=False)

252

In [7]:
query = 'SELECT * FROM "countryInfo2" LIMIT 10'
pd.read_sql_query(query, con=engine)

Unnamed: 0,#ISO,ISO3,Country,Capital,Population,Continent,Languages,geonameid
0,AD,AND,Andorra,Andorra la Vella,77006,EU,ca,3041565
1,AE,ARE,United Arab Emirates,Abu Dhabi,9630959,AS,"ar-AE,fa,en,hi,ur",290557
2,AF,AFG,Afghanistan,Kabul,37172386,AS,"fa-AF,ps,uz-AF,tk",1149361
3,AG,ATG,Antigua and Barbuda,St. John's,96286,,en-AG,3576396
4,AI,AIA,Anguilla,The Valley,13254,,en-AI,3573511
5,AL,ALB,Albania,Tirana,2866376,EU,"sq,el",783754
6,AM,ARM,Armenia,Yerevan,2951776,AS,hy,174982
7,AO,AGO,Angola,Luanda,30809762,AF,pt-AO,3351879
8,AQ,ATA,Antarctica,,0,AN,,6697173
9,AR,ARG,Argentina,Buenos Aires,44494502,SA,"es-AR,en,it,de,fr,gn",3865483


### `alternateNamesV2` 

This table is a detailed version of the column `alternative` `geonames`
Note, that the dataset `alternateNames` is deprecated
One has to treat this dataset careful: the table is by far the heviest of all and the data is heterogeneous

In [4]:
data = pd.read_csv('../datasets/alternateNamesV2.txt', sep='\t', index_col=None, header=None)
data.head()

  data = pd.read_csv('../datasets/alternateNamesV2.txt', sep='\t', index_col=None, header=None)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1284819,2994701,,Roc Mélé,,,,,,
1,1284820,2994701,,Roc Meler,,,,,,
2,4285256,3007683,,Pic des Langounelles,,,,,,
3,1291197,3017832,,Pic de les Abelletes,,,,,,
4,4290387,3017832,,Pic de la Font-Nègre,,,,,,


In [8]:
data.loc[data[1]==2017370] # Alternative names of Russian Federation 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
12105224,993186,2017370,en,Russian Soviet Federated Socialist Republic,,,,1.0,,
12105225,993187,2017370,,Rossiyskaya Sovetskaya Federativnaya Sotsialis...,,,,1.0,,
12105226,993188,2017370,en,Russian Soviet Federative Socialist Republic,,,,1.0,,
12105227,993191,2017370,en,Russian Socialist Federative Soviet Republic,,,,1.0,,
12105228,1556474,2017370,aa,Russia,1.0,,,,,
...,...,...,...,...,...,...,...,...,...,...
12105386,16930770,2017370,wo,Risi,1.0,,,,,
12105387,16930771,2017370,yi,רוסלאַנד,1.0,,,,,
12105388,16930772,2017370,zh,俄罗斯,1.0,,,,,
12105389,17433252,2017370,ru,России,,,,,,


In [9]:
data.columns = ['alternateNameId', 'geonameid', 'isolanguage', 'alternate name', 'isPreferredName', 'isShortName', 'isColloquial', 'isHistoric', 'from', 'to']

In [5]:
engine.connect().rollback()

metadata = MetaData()
geonames = Table('alternateNames', metadata,
    Column('alternateNameId', Integer),
    Column('geonameid', Integer),
    Column('isolanguage', CHAR(7)),
    Column('alternate name', String(400)),
    Column('isPreferredName', CHAR(1)),
    Column('isShortName', CHAR(1)),    
    Column('isColloquial', CHAR(1)),
    Column('isHistoric', CHAR(1)),
    Column('used_from', CHAR(20)), # should be more than enough
    Column('used_to', CHAR(20)),
)

metadata.create_all(engine)

## The following takes 46 minutes to run! 
#### data.to_sql('alternateNames', con=engine, if_exists='replace', index=False)

674

### `geonames`
This is the main source of data. This table is created from combination of country-specific tables. Here we only take the tables for countries that are of the main customer's interest (see [Project description](#project-descr)) and upload them one by one.

In [41]:
column_names = [
    'geonameid', 'name', 'asciiname', 'alternatenames', 'latitude', 'longitude',
    'feature_class', 'feature_code', 'country_code', 'cc2', 'admin1_code',
    'admin2_code', 'admin3_code', 'admin4_code', 'population', 'elevation',
    'dem', 'timezone', 'modification_date'
]

In [83]:
# Reading the text files
data = pd.read_csv('../datasets/RS.txt', sep='\t', names=column_names, encoding='utf-8')


In [None]:
geonames = Table('geonames', metadata,
    Column('geonameid', Integer),
    Column('name', String(200)),
    Column('asciiname', String(200)),
    Column('alternatenames', String(10000)),
    Column('latitude', DECIMAL),
    Column('longitude', DECIMAL),
    Column('feature_class', CHAR(1)),
    Column('feature_code', String(10)),
    Column('country_code', CHAR(2)),
    Column('cc2', String(200)),
    Column('admin1_code', String(20)),
    Column('admin2_code', String(80)),
    Column('admin3_code', String(20)),
    Column('admin4_code', String(20)),
    Column('population', BIGINT),
    Column('elevation', Integer),
    Column('dem', Integer),
    Column('timezone', String(40)),
    Column('modification_date', DATE)
)
metadata.create_all(engine)

In [84]:
data.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date
0,672867,Moravica,Moravica,"Maravita,Maraviţa,Moravica,Moravicza,Moravita,...",45.23333,21.25,H,STM,RS,,0,,,,0,,77,Europe/Belgrade,2014-11-05
1,675496,Iron Gates,Iron Gates,"Dealul Klisura,Derdap,Eisenernes Tor,Eisernes ...",44.67965,22.51537,T,GRGE,RS,RO,0,,,,0,,61,Europe/Belgrade,2019-03-01
2,682504,Kazan,Kazan,"Cazane,Cazane Defile,Kazan,Kazan Pass,Kazanske...",44.66667,22.3,T,GRGE,RS,,0,,,,0,,158,Europe/Belgrade,2014-11-05
3,682722,Râu Caraş,Rau Caras,"Caras,Caraş,Caraș,Karas,Karas River,Karaş,Kara...",44.81667,21.33333,H,STM,RS,RO,0,,,,0,,65,Europe/Belgrade,2021-02-16
4,684724,Kanal Brzava,Kanal Brzava,Kanal Brzava,45.27549,20.82796,H,CNL,RS,,0,,,,0,,72,Europe/Belgrade,2012-07-04


In [85]:
# Upload to the database
data.to_sql('geonames', con=engine, if_exists='append', index=False)

498

In [8]:
# Определяем таблицу для запроса
# geonames = Table('geonames', metadata, autoload_with=engine)

# Test query
query = "SELECT * FROM geonames WHERE country_code = 'RS' LIMIT 10  "
pd.read_sql_query(query, con=engine)

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date
0,672867,Moravica,Moravica,"Maravita,Maraviţa,Moravica,Moravicza,Moravita,...",45.23333,21.25,H,STM,RS,,0,,,,0,,77,Europe/Belgrade,2014-11-05
1,675496,Iron Gates,Iron Gates,"Dealul Klisura,Derdap,Eisenernes Tor,Eisernes ...",44.67965,22.51537,T,GRGE,RS,RO,0,,,,0,,61,Europe/Belgrade,2019-03-01
2,682504,Kazan,Kazan,"Cazane,Cazane Defile,Kazan,Kazan Pass,Kazanske...",44.66667,22.3,T,GRGE,RS,,0,,,,0,,158,Europe/Belgrade,2014-11-05
3,682722,Râu Caraş,Rau Caras,"Caras,Caraş,Caraș,Karas,Karas River,Karaş,Kara...",44.81667,21.33333,H,STM,RS,RO,0,,,,0,,65,Europe/Belgrade,2021-02-16
4,684724,Kanal Brzava,Kanal Brzava,Kanal Brzava,45.27549,20.82796,H,CNL,RS,,0,,,,0,,72,Europe/Belgrade,2012-07-04
5,685194,Begej,Begej,"Bega,Begeiul,Begej,Begheiul,Raul Bega,Riu Bega...",45.20861,20.31528,H,STM,RS,,0,,,,0,,71,Europe/Belgrade,2020-08-25
6,686243,Zlatica,Zlatica,"Aranca,Aranka,Zlatica",45.81213,20.14855,H,STM,RS,,0,,,,0,,78,Europe/Belgrade,2012-07-04
7,691517,Tisa,Tisa,"Theiss,Theiß,Tisa,Tisza,Tysa",45.13806,20.2775,H,STM,RS,,0,,,,0,,69,Europe/Belgrade,2023-09-10
8,725863,Visočica,Visocica,"Visocica,Visočica",43.29663,22.61132,H,STM,RS,,0,,,,0,,495,Europe/Belgrade,2012-09-06
9,725902,Vidlich,Vidlich,"Vidlic,Vidlich,Vidlič,Видлич",43.14215,22.80233,T,MTS,RS,,0,,,,0,,1329,Europe/Belgrade,2017-03-04


### `cities15000`

In [111]:
column_names = [
    'geonameid', 'name', 'asciiname', 'alternatenames', 'latitude', 'longitude',
    'feature_class', 'feature_code', 'country_code', 'cc2', 'admin1_code',
    'admin2_code', 'admin3_code', 'admin4_code', 'population', 'elevation',
    'dem', 'timezone', 'modification_date'
]

data = pd.read_csv('../datasets/cities15000.txt', sep='\t', names=column_names, encoding='utf-8')
data.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date
0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",42.50729,1.53414,P,PPLA,AD,,8,,,,15853,,1033,Europe/Andorra,2008-10-15
1,3041563,Andorra la Vella,Andorra la Vella,"ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...",42.50779,1.52109,P,PPLC,AD,,7,,,,20430,,1037,Europe/Andorra,2020-03-03
2,290594,Umm Al Quwain City,Umm Al Quwain City,"Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U...",25.56473,55.55517,P,PPLA,AE,,7,,,,62747,,2,Asia/Dubai,2019-10-24
3,291074,Ras Al Khaimah City,Ras Al Khaimah City,"Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'...",25.78953,55.9432,P,PPLA,AE,,5,,,,351943,,2,Asia/Dubai,2019-09-09
4,291580,Zayed City,Zayed City,"Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za...",23.65416,53.70522,P,PPL,AE,,1,103.0,,,63482,,118,Asia/Dubai,2019-10-24


In [112]:
data.to_sql('cities15000', con=engine, if_exists='append', index=False)

127

In [9]:
metadata = MetaData()

cities = Table('cities15000', metadata,
    Column('geonameid', Integer),
    Column('name', String(200)),
    Column('asciiname', String(200)),
    Column('alternatenames', String(10000)),
    Column('latitude', DECIMAL),
    Column('longitude', DECIMAL),
    Column('feature_class', CHAR(1)),
    Column('feature_code', String(10)),
    Column('country_code', CHAR(2)),
    Column('cc2', String(200)),
    Column('admin1_code', String(20)),
    Column('admin2_code', String(80)),
    Column('admin3_code', String(20)),
    Column('admin4_code', String(20)),
    Column('population', BIGINT),
    Column('elevation', Integer),
    Column('dem', Integer),
    Column('timezone', String(40)),
    Column('modification_date', DATE)
)
# metadata.create_all(engine)

query = select(func.count()).select_from(cities)

# Выполняем запрос и выводим результат
count = pd.read_sql_query(query, con=engine).values[0,0]
print("Number of entries in 'cities15000':", count)

Number of entries in 'cities15000': 27127


In [4]:
# Let's take a look again at what tables are in the database

from sqlalchemy import inspect

inspector = inspect(engine)
schemas = inspector.get_schema_names()

for schema in schemas:
    #print("schema: %s" % schema)
    print(inspector.get_table_names(schema=schema))

['sql_features', 'sql_implementation_info', 'sql_parts', 'sql_sizing']
['alternateNames', 'geonames', 'countryInfo', 'countryInfo2', 'cities15000']


### `admin1CodesASCII`

In [27]:
column_names = [
    'code', 'name', 'asciiname', 'geonameid'
]

data = pd.read_csv('../datasets/admin1CodesASCII.txt', sep='\t', names=column_names, encoding='utf-8')
data.head()

Unnamed: 0,code,name,asciiname,geonameid
0,AD.06,Sant Julià de Loria,Sant Julia de Loria,3039162
1,AD.05,Ordino,Ordino,3039676
2,AD.04,La Massana,La Massana,3040131
3,AD.03,Encamp,Encamp,3040684
4,AD.02,Canillo,Canillo,3041203


In [31]:
metadata = MetaData()

cities = Table('admin1CodesASCII', metadata,
    Column('code', CHAR(5)),
    Column('name', String(200)),
    Column('asciiname', String(100)),
    Column('geonameid', Integer)
)
metadata.create_all(engine)

data.to_sql('admin1CodesASCII', con=engine, if_exists='replace', index=False)

881

# 2. Research

# 2.1 Creating the working dataframe

The dataframe contains the bare minimum info: indiciies (`geonameIDs`) and all possible names.

-> *Upscaling: query* `geonames` *instead of* `cities15000`
-> Unrestricting country selection 

~~The following query asks for the geonameID, country code, names and population in the countries of interest 
and merges this with the data from the table countryInfo to get the country names:~~

In [62]:
country_selection = ('RU', 'KZ', 'AM', 'RS', 'ME', 'KG', 'GE')

# Delete the WHERE clause if you want to select cities from around the globe
query = f'''
SELECT geonameid, name, alternatenames, country_code
FROM cities15000  
WHERE country_code IN {country_selection}
'''
# LEFT JOIN (SELECT "ISO", "Country" FROM "countryInfo") AS ci
# ON cities15000.country_code = ci."ISO"

df = pd.read_sql_query(query, con=engine, index_col = 'geonameid')
df.head()

Unnamed: 0_level_0,name,alternatenames,country_code
geonameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
174875,Kapan,"Ghap'an,Ghapan,Ghap’an,Kafan,Kafin,Kapan,Kapan...",AM
174895,Goris,"Geryusy,Goris,Горис,Գորիս",AM
174972,Hats’avan,"Acavan,Atsavan,Hats'avan,Hats’avan,Sisian,Ацав...",AM
174979,Artashat,"Artachat,Artasat,Artasatas,Artasato,Artaschat,...",AM
174991,Ararat,"Ararat,Araratas,Ararato,Davalinskiy Tsemzavod,...",AM


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27127 entries, 3040051 to 1106542
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            27127 non-null  object
 1   alternatenames  24799 non-null  object
dtypes: object(2)
memory usage: 635.8+ KB



## 2.1 The brute force approach: `fuzzy` search

What about calculating the query similarity to each and every word and selecting the vector with the highest overall match? 

An metrics tolerant to misspellings is Levinstein distance. The fastest implementation of the algorithm provided in the `fuzzy` search, for example, `thefuzz`.

In [13]:
# %pip install thefuzz

In [63]:
from thefuzz import process
from transliterate import slugify
from transliterate import detect_language
import numpy as np

Now we need to prepare the data

In [64]:
# Splitting the column AlternativeNames into single names: 
altnames = [l.split(',') if l else [None] for l in df.alternatenames.values] 

names=df.name.values

for i in range(len(altnames)):
    altnames[i].append(names[i])
#Creating the dictionary of the structure geonameID: names for all cities including official and alternative names 
d = {ind: n for ind, n in zip(df.index, altnames)}

Writing the first function

In [65]:
def search(query, k=10):
    "The rapid fuzzy search"
    
    if detect_language(query) is not None:
        query = slugify(query)
    scores = {} # container for match scores for each city
    
    for ind, name_list in d.items(): ## for each city calculate similarity scores with evry alternative name
        _ = np.array(process.extract(query, name_list)) 

        scores[ind] = _[:, 1].astype(int).sum() # sum up the scores...
        scores[ind] /= len(_) # ...and normalize by the number of the alternative names
    
    sorted_scores = dict(sorted(scores.items(), key=lambda item: item[1], reverse=True)) # sorting the name groups by the match score
    indexes = list(sorted_scores)[:k] # select the DatFrame indicies of the top k
    result = df.loc[indexes] # return all the desired info
    result.insert(1, column='score', value=list(sorted_scores.values())[:k]) # insert the scores
    return result

Testing: 

Misspelled name

In [66]:
%%time
k=100 ## number of suggestions
# query = input()
query = 'Ржевск'

search(query, k)

CPU times: total: 188 ms
Wall time: 270 ms


Unnamed: 0_level_0,name,score,alternatenames,country_code
geonameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
554840,Izhevsk,80.0,"IJK,Ijevsk,Ischewsk,Ishewsk,Izevsk,Izevska,Ize...",RU
499717,Rzhev,77.6,"Raesevae,Rescovia,Rjev,Rjov,Rschew,Rshew,Rzev,...",RU
518659,Novokuybyshevsk,75.0,"Navakujbyshehusk,Novo-Kuybuyshev,Novo-Kuybyshe...",RU
500047,Ryazhsk,72.2,"Razhsk,Riajsk,Riazhsk,Rjajsk,Rjaschsk,Rjazhsk,...",RU
1506073,Gur’yevsk,67.2,"Gur'evsk,Gur'yevsk,Gur'yevskov,Gurevsk,Gurjevs...",RU
...,...,...,...,...
481350,Trubchevsk,53.4,"Trubchevsk,Trubtschewsk,Trubtsjevsk,Трубчевск",RU
1538637,Seversk,53.4,"Severs'k,Seversk,Severskas,Sewersk,Sewjersk,Si...",RU
2012557,Zheleznogorsk-Ilimskiy,53.4,"Korshunikha,Zheleznogorsk,Zheleznogorsk-Ilimsk...",RU
1485357,Zavodoukovsk,53.2,"Sawodoukowsk,Zavodaukousk,Zavodo-oekovsk,Zavod...",RU


Историческое название

In [74]:
%%time
k=10 ## number of suggestions
query = 'Сталинград'
#query = 'Атомград'

result = search(query, k)
result

CPU times: total: 172 ms
Wall time: 237 ms


Unnamed: 0_level_0,name,score,alternatenames,country_code
geonameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
472757,Volgograd,88.8,"Caricin,Caricyn,Estalingrado,Stalingrad,Stalin...",RU
554234,Kaliningrad,74.8,"Caliningrado,Calininopolis,KGD,Kalinin'nkrant,...",RU
518557,Novomoskovsk,72.2,"Bobriki,Novamaskousk,Novomoskovs'k,Novomoskovs...",RU
495957,Shali,71.2,"Chali,Mezhdurech'e,Mezhdurech'ye,Mezhdurech’ye...",RU
485698,Svetlograd,70.0,"Petrovskoe,Petrovskoye,Petrowskoje,Svetlagrad,...",RU
1526273,Astana,66.4,"Ak-Mola,Akmola,Akmolins'k,Akmolinsk,Aqmola,Ast...",KZ
785965,Senta,64.8,"Senta,Szenta,Szintarev,Szintarév,Szénta,Zenta,...",RS
547523,Klin,63.6,"Klin,Klina,Kline,Kļina,Ulin,ke lin,keullin,kln...",RU
1490281,Talitsa,62.6,"Talica,Talicja,Talitsa,Taliza,Taliça,Tàlitsa,t...",RU
611403,Ts’khinvali,60.2,"Chreba,Ckhinval,Ckhinvali,Stalinir,Staliniri,S...",GE


In [75]:
ind = np.asarray(result.index)

### More weight on closer matches 

In [15]:
def search_parabolic(query, k=10):
    "The rapid fuzzy search - scoring with root mean square difference. RMSD puts higher weight onto closer matches"
    if detect_language(query) is not None:
        query = slugify(query)
    scores = {} # container for match scores for each city
    
    for ind, name_list in d.items(): ## for each city calculate similarity scores with evry alternative name
        _ = np.array(process.extract(query, name_list)) 
        # Calculate RMSD
        scores[ind] = np.square(_[:, 1].astype(int)).sum() # sum up the squares of the scores...
        scores[ind] = np.sqrt(scores[ind] / len(_))  # ...normalize by the length and take the root 
    
    sorted_scores = dict(sorted(scores.items(), key=lambda item: item[1], reverse=True)) # sorting the name groups by the match score
    indexes = list(sorted_scores)[:k] # select the DatFrame indicies of the top k
    result = df.loc[indexes] # return the maches in the desired form
    result.insert(1, column='score', value=list(sorted_scores.values())[:k]) # insert the scores
    return result# df_.loc[indexes]

In [16]:
def search_exp(query, k=10):
    "The rapid fuzzy search - scoring with exponent mean difference (like RMSD but with exp and log)"
    if detect_language(query) is not None:
        query = slugify(query)
    scores = {} # container for match scores for each city
    
    for ind, name_list in d.items(): ## for each city calculate similarity scores with evry alternative name
        _ = np.array(process.extract(query, name_list)) 
        # Calculate the exponent
        scores[ind] = np.exp(_[:, 1].astype(int)).sum() / len(_) # sum up the squares of the scores...
        scores[ind] = np.log(scores[ind])  #  ...take the reverse function 
    
    sorted_scores = dict(sorted(scores.items(), key=lambda item: item[1], reverse=True)) # sorting the name groups by the match score
    indexes = list(sorted_scores)[:k] # select the DatFrame indicies of the top k
    result = df.loc[indexes] # return the maches in the desired form
    result.insert(1, column='score', value=list(sorted_scores.values())[:k]) # insert the scores
    return result# df_.loc[indexes]

In [78]:
"The rapid fuzzy search - scoring with root mean square difference. RMSD puts higher weight onto closer matches"
k=10
# query = "Атомград"
query = "Влад"

if detect_language(query) is not None:
    query = slugify(query)
scores = {} # container for match scores for each city

for ind, name_list in d.items(): ## for each city calculate similarity scores with evry alternative name
    _ = np.array(process.extract(query, name_list)) # so that the exponent is not too large!
    # Calculate the function
    scores[ind] = np.exp(_[:, 1].astype(int)).sum() / len(_)
    scores[ind] = np.log(scores[ind])  # sum up the exponents of the scores...

# sorted by the matching score (.2 ms faster with the native Python function)
scores_df = pd.DataFrame.from_records(
    sorted(scores.items(), key=lambda item: item[1], reverse=True), columns=['geonameid', 'score']) 

indexes = tuple(scores_df.loc[:k, 'geonameid']) # select the DataFrame indicies of the top k

query = f'''
    SELECT
        cities.geonameid,
        cities.name,
        regions.name as region,
        ci."Country"

    FROM
        cities15000 AS cities
    LEFT JOIN
        (SELECT "ISO", "Country" FROM "countryInfo") AS ci
    ON
        cities.country_code = ci."ISO"
    LEFT JOIN
        "admin1CodesASCII" AS regions
    ON
        COALESCE(cities.country_code, '') || '.' || COALESCE(cities.admin1_code, '') = regions.code
    WHERE
        cities.geonameid IN {indexes};
'''

qres = pd.read_sql_query(query, con=engine).drop_duplicates()

# result = df.loc[indexes] # return the maches in the desired form
# result.insert(1, column='score', value=list(sorted_scores.values())[:k]) # insert the scores
#qres
result = pd.merge(qres, scores_df, on='geonameid', how='right')

#result.loc[:, 'score'] = result.loc[:, 'score'].div(result.loc[:, 'score'].max())
result # df_.loc[indexes]

Unnamed: 0,geonameid,name,region,Country,score
0,473247,Vladimir,Vladimir Oblast,Russia,90.000000
1,473249,Vladikavkaz,North Ossetia–Alania,Russia,90.000000
2,2013348,Vladivostok,Primorye,Russia,90.000000
3,473127,Novovladykino,Moscow,Russia,89.712318
4,538913,Kurchaloy,Chechnya,Russia,77.000000
...,...,...,...,...,...
1294,1501365,,,,0.000000
1295,1502536,,,,0.000000
1296,1539209,,,,0.000000
1297,2013923,,,,0.000000


In [21]:
query = f'''
    SELECT country_code, name, asciiname, population, ci."Country"
    FROM cities15000  
    LEFT JOIN (SELECT "ISO", "Country" FROM "countryInfo") AS ci
    ON cities15000.country_code = ci."ISO"
    WHERE geonameid IN {indexes};
'''
pd.read_sql_query(query, con=engine).drop_duplicates()

Unnamed: 0,country_code,name,asciiname,population,Country
0,AF,Bāzār-e Yakāwlang,Bazar-e Yakawlang,65000,Afghanistan
2,AO,Mbanza Kongo,Mbanza Kongo,148000,Angola
4,AO,Luanda,Luanda,2776168,Angola
6,AO,Ondjiva,Ondjiva,121537,Angola
8,AR,Saladas,Saladas,18349,Argentina
...,...,...,...,...,...
192,US,Loveland,Loveland,75182,United States
194,US,Walla Walla,Walla Walla,32237,United States
196,ZA,Lady Frere,Lady Frere,25041,South Africa
198,VU,Port-Vila,Port-Vila,35901,Vanuatu


In [76]:
query = f'''
    SELECT
        cities.country_code,
        cities.name,
        cities.asciiname,
        cities.population,
        cities.admin1_code,
        ci."Country",
        regions.code,
        regions.name as region

    FROM
        cities15000 AS cities
    LEFT JOIN
        (SELECT "ISO", "Country" FROM "countryInfo") AS ci
    ON
        cities.country_code = ci."ISO"
    LEFT JOIN
        "admin1CodesASCII" AS regions
    ON
        COALESCE(cities.country_code, '') || '.' || COALESCE(cities.admin1_code, '') = regions.code
    WHERE
        cities.geonameid IN {tuple(indexes)};
'''

query = f'''
    SELECT
        cities.geonameid,
        cities.name,
        cities.population,
        ci."Country",
        regions.code,
        regions.name as region

    FROM
        cities15000 AS cities
    LEFT JOIN
        (SELECT "ISO", "Country" FROM "countryInfo") AS ci
    ON
        cities.country_code = ci."ISO"
    LEFT JOIN
        "admin1CodesASCII" AS regions
    ON
        COALESCE(cities.country_code, '') || '.' || COALESCE(cities.admin1_code, '') = regions.code
    WHERE
        cities.geonameid IN {tuple(ind)};
'''


pd.read_sql_query(query, con=engine) #.drop_duplicates()

Unnamed: 0,geonameid,name,population,Country,code,region
0,611403,Ts’khinvali,32180,Georgia,GE.73,Shida Kartli
1,611403,Ts’khinvali,32180,Georgia,GE.73,Shida Kartli
2,1526273,Astana,345604,Kazakhstan,KZ.05,Astana
3,1526273,Astana,345604,Kazakhstan,KZ.05,Astana
4,785965,Senta,20302,Serbia,RS.VO,Vojvodina
5,785965,Senta,20302,Serbia,RS.VO,Vojvodina
6,472757,Volgograd,1013533,Russia,RU.84,Volgograd Oblast
7,472757,Volgograd,1013533,Russia,RU.84,Volgograd Oblast
8,518557,Novomoskovsk,130982,Russia,RU.76,Tula Oblast
9,518557,Novomoskovsk,130982,Russia,RU.76,Tula Oblast


In [73]:
ind

array([ 472757,  554234,  518557,  495957,  485698, 1526273,  785965,
        547523, 1490281,  611403,  498817,  563514,  802078,  792680,
        514706,  488852,  463637, 2027968, 1521379,  616877, 1490140,
       1519928, 2025339,  463828, 1527534,  499161, 1491706,  490068,
       1493197, 1518542,  480562, 1516905,  495206,  519336,  790015,
        534701,  551986, 1498087,  533690, 1511309,  490466,  608668,
       1487277, 2014927, 1519725,  557775,  491422,  562161,  582750,
       1537939,  498418, 3204672,  482283,  499099, 2014718, 2119441,
        488635,  496015, 1489962,  553915,  580497, 1503335, 2127202,
       1521370,  501320,  581179, 1494907, 1497393,  610529, 1490266,
        551964, 1496476, 2027667,  174875,  616194, 3191429,  478044,
        495518,  515698,  523064,  540103,  611717,  516716, 1518262,
        486968,  566532,  493160,  479411,  493231,  489226,  498687,
        514171,  562237, 1490085,  611694, 1519691,  498698,  584471,
        487846,  561

In [53]:
# cities = pd.read_sql_query('SELECT * FROM cities15000', con=engine).set_index('geonameid')

cities.loc[2013348]

name                                                       Vladivostok
asciiname                                                  Vladivostok
alternatenames       Bladibostok,Uladzivastok,VVO,Vladivostok,Vladi...
latitude                                                      43.10562
longitude                                                    131.87353
feature_class                                                        P
feature_code                                                      PPLA
country_code                                                        RU
cc2                                                               None
admin1_code                                                         59
admin2_code                                                       None
admin3_code                                                       None
admin4_code                                                       None
population                                                      604901
elevat

In [48]:
indexes

(473247,
 473249,
 2013348,
 689378,
 473127,
 5368361,
 4228147,
 2240449,
 294421,
 6177869,
 1253468)

In [138]:
%%timeit

pd.DataFrame.from_records(
    sorted(scores.items(), key=lambda item: item[1], reverse=True), columns=['geonameid', 'score'])

1.74 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [150]:
%%timeit

pd.DataFrame.from_records(list(scores.items()), columns=['geonameid', 'score']).sort_values('score', ascending=False)

1.9 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


it is a little bit faster to sort the scores before creating the DataFrame

In [56]:
%%time
k=10 ## number of suggestions
# query = input()
query = 'Atomgrad'

search_exp(query, k)

CPU times: total: 250 ms
Wall time: 421 ms


Unnamed: 0,geonameid,score,country_code,name,asciiname,alternatenames,population
1223,1538635,5.3762340000000003e+42,RU,Zheleznogorsk,Zheleznogorsk,"Atomgrad,Devyatka,Krasnojarsk-26,Krasnoyarsk-2...",93834
1342,737421,1.239114e+31,TR,Yomra,Yomra,"Dirona,Yomra",19770
1568,325363,3.717366e+30,TR,Adana,Adana,"ADA,Adana,Adane,Adano,Adanë,Adhanah,Antiocheia...",1779463
1287,2027968,3.717344e+30,RU,Aldan,Aldan,"ADH,Aldan,Aldanas,Ałdan,Nezametnyy,aldan,Алдан,알단",24426
1550,323777,3.717343e+30,TR,Antalya,Antalya,"AYT,Adalia,Antal'ja,Antalia,Antalija,Antaliya,...",1344000
1566,325330,3.717343e+30,TR,Adıyaman,Adiyaman,"ADF,Adiaman,Adijaman,Adijamanas,Adityman,Adiya...",267131
1227,1540356,3.717343e+30,RU,Raduzhny,Raduzhny,"RAT,Radoujny,Radujnij,Radujniy,Radujnıy,Radusc...",47679
188,1526038,3.717343e+30,KZ,Atbasar,Atbasar,"ATX,Atbasa,Atbasar,Atbasaras,Atbassar,Otbosar,...",34797
481,498817,1.39259e+30,RU,Saint Petersburg,Saint Petersburg,"Agia Petroupole,Betuyrbukh,Cankt-Peterburg,LED...",5351935
130,610529,1.367539e+30,KZ,Atyrau,Atyrau,"Aterau,Atirau,Atirav,Atiraw,Atorau,Aturau,Atyr...",290700


#### More memory-friendly approach: we don't want to store two datasets in python variables
Rather, it's better to query the data from the database

Query for the preliminary output
``SELECT country_code, name, asciiname, population, ci."Country"
FROM cities15000  
LEFT JOIN (SELECT "ISO", "Country" FROM "countryInfo") AS ci
ON cities15000.country_code = ci."ISO"
WHERE geonameid IN {tuple(indexes)};``

In [115]:
query = f'''
    SELECT country_code, name, asciiname, population, ci."Country"
    FROM cities15000  
    LEFT JOIN (SELECT "ISO", "Country" FROM "countryInfo") AS ci
    ON cities15000.country_code = ci."ISO"
    WHERE geonameid IN {tuple(indexes)};
'''
result = pd.read_sql_query(query, con=engine).drop_duplicates()
result

Unnamed: 0,country_code,name,asciiname,population,Country
0,AM,Yerevan,Yerevan,1093485,Armenia
2,AM,Vagharshapat,Vagharshapat,46200,Armenia
4,AM,Vanadzor,Vanadzor,101098,Armenia
6,BY,Volkovysk,Volkovysk,47300,Belarus
8,BY,Maladziečna,Maladziecna,101300,Belarus
...,...,...,...,...,...
190,TR,İskilip,Iskilip,19829,Turkey
192,TR,Edirne,Edirne,180002,Turkey
194,TR,Bulancak,Bulancak,43635,Turkey
196,TR,Bolu,Bolu,184682,Turkey


Exponential weighting works far better, and the square is just not enough. 
The result is awesome! We are returning the most likely matches, and surfacing out the precise ones, if any. 

Now let's try to downscale the really ancient names. 

In [6]:
query = ''' SELECT geonameid, name, asciiname, population FROM cities1500 c
LEFT JOIN (SELECT geonameid, 'alternateNameId', 'alternate name', 'isHistoric', 'isColloquial', 'isShortName' FROM "alternateNames" an) 
ON geonameid
'''
df_ = pd.read_sql_query(query, con=engine)
df_.head()

OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected

[SQL:  SELECT geonameid, name, asciiname, population FROM cities1500 c
LEFT JOIN (SELECT geonameid, 'alternateNameId', 'alternate name', 'isHistoric', 'isColloquial', 'isShortName' FROM "alternateNames" an) 
ON geonameid
]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

# Some tests

In [174]:
test_df = pd.read_csv('../datasets/geo_test.csv', sep=';')
test_df.head()

Unnamed: 0,query,name,region,country
0,Смоленск,Smolensk,Smolensk Oblast,Russia
1,Кемерово,Kemerovo,Kuzbass,Russia
2,Бишкек,Bishkek,Bishkek,Kyrgyzstan
3,Москва,Moscow,Moscow,Russia
4,Алматы,Almaty,Almaty,Kazakhstan


In [176]:
queries = test_df['query'].values

In [182]:
%%time
def search_dummy(query, k=10):
    "Returns None"
    if detect_language(query) is not None:
        query = slugify(query)
    scores = {}
    for ind, name_list in d.items():
        _ = np.array(process.extract(query, name_list, limit=5))
        scores[ind] = _[:, 1].astype(int).sum()
        scores[ind] /= len(_) 
    sorted_scores = dict(sorted(scores.items(), key=lambda item: item[1], reverse=True))
    indexes = list(sorted_scores)[:k] # Works flawlessly
    result = df_.loc[indexes]
    result.insert(1, column='score', value=list(sorted_scores.values())[:k])
    return None # df_.loc[indexes]

for q in queries:  
    search_dummy(q, 10)

CPU times: total: 0 ns
Wall time: 0 ns


Processsing of 346 queries took 2 mins 18 s. That's our baseline. 

@TODO: measure accuracy!

# Second approach: the `Faiss` search
This algorithm should be much quicer and perform well in autocompletion.
This requires encoding the city names into character embeddings and n-grams. 

In [1]:
%pip install faiss-cpu # not the recommended way to install faiss - but it works...


Note: you may need to restart the kernel to use updated packages.


In [None]:
import faiss

## Preparing the data for encoding

In [None]:
import re
import string

## Not yet used
def preprocess_name(name):

  if name is not None:
    # Переводим текст в нижний регистр
    name = name.lower()

    # Удаление знаков препинания
    name = re.sub('[%s]' % re.escape(string.punctuation + '«»–'), ' ', name)

    # Разделение на латиницу и кириллицу
    name = ' '.join(re.split(r'([a-zA-Z]+|[а-яА-Я]+)', name))

    # Удаление лишних пробелов
    name = ' '.join(name.split())

    return name

# Third approach: table question answering models

Might be an overkill, because we don't need to process the whole sentence - but let's try and compare the performance

In [12]:
%pip install huggingface_hub

Collecting huggingface_hub
  Downloading huggingface_hub-0.19.4-py3-none-any.whl.metadata (14 kB)
Collecting filelock (from huggingface_hub)
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting pyyaml>=5.1 (from huggingface_hub)
  Downloading PyYAML-6.0.1-cp310-cp310-win_amd64.whl.metadata (2.1 kB)
Downloading huggingface_hub-0.19.4-py3-none-any.whl (311 kB)
   ---------------------------------------- 0.0/311.7 kB ? eta -:--:--
   ----- --------------------------------- 41.0/311.7 kB 991.0 kB/s eta 0:00:01
   ---------------------------------------  307.2/311.7 kB 3.8 MB/s eta 0:00:01
   ---------------------------------------- 311.7/311.7 kB 3.2 MB/s eta 0:00:00
Downloading PyYAML-6.0.1-cp310-cp310-win_amd64.whl (145 kB)
   ---------------------------------------- 0.0/145.3 kB ? eta -:--:--
   ---------------------------------------- 145.3/145.3 kB 4.4 MB/s eta 0:00:00
Downloading filelock-3.13.1-py3-none-any.whl (11 kB)
Installing collected packages: pyyaml, fi

In [15]:
%pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.1-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.9 (from ipywidgets)
  Downloading widgetsnbextension-4.0.9-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.9 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.9-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.1-py3-none-any.whl (139 kB)
   ---------------------------------------- 0.0/139.4 kB ? eta -:--:--
   -------- ------------------------------- 30.7/139.4 kB 1.4 MB/s eta 0:00:01
   -------------------------------------- - 133.1/139.4 kB 2.0 MB/s eta 0:00:01
   ---------------------------------------- 139.4/139.4 kB 1.7 MB/s eta 0:00:00
Downloading jupyterlab_widgets-3.0.9-py3-none-any.whl (214 kB)
   ---------------------------------------- 0.0/214.9 kB ? eta -:--:--
   ---------------------------------------- 214.9/214.9 kB 6.6 MB/s eta 0:00:00
Downloading widgetsnbextension-4.0.9-py3-none-any.whl (2.3 MB)
   ---------

In [2]:
%pip install torch transformers

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Loading the model