# Visualize taksometrs versus ormanis

In this notebook, we will compare two different types of transportation: taksometrs and ormanis. We will visualize the differences in their usage and characteristics using a line plot.

## Idea

Main idea is to show trends in the usage of taksometrs and ormanis over time. We will use a line plot to visualize the data, with the x-axis representing time and the y-axis representing the number of taksometrs and ormanis used.

We will use document frequency to show the trends in usage. Document frequency is a measure of how often a term appears in a set of documents. In this case, we will use document frequency to show how often taksometrs and ormanis are mentioned in the text.

To represent different number of documents in each year we will create relative frequencies. Relative frequency is a measure of how often a term appears in a set of documents, relative to the total number of documents. In this case, we will use relative frequency to show how often taksometrs and ormanis are mentioned in the text, relative to the total number of documents.

### Related terms

Our taksomets and ormanis counts will include some related terms such as fūrmanis and važonis for ormanis and taksītis, taksitis for taksometrs. We will also include some other related terms such as taksometra and ormaņa.

### Possible gotchas

We have to be careful not to include in our document frequency terms when used in a different non transportation context.

For example "taksītis" could refer not only to a taxi but also to a type of dog. We will have to filter out these cases.
Anda has done manual labeling of suspected false positives. We will use this data to filter out the false positives.


## Loading Libraries and showing hardware used

In [2]:
# Show Python version
import sys
print(f"Python version: {sys.version}")
from datetime import datetime
print(f"Run date: {datetime.now()}")
from pathlib import Path
import os

# Get the project root by going one level up from the current notebook directory
project_root = Path().resolve().parent
print(f"Project root: {project_root}")
# what computer are we on?
import socket
print(f"Computer name: {socket.gethostname()}")
# CPU architecture
import platform
print(f"CPU architecture: {platform.machine()}")
# CPU type
print(f"CPU type {platform.processor()}")
# CPU count
print(f"CPU count: {os.cpu_count()}")
# let's import wmi to get the CPU name
try:
    import wmi
    c = wmi.WMI()
    for cpu in c.Win32_Processor():
        print(f"CPU name: {cpu.Name}")
except ImportError:
    print("wmi not installed")
    print("Please install wmi with 'pip install WMI'")

# OS name and version
print(f"OS name: {platform.system()}")
print(f"OS version: {platform.version()}")
# memory and disk space
import psutil
print(f"Memory: {psutil.virtual_memory().total / (1024 ** 3):.2f} GB : free - {psutil.virtual_memory().available / (1024 ** 3):.2f} GB")
print(f"Swap memory: {psutil.swap_memory().total / (1024 ** 3):.2f} GB : free - {psutil.swap_memory().free / (1024 ** 3):.2f} GB")
print(f"Disk space: {psutil.disk_usage('/').total / (1024 ** 3):.2f} GB : free - {psutil.disk_usage('/').free / (1024 ** 3):.2f} GB")

# try importing the libraries we need
print("EXTERNAL libraries")

try:
    from tqdm import tqdm
    from tqdm import __version__ as tqdm_version
    print(f"tqdm version: {tqdm_version}")
except ImportError:
    print("tqdm not installed")
    print("Please install tqdm with 'pip install tqdm'")

#Pandas
try:
    import pandas as pd
    from pandas import __version__ as pandas_version
    print(f"Pandas version: {pandas_version}")
except ImportError:
    print("Pandas not installed")
    print("""Please install pandas with 'pip install "pandas[excel,parquet]"'""")

# now plotly
try:
    from plotly import express as px
    from plotly import graph_objects as go
    from plotly import __version__ as plotly_version
    print(f"Plotly version: {plotly_version}")
except ImportError:
    print("Plotly not installed")
    print("Please install plotly with 'pip install plotly'")


Python version: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
Run date: 2025-05-22 11:38:13.701429
Project root: C:\Users\vsaules\Github\lnb_transports
Computer name: 11P00694
CPU architecture: AMD64
CPU type Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU count: 8
CPU name: Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz
OS name: Windows
OS version: 10.0.19045
Memory: 31.80 GB : free - 19.47 GB
Swap memory: 4.75 GB : free - 4.72 GB
Disk space: 222.96 GB : free - 56.44 GB
EXTERNAL libraries
tqdm version: 4.66.2
Pandas version: 2.2.1
Plotly version: 5.19.0


## Loading Full Dataset

In [3]:
src = Path("../../not_repo/latsenrom_2025_05_09.parquet")

# assert src.exists()
assert src.is_file(), f"File not found: {src}"
# loading
print(f"Loading from {src}")
df = pd.read_parquet(src)
# check the dataframe
# shape
print(f"df.shape: {df.shape}")
# head
df.head()

Loading from ..\..\not_repo\latsenrom_2025_05_09.parquet
df.shape: (37605476, 17)


Unnamed: 0,deprel,form,index,lemma,parent,pos,tag,ufeats,upos,sent_ndx,author,title,dom_id,file_stem,file_stem_short,firstEdition,term
0,nmod,Mīlas,1,mīla,2.0,ncfsg_,ncfsg4,Case=Gen|Gender=Fem|Number=Sing,NOUN,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,mīla
1,nmod,ārprāta,2,ārprāts,3.0,ncmsg_,ncmsg1,Case=Gen|Gender=Masc|Number=Sing,NOUN,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,ārprāts
2,obl,varā,3,vara,6.0,ncfsl_,ncfsl4,Case=Loc|Gender=Fem|Number=Sing,NOUN,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,vara
3,nmod,ROMĀNS,4,Romāns,6.0,npmsn_,npmsn1,Case=Nom|Gender=Masc|Number=Sing,PROPN,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,Romāns
4,punct,„,5,"""",6.0,zq,zq,_,PUNCT,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,""""


### Extracting ormanis

In [4]:
ormanis_lemmas = ("ormanis","fūrmanis", "važonis")
# let's filter the dataframe for the lemmas
ormanis_df = df[df["lemma"].isin(ormanis_lemmas)]
# shape
print(f"ormanis_df.shape: {ormanis_df.shape}")
# head
ormanis_df.head()

ormanis_df.shape: (1478, 17)


Unnamed: 0,deprel,form,index,lemma,parent,pos,tag,ufeats,upos,sent_ndx,author,title,dom_id,file_stem,file_stem_short,firstEdition,term
39464,iobj,ormanim,3,ormanis,2.0,ncmsd_,ncmsd2,Case=Dat|Gender=Masc|Number=Sing,NOUN,14,AkurJ,DegoS,771400,AkurJ_DegoS_771400,AkurJ_DegoS,1912,ormanis
73314,obj,ormani,9,ormanis,7.0,ncmsa_,ncmsa2,Case=Acc|Gender=Masc|Number=Sing,NOUN,556,AkurJ,PeteD,886346,AkurJ_PeteD_886346,AkurJ_PeteD,1921,ormanis
304692,nsubj,važoņi,23,važonis,20.0,ncmpn_,ncmpn2,Case=Nom|Gender=Masc|Number=Plur,NOUN,555,ArdeE,ApLie,1051730,ArdeE_ApLie_1051730,ArdeE_ApLie,1926,ormanis
308387,nsubj:pass,ormaņi,10,ormanis,6.0,ncmpn_,ncmpn2,Case=Nom|Gender=Masc|Number=Plur,NOUN,777,ArdeE,ApLie,1051730,ArdeE_ApLie_1051730,ArdeE_ApLie,1926,ormanis
357535,obl,ormaņa,13,ormanis,9.0,ncmsg_,ncmsg2,Case=Gen|Gender=Masc|Number=Sing,NOUN,317,ArdeE,SvetA,1046832,ArdeE_SvetA_1046832,ArdeE_SvetA,1924,ormanis


In [5]:
# let's see value counts for the lemma
ormanis_df["lemma"].value_counts()

lemma
ormanis     971
važonis     390
fūrmanis    117
Name: count, dtype: int64

### Saving ormanis

In [6]:
# let's save ormanis in our parquet directly which is a sibling directory to our notebook
ormanis_df.to_parquet("../parquet/ormanis.parquet", index=False)

### Extracting taksometrs

For this our task will be harder as we also have to consider true positives

In [35]:
# first let's load the xlsx file that Anda provided with true positives for taksis and taksītis
# note that taksometrs / taksometris / taksomotors are always true positives as those terms are always related to taxis
true_taxi_df = pd.read_excel("../xlsx/Taxi-docF-redig.xlsx", sheet_name="True positives")
# shape
print(f"true_taxi_df.shape: {true_taxi_df.shape}")
# head
true_taxi_df.head()

true_taxi_df.shape: (231, 4)


Unnamed: 0,Aut,NoslT,Gads,Forma
0,Andra,Elita,1930,taksī
1,Andra,Elita,1930,taksī
2,Andra,Elita,1930,taksis
3,Andra,Elita,1930,taksī
4,Anoni,KaptT,1926,taksim


In [36]:
# let's create a file_stem_short column for the true_taxi_df
# this will be concatanation of Autors and Nosaukums using _ as separator
true_taxi_df["file_stem"] = true_taxi_df["Aut"] + "_" + true_taxi_df["NoslT"]
# rename the column to file_stem_short
true_taxi_df.rename(columns={"file_stem": "file_stem_short"}, inplace=True)
# head
true_taxi_df.head()

Unnamed: 0,Aut,NoslT,Gads,Forma,file_stem_short
0,Andra,Elita,1930,taksī,Andra_Elita
1,Andra,Elita,1930,taksī,Andra_Elita
2,Andra,Elita,1930,taksis,Andra_Elita
3,Andra,Elita,1930,taksī,Andra_Elita
4,Anoni,KaptT,1926,taksim,Anoni_KaptT


In [37]:
# how many unique file_stem_short values do we have?
print(f"true_taxi_df.file_stem_short.nunique(): {true_taxi_df.file_stem_short.nunique()}")
# get these unique values as a set
true_taxi_set = set(true_taxi_df.file_stem_short)
# length of the set
print(f"len(true_taxi_set): {len(true_taxi_set)}")

true_taxi_df.file_stem_short.nunique(): 83
len(true_taxi_set): 83


In [38]:
sure_taxi_terms = ("taksometrs",  " taksometris",  "taksomotors")
# first let's get the sure taxi terms if they are exactly in lemma column
taxi_df = df[df["lemma"].isin(sure_taxi_terms)]
# shape
print(f"taxi_df.shape: {taxi_df.shape}")
# head
taxi_df.head()

taxi_df.shape: (111, 17)


Unnamed: 0,deprel,form,index,lemma,parent,pos,tag,ufeats,upos,sent_ndx,author,title,dom_id,file_stem,file_stem_short,firstEdition,term
482494,,taksometri,9,taksometrs,,ncmpn_,ncmpn1,,,131,Arnis,AndrS,948028,Arnis_AndrS_948028,Arnis_AndrS,1928,taksometrs
488046,,taksomotoru,14,taksomotors,,ncmsa_,ncmsa1,,,697,Arnis,AndrS,948028,Arnis_AndrS_948028,Arnis_AndrS,1928,taksomotors
598018,conj,taksometros,13,taksometrs,11.0,ncmpl_,ncmpl1,Case=Loc|Gender=Masc|Number=Plur,NOUN,432,Arnis,TaurK,1051711,Arnis_TaurK_1051711,Arnis_TaurK,1933,taksometrs
617771,nmod,taksometru,9,taksometrs,10.0,ncmpg_,ncmpg1,Case=Gen|Gender=Masc|Number=Plur,NOUN,685,Arnis,TaurK,1051711,Arnis_TaurK_1051711,Arnis_TaurK,1933,taksometrs
1402411,nmod,Taksometra,1,taksometrs,2.0,ncmsg_,ncmsg1,Case=Gen|Gender=Masc|Number=Sing,NOUN,784,BaloP,UzleS,1025418,BaloP_UzleS_1025418,BaloP_UzleS,1929,taksometrs


In [39]:
unsure_taxi_terms = ("taksis", "taksītis")
# for these we also have to check if the file_stem_short is in the true_taxi_set
# we already have a file_stem_short column in the dataframe
extra_taxi_df = df[df["lemma"].isin(unsure_taxi_terms)]
# shape before filtering for true_taxi_set
print(f"extra_taxi_df.shape: {extra_taxi_df.shape}")
# let's filter the dataframe for the true_taxi_set
extra_taxi_df = extra_taxi_df[extra_taxi_df["file_stem_short"].isin(true_taxi_set)]
# shape after filtering for true_taxi_set
print(f"extra_taxi_df.shape: {extra_taxi_df.shape}")
# head 15
extra_taxi_df.head(15)

extra_taxi_df.shape: (181, 17)
extra_taxi_df.shape: (112, 17)


Unnamed: 0,deprel,form,index,lemma,parent,pos,tag,ufeats,upos,sent_ndx,author,title,dom_id,file_stem,file_stem_short,firstEdition,term
202074,obl,taksī,25,taksis,26.0,ncmsl_,ncmsl2,Case=Loc|Gender=Masc|Number=Sing,NOUN,912,Andra,Elita,1053573,Andra_Elita_1053573,Andra_Elita,1930,taksometrs
205569,obl,taksī,13,taksis,12.0,ncmsl_,ncmsl2,Case=Loc|Gender=Masc|Number=Sing,NOUN,1257,Andra,Elita,1053573,Andra_Elita_1053573,Andra_Elita,1930,taksometrs
220176,nsubj,taksis,5,taksis,6.0,ncmsn_,ncmsn2,Case=Nom|Gender=Masc|Number=Sing,NOUN,845,Andra,Elita,1053573,Andra_Elita_1053573,Andra_Elita,1930,taksometrs
247803,obl,taksī,4,taksis,5.0,ncmsl_,ncmsl2,Case=Loc|Gender=Masc|Number=Sing,NOUN,255,Andra,Elita,1053573,Andra_Elita_1053573,Andra_Elita,1930,taksometrs
382236,iobj,taksīti,3,taksītis,1.0,ncmsa_,ncmsa2,Case=Acc|Gender=Masc|Number=Sing,NOUN,1537,ArdsL,TrijV,1053572,ArdsL_TrijV_1053572,ArdsL_TrijV,1933,taksometrs
390401,obl,taksīšos,4,taksītis,6.0,ncmpl_,ncmpl2,Case=Loc|Gender=Masc|Number=Plur,NOUN,416,ArdsL,TrijV,1053572,ArdsL_TrijV_1053572,ArdsL_TrijV,1933,taksometrs
390418,iobj,taksīšiem,21,taksītis,24.0,ncmpd_,ncmpd2,Case=Dat|Gender=Masc|Number=Plur,NOUN,416,ArdsL,TrijV,1053572,ArdsL_TrijV_1053572,ArdsL_TrijV,1933,taksometrs
661359,,taksīši,2,taksītis,,ncmpn_,ncmpn2,,,428,Artis,ArNai,1053600,Artis_ArNai_1053600,Artis_ArNai,1940,taksometrs
729830,,taksīšiem,5,taksītis,,ncmpd_,ncmpd2,,,543,Artis,ArNai,1053600,Artis_ArNai_1053600,Artis_ArNai,1940,taksometrs
729857,,taksīti,26,taksītis,,ncmsa_,ncmsa2,,,544,Artis,ArNai,1053600,Artis_ArNai_1053600,Artis_ArNai,1940,taksometrs


In [40]:
# something is not quite right we should have hit more than 12 terms
# let's check how many unique file_stem_short values we have we have in original dataframe and how
unique_file_stem_short = set(df.file_stem_short.unique())
# length of the set
print(f"len(unique_file_stem_short): {len(unique_file_stem_short)}")
# let's assert that all sure_taxi_terms are in unique_file_stem_short
if not true_taxi_set <= unique_file_stem_short:
    print(f"Not all sure_taxi_terms are in unique_file_stem_short: {true_taxi_set} <= {unique_file_stem_short}")
    # which ones are missing
    missing_terms = true_taxi_set - unique_file_stem_short
    print(f"Missing terms: {missing_terms}")
else:
    print(f"All sure_taxi_terms are in unique_file_stem_short: {true_taxi_set} <= {unique_file_stem_short}")

len(unique_file_stem_short): 463
Not all sure_taxi_terms are in unique_file_stem_short: {'ErssA_AgloD', 'BorkR_JaunL', 'VeseJ_CilvS', 'ZariE_PapaZ', 'BankJ_DivaD', 'JaunJ_JaunU', 'SkalK_SirdK', 'PaulM_ProfS', 'SterF_MezkP', 'LapiK_DodaU', 'ZiemV_MilaU', 'DzilA_VirsD', 'PavlA_Celop', 'LapiK_AtplL', 'ZariK_SpigP', 'UpitA_MasaG', 'JureJ_DzivV', 'PaulM_VecaB', 'JaunJ_NeskS', 'DzilK_Dauga', 'ZibeV_LielI', 'LesiVi_LiktR', 'UpitA_JanaR', 'LankA_Inzen', 'JaunJ_Kapri', 'PrusE_TaleV', 'VindG_NaveI', 'SukuO_Oglra', 'MiltK_Peter', 'PaulM_SirdP', 'LapiK_StudF', 'BaltJ_KadNa', 'GingJ_NoslT', 'KukuJ_Laimi', 'Artis_ArNai', 'BaloP_UzleS', 'LapiK_Pagri', 'SkujF_SirdS', 'ArdsL_TrijV', 'EgliA_LigaM', 'NiedAi_SarkV', 'ZiemV_SievA', 'LapiK_CekaG', 'UpitA_PaVar', 'ZeltT_RigaG', 'NiedAi_CilvA', 'ZariK_PelnV', 'ZiemV_MilaF', 'PerlL_LudiD', 'SartJ_FabrM', 'SukuO_MilaA', 'RoziP_Cepli', 'GulbA_DruvU', 'MoorH_Rauls', 'CukuH_StarZ', 'VecoJ_DansT', 'PaulM_MilaA', 'BaltV_LielU', 'Anoni_KaptT', 'PeteM_DzivS', 'SpriJ_S

In [42]:
# let's check autor PavlA in df
unique_autors = set(df.author.unique())
assert "PavlA" in unique_autors, f"PavlA not in unique_autors: {unique_autors}"
# check SkujF
assert "SkujF" in unique_autors, f"SkujF not in unique_autors: {unique_autors}"
# check if we have any duplicates in the dataframe


In [None]:
# TODO check titles for PavlA and SkujF since they are present in our big dataframe

Authors starting with E:
EgliA
EldgH
Elita
ErglR
EideR
ErssA
EgliV
