# Visualize taksometrs versus ormanis

In this notebook, we will compare two different types of transportation: taksometrs and ormanis. We will visualize the differences in their usage and characteristics using a line plot.

## Idea

Main idea is to show trends in the usage of taksometrs and ormanis over time. We will use a line plot to visualize the data, with the x-axis representing time and the y-axis representing the number of taksometrs and ormanis used.

We will use document frequency to show the trends in usage. Document frequency is a measure of how often a term appears in a set of documents. In this case, we will use document frequency to show how often taksometrs and ormanis are mentioned in the text.

To represent different number of documents in each year we will create relative frequencies. Relative frequency is a measure of how often a term appears in a set of documents, relative to the total number of documents. In this case, we will use relative frequency to show how often taksometrs and ormanis are mentioned in the text, relative to the total number of documents.

### Related terms

Our taksomets and ormanis counts will include some related terms such as fūrmanis and važonis for ormanis and taksītis, taksitis for taksometrs. We will also include some other related terms such as taksometra and ormaņa.

### Possible gotchas

We have to be careful not to include in our document frequency terms when used in a different non transportation context.

For example "taksītis" could refer not only to a taxi but also to a type of dog. We will have to filter out these cases.
Anda has done manual labeling of suspected false positives. We will use this data to filter out the false positives.


## Loading Libraries and showing hardware used

In [1]:
# Show Python version
import sys
print(f"Python version: {sys.version}")
from datetime import datetime
print(f"Run date: {datetime.now()}")
from pathlib import Path
import os

# Get the project root by going one level up from the current notebook directory
project_root = Path().resolve().parent
print(f"Project root: {project_root}")
# what computer are we on?
import socket
print(f"Computer name: {socket.gethostname()}")
# CPU architecture
import platform
print(f"CPU architecture: {platform.machine()}")
# CPU type
print(f"CPU type {platform.processor()}")
# CPU count
print(f"CPU count: {os.cpu_count()}")
# let's import wmi to get the CPU name
try:
    import wmi
    c = wmi.WMI()
    for cpu in c.Win32_Processor():
        print(f"CPU name: {cpu.Name}")
except ImportError:
    print("wmi not installed")
    print("Please install wmi with 'pip install WMI'")

# OS name and version
print(f"OS name: {platform.system()}")
print(f"OS version: {platform.version()}")
# memory and disk space
import psutil
print(f"Memory: {psutil.virtual_memory().total / (1024 ** 3):.2f} GB : free - {psutil.virtual_memory().available / (1024 ** 3):.2f} GB")
print(f"Swap memory: {psutil.swap_memory().total / (1024 ** 3):.2f} GB : free - {psutil.swap_memory().free / (1024 ** 3):.2f} GB")
print(f"Disk space: {psutil.disk_usage('/').total / (1024 ** 3):.2f} GB : free - {psutil.disk_usage('/').free / (1024 ** 3):.2f} GB")

# try importing the libraries we need
print("EXTERNAL libraries")

try:
    from tqdm import tqdm
    from tqdm import __version__ as tqdm_version
    print(f"tqdm version: {tqdm_version}")
except ImportError:
    print("tqdm not installed")
    print("Please install tqdm with 'pip install tqdm'")

#Pandas
try:
    import pandas as pd
    from pandas import __version__ as pandas_version
    print(f"Pandas version: {pandas_version}")
except ImportError:
    print("Pandas not installed")
    print("""Please install pandas with 'pip install "pandas[excel,parquet]"'""")

# now plotly
try:
    from plotly import express as px
    from plotly import graph_objects as go
    from plotly import __version__ as plotly_version
    print(f"Plotly version: {plotly_version}")
except ImportError:
    print("Plotly not installed")
    print("Please install plotly with 'pip install plotly'")


Python version: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
Run date: 2025-05-23 10:04:53.642817
Project root: C:\Users\vsaules\Github\lnb_transports
Computer name: 11P00694
CPU architecture: AMD64
CPU type Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU count: 8
CPU name: Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz
OS name: Windows
OS version: 10.0.19045
Memory: 31.80 GB : free - 22.91 GB
Swap memory: 4.75 GB : free - 4.69 GB
Disk space: 222.96 GB : free - 56.50 GB
EXTERNAL libraries
tqdm version: 4.66.2
Pandas version: 2.2.1
Plotly version: 5.19.0


## Loading Full Dataset

In [2]:
src = Path("../../not_repo/latsenrom_2025_05_09.parquet")

# assert src.exists()
assert src.is_file(), f"File not found: {src}"
# loading
print(f"Loading from {src}")
df = pd.read_parquet(src)
# check the dataframe
# shape
print(f"df.shape: {df.shape}")
# head
df.head()

Loading from ..\..\not_repo\latsenrom_2025_05_09.parquet
df.shape: (37605476, 17)


Unnamed: 0,deprel,form,index,lemma,parent,pos,tag,ufeats,upos,sent_ndx,author,title,dom_id,file_stem,file_stem_short,firstEdition,term
0,nmod,Mīlas,1,mīla,2.0,ncfsg_,ncfsg4,Case=Gen|Gender=Fem|Number=Sing,NOUN,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,mīla
1,nmod,ārprāta,2,ārprāts,3.0,ncmsg_,ncmsg1,Case=Gen|Gender=Masc|Number=Sing,NOUN,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,ārprāts
2,obl,varā,3,vara,6.0,ncfsl_,ncfsl4,Case=Loc|Gender=Fem|Number=Sing,NOUN,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,vara
3,nmod,ROMĀNS,4,Romāns,6.0,npmsn_,npmsn1,Case=Nom|Gender=Masc|Number=Sing,PROPN,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,Romāns
4,punct,„,5,"""",6.0,zq,zq,_,PUNCT,0,AizsV,MilaU,1049452,AizsV_MilaU_1049452,AizsV_MilaU,1933,""""


### Extracting ormanis

In [3]:
ormanis_lemmas = ("ormanis","fūrmanis", "važonis")
# let's filter the dataframe for the lemmas
ormanis_df = df[df["lemma"].isin(ormanis_lemmas)]
# shape
print(f"ormanis_df.shape: {ormanis_df.shape}")
# head
ormanis_df.head()

ormanis_df.shape: (1478, 17)


Unnamed: 0,deprel,form,index,lemma,parent,pos,tag,ufeats,upos,sent_ndx,author,title,dom_id,file_stem,file_stem_short,firstEdition,term
39464,iobj,ormanim,3,ormanis,2.0,ncmsd_,ncmsd2,Case=Dat|Gender=Masc|Number=Sing,NOUN,14,AkurJ,DegoS,771400,AkurJ_DegoS_771400,AkurJ_DegoS,1912,ormanis
73314,obj,ormani,9,ormanis,7.0,ncmsa_,ncmsa2,Case=Acc|Gender=Masc|Number=Sing,NOUN,556,AkurJ,PeteD,886346,AkurJ_PeteD_886346,AkurJ_PeteD,1921,ormanis
304692,nsubj,važoņi,23,važonis,20.0,ncmpn_,ncmpn2,Case=Nom|Gender=Masc|Number=Plur,NOUN,555,ArdeE,ApLie,1051730,ArdeE_ApLie_1051730,ArdeE_ApLie,1926,ormanis
308387,nsubj:pass,ormaņi,10,ormanis,6.0,ncmpn_,ncmpn2,Case=Nom|Gender=Masc|Number=Plur,NOUN,777,ArdeE,ApLie,1051730,ArdeE_ApLie_1051730,ArdeE_ApLie,1926,ormanis
357535,obl,ormaņa,13,ormanis,9.0,ncmsg_,ncmsg2,Case=Gen|Gender=Masc|Number=Sing,NOUN,317,ArdeE,SvetA,1046832,ArdeE_SvetA_1046832,ArdeE_SvetA,1924,ormanis


In [4]:
# let's see value counts for the lemma
ormanis_df["lemma"].value_counts()

lemma
ormanis     971
važonis     390
fūrmanis    117
Name: count, dtype: int64

### Saving ormanis

In [5]:
# let's save ormanis in our parquet directly which is a sibling directory to our notebook
# ormanis_df.to_parquet("../parquet/ormanis.parquet", index=False)

### Extracting taksometrs

For this our task will be harder as we also have to consider true positives

In [16]:
# first let's load the xlsx file that Anda provided with true positives for taksis and taksītis
# note that taksometrs / taksometris / taksomotors are always true positives as those terms are always related to taxis
true_taxi_df = pd.read_excel("../xlsx/Taxi-docF-redig.xlsx", sheet_name="True positives")
# shape
print(f"true_taxi_df.shape: {true_taxi_df.shape}")
# head
true_taxi_df.head()

true_taxi_df.shape: (231, 4)


Unnamed: 0,Aut,NoslT,Gads,Forma
0,Andra,Elita,1930,taksī
1,Andra,Elita,1930,taksī
2,Andra,Elita,1930,taksis
3,Andra,Elita,1930,taksī
4,Anoni,KaptT,1926,taksim


In [17]:
# let's create a file_stem_short column for the true_taxi_df
# this will be concatanation of Autors and Nosaukums using _ as separator
true_taxi_df["file_stem"] = true_taxi_df["Aut"] + "_" + true_taxi_df["NoslT"]
# rename the column to file_stem_short
true_taxi_df.rename(columns={"file_stem": "file_stem_short"}, inplace=True)
# head
true_taxi_df.head()

Unnamed: 0,Aut,NoslT,Gads,Forma,file_stem_short
0,Andra,Elita,1930,taksī,Andra_Elita
1,Andra,Elita,1930,taksī,Andra_Elita
2,Andra,Elita,1930,taksis,Andra_Elita
3,Andra,Elita,1930,taksī,Andra_Elita
4,Anoni,KaptT,1926,taksim,Anoni_KaptT


In [18]:
# how many unique file_stem_short values do we have?
print(f"true_taxi_df.file_stem_short.nunique(): {true_taxi_df.file_stem_short.nunique()}")
# get these unique values as a set
true_taxi_set = set(true_taxi_df.file_stem_short)
# length of the set
print(f"len(true_taxi_set): {len(true_taxi_set)}")

true_taxi_df.file_stem_short.nunique(): 83
len(true_taxi_set): 83


In [19]:
sure_taxi_terms = ("taksometrs",  " taksometris",  "taksomotors")
# first let's get the sure taxi terms if they are exactly in lemma column
taxi_df = df[df["lemma"].isin(sure_taxi_terms)]
# shape
print(f"taxi_df.shape: {taxi_df.shape}")
# head
taxi_df.head()

taxi_df.shape: (111, 17)


Unnamed: 0,deprel,form,index,lemma,parent,pos,tag,ufeats,upos,sent_ndx,author,title,dom_id,file_stem,file_stem_short,firstEdition,term
482494,,taksometri,9,taksometrs,,ncmpn_,ncmpn1,,,131,Arnis,AndrS,948028,Arnis_AndrS_948028,Arnis_AndrS,1928,taksometrs
488046,,taksomotoru,14,taksomotors,,ncmsa_,ncmsa1,,,697,Arnis,AndrS,948028,Arnis_AndrS_948028,Arnis_AndrS,1928,taksomotors
598018,conj,taksometros,13,taksometrs,11.0,ncmpl_,ncmpl1,Case=Loc|Gender=Masc|Number=Plur,NOUN,432,Arnis,TaurK,1051711,Arnis_TaurK_1051711,Arnis_TaurK,1933,taksometrs
617771,nmod,taksometru,9,taksometrs,10.0,ncmpg_,ncmpg1,Case=Gen|Gender=Masc|Number=Plur,NOUN,685,Arnis,TaurK,1051711,Arnis_TaurK_1051711,Arnis_TaurK,1933,taksometrs
1402411,nmod,Taksometra,1,taksometrs,2.0,ncmsg_,ncmsg1,Case=Gen|Gender=Masc|Number=Sing,NOUN,784,BaloP,UzleS,1025418,BaloP_UzleS_1025418,BaloP_UzleS,1929,taksometrs


In [20]:
unsure_taxi_terms = ("taksis", "taksītis")
# for these we also have to check if the file_stem_short is in the true_taxi_set
# we already have a file_stem_short column in the dataframe
extra_taxi_df = df[df["lemma"].isin(unsure_taxi_terms)]
# shape before filtering for true_taxi_set
print(f"extra_taxi_df.shape: {extra_taxi_df.shape}")
# let's filter the dataframe for the true_taxi_set
extra_taxi_df = extra_taxi_df[extra_taxi_df["file_stem_short"].isin(true_taxi_set)]
# shape after filtering for true_taxi_set
print(f"extra_taxi_df.shape: {extra_taxi_df.shape}")
# head 15
extra_taxi_df.head(15)

extra_taxi_df.shape: (181, 17)
extra_taxi_df.shape: (113, 17)


Unnamed: 0,deprel,form,index,lemma,parent,pos,tag,ufeats,upos,sent_ndx,author,title,dom_id,file_stem,file_stem_short,firstEdition,term
202074,obl,taksī,25,taksis,26.0,ncmsl_,ncmsl2,Case=Loc|Gender=Masc|Number=Sing,NOUN,912,Andra,Elita,1053573,Andra_Elita_1053573,Andra_Elita,1930,taksometrs
205569,obl,taksī,13,taksis,12.0,ncmsl_,ncmsl2,Case=Loc|Gender=Masc|Number=Sing,NOUN,1257,Andra,Elita,1053573,Andra_Elita_1053573,Andra_Elita,1930,taksometrs
220176,nsubj,taksis,5,taksis,6.0,ncmsn_,ncmsn2,Case=Nom|Gender=Masc|Number=Sing,NOUN,845,Andra,Elita,1053573,Andra_Elita_1053573,Andra_Elita,1930,taksometrs
247803,obl,taksī,4,taksis,5.0,ncmsl_,ncmsl2,Case=Loc|Gender=Masc|Number=Sing,NOUN,255,Andra,Elita,1053573,Andra_Elita_1053573,Andra_Elita,1930,taksometrs
382236,iobj,taksīti,3,taksītis,1.0,ncmsa_,ncmsa2,Case=Acc|Gender=Masc|Number=Sing,NOUN,1537,ArdsL,TrijV,1053572,ArdsL_TrijV_1053572,ArdsL_TrijV,1933,taksometrs
390401,obl,taksīšos,4,taksītis,6.0,ncmpl_,ncmpl2,Case=Loc|Gender=Masc|Number=Plur,NOUN,416,ArdsL,TrijV,1053572,ArdsL_TrijV_1053572,ArdsL_TrijV,1933,taksometrs
390418,iobj,taksīšiem,21,taksītis,24.0,ncmpd_,ncmpd2,Case=Dat|Gender=Masc|Number=Plur,NOUN,416,ArdsL,TrijV,1053572,ArdsL_TrijV_1053572,ArdsL_TrijV,1933,taksometrs
661359,,taksīši,2,taksītis,,ncmpn_,ncmpn2,,,428,Artis,ArNai,1053600,Artis_ArNai_1053600,Artis_ArNai,1940,taksometrs
729830,,taksīšiem,5,taksītis,,ncmpd_,ncmpd2,,,543,Artis,ArNai,1053600,Artis_ArNai_1053600,Artis_ArNai,1940,taksometrs
729857,,taksīti,26,taksītis,,ncmsa_,ncmsa2,,,544,Artis,ArNai,1053600,Artis_ArNai_1053600,Artis_ArNai,1940,taksometrs


In [21]:
# something is not quite right we should have hit more than 12 terms
# let's check how many unique file_stem_short values we have we have in original dataframe and how
unique_file_stem_short = set(df.file_stem_short.unique())
# length of the set
print(f"len(unique_file_stem_short): {len(unique_file_stem_short)}")
# let's assert that all sure_taxi_terms are in unique_file_stem_short
if not true_taxi_set <= unique_file_stem_short:
    print(f"Not all sure_taxi_terms are in unique_file_stem_short: {true_taxi_set} <= {unique_file_stem_short}")
    # which ones are missing
    missing_terms = true_taxi_set - unique_file_stem_short
    print(f"Missing terms: {missing_terms}")
else:
    print(f"All sure_taxi_terms are in unique_file_stem_short: {true_taxi_set} <= {unique_file_stem_short}")

len(unique_file_stem_short): 463
All sure_taxi_terms are in unique_file_stem_short: {'LapiK_StudF', 'BremH_BaltM', 'NiedAi_SarkV', 'EgliA_LigaM', 'LapiK_CekaG', 'LankA_Inzen', 'BaloP_UzleS', 'ZamaL_DireK', 'CukuH_StarZ', 'SkujF_SidrS', 'PeteM_DzivS', 'ZariE_PapaZ', 'GulbA_DruvU', 'LaciV_AkmeC', 'PrusE_TaleV', 'VindG_NaveI', 'JaunJ_NeskS', 'PaulM_MilaA', 'PaulM_SirdP', 'UpitA_SmaiL', 'BaltJ_KadNa', 'SukuO_MilaA', 'BaltV_LielU', 'LapiK_AtplL', 'SpriJ_SaloL', 'UpitA_JanaR', 'Andra_Elita', 'VeseJ_CilvS', 'GingJ_NoslT', 'TormJ_Lasti', 'LapiK_DodaU', 'PaulM_VecaB', 'SukuO_Oglra', 'GulbA_JaunV', 'MiltK_Peter', 'JaunJ_Kapri', 'SpriJ_ManGr', 'SterF_MezkP', 'JureJ_DzivV', 'BankJ_DivaD', 'JaunJ_JaunU', 'Arnis_AndrS', 'Artis_ArNai', 'SkalK_SirdK', 'Arnis_TaurK', 'LapiK_Pagri', 'ZiemV_MilaF', 'ZeltT_RigaG', 'PerlL_LudiD', 'RoziE_BezTe', 'PaulM_ProfS', 'GregV_LatvK', 'MiltK_Mazur', 'SartJ_FabrM', 'LaciV_PutnB', 'LesiVi_LiktR', 'UpitA_PaVar', 'ZibeV_LielI', 'ArdsL_TrijV', 'MoorH_Rauls', 'ZariK_Vaini'

In [22]:
# let's check autor PavlA in df
unique_autors = set(df.author.unique())
assert "PavlA" in unique_autors, f"PavlA not in unique_autors: {unique_autors}"
# check SkujF
assert "SkujF" in unique_autors, f"SkujF not in unique_autors: {unique_autors}"
# check if we have any duplicates in the dataframe


In [23]:
# # TODO check titles for PavlA and SkujF since they are present in our big dataframe
# # unique works for PavlA
# pavla_df = df[df["author"] == "PavlA"]
# # shape
# print(f"pavla_df.shape: {pavla_df.shape}")
# # print unique titles
# pavla_titles = set(pavla_df.title.unique())
# print(f"pavla_titles: {pavla_titles}")

pavla_df.shape: (103135, 17)
pavla_titles: {'CeloP'}


In [24]:
# # now let's check SkujF
# skujf_df = df[df["author"] == "SkujF"]
# # shape
# print(f"skujf_df.shape: {skujf_df.shape}")
# # print unique titles
# skujf_titles = set(skujf_df.title.unique())
# print(f"skujf_titles: {skujf_titles}")

skujf_df.shape: (470843, 17)
skujf_titles: {'UzTam', 'ZeltR', 'MilaU', 'SidrS', 'ZemSa'}


### Combining sure taksomerts with extra taxi term dfs

In [25]:
# let's create full_taxi_df by combining taxi_df and extra_taxi_df
full_taxi_df = pd.concat([taxi_df, extra_taxi_df], ignore_index=True)
# shape
print(f"full_taxi_df.shape: {full_taxi_df.shape}")
# check for duplicates where row is the same
duplicates = full_taxi_df[full_taxi_df.duplicated(keep=False)]
# shape
print(f"duplicates.shape: {duplicates.shape}")

full_taxi_df.shape: (224, 17)
duplicates.shape: (0, 17)


### Saving taksometrs

In [None]:
# now let's save the full_taxi_df to parquet
# full_taxi_df.to_parquet("../parquet/taxi.parquet", index=False)

## Extract document frequency total over years

In [28]:
# let's get document frequency over years in whole dataframe
# we will  group by short_title_stem then get first for firstEdition column
# group by file_stem_short and then get firstEdition
first_edition = df.groupby('file_stem_short')['firstEdition'].first()
first_edition_over_years = first_edition.value_counts().sort_index()
# show the first 10
first_edition_over_years.head(10)


firstEdition
1879    2
1890    1
1891    3
1892    1
1893    1
1895    3
1899    1
1900    1
1901    1
1902    2
Name: count, dtype: int64

In [29]:
# last 10
first_edition_over_years.tail(10)

firstEdition
1931    22
1932    16
1933    24
1934    19
1935    36
1936    45
1937    27
1938    36
1939    27
1940    16
Name: count, dtype: int64

In [36]:
# let's convert first_edition_over_years to a dataframe
first_edition_over_years_df = first_edition_over_years.to_frame()
# index should be called year
first_edition_over_years_df.index.name = 'year'
first_edition_over_years_df.columns = ['firstEditionCount']
# show
first_edition_over_years_df.head(10)

Unnamed: 0_level_0,firstEditionCount
year,Unnamed: 1_level_1
1879,2
1890,1
1891,3
1892,1
1893,1
1895,3
1899,1
1900,1
1901,1
1902,2


In [None]:
## let's save first_edition_over_years to parquet
# first convert to dataframe

# first_edition_over_years_df.to_parquet("../parquet/first_edition_over_years.parquet", index=True)

In [38]:
# let's plot first_edition_over_years_df
plot_df = first_edition_over_years_df
fig = px.bar(plot_df, title="First edition value counts")
# turn of legend
fig.update_layout(showlegend=False)
# turn off title
fig.update_layout(title="")
# y axis title "Darbu skaits"
# fig.update_yaxes(title="Darbu skaits")
fig.update_yaxes(title="Number of Works")
# x axis title "Pirmizdevums grāmatā
# fig.update_xaxes(title="Pirmizdevums grāmatā")
fig.update_xaxes(title="First Edition")
# save html
SAVE_HTML = False
if SAVE_HTML:
    fig.write_html("../html/first_edition_value_counts.html")
# fig.write_html("../html/first_edition_value_counts.html")
# save img
# fig.write_image("../img/first_edition_value_counts.png")
# font size 18
fig.update_layout(font=dict(size=18))
# show years every 5
fig.update_xaxes(tick0=0, dtick=5)
# let's make x gridlines every 10 years
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='white')
# width 1000
fig.update_layout(width=1242)
# save as png
# fig.write_image("../img/first_edition_value_counts.png") # FIXME kaleido not installed
fig.show()

## Getting Document Freqency for ormanis and taksometrs over years

In [32]:
# now let's create ormanis_over_years
# we will  group by short_title_stem then get first for firstEdition column
ormanis_grouped = ormanis_df.groupby('file_stem_short')['firstEdition'].first()
ormanis_over_years = ormanis_grouped.value_counts().sort_index()
# show the first 10
ormanis_over_years.head(10)

firstEdition
1879    1
1891    1
1892    1
1895    1
1899    1
1900    1
1901    1
1902    2
1904    2
1905    2
Name: count, dtype: int64

In [39]:
# now let's get taxi_over_years
# we will  group by short_title_stem then get first for firstEdition column
taxi_grouped = full_taxi_df.groupby('file_stem_short')['firstEdition'].first()
taxi_over_years = taxi_grouped.value_counts().sort_index()
# show the first 10
taxi_over_years.head(10)

firstEdition
1924    1
1925    1
1926    1
1927    4
1928    4
1929    3
1930    5
1931    8
1932    6
1933    5
Name: count, dtype: int64

In [43]:
# now let's add ormanis_over_years and taxi_over_years to the first_edition_over_years_df
# first convert to dataframe
ormanis_over_years_df = ormanis_over_years.to_frame()
# index should be called year
ormanis_over_years_df.index.name = 'year'
ormanis_over_years_df.columns = ['ormanisDocFreq']
# show
ormanis_over_years_df.head(10)

Unnamed: 0_level_0,ormanisDocFreq
year,Unnamed: 1_level_1
1879,1
1891,1
1892,1
1895,1
1899,1
1900,1
1901,1
1902,2
1904,2
1905,2


In [44]:
# now let's create taxi_over_years_df
taxi_over_years_df = taxi_over_years.to_frame()
# index should be called year
taxi_over_years_df.index.name = 'year'
taxi_over_years_df.columns = ['taxiDocFreq']
# show
taxi_over_years_df.head(10)

Unnamed: 0_level_0,taxiDocFreq
year,Unnamed: 1_level_1
1924,1
1925,1
1926,1
1927,4
1928,4
1929,3
1930,5
1931,8
1932,6
1933,5


In [45]:
# now let's add these two dataframes to the first_edition_over_years_df
# if no year is present in the index then add 0
# first add ormanis_over_years_df to first_edition_over_years_df
first_edition_over_years_df = first_edition_over_years_df.join(ormanis_over_years_df, how='left')
# fillna with 0
first_edition_over_years_df.fillna(0, inplace=True)
# now add taxi_over_years_df to first_edition_over_years_df
first_edition_over_years_df = first_edition_over_years_df.join(taxi_over_years_df, how='left')
# fillna with 0
first_edition_over_years_df.fillna(0, inplace=True)
# show the first 10
first_edition_over_years_df.head(10)

Unnamed: 0_level_0,firstEditionCount,ormanisDocFreq,taxiDocFreq
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1879,2,1.0,0.0
1890,1,0.0,0.0
1891,3,1.0,0.0
1892,1,1.0,0.0
1893,1,0.0,0.0
1895,3,1.0,0.0
1899,1,1.0,0.0
1900,1,1.0,0.0
1901,1,1.0,0.0
1902,2,2.0,0.0


In [46]:
# let's calculate relative frequencies for ormanis and taxi
first_edition_over_years_df["ormanisRelDocFreq"] = first_edition_over_years_df["ormanisDocFreq"] / first_edition_over_years_df["firstEditionCount"]
first_edition_over_years_df["taxiRelDocFreq"] = first_edition_over_years_df["taxiDocFreq"] / first_edition_over_years_df["firstEditionCount"]
# show the first 10
first_edition_over_years_df.head(10)


Unnamed: 0_level_0,firstEditionCount,ormanisDocFreq,taxiDocFreq,ormanisRelDocFreq,taxiRelDocFreq
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1879,2,1.0,0.0,0.5,0.0
1890,1,0.0,0.0,0.0,0.0
1891,3,1.0,0.0,0.333333,0.0
1892,1,1.0,0.0,1.0,0.0
1893,1,0.0,0.0,0.0,0.0
1895,3,1.0,0.0,0.333333,0.0
1899,1,1.0,0.0,1.0,0.0
1900,1,1.0,0.0,1.0,0.0
1901,1,1.0,0.0,1.0,0.0
1902,2,2.0,0.0,1.0,0.0


In [47]:
# now let's rearrange the columns in alphabetical order - this will work nicely in this case
first_edition_over_years_df = first_edition_over_years_df[sorted(first_edition_over_years_df.columns)]
# show the first 10
first_edition_over_years_df.head(10)

Unnamed: 0_level_0,firstEditionCount,ormanisDocFreq,ormanisRelDocFreq,taxiDocFreq,taxiRelDocFreq
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1879,2,1.0,0.5,0.0,0.0
1890,1,0.0,0.0,0.0,0.0
1891,3,1.0,0.333333,0.0,0.0
1892,1,1.0,1.0,0.0,0.0
1893,1,0.0,0.0,0.0,0.0
1895,3,1.0,0.333333,0.0,0.0
1899,1,1.0,1.0,0.0,0.0
1900,1,1.0,1.0,0.0,0.0
1901,1,1.0,1.0,0.0,0.0
1902,2,2.0,1.0,0.0,0.0


In [48]:
# tail
first_edition_over_years_df.tail(10)

Unnamed: 0_level_0,firstEditionCount,ormanisDocFreq,ormanisRelDocFreq,taxiDocFreq,taxiRelDocFreq
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1931,22,12.0,0.545455,8.0,0.363636
1932,16,10.0,0.625,6.0,0.375
1933,24,14.0,0.583333,5.0,0.208333
1934,19,7.0,0.368421,3.0,0.157895
1935,36,17.0,0.472222,6.0,0.166667
1936,45,25.0,0.555556,8.0,0.177778
1937,27,10.0,0.37037,6.0,0.222222
1938,36,11.0,0.305556,5.0,0.138889
1939,27,12.0,0.444444,9.0,0.333333
1940,16,4.0,0.25,6.0,0.375


### Saving combined taxi and ormani df to parquet and csv

In [49]:
# let's save first_edition_over_years_df to parquet
first_edition_over_years_df.to_parquet("../parquet/first_edition_ormanis_taxi_over_years.parquet", index=True)
# let's save first_edition_over_years_df to csv
first_edition_over_years_df.to_csv("../csv/first_edition_ormanis_taxi_over_years.csv", index=True)

## Loading ormanis taxi from parquet for plotting

In [None]:
ormanis_taxi_df  = pd.read_parquet("../parquet/first_edition_ormanis_taxi_over_years.parquet")
# let's assert that it is equal to the one we just saved - we will not need to check again
# assert first_edition_over_years_df.equals(ormanis_taxi_df), "Dataframes are not equal"

# print shape
print(f"ormanis_taxi_df.shape: {ormanis_taxi_df.shape}")
# tail
ormanis_taxi_df.tail(10)

ormanis_taxi_df.shape: (44, 5)


Unnamed: 0_level_0,firstEditionCount,ormanisDocFreq,ormanisRelDocFreq,taxiDocFreq,taxiRelDocFreq
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1931,22,12.0,0.545455,8.0,0.363636
1932,16,10.0,0.625,6.0,0.375
1933,24,14.0,0.583333,5.0,0.208333
1934,19,7.0,0.368421,3.0,0.157895
1935,36,17.0,0.472222,6.0,0.166667
1936,45,25.0,0.555556,8.0,0.177778
1937,27,10.0,0.37037,6.0,0.222222
1938,36,11.0,0.305556,5.0,0.138889
1939,27,12.0,0.444444,9.0,0.333333
1940,16,4.0,0.25,6.0,0.375


In [4]:
# columns
print(f"ormanis_taxi_df.columns: {ormanis_taxi_df.columns}")

ormanis_taxi_df.columns: Index(['firstEditionCount', 'ormanisDocFreq', 'ormanisRelDocFreq',
       'taxiDocFreq', 'taxiRelDocFreq'],
      dtype='object')


## Plotting ormanixs and taksometrs over years

### Colors

moto_color: rgb(0, 61, 165)
horse_color: rgb(255, 194, 90)


In [22]:
# let's plot ormanisRelDocFreq and taxiRelDocFreq over the years - which is the index for this dataframe
# we want to use horse_color for ormanis and moto_color for taxi
# we will use plotly express for this
moto_color = "rgb(0, 61, 165)" # Pantone 293
horse_color = "rgb(255, 194, 90)" # complimentary color to Pantone 293
print(f"moto_color: {moto_color}")
print(f"horse_color: {horse_color}")
scale = 3
width = 370 * scale
height = 255 * scale
font_size = 6 * scale
margin = 25 * scale
top_margin = 25 * scale
bottom_margin = 25 * scale
line_thickness = 2 * scale

plot_df = ormanis_taxi_df.copy()
# let's rename columns ormanisRelDocFreq and taxiRelDocFreq to ormanis and taxi
plot_df.rename(columns={"ormanisRelDocFreq": "horse-drawn cab", "taxiRelDocFreq": "taxi"}, inplace=True)
fig = px.line(plot_df, 
              x=plot_df.index, 
              y=["horse-drawn cab", "taxi"], 
              title="Horse-Drawn Cab and Taxi Relative Document Frequency over Years",
            labels={"horse-drawn cab": "Horse-Drawn Cab", "taxi": "Taxi"},

              )
# set the color of the lines
fig.update_traces(line=dict(color=horse_color), selector=dict(name="horse-drawn cab"))
fig.update_traces(line=dict(color=moto_color), selector=dict(name="taxi"))

# set width of the lines
fig.update_traces(line=dict(width=line_thickness), selector=dict(name="horse-drawn cab"))
fig.update_traces(line=dict(width=line_thickness), selector=dict(name="taxi"))

# set the color of the markers
fig.update_traces(marker=dict(color=horse_color), selector=dict(name="horse-drawn cab"))
fig.update_traces(marker=dict(color=moto_color), selector=dict(name="taxi"))
# set the color of the legend
fig.update_layout(legend=dict(font=dict(color=horse_color), title_text="horse-drawn cab"))
# set the color of the legend
fig.update_layout(legend=dict(font=dict(color=moto_color), title_text="Taxi"))
# set the title of the x axis
fig.update_xaxes(title="Year")
# set the title of the y axis
fig.update_yaxes(title="Relative Document Frequency")

# set width and height
fig.update_layout(width=width, height=height)
# set the font size
# change font size
fig.update_layout(font=dict(size=font_size))

# let's hide title
fig.update_layout(title="")
# let's update legen
fig.update_layout(showlegend=True)
# set the legend position to top right
fig.update_layout(legend=dict(x=0.00, y=1.1, xanchor='left', yanchor='top'))
# set the legend title
fig.update_layout(legend_title_text="")
# set the legend font size
fig.update_layout(legend_font=dict(size=font_size))
# set the legend font color
fig.update_layout(legend_font_color=horse_color)
# set the legend font color
fig.update_layout(legend_font_color=moto_color)

fig.update_layout(margin=dict(l=margin, r=margin, t=top_margin, b=bottom_margin))
# let's change 

fig.show()


moto_color: rgb(0, 61, 165)
horse_color: rgb(255, 194, 90)


In [40]:
# Now let's create the same plot with one change
# for ormanis we want to split the data into two parts before and after 1919
# then we want to plot the first part dot markers and the second part line markers using the same horse_color
# we still want to plot the taxi line as before
moto_color = "rgb(0, 61, 165)" # Pantone 293
horse_color = "rgb(255, 194, 90)" # complimentary color to Pantone 293
print(f"moto_color: {moto_color}")
print(f"horse_color: {horse_color}")
scale = 6
width = 370 * scale
height = 255 * scale
font_size = 6 * scale
margin = 25 * scale
top_margin = 30 * scale
bottom_margin = 25 * scale
line_thickness = 2 * scale
dot_marker_size = 4 * scale

plot_df = ormanis_taxi_df.copy()
# let's rename columns ormanisRelDocFreq and taxiRelDocFreq to ormanis and taxi
plot_df.rename(columns={"ormanisRelDocFreq": "horse-drawn cab", "taxiRelDocFreq": "taxi"}, inplace=True)
# we will want to use plotly Graph Objects for this
# import plotly.graph_objects as go
# we already have it it imported as go

fig = go.Figure()
# add the first part of the horse-drawn cab line
# we simply plot the first part of horse-drawn cab line with dot markers until 1919
fig.add_trace(go.Scatter(x=plot_df.index[plot_df.index <= 1919], 
                         y=plot_df["horse-drawn cab"][plot_df.index <= 1919], 
                         mode='markers', 
                         name="horse-drawn cab until 1919  ",
                         marker=dict(color=horse_color, size=dot_marker_size),
                         showlegend=True,
                         legendgroup="horse-drawn cab until 1919",
                        #  legendgrouptitle_text="Horse-Drawn Cab",
                         legendgrouptitle_font_color=horse_color,
                        )
             ) 

# now let's add the second part of the horse-drawn cab line
fig.add_trace(go.Scatter(x=plot_df.index[plot_df.index >= 1919], 
                         y=plot_df["horse-drawn cab"][plot_df.index >= 1919], 
                         mode='lines', 
                         name="horse-drawn cab after 1919  ",
                         line=dict(color=horse_color, width=line_thickness),
                         showlegend=True,
                         legendgroup="horse-drawn cab after 1919  ",
                        #  legendgrouptitle_text="Horse-Drawn Cab",
                         legendgrouptitle_font_color=horse_color,
                        )
             )

# now let's add the taxi line we also want points until 1919
fig.add_trace(go.Scatter(x=plot_df.index[plot_df.index <= 1919],
                         y=plot_df["taxi"][plot_df.index <= 1919],
                            mode='markers',
                            name="taxi until 1919  ",
                            marker=dict(color=moto_color, size=dot_marker_size*0.5),
                            showlegend=True,
                            legendgroup="taxi until 1919  ",
                        #  legendgrouptitle_text="Taxi",
                            legendgrouptitle_font_color=moto_color,
                        )
                )


fig.add_trace(go.Scatter(x=plot_df.index[plot_df.index >= 1919], 
                         y=plot_df["taxi"][plot_df.index >= 1919], 
                         mode='lines', 
                         name="taxi after 1919  ",
                         line=dict(color=moto_color, width=line_thickness),
                         showlegend=True,
                         legendgroup="taxi",
                        #  legendgrouptitle_text="Taxi",
                         legendgrouptitle_font_color=moto_color,
                        )
             )

# set the title of the x axis
fig.update_xaxes(title="Year")
# set the title of the y axis
fig.update_yaxes(title="Relative Document Frequency")
# set width and height
fig.update_layout(width=width, height=height)
# set the font size
# change font size
fig.update_layout(font=dict(size=font_size))
# let's hide title
fig.update_layout(title="")
# let's update legend
fig.update_layout(showlegend=True)
# set the legend position to top right
fig.update_layout(legend=dict(x=0.00, y=1.30, xanchor='left', yanchor='top'))
# set the legend title
fig.update_layout(legend_title_text="")
# set the legend font size
fig.update_layout(legend_font=dict(size=font_size))

fig.update_layout(margin=dict(l=margin, r=margin, t=top_margin, b=bottom_margin))

# let's move legends in two rows
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="left",
    x=0.0
))


# fig.write_image("../img/first_edition_ormanis_taxi_over_years.png") # FIXME kaleido not installed
fig.show()


moto_color: rgb(0, 61, 165)
horse_color: rgb(255, 194, 90)
