# Generate Gold standard dataset of lexical change from Dictionnaire historique du français (A. Rey) and Frantext

Summary :
- get words from dico rey parsed (xlsx file)
- retain nouns and verbs whose frequency in Frantext 1800-1850 and in Frantext 1950-2000 is more than 50 and 100 respectively.

This is what the present notebook is doing. Then next step is :
- choose manually from the resulting files (`noms_a_choisir_dico_rey.xlsx` and `verbes_a_choisir_dico_rey.xlsx`) 10 nouns and 10 verbs and retrieve from Frantext X (1800-1850) and X (1950-2000) sentences
- use the SemEval 2020 task 1 evaluation framework (https://arxiv.org/pdf/2007.11464.pdf) to generate the Gold Standard

## A. Generate frequency lists from "manual" retrieval from Frantext (1800-1850 and 1950-2000), freq > 50 and 100 respectively

Note : it is impossible to download full wordlists (for nouns and verbs, as we are concerned) directly from Frantext web interface. Interaction with Frantext Staff did not resolve the point. So we download manually by copy-paste from wordlist result 100 items per page (with word, raw and relative frequency) down to frequency > 5.The resulting files are :
- frantext_1800-1850_all_freq.csv
- frantext_1950-2000_all_freq.csv

In [3]:
# reading Frantext 1800-1850 wordlist 
#(one data per line : word, absolute frequency, relative frequency)
# this strange format is due to the copy-paster procedure
# and generating frantext_1800-1850_all.xlsx 
# (three columns : word, absolute and relative freq)

import pandas as pd

data = []
with open("frantext_1800-1850_all_freq.csv") as f:
    for line in f:
        data.append(line.strip())

words1850 = data[0::3]
freqs = data[1::3]
relfreqs = data[2::3]
#print(len(words))
#print(relfreqs)
# for B processing below
words1850d = {words1850[i]: freqs[i] for i in range(len(words1850))} 

df = pd.DataFrame({'word':words1850,'freq':freqs,'relfreqs':relfreqs}).to_excel("frantext_1800-1850_all.xlsx",index=False)
#print(df.info())
        

In [4]:
# same as above for Frantext 1950-2000
import pandas as pd

data = []
with open("frantext_1950-2000_all_freq.csv") as f:
    for line in f:
        data.append(line.strip())

words1950 = data[0::3]
freqs = data[1::3]
relfreqs = data[2::3]
#print(len(words))
#print(relfreqs)
# for B processing below
words1950d = {words1950[i]: freqs[i] for i in range(len(words1950))} 
df = pd.DataFrame({'word':words1950,'freq':freqs,'relfreqs':relfreqs}).to_excel("frantext_1950-2000_all.xlsx",index=False)
#print(df.info())
        

# B. Generate semantic innovations candidates from dico hist if words has less than 6 senses and are in both Frantext Frequency Lists

Note : `wiktionnary_nom_def.tsv`and `wiktionnary_verbs_def.tsv` are the result of the python script `èxtractWiktionaire_n_v_adj.py`in the same directory.

## Nouns

In [5]:
# loading wiktionary data
import pandas as pd

df = pd.read_excel("../dico-rey/dico-historique-rey.xlsx")
print(df.info())
df =  df[df.pos.str.contains(r"n\.")]
df.word = df.word.str.lower()
print(df.info())
print(df.word.value_counts().head(20))
#print(len(df.word.unique())
      


# get words in words1850 and in words1950 and save to excel
df3 = df[(df.word.isin(words1850)) & (df.word.isin(words1950))].copy()
df3['freq1850'] = df3.word.apply(lambda x : words1850d[x])
df3['freq1950'] = df3.word.apply(lambda x : words1950d[x])
df3.to_excel("noms_a_choisir_dico-rey.xlsx", index=False)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14594 entries, 0 to 14593
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   word      14594 non-null  object
 1   pos       14594 non-null  object
 2   maintext  14594 non-null  object
 3   html      14594 non-null  object
dtypes: object(4)
memory usage: 456.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12257 entries, 0 to 14593
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   word      12257 non-null  object
 1   pos       12257 non-null  object
 2   maintext  12257 non-null  object
 3   html      12257 non-null  object
dtypes: object(4)
memory usage: 478.8+ KB
None
prime     5
c         5
coco      4
bar       4
h         4
pan       4
pin       4
chat      4
m         4
faux      3
union     3
salve     3
botte     3
pion      3
flèche    3
balle     3
vague     3
droit     3
poêle     3
don   

## Verbs

In [6]:
# loading wiktionary data
import pandas as pd

df = pd.read_excel("../dico-rey/dico-historique-rey.xlsx")
print(df.info())
df =  df[df.pos.str.contains(r"v\.")]
df.word = df.word.str.lower()
print(df.info())
print(df.word.value_counts().head(20))
#print(len(df.word.unique())
      


# get words in words1850 and in words1950 and save to excel
df3 = df[(df.word.isin(words1850)) & (df.word.isin(words1950))].copy()
df3['freq1850'] = df3.word.apply(lambda x : words1850d[x])
df3['freq1950'] = df3.word.apply(lambda x : words1950d[x])
df3.to_excel("verbes_a_choisir_dico-rey.xlsx", index=False)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14594 entries, 0 to 14593
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   word      14594 non-null  object
 1   pos       14594 non-null  object
 2   maintext  14594 non-null  object
 3   html      14594 non-null  object
dtypes: object(4)
memory usage: 456.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2584 entries, 4 to 14583
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   word      2584 non-null   object
 1   pos       2584 non-null   object
 2   maintext  2584 non-null   object
 3   html      2584 non-null   object
dtypes: object(4)
memory usage: 100.9+ KB
None
in            9
ex            4
a             3
si            3
sombrer       2
planer        2
cingler       2
flétrir       2
contracter    2
sortir        2
louer         2
toper         2
embroncher    2
importer      2
super      