# Using openpyxl for Direct Parsing of Formulas

Using `pd.read_csv()` and `pd.read_excel()` excludes some formulas. Specifically, 30 formulas are returned as NaN for CSV files and 3 formulas are returned as NaN for Excel files. Therefore, I have used `openpyxl` for direct parsing of the formulas.

CSV files are plain text and don't preserve cell metadata such as the actual formula text; they only store the computed values. When Excel exports to CSV, it writes out the result of the formula, not the formula itself. This is why we use libraries like `openpyxl` with the Excel format (XLSX) and set `data_only=False` to retrieve the underlying formula strings.

In [1]:

from openpyxl import load_workbook
wb = load_workbook('FeynmanEquations.xlsx', data_only=False)
ws = wb.active


In [2]:
# Iterate through rows (we start from row 2 because we skip header)
formula_col_index = 4 
formula_list=[]
# Iterate through rows (we start from row 2 because we skip header)
for row in ws.iter_rows(min_row=2):
    # get cell in formula column
    cell = row[formula_col_index - 1]  # zero-indexed
    formula_list.append(cell.value)  

## Plan for Tokenization

### Extract All Unique Variable Names
1. Read the `FeynmanEquations.xlsx` file.
2. Collect all unique variable names from the columns `v1_name`, `v2_name`, `v3_name`, etc.

### Extract All Unique Operators and Functions
1. Parse the formula column.
2. Extract mathematical operators (`+`, `-`, `*`, `/`, `**`) and functions (`sin`, `cos`, `exp`, etc.).

### Merge into a Single Set (Vocabulary)
1. Combine the extracted variable names and operators/functions into a single unique set.
2. This set forms our tokenization vocabulary.

### Tokenize Each Formula Using This Vocabulary
1. Convert each formula into a sequence of tokens using the extracted vocabulary.


In [None]:
## Extract all Uique variable names
import pandas as pd
import sympy
import re

df=pd.read_excel('FeynmanEquations.xlsx')
variable_columns=[col for col in df.columns if re.match(r'v\d+_name',col)]


print(variable_columns)

unique_variables=set()
for col in variable_columns:
    unique_variables.update(df[col].dropna().astype(str).unique())

print(unique_variables)



['v1_name', 'v2_name', 'v3_name', 'v4_name', 'v5_name', 'v6_name', 'v7_name', 'v8_name', 'v9_name', 'v10_name']
{'x1', 'G', 'q', 'kappa', 't', 'Volt', 'V1', 'M', 'H', 'n', 'mu_drift', 'z2', 'Y', 'Nn', 'pr', 'chi', 'p_d', 'delta', 'sigma_den', 'rho', 'Pwr', 'n_0', 'y2', 'x3', 'epsilon', 'T', 'u', 'U', 'A_vec', 'omega', 'n_rho', 'm1', 'T1', 'Int_0', 'sigma', 'Bx', 'd', 'z', 'a', 'h', 'I_0', 'q2', 'd1', 'theta', 'I2', 'x', 'm2', 'B', 'rho_c_0', 'alpha', 'g_', 'y', 'q1', 'omega_0', 'theta2', 'By', 'V', 'T2', 'd2', 'I1', 'E_n', 'V2', 'z1', 'F', 'y1', 'p', 'r', 'beta', 'Ef', 'v', 'r1', 'y3', 'mom', 'm_0', 'theta1', 'Jz', 'A', 'kb', 'w', 'm', 'c', 'lambd', 'g', 'gamma', 'k', 'C', 'mu', 'x2', 'Bz', 'r2', 'k_spring', 'mob', 'I'}


In [30]:
###Checker
# print("sigma" in unique_variables )
# print("pi" in unique_variables )
# print("sigma" in unique_variables )
# print("omega" in unique_variables )
# print("h" in unique_variables )
print("ln" in unique_variables)
print("Int" in unique_variables)
print("delta" in unique_variables)




False
False
True


In [None]:
### All mathematical operators and synbols expected 
formula_list

mathematical_symbols=['+','-','*','/','**','exp','sqrt','pi','sin()','cos()','ln()','Int()','tanh()','log()','arcsin()','arctan()','arccos()']



['exp(-theta**2/2)/sqrt(2*pi)',
 'exp(-(theta/sigma)**2/2)/(sqrt(2*pi)*sigma)',
 'exp(-((theta-theta1)/sigma)**2/2)/(sqrt(2*pi)*sigma)',
 'sqrt((x2-x1)**2+(y2-y1)**2)',
 'G*m1*m2/((x2-x1)**2+(y2-y1)**2+(z2-z1)**2)',
 'm_0/sqrt(1-v**2/c**2)',
 'x1*y1+x2*y2+x3*y3',
 'mu*Nn',
 'q1*q2*r/(4*pi*epsilon*r**3)',
 'q1*r/(4*pi*epsilon*r**3)',
 'q2*Ef',
 'q*(Ef+B*v*sin(theta))',
 '1/2*m*(v**2+u**2+w**2)',
 'G*m1*m2*(1/r2-1/r1)',
 'm*g*z',
 '1/2*k_spring*x**2',
 '(x-u*t)/sqrt(1-u**2/c**2)',
 '(t-u*x/c**2)/sqrt(1-u**2/c**2)',
 'm_0*v/sqrt(1-v**2/c**2)',
 '(u+v)/(1+u*v/c**2)',
 '(m1*r1+m2*r2)/(m1+m2)',
 'r*F*sin(theta)',
 'm*r*v*sin(theta)',
 '1/2*m*(omega**2+omega_0**2)*1/2*x**2',
 'q/C',
 'arcsin(n*sin(theta2))',
 '1/(1/d1+n/d2)',
 'omega/c',
 'sqrt(x1**2+x2**2-2*x1*x2*cos(theta1-theta2))',
 'Int_0*sin(n*theta/2)**2/sin(theta/2)**2',
 'arcsin(lambd/(n*d))',
 'q**2*a**2/(6*pi*epsilon*c**3)',
 '(1/2*epsilon*c*Ef**2)*(8*pi*r**2/3)*(omega**4/(omega**2-omega_0**2)**2)',
 'q*v*B/p',
 'omega_0/(1-v/c)',
