A small domain-specific language and Python interpreter for loading tabular data, basic cleaning, and training a choice of two sklearn models—built as a final project for a programming class.
EasyML runs .ezml scripts line by line: set a file path, load CSV or Excel, clean rows, optionally fit classification or regression (picking between two fixed model pairs), and export datasets or models to exported/. It is a learning exercise, not a production ML system.
Python 3.11, pandas, scikit-learn (linear/logistic regression, decision tree classifier, random forest regressor), joblib, openpyxl (Excel). Dependencies are pinned in environment.yml and requirements.txt.
- Miniconda or Anaconda with
condaon yourPATH - Your own CSV/XLSX data if you run the samples (see
trainingData/README.md)
- Clone this repository and
cdinto the project root (the directory that containseasyML.pyandenvironment.yml). - Create the environment:
conda env create -f environment.yml
- Activate it:
conda activate undergrad-archive--easyml
- Create an output directory (first
DOWNLOADwill fail if it is missing):mkdir -p exported
conda activate undergrad-archive--easyml
cd /path/to/EasyML
python easyML.py your_script.ezmlDATAPATH myPath = 'data/myfile.csv'
DATASET myDf = LOAD myPath
CLEAN myDf
MODEL myModel = PREDICT_CAT(myDf, COLUMN J)
DOWNLOAD DATASET myDf
DOWNLOAD MODEL myModel
DATAPATH myPath = 'data/myfile.csv'
DATASET myDf = LOAD myPath
CLEAN myDf
MODEL myModel = PREDICT_NUM(myDf, COLUMN I)
DOWNLOAD DATASET myDf
DOWNLOAD MODEL myModel
- sample1.ezml — Loads
running.xlsx, cleans, exports the dataset only. - sample2.ezml — Loads
Titanic.csv, cleans, trains classification onCOLUMN J, exports dataset and model. - sample3.ezml — Loads
housing.csv, cleans, trains regression onCOLUMN I, exports dataset and model.
- One statement per line; whitespace splitting (
line.split()). Targets useCOLUMNletters (A=0, …).PREDICT_CATvsPREDICT_NUM; paths with spaces are fragile. - No unit tests or CI; little error handling beyond a missing script file.
- No datasets in the repo; samples assume files exist at the given
DATAPATH(seetrainingData/README.mdfor formats and public sources). exported/is not created automatically.- Model choice and metrics are simplistic; multiclass and messy real-world tables can still break edge cases.
Apache License 2.0 — see LICENSE.