GitHub - derezed88/stats · GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE.md		LICENSE.md
README.md		README.md
regression.py		regression.py
requirements.txt		requirements.txt

Repository files navigation

Interactive Regression Analysis Tool

A comprehensive Python tool for performing regression analysis with an interactive text-based terminal interface
Explore the docs »

Report Bug · Request Feature

Table of Contents

About The Project
- Features
- Built With
Getting Started
- Requirements
- Installation
Usage
Workflow
Advanced Features
Keyboard Navigation
Output Files
Troubleshooting
Contributing
License
Contact
Acknowledgments

About The Project

A comprehensive Python tool for performing regression analysis with an interactive curses-based terminal interface. Designed for statistical analysts needing exploratory regression analysis with publication-quality outputs and detailed statistical interpretation.

Version: 3.1-full-curses (Updated: 2026-02-14)

Features

Multiple Data Loading Methods:
- Download data from URLs (CSV or JSON format) with CTRL-V paste support
- Interactive curses-based file browser for local files
- Support for CSV files with or without headers
- Automatic format detection and validation
Interactive Data Cleaning:
- Inspect data with pagination and statistics view
- Create custom filters (missing values, thresholds, ranges, outliers, custom queries)
- View and manage active filters with preview
- Transform columns to datetime format with validation
Regression Analysis:
- Simple Linear Regression (OLS)
- Multiple Regression (OLS)
- Logistic Regression with ROC curves
Comprehensive Visualizations:
- Q-Q plots for normality assessment
- Residual plots with histograms
- Correlation heatmaps
- Influence plots (Cook's distance and leverage)
- Simple regression plots (scatter, histograms, box plots, bar charts) for 2-variable analysis
- ROC curves and confusion matrices for logistic regression
- Prediction scatter plots
Detailed Statistical Reports:
- Full statsmodels output
- Interpretation of 25+ key statistics:
  - R², Adjusted R², F-statistic
  - Coefficients and standardized beta weights
  - P-values with significance assessment
  - Durbin-Watson (autocorrelation)
  - Skewness/Kurtosis with convention choice
  - VIF and Condition Number (multicollinearity)
  - Cook's distance and leverage (influential observations)
  - Jarque-Bera and Omnibus (normality tests)
  - Breusch-Pagan (heteroscedasticity)
  - AIC/BIC (model comparison)
  - ROC AUC and classification metrics (logistic)
- Reports saved to text files with timestamps
Advanced Options:
- Kurtosis convention selection (Excess/Standard)
- Robust standard errors (HC0, HC1, HC3)
- Confidence level selection (90%, 95%, 99%)
- Classification threshold control (0.3, 0.5, 0.7)
Session Management: Save and resume analysis sessions with full state preservation

Built With

pandas
statsmodels
matplotlib
seaborn
numpy
scipy

Getting Started

Requirements

pip install pandas statsmodels matplotlib seaborn numpy scipy requests

Installation

Clone the repo

git clone https://github.com/derezed88/stats.git

Navigate to the project directory
```
cd stats
```
Make the script executable
```
chmod +x regression.py
```

Usage

python regression.py
# or
./regression.py

The tool will launch an interactive terminal-based interface with a main menu.

Workflow

1. Start New Analysis

Choose option 1 from main menu
Enter a session name (optional, auto-generated if blank)

2. Load Data

Option A: From URL (Option 4)

Paste URL to CSV or JSON file
Review data metrics (rows, columns, memory usage)
Confirm to proceed
Choose filename to save locally

Option B: From File (Option 5)

Choose between interactive file browser or manual path entry
File Browser Mode: Navigate directories with arrow keys, select files with Enter
Manual Mode: Type or paste file path
Review data metrics
Optionally copy to data directory (skipped if already in data/)

CSV Without Headers Support:

Automatically detects if CSV lacks header row
Prompts you to specify column names interactively
Creates virtual headers that persist through analysis
Session metadata tracks user-specified headers

3. Select Columns (Option 6)

Choose specific columns for analysis, or 'all' for all columns
These columns will be available for filtering and regression

4. Clean/Filter Data (Optional - Option 7)

Data Inspection:

Interactive curses-based viewer with immediate response
Navigate with arrow keys
Toggle between data and statistics views
Jump to specific rows

Add Filters:

Remove missing values
Filter by thresholds (>, <, ==)
Filter by ranges
Remove outliers (beyond N standard deviations)
Custom pandas queries

Transform Columns to Datetime:

Select column to transform
Specify new column name (default: {column}_datetime)
Optional format string (e.g., '%Y-%m', '%Y-%m-%d')
Automatic validation with rollback on failure
Persists through filter operations
Plots show human-readable dates instead of Unix timestamps

Filter Management:

View active filters
Remove individual filters
Clear all filters

5. Perform Regression Analysis (Option 8)

Step 1: Select Variables

Choose dependent variable (Y)
Choose independent variable(s) (X) - comma-separated

Step 2: Choose Regression Type

Linear Regression (OLS)
Multiple Regression (OLS)
Logistic Regression

Step 3: Configure Advanced Options (if desired)

Kurtosis convention (Excess vs Standard)
Robust standard errors (None, HC0, HC1, HC3)
Confidence level (90%, 95%, 99%) - for OLS
Classification threshold (0.3, 0.5, 0.7) - for Logistic

Step 4: View Results

Results displayed on screen
Automatically generates:
- OLS Regression:
  - Q-Q plot (normality assessment)
  - Residual plots (fitted vs residuals, histogram)
  - Correlation heatmap
  - Influence plot (Cook's distance, leverage)
  - Simple regression plots (for 2-variable analysis):
    - Scatter plot with regression line and confidence/prediction bands
    - Distribution histograms with KDE
    - Box plots for outlier detection
    - Bar plot of mean Y by binned X values
- Logistic Regression:
  - ROC curve with AUC
  - Confusion matrix with classification metrics
  - Prediction scatter plot
- Comprehensive text report with interpretations

6. Review Results (Option 9)

View statistical output
Read plain-English interpretations of:
- Model Fit: R², Adjusted R², Pseudo R² (logistic)
- Overall Significance: F-statistic, LLR p-value (logistic)
- Coefficients: Effect size, direction, and standardized beta weights
- Statistical Significance: P-values with assessment
- Assumptions:
  - Autocorrelation (Durbin-Watson)
  - Normality (Jarque-Bera, Omnibus, Skewness, Kurtosis)
  - Heteroscedasticity (Breusch-Pagan)
- Multicollinearity: VIF, Condition Number
- Influential Observations: Cook's distance, Leverage
- Model Comparison: AIC, BIC
- Classification Performance (logistic): ROC AUC, Accuracy, Precision, Recall, F1-Score, Specificity

7. Save Session (Press 's')

Save your work to resume later
All settings, selected columns, filters, and results are preserved
Pickled session files stored in sessions/ directory

Advanced Features

Kurtosis Convention

Choose between two reporting standards:

Excess (Fisher's) - Normal distribution = 0 (scipy default, recommended)
Standard (Pearson's) - Normal distribution = 3 (textbook convention)

Robust Standard Errors (OLS only)

Adjust for heteroscedasticity:

None - Classical OLS standard errors (default)
HC0 (White) - Basic heteroscedasticity-consistent
HC1 - HC0 with degrees-of-freedom correction
HC3 - Conservative, best for small samples

Reports show both classical and robust SE/p-values when enabled.

Confidence Levels (OLS only)

Select for confidence and prediction bands in simple regression plots:

90% - Wider acceptance region
95% - Standard significance level (default)
99% - Stricter, wider bands

Classification Threshold (Logistic only)

Control positive class prediction threshold:

0.3 - More sensitive, predicts more positives
0.5 - Standard balanced threshold (default)
0.7 - More specific, predicts fewer positives

Affects confusion matrix and all classification metrics.

Datetime Column Support

Transform text columns to proper datetime format
Specify format string or use auto-detection
Creates new column preserving original
Plots automatically format datetime axes with human-readable labels
Supports various datetime formats (year, year-month, full dates, timestamps)

Keyboard Navigation

Data Inspector

↑ or k - Previous page
↓ or K - Next page
Page Up/Down - Scroll 5 pages at a time
Home/End - Jump to first/last page
s - Toggle statistics view
d - Toggle data view
j - Jump to specific row (opens input dialog)
q - Quit inspection

File Browser

↑/↓ - Navigate up/down
Enter - Select file or enter directory
←/Backspace - Go up one directory
a - Toggle showing all files (default: only CSV/JSON)
Home/End - Jump to first/last item
Page Up/Down - Scroll by page
q - Quit browser

Menu Navigation

↑/↓ - Navigate options
Home/End - Jump to first/last
Page Up/Down - Scroll menu
Enter - Select option
q - Cancel (if allowed)

Variable Selection

↑/↓ or k/j - Navigate list
Space - Toggle selection (multi-select mode)
a - Select all
n - Deselect all
Enter - Confirm selection
q - Cancel

Directory Structure

The tool automatically creates these directories:

./
├── data/           # Downloaded data files
├── plots/          # Generated PNG plots (300 DPI)
├── reports/        # Statistical reports (TXT)
└── sessions/       # Saved session files (PKL)

Output Files

Reports (TXT)

reports/regression_report_YYYYMMDD_HHMMSS.txt
reports/shape_report_YYYYMMDD_HHMMSS.txt

Contains:

Session metadata
Applied filters
Full statsmodels summary
Statistical interpretations
All diagnostic test results

Plots (PNG)

OLS Regression:

plots/qq_plot_YYYYMMDD_HHMMSS.png
plots/residual_plot_YYYYMMDD_HHMMSS.png
plots/correlation_YYYYMMDD_HHMMSS.png
plots/influence_plot_YYYYMMDD_HHMMSS.png
plots/YYYYMMDD_HHMMSS_simple_regression_plots.png  # 2 variables only

Logistic Regression:

plots/roc_curve_YYYYMMDD_HHMMSS.png
plots/confusion_matrix_YYYYMMDD_HHMMSS.png
plots/prediction_scatter_YYYYMMDD_HHMMSS.png

Simple Regression Plots (generated only when analyzing 2 variables):

Top-left (Scatter): Relationship between X and Y with fitted regression line, confidence bands, and prediction bands
Top-right (Histograms): Distribution of both variables with KDE on dual y-axes
Bottom-left (Box plots): Side-by-side box plots for outlier detection
Bottom-right (Bar chart): Mean Y value for binned X ranges with sample counts

Example Data Sources

Test with public datasets:

CSV Examples:

https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv
https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv

JSON Examples:

Any REST API returning JSON arrays

Filter Examples

Remove Missing Values

Filter type: 1
Column: Age

Keep Values Greater Than Threshold

Filter type: 2
Column: Price
Threshold: 100

Remove Outliers

Filter type: 6
Column: Income
Standard deviations: 3

Custom Query

Filter type: 7
Query: Age > 18 and Income < 100000

Troubleshooting

"Could not parse as CSV or JSON"

Verify URL is accessible
Check that URL points directly to CSV/JSON file
Ensure proper file format

"No valid data points after removing missing values"

Review your filters - they may be too restrictive
Use data inspection to check for missing values
Consider removing or relaxing filters

"High multicollinearity"

Independent variables are highly correlated
Check VIF values in the report
Consider removing some predictors
Use VIF analysis to identify problematic variables

Analysis errors

Check for non-numeric data in regression variables
Ensure sufficient data points (at least 30 recommended, 10-20 per predictor)
Verify dependent variable has variance

Tips

Start with data inspection before creating filters to understand your data
Transform datetime columns early if you have date/time data in text format
Apply filters incrementally - add one at a time and inspect results
Save sessions frequently to preserve your work
Check Q-Q plots to verify normality assumptions
Review VIF values to detect multicollinearity
Use robust SEs if Breusch-Pagan test suggests heteroscedasticity
Compare AIC/BIC when trying different model specifications
Examine Cook's distance to identify influential observations
For logistic regression, try different classification thresholds to optimize for your use case

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Open source - feel free to modify and extend!

Contact

Mark Jimenez - @properTweetment - xb12pilot@gmail.com

Project Link: https://github.com/derezed88/stats

Acknowledgments

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

Contributors

Languages