data-science-snippets is a modular, production-ready Python snippets containing curated, reusable utilities used in the day-to-day workflows of senior data scientists and machine learning engineers.
It includes tools for EDA, cleaning, validation, text processing, feature engineering, visualization, model evaluation, time series, and more β organized by task to keep your work clean and efficient.
β
Covers every major step in the data science lifecycle
β
Clean, modular structure by task
β
Built for reusability in real-world projects
β
Lightweight: only depends on pandas, numpy, matplotlib, seaborn by default
β
Compatible with Python 3.9+
data-science-snippets/
βββ eda/
β βββ most_frequent_values.py
β βββ data_summary.py
β βββ cardinality_report.py
β βββ basic_statistics.py
βββ data_cleaning/
β βββ missing_data_summary.py
β βββ outlier_detection.py
β βββ duplicate_removal.py
βββ preprocessing/
β βββ minmax_scaling.py
β βββ encoding.py
β βββ normalize_columns.py
βββ loading/
β βββ load_csv_with_info.py
β βββ safe_parquet_loader.py
β βββ load_large_file_chunks.py
βββ visualization/
β βββ missing_data_heatmap.py
β βββ distribution_plot.py
β βββ correlation_matrix.py
β βββ color_palette_utils.py
βββ feature_engineering/
β βββ create_datetime_features.py
β βββ binning.py
β βββ interaction_terms.py
β βββ rare_label_encoding.py
βββ automated_eda/
β βββ quick_eda_report.py
β βββ profile_report_wrapper.py
βββ model_evaluation/
β βββ classification_report_extended.py
β βββ confusion_matrix_plot.py
β βββ cross_validation_metrics.py
β βββ roc_auc_plot.py
βββ text_processing/
β βββ clean_text.py
β βββ tokenize_text.py
β βββ tfidf_features.py
βββ time_series/
β βββ lag_features.py
β βββ rolling_statistics.py
β βββ datetime_indexing.py
βββ modeling/
β βββ model_training.py
β βββ pipeline_builder.py
β βββ hyperparameter_tuner.py
βββ data_validation/
β βββ schema_check.py
β βββ unique_constraints.py
β βββ value_range_check.py
βββ utils/
β βββ memory_optimization.py
β βββ execution_timer.py
β βββ logging_setup.py
βββ README.md
most_frequent_values.py: Shows the most common (modal) value per column, its frequency, and percent from non-null values.data_summary.py: Summarizes dtypes, nulls, uniques, and memory usage for quick inspection.cardinality_report.py: Reports high-cardinality columns in categorical features.basic_statistics.py: Returns mean, median, min, max, std, and other summary statistics.
missing_data_summary.py: Shows missing value count and percentage per column, along with data types.outlier_detection.py: Detects outliers using IQR or Z-score methods.duplicate_removal.py: Identifies and removes duplicate rows or records.
minmax_scaling.py: Scales numeric values to a [0, 1] range.encoding.py: Label encoding and one-hot encoding utilities.normalize_columns.py: Z-score standardization and column normalization helpers.
load_csv_with_info.py: Loads CSVs and prints metadata like shape, dtypes, and missing values.safe_parquet_loader.py: Robust parquet file loader with fallback options.load_large_file_chunks.py: Loads large files in chunks with progress reporting.
missing_data_heatmap.py: Visualizes missing values with a Seaborn heatmap.distribution_plot.py: Plots distributions of numeric variables.correlation_matrix.py: Draws a correlation heatmap of numeric features.
create_datetime_features.py: Extracts features like day, month, year, weekday from datetime columns.binning.py: Performs binning (equal-width or quantile) on continuous variables.interaction_terms.py: Creates interaction features (e.g., feature1 * feature2).rare_label_encoding.py: Groups rare categorical labels into 'Other'.
quick_eda_report.py: Generates a summary of shape, dtypes, nulls, basic stats.profile_report_wrapper.py: Wrapper for pandas-profiling / ydata-profiling report generation.
classification_report_extended.py: Displays precision, recall, F1 with support for multiple averages.confusion_matrix_plot.py: Annotated confusion matrix visual.cross_validation_metrics.py: Computes metrics across folds and aggregates results.roc_auc_plot.py: Plots ROC curve and calculates AUC score.
clean_text.py: Removes punctuation, stopwords, numbers, and lowercases text.tokenize_text.py: Word and sentence tokenizers with NLTK or spaCy support.tfidf_features.py: Builds TF-IDF matrix from text columns.
lag_features.py: Generates lagged versions of a column for time-aware modeling.rolling_statistics.py: Rolling mean, median, std, and min/max features.datetime_indexing.py: Time-based slicing, filtering, and resampling helpers.
model_training.py: Trains scikit-learn models with optional cross-validation and logging.pipeline_builder.py: Builds preprocessing + modeling pipelines usingPipelineorColumnTransformer.hyperparameter_tuner.py: WrapsGridSearchCVorRandomizedSearchCVwith easy setup and evaluation.
schema_check.py: Validates schema based on expected dtypes and column names.unique_constraints.py: Ensures unique values for IDs or compound keys.value_range_check.py: Checks for valid value ranges in numeric columns.
memory_optimization.py: Downcasts numerical columns to save memory.execution_timer.py: Times function execution with decorators or context managers.logging_setup.py: Sets up consistent logging configuration for larger projects.
Copy-Paste π¦- Python β₯ 3.9
- pandas β₯ 1.5.3
- numpy β₯ 1.24.4
- seaborn β₯ 0.12.2
- matplotlib β₯ 3.6.3
Please see our SECURITY.md for vulnerability disclosure guidelines.
- Vataselu Andrei
- Nicola-Diana Sincaru
This project is licensed under the MIT License. See the LICENSE file for details.
We welcome contributions! If you have a reusable function or snippet that you think belongs in a senior data scientistβs toolkit, feel free to open a pull request.