Comprehensive Binary Classification with Datasets Merging

This project involves building and evaluating machine learning models for a binary classification problem. The dataset used is split into two parts, Training_part1.csv and Training_part2.csv, which are merged to create a complete training dataset.

No description of the dataset fields is provided, but we only know that the "Class" column is the target variable. The "id" column is present in both tables and should be used for matching.

Dataset Description

Training_part1.csv: Contains the first part of the training data.
Training_part2.csv: Contains the second part of the training data.

Preprocessing

Duplicate records in both parts of the training data are removed.
Missing values are handled:
- Numeric features (int_features and float_features) are filled with the mean of the respective columns.
- Categorical features (cat_features) are filled with the mode of the respective columns.
Categorical features are converted into dummy variables.

Feature Selection

Pearson correlation is used to identify highly correlated features.
Features with a correlation greater than 0.95 are dropped to reduce multicollinearity.

Models Used

Logistic Regression with Cross-Validation
Random Forest Classifier with Cross-Validation
Linear Support Vector Machine (SVM) with Cross-Validation

Evaluation Metrics

Accuracy: Measures the overall accuracy of the model predictions.
Classification Reports: Precision, Recall, F1-score, Support, Macro Average, and Weighted Avg.

File Structure

data/: Directory containing the dataset files.
- Training_part1.csv: First part of the training data.
- Training_part2.csv: Second part of the training data.
evaluation/: Directory containing scripts for evaluating model performance.
- evaluation_metrics.py: Contains functions for generating evaluation metrics.
models/: Directory containing scripts for different classification models.
- logistic_regression.py: Contains functions for logistic regression modeling.
- random_forest.py: Contains functions for random forest modeling.
- svm_classifier.py: Contains functions for support vector machine modeling.
preprocessing/: Directory containing scripts for data cleaning and preprocessing.
- data_cleaning.py: Contains functions for cleaning and merging the datasets.
- feature_selection.py: Contains functions for selecting relevant features.
main.py: Main Python script that orchestrates data preprocessing, model training, and evaluation.
Dockerfile: Contains instructions for building the Docker image.
docker-compose.yml: Defines services for running the application using Docker Compose.
requirements.txt: Lists the required dependencies for the project.
README.md: Documentation for the project.

Running the Code

Using Python

Install the required dependencies listed in requirements.txt using pip:
```
pip install -r requirements.txt
```
Run the main.py script:
```
python main.py
```

Using Docker

Build the Docker image:

docker build -t data-merging-classification .

Run the Docker container:
```
docker run data-merging-classification
```
Alternatively, use Docker Compose:
```
docker-compose up
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comprehensive Binary Classification with Datasets Merging

Dataset Description

Preprocessing

Feature Selection

Models Used

Evaluation Metrics

File Structure

Running the Code

Using Python

Using Docker

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
evaluation		evaluation
models		models
preprocessing		preprocessing
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
results.png		results.png

francesco-s/DataMerging-Classification

Folders and files

Latest commit

History

Repository files navigation

Comprehensive Binary Classification with Datasets Merging

Dataset Description

Preprocessing

Feature Selection

Models Used

Evaluation Metrics

File Structure

Running the Code

Using Python

Using Docker

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages