# Wine Cultivar Origin Prediction System
## Machine Learning Model Training and Evaluation

**Objective**: Predict the cultivar (origin/class) of wine based on its chemical properties using the Wine Dataset

**Selected Features (6 out of 11)**:
1. Alcohol
2. Malic Acid
3. Total Phenols
4. Flavanoids
5. Color Intensity
6. Proline

**Target Variable**: Cultivar (Class Label - 1, 2, or 3)

**Algorithm**: Random Forest Classifier

## Step 1: Import Required Libraries

Import all necessary libraries for data manipulation, machine learning, and model evaluation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import pickle
import warnings

warnings.filterwarnings('ignore')

print("✓ All libraries imported successfully!")

✓ All libraries imported successfully!


## Step 2: Load the Wine Dataset

Load the wine.data file and prepare it for analysis.

In [2]:
# Define feature names based on the dataset description (13 features)
feature_names = [
    'alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
    'total_phenols', 'flavanoids', 'color_intensity', 'hue',
    'od280/od315_of_diluted_wines', 'proanthocyanins', 'color_hue', 'proline'
]

# Load the dataset
data = pd.read_csv('../wine.data', header=None)

# Add column names (first column is the cultivar/target)
column_names = ['cultivar'] + feature_names
data.columns = column_names

print("=" * 70)
print("STEP 1: LOADING WINE DATASET")
print("=" * 70)
print(f"\n✓ Dataset loaded successfully!")
print(f"  - Shape: {data.shape}")
print(f"  - Total samples: {len(data)}")
print(f"  - Total features (including target): {len(data.columns)}")
print(f"\nFirst few rows of the dataset:")
print(data.head())
print(f"\nDataset Info:")
print(data.info())

STEP 1: LOADING WINE DATASET

✓ Dataset loaded successfully!
  - Shape: (178, 14)
  - Total samples: 178
  - Total features (including target): 14

First few rows of the dataset:
   cultivar  alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  \
0         1    14.23        1.71  2.43               15.6        127   
1         1    13.20        1.78  2.14               11.2        100   
2         1    13.16        2.36  2.67               18.6        101   
3         1    14.37        1.95  2.50               16.8        113   
4         1    13.24        2.59  2.87               21.0        118   

   total_phenols  flavanoids  color_intensity   hue  \
0           2.80        3.06             0.28  2.29   
1           2.65        2.76             0.26  1.28   
2           2.80        3.24             0.30  2.81   
3           3.85        3.49             0.24  2.18   
4           2.80        2.69             0.39  1.82   

   od280/od315_of_diluted_wines  proanthocyanins  color_h

## Step 3: Data Preprocessing

Handle missing values and explore the data distribution.

In [3]:
print("\n" + "=" * 70)
print("STEP 2: DATA PREPROCESSING")
print("=" * 70)

# Check for missing values
print("\nChecking for missing values...")
missing_values = data.isnull().sum().sum()
if missing_values == 0:
    print("✓ No missing values found in the dataset")
else:
    print(f"✓ Found {missing_values} missing values - handling them...")
    data = data.dropna()

print(f"\nDataset info after handling missing values:")
print(f"  - Shape: {data.shape}")
print(f"  - Unique cultivars: {data['cultivar'].nunique()}")
print(f"  - Cultivar distribution:")
print(data['cultivar'].value_counts().sort_index())


STEP 2: DATA PREPROCESSING

Checking for missing values...
✓ No missing values found in the dataset

Dataset info after handling missing values:
  - Shape: (178, 14)
  - Unique cultivars: 3
  - Cultivar distribution:
cultivar
1    59
2    71
3    48
Name: count, dtype: int64


## Step 4: Feature Selection

Select 6 features from the 13 available features.

In [5]:
print("\n" + "=" * 70)
print("STEP 3: FEATURE SELECTION")
print("=" * 70)

# Selected 6 features
selected_features = [
    'alcohol', 'malic_acid', 'total_phenols',
    'flavanoids', 'color_intensity', 'proline'
]

print(f"\nSelected Features ({len(selected_features)}):")
for i, feature in enumerate(selected_features, 1):
    print(f"  {i}. {feature}")

X = data[selected_features]
y = data['cultivar']

print(f"\nFeature matrix (X) shape: {X.shape}")
print(f"Target vector (y) shape: {y.shape}")
print(f"\nFeature statistics:")
print(X.describe())


STEP 3: FEATURE SELECTION

Selected Features (6):
  1. alcohol
  2. malic_acid
  3. total_phenols
  4. flavanoids
  5. color_intensity
  6. proline

Feature matrix (X) shape: (178, 6)
Target vector (y) shape: (178,)

Feature statistics:
          alcohol  malic_acid  total_phenols  flavanoids  color_intensity  \
count  178.000000  178.000000     178.000000  178.000000       178.000000   
mean    13.000618    2.336348       2.295112    2.029270         0.361854   
std      0.811827    1.117146       0.625851    0.998859         0.124453   
min     11.030000    0.740000       0.980000    0.340000         0.130000   
25%     12.362500    1.602500       1.742500    1.205000         0.270000   
50%     13.050000    1.865000       2.355000    2.135000         0.340000   
75%     13.677500    3.082500       2.800000    2.875000         0.437500   
max     14.830000    5.800000       3.880000    5.080000         0.660000   

           proline  
count   178.000000  
mean    746.893258  
std  

## Step 5: Feature Scaling

Apply StandardScaler to normalize features to have mean 0 and standard deviation 1.

In [6]:
print("\n" + "=" * 70)
print("STEP 4: FEATURE SCALING (StandardScaler)")
print("=" * 70)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("✓ Features scaled using StandardScaler")
print(f"  - Scaled data shape: {X_scaled.shape}")
print(f"  - Mean of scaled features (should be ~0): {X_scaled.mean(axis=0).round(10)}")
print(f"  - Std of scaled features (should be ~1): {X_scaled.std(axis=0).round(4)}")


STEP 4: FEATURE SCALING (StandardScaler)
✓ Features scaled using StandardScaler
  - Scaled data shape: (178, 6)
  - Mean of scaled features (should be ~0): [-0. -0.  0. -0.  0. -0.]
  - Std of scaled features (should be ~1): [1. 1. 1. 1. 1. 1.]


## Step 6: Train-Test Split

Split the data into training (80%) and testing (20%) sets with stratification.

In [7]:
print("\n" + "=" * 70)
print("STEP 5: TRAIN-TEST SPLIT")
print("=" * 70)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"✓ Data split successfully!")
print(f"  - Training set size: {X_train.shape[0]} samples")
print(f"  - Testing set size: {X_test.shape[0]} samples")
print(f"  - Train-Test ratio: {len(X_train)/len(X_test):.2f}")


STEP 5: TRAIN-TEST SPLIT
✓ Data split successfully!
  - Training set size: 142 samples
  - Testing set size: 36 samples
  - Train-Test ratio: 3.94


## Step 7: Train the Model

Train a Random Forest Classifier with the training data.

In [8]:
print("\n" + "=" * 70)
print("STEP 6: MODEL TRAINING (Random Forest Classifier)")
print("=" * 70)

# Initialize and train the Random Forest Classifier
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1,
    verbose=0
)

print("Training the Random Forest Classifier...")
model.fit(X_train, y_train)

print("✓ Model trained successfully!")
print(f"  - Algorithm: Random Forest Classifier")
print(f"  - Number of trees: 100")
print(f"  - Max depth: 10")
print(f"  - Training completed")


STEP 6: MODEL TRAINING (Random Forest Classifier)
Training the Random Forest Classifier...
✓ Model trained successfully!
  - Algorithm: Random Forest Classifier
  - Number of trees: 100
  - Max depth: 10
  - Training completed


## Step 8: Model Evaluation

Evaluate the model using multiple classification metrics.

In [9]:
print("\n" + "=" * 70)
print("STEP 7: MODEL EVALUATION")
print("=" * 70)

# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate metrics
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"\n{'ACCURACY METRICS':-^70}")
print(f"Training Accuracy:    {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Testing Accuracy:     {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

# Precision, Recall, F1-Score (Weighted)
test_precision = precision_score(y_test, y_test_pred, average='weighted', zero_division=0)
test_recall = recall_score(y_test, y_test_pred, average='weighted', zero_division=0)
test_f1 = f1_score(y_test, y_test_pred, average='weighted', zero_division=0)

print(f"\n{'WEIGHTED METRICS (Test Set)':-^70}")
print(f"Precision (Weighted): {test_precision:.4f}")
print(f"Recall (Weighted):    {test_recall:.4f}")
print(f"F1-Score (Weighted):  {test_f1:.4f}")

# Precision, Recall, F1-Score (Macro)
test_precision_macro = precision_score(y_test, y_test_pred, average='macro', zero_division=0)
test_recall_macro = recall_score(y_test, y_test_pred, average='macro', zero_division=0)
test_f1_macro = f1_score(y_test, y_test_pred, average='macro', zero_division=0)

print(f"\n{'MACRO METRICS (Test Set)':-^70}")
print(f"Precision (Macro):    {test_precision_macro:.4f}")
print(f"Recall (Macro):       {test_recall_macro:.4f}")
print(f"F1-Score (Macro):     {test_f1_macro:.4f}")


STEP 7: MODEL EVALUATION

---------------------------ACCURACY METRICS---------------------------
Training Accuracy:    1.0000 (100.00%)
Testing Accuracy:     0.9722 (97.22%)

---------------------WEIGHTED METRICS (Test Set)----------------------
Precision (Weighted): 0.9741
Recall (Weighted):    0.9722
F1-Score (Weighted):  0.9720

-----------------------MACRO METRICS (Test Set)-----------------------
Precision (Macro):    0.9778
Recall (Macro):       0.9667
F1-Score (Macro):     0.9710


In [10]:
# Classification Report
print(f"\n{'DETAILED CLASSIFICATION REPORT (Test Set)':-^70}")
print("\n" + classification_report(y_test, y_test_pred, zero_division=0))

# Confusion Matrix
print(f"\n{'CONFUSION MATRIX':-^70}")
cm = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:")
print(cm)


--------------DETAILED CLASSIFICATION REPORT (Test Set)---------------

              precision    recall  f1-score   support

           1       1.00      1.00      1.00        12
           2       0.93      1.00      0.97        14
           3       1.00      0.90      0.95        10

    accuracy                           0.97        36
   macro avg       0.98      0.97      0.97        36
weighted avg       0.97      0.97      0.97        36


---------------------------CONFUSION MATRIX---------------------------
Confusion Matrix:
[[12  0  0]
 [ 0 14  0]
 [ 0  1  9]]


In [11]:
# Feature Importance
print(f"\n{'FEATURE IMPORTANCE':-^70}")
feature_importance = pd.DataFrame({
    'Feature': selected_features,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFeature Importance Scores:")
for idx, row in feature_importance.iterrows():
    print(f"  {row['Feature']:25s}: {row['Importance']:.4f}")

# Display as a nice table
print("\n\nFeature Importance DataFrame:")
print(feature_importance)


--------------------------FEATURE IMPORTANCE--------------------------

Feature Importance Scores:
  flavanoids               : 0.2721
  proline                  : 0.2706
  alcohol                  : 0.1977
  malic_acid               : 0.1114
  total_phenols            : 0.1107
  color_intensity          : 0.0375


Feature Importance DataFrame:
           Feature  Importance
3       flavanoids    0.272123
5          proline    0.270554
0          alcohol    0.197653
1       malic_acid    0.111426
2    total_phenols    0.110721
4  color_intensity    0.037523


## Step 9: Save the Model

Save the trained model and scaler to disk for future use.

In [12]:
print("\n" + "=" * 70)
print("STEP 8: SAVING THE TRAINED MODEL")
print("=" * 70)

# Save the trained model
model_filename = 'wine_cultivar_model.pkl'
with open(model_filename, 'wb') as file:
    pickle.dump(model, file)

print(f"✓ Model saved successfully to '{model_filename}'")

# Save the scaler as well for future predictions
scaler_filename = 'wine_scaler.pkl'
with open(scaler_filename, 'wb') as file:
    pickle.dump(scaler, file)

print(f"✓ Scaler saved successfully to '{scaler_filename}'")

# Save feature names and selected features info
info_filename = 'wine_model_info.txt'
with open(info_filename, 'w') as file:
    file.write("WINE CULTIVAR PREDICTION MODEL - INFORMATION\n")
    file.write("=" * 70 + "\n\n")
    file.write("SELECTED FEATURES:\n")
    for feature in selected_features:
        file.write(f"  - {feature}\n")
    file.write(f"\nTARGET VARIABLE: cultivar\n")
    file.write(f"\nMODEL PERFORMANCE:\n")
    file.write(f"  - Training Accuracy: {train_accuracy:.4f}\n")
    file.write(f"  - Testing Accuracy: {test_accuracy:.4f}\n")
    file.write(f"  - Precision (Weighted): {test_precision:.4f}\n")
    file.write(f"  - Recall (Weighted): {test_recall:.4f}\n")
    file.write(f"  - F1-Score (Weighted): {test_f1:.4f}\n")

print(f"✓ Model information saved to '{info_filename}'")


STEP 8: SAVING THE TRAINED MODEL
✓ Model saved successfully to 'wine_cultivar_model.pkl'
✓ Scaler saved successfully to 'wine_scaler.pkl'
✓ Model information saved to 'wine_model_info.txt'


## Step 10: Project Summary

Summary of the Wine Cultivar Prediction System development and deployment.

In [13]:
print("\n" + "=" * 70)
print("SUMMARY")
print("=" * 70)
print(f"""
✓ Wine Cultivar Origin Prediction System Successfully Developed!

Dataset: {len(data)} wine samples with {len(feature_names)} features
Selected Features: {len(selected_features)} features
Algorithm: Random Forest Classifier
Test Accuracy: {test_accuracy*100:.2f}%

Files Generated:
  1. wine_cultivar_model.pkl - Trained machine learning model
  2. wine_scaler.pkl - Feature scaling scaler
  3. wine_model_info.txt - Model information and performance metrics

The model is ready for making predictions on new wine samples!
""")
print("=" * 70)


SUMMARY

✓ Wine Cultivar Origin Prediction System Successfully Developed!

Dataset: 178 wine samples with 13 features
Selected Features: 6 features
Algorithm: Random Forest Classifier
Test Accuracy: 97.22%

Files Generated:
  1. wine_cultivar_model.pkl - Trained machine learning model
  2. wine_scaler.pkl - Feature scaling scaler
  3. wine_model_info.txt - Model information and performance metrics

The model is ready for making predictions on new wine samples!



## Step 11: Making Predictions on New Samples

Example code for using the trained model to make predictions on new wine samples.

In [14]:
# Example 1: Make predictions from dataset
print("\n" + "=" * 70)
print("EXAMPLE: Making Predictions on New Wine Samples")
print("=" * 70)

# Example wine samples from the dataset
example_samples = {
    'Sample 1': [14.23, 1.71, 2.8, 3.06, 5.64, 1065],
    'Sample 2': [13.2, 1.78, 2.65, 2.76, 4.38, 1050],
    'Sample 3': [12.37, 1.21, 2.56, 2.67, 3.04, 985],
}

cultivar_names = {1: 'Cultivar 1', 2: 'Cultivar 2', 3: 'Cultivar 3'}

print("\nPredicting cultivar for new wine samples...")
print("-" * 70)

for sample_name, feature_values in example_samples.items():
    # Convert to numpy array and reshape for scaler
    sample = np.array(feature_values).reshape(1, -1)
    
    # Scale the features
    sample_scaled = scaler.transform(sample)
    
    # Make prediction
    prediction = model.predict(sample_scaled)[0]
    
    # Get prediction probability
    prediction_proba = model.predict_proba(sample_scaled)[0]
    
    print(f"\n{sample_name}")
    print(f"  Features: {dict(zip(selected_features, feature_values))}")
    print(f"  Predicted Cultivar: {cultivar_names[prediction]}")
    print(f"  Confidence: {max(prediction_proba)*100:.2f}%")
    print(f"  Probabilities:")
    for cultivar_id, prob in enumerate(prediction_proba, 1):
        print(f"    - {cultivar_names[cultivar_id]}: {prob*100:.2f}%")


EXAMPLE: Making Predictions on New Wine Samples

Predicting cultivar for new wine samples...
----------------------------------------------------------------------

Sample 1
  Features: {'alcohol': 14.23, 'malic_acid': 1.71, 'total_phenols': 2.8, 'flavanoids': 3.06, 'color_intensity': 5.64, 'proline': 1065}
  Predicted Cultivar: Cultivar 1
  Confidence: 94.00%
  Probabilities:
    - Cultivar 1: 94.00%
    - Cultivar 2: 2.00%
    - Cultivar 3: 4.00%

Sample 2
  Features: {'alcohol': 13.2, 'malic_acid': 1.78, 'total_phenols': 2.65, 'flavanoids': 2.76, 'color_intensity': 4.38, 'proline': 1050}
  Predicted Cultivar: Cultivar 1
  Confidence: 94.00%
  Probabilities:
    - Cultivar 1: 94.00%
    - Cultivar 2: 2.00%
    - Cultivar 3: 4.00%

Sample 3
  Features: {'alcohol': 12.37, 'malic_acid': 1.21, 'total_phenols': 2.56, 'flavanoids': 2.67, 'color_intensity': 3.04, 'proline': 985}
  Predicted Cultivar: Cultivar 2
  Confidence: 68.00%
  Probabilities:
    - Cultivar 1: 25.00%
    - Cultivar 2

## Step 12: Load and Use Saved Model

Demonstrate how to load the saved model and scaler for future predictions without retraining.

In [15]:
# Load saved model (in a new session, you would load like this)
print("\n" + "=" * 70)
print("LOADING SAVED MODEL FOR FUTURE USE")
print("=" * 70)

# Load the saved model
with open('wine_cultivar_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Load the saved scaler
with open('wine_scaler.pkl', 'rb') as file:
    loaded_scaler = pickle.load(file)

print("\n✓ Model and scaler loaded successfully!")
print("  - Model type:", type(loaded_model).__name__)
print("  - Scaler type:", type(loaded_scaler).__name__)

# Test prediction with loaded model
test_sample = np.array([[14.23, 1.71, 2.8, 3.06, 5.64, 1065]])
test_scaled = loaded_scaler.transform(test_sample)
test_pred = loaded_model.predict(test_scaled)[0]
test_proba = loaded_model.predict_proba(test_scaled)[0]

print(f"\n✓ Test Prediction with Loaded Model:")
print(f"  - Predicted Cultivar: {cultivar_names[test_pred]}")
print(f"  - Confidence: {max(test_proba)*100:.2f}%")


LOADING SAVED MODEL FOR FUTURE USE

✓ Model and scaler loaded successfully!
  - Model type: RandomForestClassifier
  - Scaler type: StandardScaler

✓ Test Prediction with Loaded Model:
  - Predicted Cultivar: Cultivar 1
  - Confidence: 94.00%


## Project Completion

✅ **Wine Cultivar Origin Prediction System - SUCCESSFULLY COMPLETED**

### Key Results:
- **Test Accuracy**: 97.22%
- **Algorithm**: Random Forest Classifier
- **Features Used**: 6 out of 13 available features
- **Dataset Size**: 178 wine samples across 3 cultivar classes

### Deliverables:
1. ✅ Trained machine learning model (wine_cultivar_model.pkl)
2. ✅ Feature scaler (wine_scaler.pkl)
3. ✅ Comprehensive evaluation metrics
4. ✅ Feature importance analysis
5. ✅ Production-ready code

### Model Performance:
- Accuracy: 97.22%
- Precision (Weighted): 97.41%
- Recall (Weighted): 97.22%
- F1-Score (Weighted): 97.20%

The model is ready for deployment and can accurately predict wine cultivar origins based on chemical properties!