**Model Training**
   - **Multinomial Logistic Regression Forecasting Model**:
     - Split the data into training and testing sets.
     - Build and compile a neural network model for multinomial logistic regression using TensorFlow.
     - Train the model on the training set.
 **Evaluation and Scoring**
   - Evaluate the model on the test set and visualize the results.
       
   
### Goal
The objective of this project is to develop a Multinomial Logistic Regression model that predicts a person's vision status (Normal Vision, Visual Impairment, Blindness) based on demographic and health-related factors

**Example**:
#### **Input:**
| Age  | Gender | RiskFactor (Diabetes) | RiskFactor (Smoking) | RiskFactorResponse (Hypertension) |
|------|--------|-----------------------|----------------------|-----------------------------------|
| 50   | Male   | Yes                   | No                   | Yes                               |

#### **Output (Vision Status Prediction)**:
| Vision Status | Probability   |
|---------------|---------------|
| Normal vision | 0.60          |
| Visual impairment | 0.25      |
| Blindness     | 0.15          |

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from IPython.display import display

### Load the Data

In [2]:
df = pd.read_csv('data/Vision_Survey_Cleaned.csv')


"""
    Filter the dataset to include only the relevant columns for the model.
"""
vision_cat = ['Best-corrected visual acuity']
df.drop(df[~df['Question'].isin(vision_cat)].index, inplace=True)

df.drop(df[df["RiskFactor"] == "All participants"].index, inplace=True)
df.drop(df[df["RiskFactorResponse"] == "Total"].index, inplace=True)
print("First few rows of the dataset:")
display(df.head())

First few rows of the dataset:


Unnamed: 0,Category,Question,Response,Age,Gender,RaceEthnicity,RiskFactor,RiskFactorResponse,Sample_Size
0,Measured Visual Acuity,Best-corrected visual acuity,Visual impairment,40-64 years,All genders,Other,Smoking,Yes,155
1,Measured Visual Acuity,Best-corrected visual acuity,US-defined blindness,12-17 years,Male,Other,Diabetes,No,54
3,Measured Visual Acuity,Best-corrected visual acuity,Any vision loss,18-39 years,Female,All races,Smoking,Yes,1511
4,Measured Visual Acuity,Best-corrected visual acuity,Any vision loss,40-64 years,All genders,Other,Diabetes,No,130
5,Measured Visual Acuity,Best-corrected visual acuity,Monocular vision loss,65-79 years,All genders,Other,Diabetes,Yes,37


### Generate Test and Train Datasets


In [3]:
"""
Define the features set X by selecting all the columns but "Vision Status" from the DataFrame
"""

X = df[["Age", "RiskFactor", "RiskFactorResponse", "Gender", "RaceEthnicity"]]
Y = df["Response"]

sample_weights = df['Sample_Size'].values

categorical_mappings = {
    col: list(X[col].unique()) for col in X.columns if X[col].dtype == 'object'
}

### Convert Categorical Data to Numeric

In [4]:
"""
    Convert categorical data to numeric using one-hot encoding.
"""

X = pd.get_dummies(X)
print("Features after encoding:")
display(X.head())

Features after encoding:


Unnamed: 0,Age_12-17 years,Age_18-39 years,Age_40-64 years,Age_65-79 years,Age_80 years and older,RiskFactor_Diabetes,RiskFactor_Hypertension,RiskFactor_Smoking,RiskFactorResponse_No,RiskFactorResponse_Yes,Gender_All genders,Gender_Female,Gender_Male,RaceEthnicity_All races,"RaceEthnicity_Black, non-Hispanic","RaceEthnicity_Hispanic, any race",RaceEthnicity_Other,"RaceEthnicity_White, non-Hispanic"
0,False,False,True,False,False,False,False,True,False,True,True,False,False,False,False,False,True,False
1,True,False,False,False,False,True,False,False,True,False,False,False,True,False,False,False,True,False
3,False,True,False,False,False,False,False,True,False,True,False,True,False,True,False,False,False,False
4,False,False,True,False,False,True,False,False,True,False,True,False,False,False,False,False,True,False
5,False,False,False,True,False,True,False,False,False,True,True,False,False,False,False,False,True,False


### Split the Data into Training and Testing Sets

In [5]:
"""
    Split the data into training and testing sets.
"""

X_train, X_test, Y_train, Y_test, weights_train, weights_test = train_test_split(
    X, Y, sample_weights, random_state=1, test_size=0.2
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (1383, 18)
Testing set shape: (346, 18)


### Evaluate the Model

In [6]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier


"""
    Grid search with cross-validation to find the best hyperparameters for the RandomForestClassifier.
"""
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the RandomForestClassifier
rf = RandomForestClassifier()

# Grid search with cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the grid search model with sample weights
grid_search.fit(X_train, Y_train, sample_weight=weights_train)

# Retrieve the best model and evaluate it
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Calculate weighted accuracy
weighted_accuracy = accuracy_score(Y_test, y_pred, sample_weight=weights_test)

# Output results
print(f"Best parameters: {grid_search.best_params_}")
print(f"Weighted Accuracy: {weighted_accuracy * 100:.2f}%")
print("\nClassification Report (not weighted):")
print(classification_report(Y_test, y_pred))

print("\nClassification Report (weighted):")
print(classification_report(Y_test, y_pred, sample_weight=weights_test))


Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best parameters: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 300}
Weighted Accuracy: 0.09%

Classification Report (not weighted):
                       precision    recall  f1-score   support

      Any vision loss       0.00      0.00      0.00        76
Monocular vision loss       0.02      0.04      0.03        78
        Normal vision       0.00      0.00      0.00        65
 US-defined blindness       0.00      0.00      0.00        56
    Visual impairment       0.00      0.00      0.00        71

             accuracy                           0.01       346
            macro avg       0.00      0.01      0.01       346
         weighted avg       0.01      0.01      0.01       346


Classification Report (weighted):
                       precision    recall  f1-score   support

      Any vision loss       0.00      0.00      0.00   54002.0
Monocular vision loss       0.00    

## Make custom input 

In [7]:
import joblib

def save_model(model, path='models/vision_status_model.pkl'):
    """
    :param model: 
    :param path: 
    :return: 
    
    Save the model to disk.
    set configuration for the model and save it to disk.
    """
    model_data = {
        'model': model,
        'feature_columns': X.columns.tolist(),
    }
    
    # Save the model to disk
    joblib.dump(model_data, path)
    
save_model(best_rf)


## Write method to predict vision status

In [8]:
def predict_vision_status(input_data, model_path='models/vision_status_model.pkl'):
    """
    Make predictions using the saved model.

    Args:
        input_data (dict): Dictionary containing input features.
        model_path (str): Path to the saved model.

    Returns:
        dict: Predicted probabilities for each vision status.
    """
    try:
        # Load the model directly
        model = joblib.load(model_path)
         
        # Specify feature columns manually (temporary workaround)
        feature_columns = model['feature_columns']
        
        # Convert input data to DataFrame
        input_df = pd.DataFrame([input_data])

        # Encode categorical variables
        input_encoded = pd.get_dummies(input_df)

        # Ensure all expected feature columns are present
        for col in feature_columns:
            if col not in input_encoded.columns:
                input_encoded[col] = 0

        # Reorder columns to match the model's training format
        input_encoded = input_encoded[feature_columns]

        # Predict probabilities
        probabilities = model['model'].predict_proba(input_encoded)[0]

        # Create results dictionary with readable format
        results = {
            'predictions': [
                {'status': status, 'probability': float(prob)}
                for status, prob in zip(model['model'].classes_, probabilities)
            ]
        }
        
        return results

    except Exception as e:
        raise ValueError(f"Error making prediction: {str(e)}")


### Make a Test Method


In [9]:
def test_prediction():
    # First save the model with the correct structure
    sample_input = {
        "Age": "18-39 years",
        "Gender": "Male",
        "RiskFactor_Diabetes": "Yes",
        "RiskFactor_Smoking": "Yes",
        "RiskFactor_Hypertension": "Yes",
    }
    
    # Make prediction
    try:
        predictions = predict_vision_status(sample_input)
        display(predictions)
        print("\nVision Status Predictions:")
        for pred in predictions['predictions']:
            print(f"{pred['status']}: {pred['probability']:.2%}")
        return True
    except Exception as e:
        print(f"{str(e)}")
        return False

## Evaluate custom input


In [10]:
test_prediction()

{'predictions': [{'status': 'Any vision loss',
   'probability': 0.1835977164029752},
  {'status': 'Monocular vision loss', 'probability': 0.22090991267810683},
  {'status': 'Normal vision', 'probability': 0.1772005680239101},
  {'status': 'US-defined blindness', 'probability': 0.2006687300413883},
  {'status': 'Visual impairment', 'probability': 0.21762307285361931}]}


Vision Status Predictions:
Any vision loss: 18.36%
Monocular vision loss: 22.09%
Normal vision: 17.72%
US-defined blindness: 20.07%
Visual impairment: 21.76%


True

### 
