# Multinomial Logistic Regression Forecasting Model

### Steps:
1. **Dataset Information Gathering**
   - Explore the dataset and understand the available features.
   - Identify potential target and predictor variables for the model.
   - Check for missing values, data types, and overall structure.

2. **Data Cleaning**
     - Handle missing values
     - Encode categorical variables into numeric format where necessary (e.g., gender, race).
     - Normalize or scale features if required.

3. **Data Visualization**
   - Visualize relationships between predictors and the target variable.
   - Use histograms, box plots, and scatter plots to understand distributions.
   - Correlation heatmaps to check for multi collinearity.

4. **TensorFlow Model Training**
   - **Multinomial Logistic Regression Forecasting Model**:
     - Split the data into training and testing sets.
     - Build and compile a neural network model for multinomial logistic regression using TensorFlow.
     - Train the model on the training set.
     - Evaluate the model on the test set and visualize the results.
   
### Goal
The objective of this project is to develop a Multinomial Logistic Regression model that predicts a person's vision status (Normal Vision, Visual Impairment, Blindness) based on demographic and health-related factors

**Example**:
#### **Input:**
| Age  | Gender | RiskFactor (Diabetes) | RiskFactor (Smoking) | RiskFactorResponse (Hypertension) |
|------|--------|-----------------------|----------------------|-----------------------------------|
| 50   | Male   | Yes                   | No                   | Yes                               |

#### **Output (Vision Status Prediction)**:
| Vision Status | Probability   |
|---------------|---------------|
| Normal vision | 0.60          |
| Visual impairment | 0.25      |
| Blindness     | 0.15          |


In [1]:
import pandas as pd

df = pd.read_csv("data/National_Health_and_Nutrition_Examination_Survey_Vision_and_Eye_Health_Surveillance.csv", low_memory=True)

### Make a rough overview of all the Data

In [52]:
df.describe()

Unnamed: 0,YearStart,YearEnd,Data_Value,Low_Confidence_limit,High_Confidence_Limit,Numerator,Sample_Size,LocationID,DataValueTypeID,GeoLocation,Geographic Level
count,10320.0,10320.0,6328.0,6328.0,6328.0,0.0,8639.0,10320.0,0.0,0.0,0.0
mean,2001.162791,2008.0,23.670702,20.277276,27.117668,,1236.882278,59.0,,,
std,2.880952,0.0,31.245569,30.055377,32.022886,,2400.059894,0.0,,,
min,1999.0,2008.0,0.0,0.0,0.0,,30.0,59.0,,,
25%,1999.0,2008.0,3.4,2.2,4.8,,155.0,59.0,,,
50%,1999.0,2008.0,10.8,7.7,14.1,,453.0,59.0,,,
75%,2005.0,2008.0,25.0,17.7,33.1,,1265.5,59.0,,,
max,2005.0,2008.0,100.0,99.3,100.0,,35090.0,59.0,,,


### List of all Rows and there DataType

In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10320 entries, 0 to 10319
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   YearStart                   10320 non-null  int64  
 1   YearEnd                     10320 non-null  int64  
 2   LocationAbbr                10320 non-null  object 
 3   LocationDesc                10320 non-null  object 
 4   DataSource                  10320 non-null  object 
 5   Topic                       10320 non-null  object 
 6   Category                    10320 non-null  object 
 7   Question                    10320 non-null  object 
 8   Response                    10320 non-null  object 
 9   Age                         10320 non-null  object 
 10  Gender                      10320 non-null  object 
 11  RaceEthnicity               10320 non-null  object 
 12  RiskFactor                  10320 non-null  object 
 13  RiskFactorResponse          103

### List of First 5 Rows

In [54]:
df.head()

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Category,Question,Response,Age,Gender,RaceEthnicity,RiskFactor,RiskFactorResponse,Data_Value_Unit,Data_Value_Type,Data_Value,Data_Value_Footnote_Symbol,Data_Value_Footnote,Low_Confidence_limit,High_Confidence_Limit,Numerator,Sample_Size,LocationID,TopicID,CategoryID,QuestionID,ResponseID,DataValueTypeID,AgeID,GenderID,RaceEthnicityID,RiskFactorID,RiskFactorResponseID,GeoLocation,Geographic Level
0,2005,2008,US,National,NHANES,Visual Function,Blind or Difficulty Seeing,Percentage of people blind in both eyes,Yes,All ages,Female,All races,Diabetes,No,%,Crude Prevalence,0.0,,,0.0,0.0,,5800.0,59,TVFUNC,CBLIND,NHBL,RYES,,AGEALL,GF,ALLRACE,RFDM,RFNO,,
1,1999,2008,US,National,NHANES,Visual Function,Measured Visual Acuity,Best-corrected visual acuity,Visual impairment,40-64 years,All genders,Other,Smoking,Yes,%,Crude Prevalence,0.0,,,0.0,0.0,,155.0,59,TVFUNC,CVISAC,QVISA,RVIMP,,AGE4064,GALL,OTH,RFSM,RFYES,,
2,1999,2008,US,National,NHANES,Visual Function,Measured Visual Acuity,Best-corrected visual acuity,US-defined blindness,12-17 years,Male,Other,Diabetes,No,%,Crude Prevalence,0.0,,,0.0,0.0,,54.0,59,TVFUNC,CVISAC,QVISA,RVUSB,,AGE1217,GM,OTH,RFDM,RFNO,,
3,2005,2008,US,National,NHANES,Visual Function,Blind or Difficulty Seeing,Percentage of people blind in both eyes,Yes,All ages,Female,Other,Diabetes,Yes,%,Crude Prevalence,0.0,,,0.0,0.0,,32.0,59,TVFUNC,CBLIND,NHBL,RYES,,AGEALL,GF,OTH,RFDM,RFYES,,
4,2005,2008,US,National,NHANES,Visual Function,Blind or Difficulty Seeing,Percentage of people blind in both eyes,Yes,18-39 years,Female,"Black, non-Hispanic",Diabetes,No,%,Crude Prevalence,0.0,,,0.0,0.0,,476.0,59,TVFUNC,CBLIND,NHBL,RYES,,AGE1839,GF,BLK,RFDM,RFNO,,


### Percentage of Values that are Null

In [57]:
round(df.isnull().sum().sum() / df.count().sum(), 3)

0.222

### Check for Duplicate values

In [56]:
df.duplicated().sum()

0

### Data Cleaning 

**Rows to remove for Multinomial Logistic Regression to predicting vision status**
- YearStart
- YearEnd
- LocationDesc
- DataSource
- TopicType
- Data_Value_Unit
- DataValueTypeID
- Data_Value_Alt
- GeoLocation
- LowConfidenceLimit
- HighConfidenceLimit
- Sample_Size
- Data_Value_Footnote_Symbol
- Data_Value_Footnote
- ResponseID
- QuestionID
- StratificationID


In [2]:
df.drop(columns=['YearStart', 'YearEnd', 'LocationDesc', 'DataSource', 'Topic', 
                 'Data_Value_Unit', 'DataValueTypeID', 'Data_Value_Alt', 'GeoLocation', 
                 'LowConfidenceLimit', 'HighConfidenceLimit', 'Sample_Size', 
                 'Data_Value_Footnote_Symbol', 'Data_Value_Footnote', 'ResponseID', 
                 'QuestionID', 'StratificationID'], inplace=True)

KeyError: "['Year End', 'TopicType', 'Data_Value_Alt', 'LowConfidenceLimit', 'HighConfidenceLimit', 'StratificationID'] not found in axis"