<p style="color:rgb(0, 0, 255); line-height: 1.6;"><b>This dataset is related to Dry Eye disease analysis. The goal is to predict or analyze factors influencing Dry Eye conditions for a given population . The dataset includes patient-related features, test results, or environmental factors contributing to Dry Eye disease.</b></p>

<p>
<b>Gender</b> – Categorical (e.g., Male, Female)

<b>Age</b> – Numeric (Integer)

<b>Sleep duration</b> – Numeric (Float, likely in hours)

<b>Sleep quality</b> – Numeric (Integer scale, e.g., 1–10)

<b>Stress level</b> – Numeric (Integer scale)

<b>Blood pressure</b> – Categorical (e.g., "120/80")

<b>Heart rate</b> – Numeric (Integer, beats per minute)

<b>Daily steps</b> – Numeric (Integer)

<b>Physical activity</b> – Numeric (Integer, possibly minutes/day or activity level)

<b>Height</b> – Numeric (Integer, likely in cm)

<b>Weight</b> – Numeric (Integer, likely in kg)

<b>Sleep disorder</b> – Categorical (e.g., Yes/No or disorder types)

<b>Wake up during night</b> – Categorical (Yes/No)

<b>Feel sleepy during day</b> – Categorical (Yes/No)

<b>Caffeine consumption</b> – Categorical (e.g., Yes/No or levels)

<b>Alcohol consumption</b> – Categorical

<b>Smoking – Categorical</b>

<b>Medical issue</b> – Categorical (e.g., Yes/No)

<b>Ongoing medication</b> – Categorical

<b>Smart device before bed</b> – Categorical (e.g., Yes/No)

<b>Average screen time</b> – Numeric (Float, likely in hours/day)

<b>Blue-light filter</b> – Categorical (Yes/No)

<b>Discomfort Eye-strain</b> – Categorical (Yes/No)

<b>Redness in eye</b> – Categorical

<b>Itchiness/Irritation in eye</b> – Categorical

<b>Dry Eye Disease</b> – Categorical (Target variable: Yes/No or diagnosis)
</p>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.metrics import accuracy_score,recall_score,f1_score,precision_score,roc_auc_score,classification_report,roc_curve,confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
import warnings
warnings.filterwarnings('ignore')

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df=pd.read_csv('Dry_Eye_Dataset.csv')

In [None]:
df.head()

In [None]:
df.info()

# Checking for duplicate values

In [None]:
df.duplicated().sum()

<p style="color:rgb(0, 0, 255);>No missing values found</p> 

# Checking for missing values

In [None]:
# Missing Values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
print("\nMissing Value Percentage:")
print(missing_percentage)

# Statistical Summary

In [None]:
df.describe()

 <p>
     1.<b>Age:</b>This is a relatively young to middle-aged population.The narrow age range may affect generalizability to older adults or children.</p>
 <p>2.<b>Sleep duration:</b>On average, participants meet recommended sleep duration (7–9 hrs). However, a significant portion sleeps below that threshold  (min = 4),which may affect health and screen-time related disorders.</p>
 <p>3.<b>Sleep quality:</b>Normal distribution centered at moderate sleep quality. There’s noticeable variance, suggesting some individuals experience poor sleep regularly.</p>
<p> 4.<b>Stress level:</b>stress levels average around medium, with enough spread to detect differences among individuals. This could be a critical feature  when analyzing lifestyle-related conditions.</p>
<p> 5.<b>Heart rate:</b>All within normal resting heart rate range.</p>
<p> 6.<b>Daily steps:</b>Quite active on average — the general guideline is 10,000 steps/day. High variance suggests significant lifestyle differences  (sedentary vs. active users).</p>
<p> 7.<b>Physical activity:</b>On average, people meet the recommended 30 mins/day. But some are completely inactive (0 minutes) — a possible risk factor  for lifestyle diseases.</p>
<p> 8.<b>Height:</b>Represents a fairly typical adult height distribution.</p>
<p> 9:<b>Weight:</b>Standard weight range for adults. </p>
<p> 10.<b>Average Screen Time:</b>High screen time on average — this may correlate strongly with eye strain, dry eye disease (DED), and sleep quality  issues. A key feature</p>


# Checking for anomalies

In [None]:
for i in df.select_dtypes(include='object').columns:
    print(f'{i}\n',df[i].unique())

<b>blood pressure columns needs to be split into two columns as it is wrongly taken as categoric</b>

In [None]:
# we will Split 'Blood pressure' into 'Systolic' and 'Diastolic'
df[['Systolic BP', 'Diastolic BP']] = df['Blood pressure'].str.split('/', expand=True)

# Convert the new columns to numeric
df['Systolic BP'] = pd.to_numeric(df['Systolic BP'], errors='coerce')
df['Diastolic BP'] = pd.to_numeric(df['Diastolic BP'], errors='coerce')

# Drop the original 'Blood pressure' column
df = df.drop(columns=['Blood pressure'])

# Checking for outliers

In [None]:
# checking for outliers
numeric=df.select_dtypes(include=np.number).columns
plt.figure(figsize=(12,18))
t=1
for i in numeric:
    plt.subplot(6,2,t)
    sns.boxplot(df[i])
    plt.title(f'Boxplot of {i}')
    t+=1
plt.tight_layout()
plt.show()

<b>
1.No severe outliers in most features.

2.Distributions appear fairly normal or slightly skewed in some cases.
</b>

# Univariate Analysis

## numeric columns

In [None]:
numeric=df.select_dtypes(include=np.number).columns
plt.figure(figsize=(12,18))
t=1
for i in numeric:
    plt.subplot(7,2,t)
    sns.histplot(df[i])
    plt.title(f'Distribution of {i}')
    plt.xlabel(i)
    plt.ylabel('Frequency')
    t+=1
plt.tight_layout()
plt.show()

<p><b>Age:</b>The age distribution appears roughly uniform between 18 and 45, but there's a notable absence of older adults (above 45 years).</p>
<p><b>Sleep duration:</b>Sleep Duration is slightly right-skewed — many sleep around 6–8 hours, which is ideal, but a subset sleeps <6 hours.</p>
<p><b>Sleep quality:</b>Sleep Quality shows an even spread — indicating variability, and low sleep quality is a known DED risk factor.</p>
<p><b>Stress level:</b>Stress Level:Spread across the full 1–5 scale. Some users report high stress.</p>
<p><b>Average Screen Time:</b>Nearly uniform, with many individuals having >6 hours/day of screen time.High screen exposure reduces blink rate, a direct trigger for Dry Eye Disease — this is likely a strong predictive feature.</p>
<p><b>Heart rate:</b>Uniform, but some individuals are on the higher side (90–100 bpm).</p>

### checking skewness of numeric variables

In [None]:
for i in numeric:
    print(f'Skewness of {i} :',df[i].skew())

## Categoric columns

In [None]:
categoric=df.select_dtypes(include='object').columns
for i in categoric:
    print(df[i].value_counts(normalize=True)*100)

In [None]:


plt.figure(figsize=(12, 18))
t = 1

for i in categoric:
    plt.subplot(10, 2, t)
    df[i].value_counts(normalize=True).plot(kind='bar')
    plt.title(f'Distribution of {i}')
    plt.ylabel('')
    t += 1

plt.tight_layout()
plt.show()


<p>1. Almost all the features are fairly distributed.</p>
<p>2.<b>Dry Eye Disease:Yes: 65.2%, No: 34.8%</b></br>
Indicates a mild class imbalance — might influence classification metrics like accuracy.</p>

# Bi-variate analysis

##### checking numeric features relationship with target

In [None]:
plt.figure(figsize=(12, 18))
t=1
for i in numeric:
    plt.subplot(7,2,t)
    sns.boxplot(y=df[i], x=df['Dry Eye Disease'])
    plt.title(f'{i} vs Dry Eye Disease')
    plt.xlabel(i)
    plt.ylabel('Frequency')
    t+=1
plt.tight_layout()
plt.show()

<p>

<b>Age:</b>

Slight upward trend in DED frequency with increasing age.Indicates that older individuals are more prone to Dry Eye Disease.

<b>Sleep Duration:</b>

Moderate fluctuations, but overall, DED frequency seems slightly lower with longer sleep durations.Poor sleep may be linked to increased risk of DED.

<b>Heart Rate:</b>

No strong pattern observed, though slightly more DED cases are seen at lower and higher extremes, suggesting possible impact of health/stress levels.

<b>Daily Steps:</b>

Slight downward trend – more physically active individuals (higher steps) seem to have fewer DED cases.Physical activity may help reduce risk.

<b>Physical Activity:</b>

High variability, but generally lower DED frequency at moderate-to-high activity levels.Reinforces that inactivity might correlate with DED.

<b>Height:</b>

No consistent relationship with DED observed.Likely not a significant predictor.

<b>Weight:</b>

Slight upward trend in DED frequency with increasing weight.

<b>Average Screen Time:</b>

Clear upward trend: more screen time strongly correlates with higher DED frequency.Indicates screen exposure is a major risk factor.

<b>Systolic BP:</b>

Moderate rise in DED frequency at higher systolic BP.High BP might be indirectly linked to DED via overall health conditions.

<b>Diastolic BP:</b>

Similar trend to systolic – DED increases with higher diastolic BP.Suggests possible vascular or systemic health impact on eye health.</p>

#### categoric features relationship with target

In [None]:
plt.figure(figsize=(12, 24))
t = 1

for col in categoric:
    if col != 'Dry Eye Disease':
        ax = plt.subplot(8, 2, t)
        pd.crosstab(df[col], df['Dry Eye Disease'], normalize='index').plot(
            kind='bar', ax=ax, legend=False)
        ax.set_title(f'{col} vs Dry Eye Disease')
        ax.set_ylabel('Proportion')
        ax.set_xlabel(col)
        t += 1

plt.tight_layout()
plt.legend(['No', 'Yes'], title='Dry Eye Disease', bbox_to_anchor=(1.05, 4), loc='upper left')
plt.show()

<p><b>
Females, individuals with poor sleep quality, high stress, and sleep disorders are more likely to have Dry Eye Disease (DED).

Excessive screen time, especially before bed, and symptoms like eye strain, redness, and irritation are strong indicators of DED.

Use of blue-light filters and maintaining good sleep hygiene may help reduce the risk of DED.

Health factors such as medical issues, ongoing medication, and smoking also show moderate influence on DED presence.</b>
</p>

# Feature Engineering

In [None]:
df['Pulse_Pressure']=df['Systolic BP']-df['Diastolic BP']

In [None]:
sns.boxplot(x=df['Dry Eye Disease'],y=df['Pulse_Pressure'])
plt.show()

<b>Pulse Pressure does not show a strong univariate association with Dry Eye Disease.</b>

In [None]:
# adding one more feature 'BMI'

In [None]:
df['BMI']=df['Weight'] / (df['Height']/100)**2

In [None]:
sns.boxplot(x=df['Dry Eye Disease'],y=df['BMI'])
plt.show()

<p>Median BMI is almost identical for both DED and non-DED groups.</br>
Both groups have similar interquartile ranges (IQR), indicating comparable variability.</br>
There are slightly more high-end outliers in the DED group, but not significantly different.</p>

In [None]:
# Adding Blood pressure category based on systolic and diastolic metrics using data/facts from American Heart Association

In [None]:
def classify_bp(row):
    systolic = row['Systolic BP']
    diastolic = row['Diastolic BP']
    
    if systolic > 180 or diastolic > 120:
        return 'Hypertensive Crisis'
    elif systolic >= 140 or diastolic >= 90:
        return 'Hypertension Stage 2'
    elif systolic >= 130 or diastolic >= 80:
        return 'Hypertension Stage 1'
    elif systolic >= 120 and diastolic < 80:
        return 'Elevated'
    else:
        return 'Normal'
# Create new column
df['BP_category'] = df.apply(classify_bp, axis=1)

In [None]:
pd.crosstab(df['BP_category'], df['Dry Eye Disease'], normalize='index').plot(kind='bar',legend=False)
plt.title('BP_category vs Dry eye disease')

<p>The proportion of individuals with Dry Eye Disease (DED) appears to be slightly higher in those categorized under:</br>
<b>Hypertension Stage 1</b></br>
<b>Hypertension Stage 2 </b></p>
<p>Even though the differences are not dramatic, the orange bars (indicating DED = "Yes") 
show a small upward trend from <b>Normal and Elevated BP </b>categories</br> to <b>Stage 1 and Stage 2</b>.</p>

## Note

<p>We can't categorize age because we don't have significant data in age column</p>

In [None]:
# Categorizing sleep duration based on data/facts provided by WHO

In [None]:
def categorize_sleep(duration):
    if duration < 7:
        return 'Short'
    elif 7 <= duration <= 9:
        return 'Healthy'
    else:
        return 'Long'

df['Sleep_category'] = df['Sleep duration'].apply(categorize_sleep)

In [None]:
pd.crosstab(df['Sleep_category'], df['Dry Eye Disease'], normalize='index').plot(kind='bar',legend=False)

<p><b>There is no significant variation in the prevalence of Dry Eye Disease across different sleep duration categories.</b></p>

In [None]:
#categorizing average screen time using data/facts from National Institute of health

In [None]:
def categorize_screen_time(hours):
    if hours <= 2:
        return 'Low'
    elif hours <= 6:
        return 'Moderate'
    elif hours <= 9:
        return 'High'
    else:
        return 'Very High'

df['Screen_Time_Category'] = df['Average screen time'].apply(categorize_screen_time)

In [None]:
pd.crosstab(df['Screen_Time_Category'], df['Dry Eye Disease'], normalize='index').plot(kind='bar',legend=False)

<p>The "Very High" screen time category shows the highest proportion of individuals with Dry Eye Disease (DED).</br>
The "Low", "Moderate", and "High" categories all show a relatively lower and consistent proportion of DED cases.</p>
<p><b>This suggests a positive association between very high screen time (≥10 hrs/day) and DED prevalence.</b></p>

In [None]:
# Using heat map checking corelation after adding new features

In [None]:
numeric=df.select_dtypes(include=np.number).columns
plt.figure(figsize=(15,15))
sns.heatmap(df[numeric].corr(),annot=True,cmap='viridis')

<p>
<b>Systolic BP shows a strong positive correlation with Pulse Pressure (0.85), meaning higher systolic pressure is associated with higher pulse pressure.

BMI has a strong positive correlation with Weight (0.75) and a strong negative correlation with Height (-0.64), which is expected given BMI's formula.

Diastolic BP is moderately negatively correlated with Pulse Pressure (-0.52), indicating an inverse relationship.

Most features like age, sleep duration, physical activity, screen time, etc. show very weak or negligible correlations with each other (values close to 0), suggesting they are largely independent in this dataset.</b>
</p>

In [None]:
sns.pairplot(df)
plt.show()

In [None]:
df.head()

In [None]:
df.info()

# Checking the Relationship Between Categorical Features and Target (Dry Eye Disease)

In [None]:
categoric=df.select_dtypes(include='object').columns

In [None]:
from scipy.stats import chi2_contingency

for col in categoric:
    if col !='Dry Eye Disease':
        contingency_table = pd.crosstab(df[col], df['Dry Eye Disease'])
        chi2, p, dof, expected = chi2_contingency(contingency_table)
        print(f"Chi-squared test for {col}:")
        print(f"  p-value: {p}")
        if p < 0.05:
            print("  Conclusion: There is a statistically significant association between", col, "and Dry Eye Disease.")
        else:
            print("  Conclusion: There is no statistically significant association between", col, "and Dry Eye Disease.")

# Checking relationship between numeric columns and Target(DED)

In [None]:
from scipy.stats import ttest_ind
numeric=df.select_dtypes(include=np.number).columns
for i in numeric:
    group1=df[df['Dry Eye Disease']=='Y'][i]
    group2=df[df['Dry Eye Disease']=='N'][i]
    t_stat, p_val = ttest_ind(group1, group2, equal_var=False)
    print(f"ttest_ind for {i}:")
    print(f"  p-value: {p_val}")
    if p_val < 0.05:
        print("  Conclusion: There is a statistically significant association between", i, "and Dry Eye Disease.")
    else:
        print("  Conclusion: There is no statistically significant association between", i, "and Dry Eye Disease.")

#### Checking relationship of 'Discomfort Eye-strain' with others

In [None]:
categoric=df.select_dtypes(include='object').columns
for col in categoric:
    if col !='Discomfort Eye-strain':
        contingency_table = pd.crosstab(df[col], df['Discomfort Eye-strain'])
        chi2, p, dof, expected = chi2_contingency(contingency_table)
        print(f"Chi-squared test for {col}:")
        print(f"  p-value: {p}")
        if p < 0.05:
            print("  Conclusion: There is a statistically significant association between", col, "and Discomfort_Eye_strain.")
        else:
            print("  Conclusion: There is no statistically significant association between", col, "and Discomfort_Eye_strain.")

In [None]:
numeric=df.select_dtypes(include=np.number).columns
for i in numeric:
    group1=df[df['Discomfort Eye-strain']=='Y'][i]
    group2=df[df['Discomfort Eye-strain']=='N'][i]
    t_stat, p_val = ttest_ind(group1, group2, equal_var=False)
    print(f"ttest_ind for {i}:")
    print(f"  p-value: {p_val}")
    if p_val < 0.05:
        print("  Conclusion: There is a statistically significant association between", i, "and Discomfort_Eye_strain.")
    else:
        print("  Conclusion: There is no statistically significant association between", i, "and Discomfort_Eye_strain.")

#### Checking relation between Redness_in_eye with other features

In [None]:
for col in categoric:
    if col !='Redness in eye':
        contingency_table = pd.crosstab(df[col], df['Redness in eye'])
        chi2, p, dof, expected = chi2_contingency(contingency_table)
        print(f"Chi-squared test for {col}:")
        print(f"  p-value: {p}")
        if p < 0.05:
            print("  Conclusion: There is a statistically significant association between", col, "and Redness_in_eye.")
        else:
            print("  Conclusion: There is no statistically significant association between", col, "and Redness_in_eye.")

In [None]:
for i in numeric:
    group1=df[df['Redness in eye']=='Y'][i]
    group2=df[df['Redness in eye']=='N'][i]
    t_stat, p_val = ttest_ind(group1, group2, equal_var=False)
    print(f"ttest_ind for {i}:")
    print(f"  p-value: {p_val}")
    if p_val < 0.05:
        print("  Conclusion: There is a statistically significant association between", i, "and Redness_in_eye.")
    else:
        print("  Conclusion: There is no statistically significant association between", i, "and Redness_in_eye.")

#### Checking Itchiness_Irritation_in_eye relation with other features

In [None]:
for col in categoric:
    if col !='Itchiness/Irritation in eye':
        contingency_table = pd.crosstab(df[col], df['Itchiness/Irritation in eye'])
        chi2, p, dof, expected = chi2_contingency(contingency_table)
        print(f"Chi-squared test for {col}:")
        print(f"  p-value: {p}")
        if p < 0.05:
            print("  Conclusion: There is a statistically significant association between", col, "and Itchiness_Irritation_in_eye.")
        else:
            print("  Conclusion: There is no statistically significant association between", col, "and Itchiness_Irritation_in_eye.")

In [None]:
for i in numeric:
    group1=df[df['Redness in eye']=='Y'][i]
    group2=df[df['Redness in eye']=='N'][i]
    t_stat, p_val = ttest_ind(group1, group2, equal_var=False)
    print(f"ttest_ind for {i}:")
    print(f"  p-value: {p_val}")
    if p_val < 0.05:
        print("  Conclusion: There is a statistically significant association between", i, "and Redness_in_eye.")
    else:
        print("  Conclusion: There is no statistically significant association between", i, "and Redness_in_eye.")

## Removing spaces and special characters from column names

In [None]:
df.columns = df.columns.str.strip()  # remove leading/trailing spaces
df.columns = df.columns.str.replace('[^0-9a-zA-Z]+', '_', regex=True)

In [None]:
df.columns

In [None]:
categoric=df.select_dtypes(include='object').columns
categoric

# Encoding the categoric features

In [None]:
#df=df.drop('symptom_severity',axis=1)

In [None]:
df1=df.copy()

In [None]:
df1=df1.drop(['Height','Weight'],axis=1)

In [None]:
cols=['Sleep_disorder', 'Wake_up_during_night',
       'Feel_sleepy_during_day', 'Caffeine_consumption', 'Alcohol_consumption',
       'Smoking', 'Medical_issue', 'Ongoing_medication',
       'Smart_device_before_bed', 'Blue_light_filter', 'Discomfort_Eye_strain',
       'Redness_in_eye', 'Itchiness_Irritation_in_eye', 'Dry_Eye_Disease']

In [None]:
df1[cols] = df1[cols].applymap(lambda x: 1 if x == 'Y' else 0)

In [None]:
df1 = pd.get_dummies(df1, columns=['BP_category', 'Sleep_category', 'Screen_Time_Category'], drop_first=True,dtype=int)

In [None]:
df1['Gender']=df1['Gender'].apply(lambda x:1 if x=='M' else 0)

In [None]:
numcol=['Age', 'Sleep_duration', 'Sleep_quality', 'Stress_level', 'Heart_rate','Daily_steps', 'Physical_activity',
        'Average_screen_time', 'Systolic_BP', 'Diastolic_BP', 'Pulse_Pressure','BMI']

In [None]:
ss=StandardScaler()
df1[numcol]=ss.fit_transform(df1[numcol])

In [None]:
df1.head(5)

In [None]:
df1.info()

In [None]:
#df1.to_csv('encoded_dry_eye.csv', index=False)

In [None]:
# train test split

In [None]:
X=df1.drop(['Dry_Eye_Disease'],axis=1)
y=df1['Dry_Eye_Disease']

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

In [None]:
print('Xtrain',X_train.shape)
print('Xtest',X_test.shape)
print('ytrain',y_train.shape)
print('ytest',y_test.shape)

In [None]:
print('y_train')
y_train.value_counts()

In [None]:
print('y_test')
y_test.value_counts()

# Defining functions

In [None]:
def metrics(y_test,y_pred,model):
    print(model)
    print('accuracy',accuracy_score(y_test,y_pred))
    print('precision',precision_score(y_test,y_pred))
    print('recall',recall_score(y_test,y_pred))
    print('fi score',f1_score(y_test,y_pred))
    print('classification report',classification_report(y_test,y_pred))

In [None]:
def plot_roc_curve(y_true, y_probs, model):
    fpr, tpr, thresholds = roc_curve(y_true, y_probs)

    # Calculate AUC Score
    auc_score = roc_auc_score(y_true, y_probs)
    print(f'ROC-AUC Score for {model}: {auc_score:.2f}')

    # Plot
    plt.figure(figsize=(8,6))
    plt.plot(fpr, tpr, color='blue', label=f'{model} (AUC = {auc_score:.2f})')
    plt.plot([0, 1], [0, 1], color='red', linestyle='--') 
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - {model}')
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
def plot_confusion_matrix(y_test, y_pred):

    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6,5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def vif_check(X,numcol):
    vif=pd.DataFrame()
    vif['feature']=X[numcol].columns
    vif['vifscore']=[variance_inflation_factor(X[numcol].values,i) for i in range(X[numcol].shape[1])]
    vif['vifscore']=round(vif['vifscore'],2)
    vif=vif.sort_values(by='vifscore',ascending=False)
    return vif

In [None]:
def imp_feature(model):
    importance = model.feature_importances_
    features = X_train.columns
    
    feat_df = pd.DataFrame({'Feature': features, 'Importance': importance})
    feat_df = feat_df.sort_values(by='Importance', ascending=False)
    
    plt.figure(figsize=(10,6))
    plt.barh(feat_df['Feature'], feat_df['Importance'])
    plt.xlabel("Importance Score")
    plt.ylabel("Features")
    plt.title("Feature Importance")
    plt.gca().invert_yaxis()
    plt.show()

# Model building

### Logistic Regression

In [None]:
lr=LogisticRegression()
lr_model=lr.fit(X_train,y_train)


In [None]:
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)

metrics(y_train, y_train_pred,'train_metrics')

In [None]:
metrics(y_test, y_test_pred,'test_metrics')

<p><b>The model well performing well on both train and test data</b></p> 

In [None]:
lr=LogisticRegression()
lr_model=lr.fit(X_train,y_train)
y_pred=lr_model.predict(X_test)
y_pred_proba=lr_model.predict_proba(X_test)[:,1]

In [None]:
metrics(y_test,y_pred,'Logistic Regression')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
metrics(y_test,y_pred,'Logistic Regression')

In [None]:
vif_check(X_train,numcol)

In [None]:
X=df1.drop(['Dry_Eye_Disease','Diastolic_BP'],axis=1)
y=df1['Dry_Eye_Disease']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
print('Xtrain',X_train.shape)
print('Xtest',X_test.shape)
print('ytrain',y_train.shape)
print('ytest',y_test.shape)

In [None]:
numcol1=['Age', 'Sleep_duration', 'Sleep_quality', 'Stress_level', 'Heart_rate','Daily_steps', 'Physical_activity',
        'Average_screen_time', 'Systolic_BP', 'Pulse_Pressure','BMI']
ss=StandardScaler()
X_train[numcol1]=ss.fit_transform(X_train[numcol1])
X_test[numcol1]=ss.transform(X_test[numcol1])

In [None]:
vif_check(X_train,numcol1)

In [None]:
lr=LogisticRegression()
lr_model=lr.fit(X_train,y_train)
train_pred=lr_model.predict(X_train)
y_pred=lr_model.predict(X_test)
y_pred_proba=lr_model.predict_proba(X_test)[:,-1]
y_pred_cust = (y_pred_proba >= 0.4).astype(int)

In [None]:
metrics(y_train,train_pred,'train accuracy')

In [None]:
metrics(y_test,y_pred,'test accuracy')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
plot_roc_curve(y_test,y_pred_proba,'lr-model')

In [None]:
metrics(y_test,y_pred_cust,'threshold 0.4')

## Decision Tree

In [None]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
train_pred=dt_model.predict(X_train)
y_pred = dt_model.predict(X_test)
y_pred_prob = dt_model.predict_proba(X_test)[:, 1]

In [None]:
metrics(y_train,train_pred,'train metrics')

In [None]:
metrics(y_test,y_pred,'test metrics')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
#Tunned Dt

In [None]:
param = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, 15, 20, None],
    'min_samples_split': [5, 10, 20, 50],
    'min_samples_leaf': [2, 4, 10]
}

In [None]:
dt_serach=GridSearchCV(estimator=dt_model,param_grid=param,cv=5,n_jobs=-1,verbose=1)
model=dt_serach.fit(X_train,y_train)
print(model.best_params_)

In [None]:
dt_model = DecisionTreeClassifier(criterion='entropy',max_depth=3,min_samples_leaf=2,min_samples_split=5)
dt_model.fit(X_train, y_train)
train_pred=dt_model.predict(X_train)
y_pred = dt_model.predict(X_test)
y_pred_proba=dt_model.predict_proba(X_test)[:,1]

In [None]:
metrics(y_train,train_pred,'train prediction')

In [None]:
metrics(y_test,y_pred,'Test prediction')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
plot_roc_curve(y_test,y_pred_proba,'dt-model')

## Random forest

In [None]:
rf = RandomForestClassifier()
rf_model = rf.fit(X_train, y_train)
train_pred=rf_model.predict(X_train)
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)[:,1]

In [None]:
metrics(y_train, train_pred, 'train_metrics')

In [None]:
metrics(y_test,y_pred,'test metrics')

In [None]:
plot_confusion_matrix(y_test,y_pred)

#### Tunning Random forest

In [None]:
param = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'class_weight': ['balanced']
}

In [None]:
random_search = GridSearchCV(
    estimator=rf,
    param_grid=param,
    cv=5,
    verbose=1,
    n_jobs=-1
)
random_search.fit(X_train, y_train)
best_rf = random_search.best_estimator_
y_pred = best_rf.predict(X_test)
print("Best Params:", random_search.best_params_)

In [None]:
rf = RandomForestClassifier(class_weight='balanced',n_estimators=200,min_samples_split=5,random_state=42,max_depth=20)
rf_model = rf.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
metrics(y_test, y_pred_rf, 'Random Forest')

In [None]:
plot_confusion_matrix(y_test,y_pred_rf)

In [None]:
imp_feature(rf)

## AdaBoost

In [None]:
ada=AdaBoostClassifier(random_state=42)
ad_model=ada.fit(X_train,y_train)
y_pred=ad_model.predict(X_test)

In [None]:
metrics(y_test,y_pred,'ad_model')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
#tunning ada boost

In [None]:
base_dt=DecisionTreeClassifier()
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1],
    'algorithm': ['SAMME'],
    'estimator': [DecisionTreeClassifier(max_depth=1),
                  DecisionTreeClassifier(max_depth=2),
                  DecisionTreeClassifier(max_depth=3)]
}

In [None]:
ada = AdaBoostClassifier(estimator=base_dt)
grid = GridSearchCV(estimator=ada, param_grid=params, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_sel, y_train)
print("Best Params:", grid.best_params_)

In [None]:
ada=AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2),algorithm='SAMME',learning_rate=0.01,n_estimators=50,random_state=42)
ad_model=ada.fit(X_train,y_train)
y_pred=ad_model.predict(X_test)

In [None]:
metrics(y_test,y_pred,'tunned_model')

In [None]:
imp_feature(ad_model)

## Gradient Boost

In [None]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)

In [None]:
metrics(y_test,y_pred,'GB_model')

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'max_features': ['sqrt', 'log2']
}

grid = GridSearchCV(estimator=gb, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)

In [None]:
gb = GradientBoostingClassifier(learning_rate=0.1,max_depth=3,max_features='log2',n_estimators=100,subsample=1.0,random_state=42)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)

In [None]:
metrics(y_test,y_pred,'GB-tunned')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
imp_feature(gb)

## XGBOOST

In [None]:
num_neg = sum(y_train == 0)
num_pos = sum(y_train == 1)

scale_pos_weight = num_pos/ num_neg
print("scale_pos_weight =", scale_pos_weight)

In [None]:
XGB = XGBClassifier()
xgb_model=XGB.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)

In [None]:
metrics(y_test,y_pred,'XG-model')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
param = {
    'n_estimators': [100, 200, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5, 6],
    'min_child_weight': [1, 3, 5],
    'gamma': [0, 0.1, 0.5],
    'scale_pos_weight': [1, 2, 5] 
}

In [None]:
grid_search = GridSearchCV(
    estimator=XGB,
    param_grid=param,
    scoring='recall',
    cv=5,
    verbose=2,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)

In [None]:
XGB = XGBClassifier(learning_rate=0.01,max_depth=3,min_child_weight=1,n_estimators=100,scale_pos_weight=1.81)
xgb_model=XGB.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
y_pred_prob=xgb_model.predict_proba(X_test)[:,1]
y_pred_cust=(y_pred_prob>=0.7).astype(int)

In [None]:
metrics(y_test,y_pred_cust,'tunned model')

### Support Vector Machines (SVM)

In [None]:
svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)


In [None]:
metrics(y_test,y_pred,'SVM_model')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
#tunned svm

In [None]:
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

grid = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)

In [None]:
svm = SVC(C=1,gamma='scale',kernel='rbf')
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)

In [None]:
metrics(y_test,y_pred,'tunned_svm')

In [None]:
plot_confusion_matrix(y_test,y_pred)

### Light GBM

In [None]:
lgbm = LGBMClassifier()
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_test)

In [None]:
metrics(y_test,y_pred,'LGBM')

In [None]:
plot_confusion_matrix(y_test,y_pred)

In [None]:
#tunning

In [None]:
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [5, 10, -1],
    'num_leaves': [31, 64],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)

### shap values

In [None]:
import shap
model = xgb.XGBClassifier(eval_metric='logloss', use_label_encoder=False)
model.fit(X_train, y_train)

# SHAP Explainer
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)

# Summary Plot (bar shows global feature importance)
shap.plots.bar(shap_values, max_display=10)

In [None]:
# Models
models = [dt_model,rf, ada, gb, XGB, lgbm, svm]  # your pre-trained models
model_names = ['decision tree', 'random forest', 'Ada Boost', 'GradientBoost', 'XGBoost', 'Light GBM', 'Support Vector']


# Number of features to select
n_features_to_select = 30
for model, name in zip(models, model_names):
    # RFE Feature Selection
    rfe = RFE(estimator=model, n_features_to_select=n_features_to_select)
    rfe.fit(X_train, y_train)
    selected_features = X_train.columns[rfe.support_]
    print(f'{name}\n',selected_features)

In [None]:
#Decision Tree on RFE features


In [None]:
selected_features=['Gender', 'Age', 'Sleep_duration', 'Sleep_quality', 'Stress_level',
       'Heart_rate', 'Daily_steps', 'Physical_activity', 'Sleep_disorder',
       'Wake_up_during_night', 'Feel_sleepy_during_day',
       'Caffeine_consumption', 'Alcohol_consumption', 'Smoking',
       'Medical_issue', 'Average_screen_time', 'Blue_light_filter',
       'Discomfort_Eye_strain', 'Redness_in_eye',
       'Itchiness_Irritation_in_eye', 'Systolic_BP', 'Diastolic_BP',
       'Pulse_Pressure', 'BMI', 'BP_category_Hypertension Stage 1',
       'BP_category_Hypertension Stage 2', 'BP_category_Normal',
       'Sleep_category_Long', 'Sleep_category_Short',
       'Screen_Time_Category_Low']

In [None]:
X_train_sel=X_train[selected_features]
X_test_sel=X_test[selected_features]

In [None]:
dt_rfe = DecisionTreeClassifier()
dt_rfe.fit(X_train_sel, y_train)
train_pred=dt_rfe.predict(X_train_sel)
y_pred = dt_rfe.predict(X_test_sel)
y_pred_prob = dt_rfe.predict_proba(X_test_sel)[:, 1]

In [None]:
metrics(y_train,train_pred,'train metrics')

In [None]:
metrics(y_test,y_pred,'test metrics')

In [None]:
param = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, 15, 20, None],
    'min_samples_split': [5, 10, 20, 50],
    'min_samples_leaf': [2, 4, 10]
}

In [None]:
dt_serach=GridSearchCV(estimator=dt_rfe,param_grid=param,cv=5,n_jobs=-1,verbose=1)
model=dt_serach.fit(X_train_sel,y_train)
print(model.best_params_)

In [None]:
dt_rfe = DecisionTreeClassifier(criterion='entropy',min_samples_leaf=2,min_samples_split=2,max_depth=4,random_state=42)
dt_rfe.fit(X_train_sel, y_train)
y_pred = dt_rfe.predict(X_test_sel)
metrics(y_test,y_pred,'dt rfe')

In [None]:
# Random Forest

In [None]:
rf_model_rfe = RandomForestClassifier(random_state=42)
rf_model_rfe.fit(X_train_sel, y_train)
y_pred_rf = rf_model_rfe.predict(X_test_sel)
y_pred_prob=rf_model_rfe.predict_proba(X_test_sel)[:,1]
y_pred_cust = (y_pred_prob > 0.35).astype(int)
metrics(y_test, y_pred, 'Random Forest')

In [None]:
#ada boost
ada=AdaBoostClassifier(random_state=42)
ad_rfe=ada.fit(X_train_sel,y_train)
y_pred=ad_rfe.predict(X_test_sel)

In [None]:
metrics(y_test,y_pred,'adaboost rfe')

In [None]:
#Gradient boost

In [None]:
gb_rfe = GradientBoostingClassifier()
gb_rfe.fit(X_train_sel, y_train)
y_pred = gb_rfe.predict(X_test_sel)

In [None]:
metrics(y_test,y_pred,'gradient boost rfe')

In [None]:
# XGBoost

In [None]:
XGB = XGBClassifier()
xgb_rfe=XGB.fit(X_train_sel, y_train)
y_pred = xgb_rfe.predict(X_test_sel)

In [None]:
metrics(y_test,y_pred,'XGB rfe')

In [None]:
metrics(y_test,y_pred,'shap xgb model')

In [None]:
lr=LogisticRegression()
lr.fit(X_train,y_train)
y_pred=lr.predict(X_test)
y_pred_prob=lr.predict_proba(X_test)[:,1]
y_pred_cust = (y_pred_prob >= 0.55).astype(int)

In [None]:
metrics(y_test,y_pred_cust,'l model')

In [None]:
voting_clf = VotingClassifier(estimators=[
    ('dt', dt_model),
    ('rf', rf),
    ('ada', ada),
    ('gb',gb),
    ('xgb', XGB),
    ('lbgm',lgbm),
], voting='soft')  # Use 'hard' for majority vote, 'soft' for probabilities

# Fit VotingClassifier
voting_clf.fit(X_train, y_train)

# Predict
y_pred_voting = voting_clf.predict(X_test)


In [None]:
metrics(y_test,y_pred_voting,'voting')

In [None]:
def metric(y_test, y_pred, model):
    print(model)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    return acc, prec, rec, f1

In [None]:
models = [
    ('Logistic Regression', lr_model),
    ('Decision Tree', dt_model),
    ('Random Forest', rf),
    ('Gradient Boosting', gb),
    ('XGBoost', XGB),
    ('LightGBM', lgbm),
    ('AdaBoost', ada),
    ('SVC', svm),
    ('Voting Classifier', voting_clf)
]
results = []

# Loop through models
for name, model in models:
    y_pred = model.predict(X_test)
    acc, prec, rec, f1 = metric(y_test, y_pred, name)
    
    results.append({
        'Model': name,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1
    })

# Convert results to DataFrame
df_results = pd.DataFrame(results)
df_results

In [None]:
df_results