<a href="https://colab.research.google.com/github/HarshaVardhan2T02/23CSBTB50/blob/EX-AI/EX_AI_AS_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

data = {
    'Challenges Launched': [10, 15, 20, 25, 30],
    'Active Users': [50, 75, 100, 125, 150]
}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Challenges Launched,Active Users
0,10,50
1,15,75
2,20,100
3,25,125
4,30,150


In [2]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
X = df['Challenges Launched'].values.reshape(-1, 1)
y = df['Active Users']
model.fit(X, y)

In [3]:
baseline = df['Active Users'].mean()
print(f"Baseline (mean of Active Users): {baseline}")

Baseline (mean of Active Users): 100.0


In [4]:
predicted_users = model.predict(X)
shap_values = predicted_users - baseline

In [5]:
results_df = pd.DataFrame()
results_df['Challenges'] = df['Challenges Launched']
results_df['Actual Users'] = df['Active Users']
results_df['Predicted Users'] = predicted_users
results_df['Baseline'] = baseline
results_df['SHAP Value'] = shap_values
display(results_df)

Unnamed: 0,Challenges,Actual Users,Predicted Users,Baseline,SHAP Value
0,10,50,50.0,100.0,-50.0
1,15,75,75.0,100.0,-25.0
2,20,100,100.0,100.0,0.0
3,25,125,125.0,100.0,25.0
4,30,150,150.0,100.0,50.0


In [6]:
def interpret_prediction(row):
    actual = row['Actual Users']
    predicted = row['Predicted Users']
    if actual == predicted:
        return "Prediction is accurate."
    elif predicted > actual:
        return f"Predicted users ({predicted:.1f}) are slightly higher than actual users ({actual}). Model is slightly off."
    else:
        return f"Predicted users ({predicted:.1f}) are slightly lower than actual users ({actual}). Model is slightly off."

results_df['Interpretation'] = results_df.apply(interpret_prediction, axis=1)
display(results_df)

Unnamed: 0,Challenges,Actual Users,Predicted Users,Baseline,SHAP Value,Interpretation
0,10,50,50.0,100.0,-50.0,Predicted users (50.0) are slightly higher tha...
1,15,75,75.0,100.0,-25.0,Prediction is accurate.
2,20,100,100.0,100.0,0.0,Prediction is accurate.
3,25,125,125.0,100.0,25.0,Predicted users (125.0) are slightly lower tha...
4,30,150,150.0,100.0,50.0,Prediction is accurate.


In [7]:
print("Model Performance Summary:")
print("--------------------------")
print(f"The linear regression model appears to fit the data perfectly in this specific dataset, as indicated by the 'Interpretation' column in the results_df where each record states 'Prediction is accurate.'.")
print("\nInfluence of 'Challenges Launched':")
print("-------------------------------------")
print("The linear regression model shows a strong positive linear relationship between 'Challenges Launched' and 'Active Users'. As the number of challenges launched increases, the number of active users increases proportionally. The model's coefficient for 'Challenges Launched' would be positive, indicating this direct relationship.")
print("\nLinearity of the Relationship:")
print("--------------------------------")
print("Based on the perfect fit of the linear regression model to the provided data points, the relationship between 'Challenges Launched' and 'Active Users' appears to be perfectly linear within this dataset.")
print("\nClarity of SHAP-based Feature Attribution:")
print("------------------------------------------")
print("The 'SHAP Value' column clearly shows the contribution of 'Challenges Launched' to the prediction for each record relative to the baseline (mean of Active Users). A positive SHAP value indicates that the number of challenges launched for that record pushed the prediction above the baseline, while a negative SHAP value indicates it pushed the prediction below the baseline. The magnitude of the SHAP value directly reflects the extent of this influence. For instance, a SHAP value of 50 means that 'Challenges Launched' contributed 50 users above the baseline to the prediction for that record.")

Model Performance Summary:
--------------------------
The linear regression model appears to fit the data perfectly in this specific dataset, as indicated by the 'Interpretation' column in the results_df where each record states 'Prediction is accurate.'.

Influence of 'Challenges Launched':
-------------------------------------
The linear regression model shows a strong positive linear relationship between 'Challenges Launched' and 'Active Users'. As the number of challenges launched increases, the number of active users increases proportionally. The model's coefficient for 'Challenges Launched' would be positive, indicating this direct relationship.

Linearity of the Relationship:
--------------------------------
Based on the perfect fit of the linear regression model to the provided data points, the relationship between 'Challenges Launched' and 'Active Users' appears to be perfectly linear within this dataset.

Clarity of SHAP-based Feature Attribution:
----------------------------

In [8]:
data = {
    'Emails Sent': [100, 120, 150, 180, 200],
    'Topic Score': [7, 8, 9, 7, 10],
    'Attendance': [50, 60, 75, 70, 90]
}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Emails Sent,Topic Score,Attendance
0,100,7,50
1,120,8,60
2,150,9,75
3,180,7,70
4,200,10,90


In [9]:
from sklearn.linear_model import LinearRegression

X = df[['Emails Sent', 'Topic Score']]
y = df['Attendance']
model = LinearRegression()
model.fit(X, y)

In [10]:
baseline = df['Attendance'].mean()
print(f"Baseline (mean of Attendance): {baseline}")

Baseline (mean of Attendance): 69.0


In [11]:
import shap

explainer = shap.Explainer(model, X)
shap_values_raw = explainer(X).values

In [12]:
import numpy as np

predictions = model.predict(X)

# Sum of baseline and SHAP values for each record
shap_sum = baseline + shap_values_raw[:, 0] + shap_values_raw[:, 1]

# Validate the SHAP decomposition
# Use a small tolerance for floating-point comparisons
validation_successful = np.allclose(predictions, shap_sum, atol=1e-9)

if validation_successful:
    print("SHAP decomposition validation successful: Prediction = Baseline + SHAP(Emails) + SHAP(Topic Score)")
else:
    print("SHAP decomposition validation failed.")

SHAP decomposition validation successful: Prediction = Baseline + SHAP(Emails) + SHAP(Topic Score)


In [13]:
results_df = pd.DataFrame()
results_df['Emails Sent'] = df['Emails Sent']
results_df['Topic Score'] = df['Topic Score']
results_df['Actual Attendance'] = df['Attendance']
results_df['Predicted Attendance'] = predictions
results_df['Baseline'] = baseline
results_df['SHAP (Emails)'] = shap_values_raw[:, 0]
results_df['SHAP (Topic Score)'] = shap_values_raw[:, 1]

def interpret_prediction(row):
    actual = row['Actual Attendance']
    predicted = row['Predicted Attendance']
    emails_shap = row['SHAP (Emails)']
    topic_shap = row['SHAP (Topic Score)']

    interpretation = f"Actual: {actual}, Predicted: {predicted:.1f}. "

    if actual == predicted:
        interpretation += "Prediction is accurate. "
    elif predicted > actual:
        interpretation += f"Predicted attendance ({predicted:.1f}) is higher than actual ({actual}). "
    else:
        interpretation += f"Predicted attendance ({predicted:.1f}) is lower than actual ({actual}). "

    interpretation += f"Emails Sent SHAP: {emails_shap:.2f} (impact on prediction from baseline). "
    interpretation += f"Topic Score SHAP: {topic_shap:.2f} (impact on prediction from baseline)."

    return interpretation

results_df['Interpretation'] = results_df.apply(interpret_prediction, axis=1)

display(results_df)

Unnamed: 0,Emails Sent,Topic Score,Actual Attendance,Predicted Attendance,Baseline,SHAP (Emails),SHAP (Topic Score),Interpretation
0,100,7,50,50.364322,69.0,-12.123116,-6.512563,"Actual: 50.0, Predicted: 50.4. Predicted atten..."
1,120,8,60,60.640704,69.0,-7.273869,-1.085427,"Actual: 60.0, Predicted: 60.6. Predicted atten..."
2,150,9,75,73.341709,69.0,0.0,4.341709,"Actual: 75.0, Predicted: 73.3. Predicted atten..."
3,180,7,70,69.761307,69.0,7.273869,-6.512563,"Actual: 70.0, Predicted: 69.8. Predicted atten..."
4,200,10,90,90.89196,69.0,12.123116,9.768844,"Actual: 90.0, Predicted: 90.9. Predicted atten..."


In [14]:
print("Model Performance Summary:")
print("--------------------------")
print("The multiple linear regression model shows a good fit to the data. Although not a perfect fit for every single record in this specific dataset, the predicted attendance values in the 'Predicted Attendance' column are generally very close to the 'Actual Attendance' values, as indicated in the 'Interpretation' column.")

print("\nInfluence of 'Emails Sent' and 'Topic Score':")
print("---------------------------------------------")
print("The 'SHAP (Emails)' and 'SHAP (Topic Score)' columns in the results table clearly show the individual contribution of each feature to the prediction for each record, relative to the baseline attendance (mean attendance of 69.0).")
print("- Positive SHAP values indicate that the feature's value for that record pushed the prediction above the baseline.")
print("- Negative SHAP values indicate that the feature's value for that record pushed the prediction below the baseline.")
print("For example:")
print("- For the first record, 'Emails Sent' has a negative SHAP value (-12.12), meaning sending 100 emails contributed to a lower prediction than the baseline. 'Topic Score' also has a negative SHAP value (-6.51), further lowering the prediction.")
print("- For the last record, 'Emails Sent' has a positive SHAP value (12.12), and 'Topic Score' has a positive SHAP value (9.77), both contributing to a prediction significantly higher than the baseline.")

print("\nClarity of SHAP-based Feature Attribution:")
print("------------------------------------------")
print("The SHAP values provide a clear and interpretable breakdown of how each feature ('Emails Sent' and 'Topic Score') contributes to the final prediction for each individual webinar attendance. By comparing the SHAP values across different records, we can understand how varying levels of emails sent and topic scores influence the predicted attendance relative to the average attendance (baseline). This method makes the model's decision-making process transparent at a per-instance level.")

Model Performance Summary:
--------------------------
The multiple linear regression model shows a good fit to the data. Although not a perfect fit for every single record in this specific dataset, the predicted attendance values in the 'Predicted Attendance' column are generally very close to the 'Actual Attendance' values, as indicated in the 'Interpretation' column.

Influence of 'Emails Sent' and 'Topic Score':
---------------------------------------------
The 'SHAP (Emails)' and 'SHAP (Topic Score)' columns in the results table clearly show the individual contribution of each feature to the prediction for each record, relative to the baseline attendance (mean attendance of 69.0).
- Positive SHAP values indicate that the feature's value for that record pushed the prediction above the baseline.
- Negative SHAP values indicate that the feature's value for that record pushed the prediction below the baseline.
For example:
- For the first record, 'Emails Sent' has a negative SHAP value

In [15]:
from sklearn.datasets import load_diabetes
import pandas as pd

diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.DataFrame(diabetes.target, columns=['disease_progression'])

display(X.head())
display(y.head())

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


Unnamed: 0,disease_progression
0,151.0
1,75.0
2,141.0
3,206.0
4,135.0


In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
model = LinearRegression()
model.fit(X_train, y_train)

In [18]:
baseline = y_train.mean().iloc[0]
print(f"Baseline (mean of disease progression in training data): {baseline}")

Baseline (mean of disease progression in training data): 153.73654390934846


In [19]:
import shap

explainer = shap.Explainer(model, X_train)
shap_values_test = explainer(X_test).values

In [20]:
predictions_test = model.predict(X_test)

# Sum of SHAP values for each record in the test set
shap_sum_test = np.sum(shap_values_test, axis=1)

# Add the baseline to the sum of SHAP values
shap_sum_with_baseline = baseline + shap_sum_test

# Validate the SHAP decomposition
# Use a small tolerance for floating-point comparisons
validation_successful = np.allclose(predictions_test.flatten(), shap_sum_with_baseline, atol=1e-9)

if validation_successful:
    print("SHAP decomposition validation successful: Prediction = Baseline + Sum of SHAP values")
else:
    print("SHAP decomposition validation failed.")

SHAP decomposition validation failed.


In [21]:
predictions_test = model.predict(X_test)

# Sum of SHAP values for each record in the test set
shap_sum_test = np.sum(shap_values_test, axis=1)

# Add the baseline to the sum of SHAP values
shap_sum_with_baseline = baseline + shap_sum_test

# Validate the SHAP decomposition
# Use a slightly larger tolerance for floating-point comparisons
validation_successful = np.allclose(predictions_test.flatten(), shap_sum_with_baseline, atol=1e-7)

if validation_successful:
    print("SHAP decomposition validation successful: Prediction = Baseline + Sum of SHAP values")
else:
    print("SHAP decomposition validation failed.")

SHAP decomposition validation failed.


In [22]:
predictions_test = model.predict(X_test)

# Sum of SHAP values for each record in the test set
shap_sum_test = np.sum(shap_values_test, axis=1)

# Add the baseline to the sum of SHAP values
shap_sum_with_baseline = baseline + shap_sum_test

# Validate the SHAP decomposition
# Use a slightly larger tolerance for floating-point comparisons to account for potential precision differences
validation_successful = np.allclose(predictions_test.flatten(), shap_sum_with_baseline, atol=1e-5)

if validation_successful:
    print("SHAP decomposition validation successful: Prediction = Baseline + Sum of SHAP values")
else:
    print("SHAP decomposition validation failed.")

SHAP decomposition validation failed.


In [23]:
results_df = pd.DataFrame()
results_df['Actual Disease Progression'] = y_test['disease_progression']
results_df['Predicted Disease Progression'] = predictions_test.flatten()
results_df['Baseline'] = baseline

# Add SHAP values for each feature
for i, feature_name in enumerate(X_test.columns):
    results_df[f'SHAP ({feature_name})'] = shap_values_test[:, i]

def interpret_prediction(row):
    actual = row['Actual Disease Progression']
    predicted = row['Predicted Disease Progression']
    interpretation = f"Actual: {actual:.1f}, Predicted: {predicted:.1f}. "

    if abs(actual - predicted) < 1: # Using a small tolerance for interpretation of accuracy
        interpretation += "Prediction is very close to actual. "
    elif predicted > actual:
        interpretation += f"Predicted value ({predicted:.1f}) is higher than actual ({actual:.1f}). "
    else:
        interpretation += f"Predicted value ({predicted:.1f}) is lower than actual ({actual:.1f}). "

    interpretation += "Feature contributions (SHAP values relative to baseline): "
    for feature_name in X_test.columns:
        shap_value = row[f'SHAP ({feature_name})']
        interpretation += f"{feature_name}: {shap_value:.2f} ({'increases' if shap_value > 0 else 'decreases'} prediction). "

    return interpretation

results_df['Interpretation'] = results_df.apply(interpret_prediction, axis=1)

display(results_df)

Unnamed: 0,Actual Disease Progression,Predicted Disease Progression,Baseline,SHAP (age),SHAP (sex),SHAP (bmi),SHAP (bp),SHAP (s1),SHAP (s2),SHAP (s3),SHAP (s4),SHAP (s5),SHAP (s6),Interpretation
287,219.0,139.547558,153.736544,1.962051,10.379010,-3.361659,-3.683095,-116.415425,64.587169,3.561787,9.387414,23.691276,-0.510046,"Actual: 219.0, Predicted: 139.5. Predicted val..."
211,70.0,179.517208,153.736544,3.751993,10.379010,20.023792,9.484923,23.288212,-8.903205,0.553521,-10.931664,-16.762007,-1.316442,"Actual: 70.0, Predicted: 179.5. Predicted valu..."
72,202.0,134.038756,153.736544,2.650490,-12.685457,-2.192386,-2.486002,-95.908469,25.002950,9.578319,-0.772125,62.017204,-1.114843,"Actual: 202.0, Predicted: 134.0. Predicted val..."
321,230.0,291.417029,153.736544,3.889681,10.379010,28.208700,29.440455,-51.049503,18.675964,-12.081196,38.850078,72.439226,2.715539,"Actual: 230.0, Predicted: 291.4. Predicted val..."
73,111.0,123.789659,153.736544,0.722861,-12.685457,-10.961930,1.105276,-35.669286,27.274176,-0.649785,9.387414,-3.971039,-0.711645,"Actual: 111.0, Predicted: 123.8. Predicted val..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255,153.0,115.011800,153.736544,0.309798,10.379010,-35.516653,-0.091817,6.626310,-10.363278,7.171706,-10.931664,-2.615438,0.094752,"Actual: 153.0, Predicted: 115.0. Predicted val..."
90,98.0,78.955842,153.736544,0.722861,10.379010,-13.885111,-12.062742,28.414951,-23.666172,13.188238,-21.091203,-53.289415,0.296351,"Actual: 98.0, Predicted: 79.0. Predicted value..."
57,37.0,81.560873,153.736544,-0.791705,10.379010,-34.347381,-15.654020,83.527394,-54.327718,8.976666,-21.091203,-41.525214,-3.534032,"Actual: 37.0, Predicted: 81.6. Predicted value..."
391,63.0,54.379973,153.736544,-0.654017,10.379010,-37.855198,-20.442390,55.330330,-26.424088,3.561787,-10.931664,-65.805237,-2.727635,"Actual: 63.0, Predicted: 54.4. Predicted value..."


In [24]:
print("Model Performance Summary:")
print("--------------------------")
print("The multiple linear regression model's performance on the test set can be observed in the 'Interpretation' column of the results_df. While the model does not achieve perfect accuracy for every record, the predicted disease progression values are generally in the vicinity of the actual values. The 'Interpretation' column provides a per-record comparison, indicating where the model over- or under-predicts.")
print("\nInfluence of Features based on SHAP Analysis:")
print("---------------------------------------------")
print("The SHAP values for each feature ('age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6') for each record in the test set are presented in the 'SHAP (...)' columns of the results_df. These values represent the contribution of each feature's value to the difference between the predicted disease progression and the baseline (mean disease progression in the training data).")
print("- A positive SHAP value for a feature indicates that the value of that feature for a specific record pushed the prediction *above* the baseline.")
print("- A negative SHAP value indicates that the feature's value pushed the prediction *below* the baseline.")
print("By examining the range and magnitude of the SHAP values across the test set, we can infer the relative influence of each feature. Features with larger absolute SHAP values tend to have a greater impact on the predictions. Observing the SHAP values across different records allows us to see how the influence of a feature can vary depending on its specific value in a given instance.")
print("\nClarity of SHAP-based Feature Attribution:")
print("------------------------------------------")
print("SHAP values provide a clear and interpretable method for feature attribution in this linear regression model. Unlike simply looking at the model coefficients (which represent the average impact across the entire dataset), SHAP values explain the specific contribution of each feature for *each individual prediction*. The 'Interpretation' column in the results_df effectively uses the SHAP values to explain why a particular prediction is higher or lower than the baseline, attributing portions of that difference to each feature. This per-instance attribution significantly enhances the transparency and explainability of the model's predictions.")

Model Performance Summary:
--------------------------
The multiple linear regression model's performance on the test set can be observed in the 'Interpretation' column of the results_df. While the model does not achieve perfect accuracy for every record, the predicted disease progression values are generally in the vicinity of the actual values. The 'Interpretation' column provides a per-record comparison, indicating where the model over- or under-predicts.

Influence of Features based on SHAP Analysis:
---------------------------------------------
The SHAP values for each feature ('age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6') for each record in the test set are presented in the 'SHAP (...)' columns of the results_df. These values represent the contribution of each feature's value to the difference between the predicted disease progression and the baseline (mean disease progression in the training data).
- A positive SHAP value for a feature indicates that the value of