# Predicting Initiating Events in Narrative Texts
By: Adrian Chavez-Loya

## Background
Story telling is a key component of interpersonal communication and the study of narrative ability in children can provide critical insights into their language development. Narrative sample analysis is a process in which an individual produces a narrative and then a Speech-Language Pathologist (or similar practitioner) analyzes the quality. One tool for measuring this quality is the Monitoring Indicators of Scholarly Language (MISL). It provides an objective measure of the macrostructure story elements (e.g. Characters, Setting, Initiating Event) as well as the microstructure or grammatical elements. 

The process of scoring the macrostructure can be very time consuming though, which leads to less effective ongoing monitoring. This dataset provides the first publicly accessible data for attempting to automate scoring of the macrostructure via Machine Learning.

## Dataset:
`AutomatedNarrativeAnalysisMISLData.csv`

## Task

We will predict the Initiating Event (`IE`) label. The `IE` is scored as either 0, 1, 2, or 3 but for our purposes it is acceptable to predict this as either a continuous or categorical output. If you predict it as continuous, it is necessary to constrain the prediction in some way, therefore, categorical may be easier.

For predictor variables, we have two choices: either the raw text or the text features (or both, technically). The text features are every column **except** `Char`, `Sett`, `IE`, `Plan`, `Act`, and `Con`. Those 6 variables are the output scores but again we'll just be focusing on `IE` for now. Also, exclude the `ID` column.

Using cross-validation, we will explore the many different classification algorithms we discussed to find the model with the highest performance (I'll leave it to you to define performance).

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score

df = pd.read_csv('AutomatedNarrativeAnalysisMISLData.csv')

In [3]:
df.head()

Unnamed: 0,ID,vecOfNarratives,Char,Sett,IE,Plan,Act,Con,ENP,DESPC,...,WRDCNCc,WRDIMGc,WRDMEAc,WRDPOLc,WRDHYPn,WRDHYPv,WRDHYPnv,RDFRE,RDFKGL,RDL2
0,1,The mom wanted to go and pet the alien dog. B...,1,0,2,1,2,0,2,1,...,430.110992,465.694,475.229004,3.437,6.04,1.551,1.737,100.0,1.266,23.245001
1,2,There was two little kids who were walking in...,1,1,3,0,2,3,3,1,...,399.847992,425.677002,428.175995,5.466,6.666,1.551,1.337,96.751999,1.777,34.120998
2,3,these aliens came to earth. and this girl a...,1,1,1,0,1,0,2,1,...,462.75,486.450012,468.105011,5.073,6.408,1.573,2.005,100.0,0.607,13.303
3,4,One time this alien ship came down to earth. ...,1,1,1,1,1,0,2,1,...,425.25,457.356995,434.609009,5.587,7.559,1.834,1.83,95.927002,3.357,28.655001
4,5,Aliens came down from the planet they came fr...,0,0,2,0,2,0,1,1,...,452.666992,461.444,464.75,4.758,6.361,1.065,0.837,100.0,0.148,45.837002


In [4]:
# Dropped unnecessary ID column
df.drop(columns=['ID'], inplace=True) 

In [5]:
df.head()

Unnamed: 0,vecOfNarratives,Char,Sett,IE,Plan,Act,Con,ENP,DESPC,DESSC,...,WRDCNCc,WRDIMGc,WRDMEAc,WRDPOLc,WRDHYPn,WRDHYPv,WRDHYPnv,RDFRE,RDFKGL,RDL2
0,The mom wanted to go and pet the alien dog. B...,1,0,2,1,2,0,2,1,11,...,430.110992,465.694,475.229004,3.437,6.04,1.551,1.737,100.0,1.266,23.245001
1,There was two little kids who were walking in...,1,1,3,0,2,3,3,1,29,...,399.847992,425.677002,428.175995,5.466,6.666,1.551,1.337,96.751999,1.777,34.120998
2,these aliens came to earth. and this girl a...,1,1,1,0,1,0,2,1,7,...,462.75,486.450012,468.105011,5.073,6.408,1.573,2.005,100.0,0.607,13.303
3,One time this alien ship came down to earth. ...,1,1,1,1,1,0,2,1,5,...,425.25,457.356995,434.609009,5.587,7.559,1.834,1.83,95.927002,3.357,28.655001
4,Aliens came down from the planet they came fr...,0,0,2,0,2,0,1,1,6,...,452.666992,461.444,464.75,4.758,6.361,1.065,0.837,100.0,0.148,45.837002


In [6]:
# Extracted features/target 'IE' 
X = df.drop(columns=['Char', 'Sett', 'IE', 'Plan', 'Act', 'Con', 'vecOfNarratives'])
y = df['IE']

In [7]:

# Tf-Idf Vextorizer 
tfidf = TfidfVectorizer(max_features=1000)
text_features = tfidf.fit_transform(df['vecOfNarratives']).toarray()

# Combine text features with other features
X_text = pd.DataFrame(text_features, columns=[f'text_feat_{i}' for i in range(text_features.shape[1])])
X = pd.concat([X.reset_index(drop=True), X_text], axis=1)

# Converted columns to numeric values, dropped others 
X = X.apply(pd.to_numeric, errors='coerce') 
X.dropna(axis=1, inplace=True)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardized data with standard scaler 
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Chose models to use for cross-validation 
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'SVM': SVC()
}

In [8]:
# Cross-validation and model evaluation
results = {}
for model_name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    results[model_name] = scores.mean()


In [9]:
# Results of Model Performance 
print("Model Performance (Accuracy):")
for model_name, score in results.items():
    print(f"{model_name}: {score:.4f}")

# Trained the best model on the entire training set and evaluate on the test set
best_model_name = max(results, key=results.get)
best_model = models[best_model_name]
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred) # Evaluation for the best model 
test_f1 = f1_score(y_test, y_pred, average='weighted')

print(f"\nBest Model: {best_model_name}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test F1 Score: {test_f1:.4f}")


Model Performance (Accuracy):
Logistic Regression: 0.5285
KNN: 0.4077
Decision Tree: 0.3805
Random Forest: 0.5557
Gradient Boosting: 0.5255
AdaBoost: 0.4106
SVM: 0.5014

Best Model: Random Forest
Test Accuracy: 0.5060
Test F1 Score: 0.4672


* The Random Forest model showed the highest cross-validation accuracy, but its performance on the test set was slightly lower, indicating a potential overfitting issue.
* Logistic Regression and Gradient Boosting also showed competitive cross-validation scores, suggesting they might be worth further exploration or tuning.
* Further tuning of hyperparameters, such as adjusting the number of trees in Random Forest or the regularization parameter in Logistic Regression, could potentially improve performance on the test set.
* Feature engineering, such as experimenting with different text features or additional domain-specific features, could also enhance model performance.


# Hyperparameter Tuning

In [26]:
from sklearn.metrics import accuracy_score, f1_score

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Initialize the Random Forest model and Grid Search 
rf_model = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

 
print("Best Parameters:", grid_search.best_params_) # Best parameters and best CV accuracy 
print("Best Cross-Validation Accuracy:", grid_search.best_score_)
best_rf_model = grid_search.best_estimator_ # Best Model 
y_pred = best_rf_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred) 
test_f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Test Accuracy: {test_accuracy:.4f}") # Test accuraxy/F1 Score 
print(f"Test F1 Score: {test_f1:.4f}")


Fitting 5 folds for each of 324 candidates, totalling 1620 fits
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_spl

540 fits failed out of a total of 1620.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
238 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/adrianchavezloya/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/adrianchavezloya/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 1144, in wrapper
    estimator._validate_params()
  File "/Users/adrianchavezloya/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "/Users/adrianchavezloya/anaconda3/lib/python3.11/site-packages/sklearn/utils/_param_v

Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 50}
Best Cross-Validation Accuracy: 0.585888738127544
Test Accuracy: 0.4819
Test F1 Score: 0.4300


# Model Comparison

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

# Initialized Logistic Regression model/Gradient Boosting model
logreg_model = LogisticRegression(random_state=42, max_iter=1000)
gb_model = GradientBoostingClassifier(random_state=42)

# Trained LR Model/GB Model 
logreg_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
y_pred_logreg = logreg_model.predict(X_test) # Predictions on test set 
y_pred_gb = gb_model.predict(X_test)
test_accuracy_logreg = accuracy_score(y_test, y_pred_logreg) 
test_f1_logreg = f1_score(y_test, y_pred_logreg, average='weighted')
print("Logistic Regression Performance:") # Log. model regression results
print(f"Test Accuracy: {test_accuracy_logreg:.4f}")
print(f"Test F1 Score: {test_f1_logreg:.4f}")
print()


test_accuracy_gb = accuracy_score(y_test, y_pred_gb) 
test_f1_gb = f1_score(y_test, y_pred_gb, average='weighted')
print("Gradient Boosting Performance:") #Gradient Boosting results 
print(f"Test Accuracy: {test_accuracy_gb:.4f}")
print(f"Test F1 Score: {test_f1_gb:.4f}")


Logistic Regression Performance:
Test Accuracy: 0.5542
Test F1 Score: 0.5273

Gradient Boosting Performance:
Test Accuracy: 0.5422
Test F1 Score: 0.5329


# Analysis of Model Comparison:
* Both Logistic Regression and Gradient Boosting models show comparable performance metrics.
* Logistic Regression slightly outperforms Gradient Boosting in terms of accuracy (55.42% vs. 54.22%), but Gradient Boosting has a slightly higher F1 score (53.29% vs. 52.73%).
* These results suggest that both models are reasonable choices, and the preference might depend on whether higher precision or recall is more critical for this application.

In [33]:
# Got feature importances from the best Random Forest model
feature_importances = best_rf_model.feature_importances_

# Created a DataFrame to display feature importances
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
})

# Sorted features by importance
feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)

# Top 10 most important features
print("Top 10 Most Important Features:")
print(feature_importances_df.head(10))


Top 10 Most Important Features:
      Feature  Importance
3       DESWC    0.027832
13     PCNARp    0.015797
46      LSAGN    0.014257
69      SYNLE    0.013994
9    DESWLsyd    0.012545
20      PCDCz    0.012459
79       DRPP    0.012344
2       DESSC    0.011905
66  SMCAUSlsa    0.011265
48     LDTTRc    0.010828


### Analysis of Feature Importance

- **DESWC (Descriptive Word Count):**  
  Measures the richness of descriptive language in the narrative. Higher values indicate more vivid and detailed descriptions, which improve the model’s ability to identify well-developed Initiating Events.

- **PCNARp (Narrative Progression Percentage):**  
  Quantifies how effectively the narrative progresses through its key stages. It captures coherence and temporal flow, allowing the model to recognize structured storytelling patterns.

- **LSAGN (Lexical Sophistication Aggregate Norm):**  
  Reflects the overall sophistication and precision of vocabulary. Higher lexical sophistication contributes to clearer distinctions between simple and complex Initiating Events.

- **SYNLE (Syntactic Lexical Complexity):**  
  Represents the syntactic variety and structural complexity of sentences. It enhances the model’s understanding of linguistic organization and grammatical depth.

- **DESWLsyd (Descriptive Word Lexical Diversity):**  
  Measures the diversity of descriptive word usage, indicating how varied the vocabulary is when describing scenes or events. High diversity supports richer narrative representation.

- **PCDCz (Narrative Cohesion Coefficient):**  
  Captures the level of narrative cohesion and connectivity between sentences or ideas. Strong cohesion improves interpretability and event continuity.

- **DRPP (Discourse Referential Pronoun Proportion):**  
  Represents the frequency and usage of pronouns and discourse markers that maintain narrative continuity. Effective discourse use signals coherent event progression.

- **DESSC (Descriptive Semantic Coherence):**  
  Assesses the semantic consistency and alignment of descriptive language. Higher coherence indicates a more unified and meaningful event description.

- **SMCAUSlsa (Semantic Causality via LSA):**  
  Measures causal relationships within the narrative using latent semantic analysis. High causal linkage strength indicates well-connected and logically sequenced events.

- **LDTTRc (Lexical Diversity Type-Token Ratio - Corrected):**  
  Quantifies the lexical variety of the text while correcting for length effects. Greater lexical diversity reflects linguistic richness and improves differentiation between narrative quality levels.


## Summary and Conclusions

### Model Performance
This project applied machine learning to narrative data to predict the Initiating Event (IE) label. Several algorithms were tested, and the Random Forest model achieved the best performance, with a cross-validated accuracy of approximately 58.6%.

### Feature Importance
The model identified key linguistic and narrative features that most strongly influenced Initiating Event prediction:
- DESWC  
- PCNARp  
- LSAGN  
- SYNLE  
- DESWLsyd  
- PCDCz  
- DRPP  
- DESSC  
- SMCAUSlsa  
- LDTTRc  

**These features represent elements of descriptive richness, narrative coherence, syntactic complexity, and semantic structure—factors critical to identifying well-developed Initiating Events.**

### Conclusion
The Random Forest model provides a moderate but meaningful level of predictive accuracy for identifying Initiating Events in narrative text. Feature analysis highlights how linguistic complexity and coherence measures contribute to model performance. Continued experimentation with advanced NLP methods could further enhance accuracy and interpretability!