In this notebook, we will transform an excel file with all the sentences in each passage into an excel file where each sentence is associated several [Coh-metrix text properties](http://cohmetrix.memphis.edu/cohmetrixhome/documentation_indices.html) (a program cited by over 2000 studies).

---
**Features:**

| feature    | cohmetrix name | description |
| :--------: | :--------: |  :--------: |
| *sentence_length*  | DESWC    | # of words in a sentence |
| *word_length* | DESWLsy     | mean # of syllables per word |
| *word_frequency_cmx*    | WRDFRQc    | measure of word frequency in english |
| *word_concreteness_cmx*    | WRDCNCc    | measure of concreteness |
| *grade_level*    | RDFKGL    | Flesch-Kincaid Grade Level^ |

^Grade Level is computed from two prior variables: *grade_level = [(0.39 * sentence_length) + (11.8 * word_length) – 15.59]*.

**Files:**
* `data\ListA_Sentences.xlsx` -> ` data\ListA_TextProperties.xlsx`

---




Step 1: Load in and process sentences from each passage

In [1]:
# (0) Load in data
import pandas as pd

passages = ["A1", "A2", "A3", "A4", "A5"]
for pas_id in passages:
    df = pd.read_excel("data/ListA_Sentences.xlsx", sheet_name=pas_id)

    sentences = df.Sentence.to_list()

    # Write each sentence to a .txt file
    for i, sentence in enumerate(sentences):
        with open(f"data/cohmetrix-input_ListA/{pas_id}_{i + 1}.txt", "w") as txt_file:
            txt_file.write(sentence)

Step 2: Feed the `data/cohmetrix-input_ListA` folder into the cohmetrix program and save output into this `data\cohmetrix-output_ListA.csv` file.

Step 3: Clean up Coh-metrix output

In [2]:
# (1) Load in the scored sentences
import pandas as pd
cohmetrix_output = pd.read_csv("data/cohmetrix-output_ListA.csv")

sent_ids = [string.split("\\")[-1][:-4] for string in cohmetrix_output.TextID.to_list()]
cohmetrix_output["passage"] = [s.split("_")[0] for s in sent_ids]
cohmetrix_output["input"] = [int(s.split("_")[1]) for s in sent_ids]

cohmetrix_output = cohmetrix_output[["passage", "DESWC", "DESWLsy", "WRDFRQc","WRDCNCc","RDFKGL", "input"]].sort_values(["passage", "input"])

cohmetrix_output = cohmetrix_output.rename(columns={
                                 "DESWC":"sentence_length", 
                                 "DESWLsy": "word_length",
                                 "WRDFRQc":"word_frequency_cmx",
                                 "WRDCNCc":"word_concreteness_cmx",
                                 "RDFKGL":"grade_level"})

cohmetrix_output = cohmetrix_output.round(2)

cohmetrix_output.head()

Unnamed: 0,passage,sentence_length,word_length,word_frequency_cmx,word_concreteness_cmx,grade_level,input
0,A1,10,1.5,2.14,446.33,6.01,1
11,A1,9,1.44,2.95,378.2,4.96,2
13,A1,12,1.25,1.88,377.0,3.84,3
14,A1,11,1.36,2.07,435.33,4.8,4
15,A1,11,1.54,1.96,399.5,6.93,5


Step 4: Write sentences along with their text features, paragraph, and passage to excel sheets.

In [3]:
%%capture

with pd.ExcelWriter("data/ListA_TextProperties.xlsx") as writer:   
    for pas_id in passages:
        pas_df = cohmetrix_output.query(f"passage == '{pas_id}'")
        df = pd.read_excel("data/ListA_Sentences.xlsx", sheet_name=pas_id)
        sentences = df.Sentence.to_list()
        pas_df["paragraph"] = [(input_pos - 1) // 5 + 1 for input_pos in pas_df.input]
        pas_df["passage_id"] = pas_id 
        pas_df["sentence"] = sentences
        pas_df = pas_df[["sentence"] + pas_df.columns[1:-1].to_list()]
        pas_df.to_excel(writer, sheet_name = pas_id, index=False)
