# Normalizing C1 original text

<br>

**Language: Python**

This notebook shows the process used for creating the length-normalized C1 text from the original public IELTS Task 2 C1 text. The original text is a public academic writing sample from the [IELTS website](https://www.ielts.org/en-us/about-the-test/sample-test-questions). The original text is 254 words and the desired length is 250 words (see chapters 5.1 and 5.2 of the dissertation).

**Notebook contents:**
- [Initial setup](#Initial-setup)
- [Text processing](#Text-processing)
- [Syntactic complexity](#Syntactic-complexity)
- [Lexical diversity](#Lexical-diversity)
- [Lexical sophistication](#Lexical-sophistication)
- [Collocation measures](#Collocation-measures)
- [Accuracy](#Accuracy)
- [Normalizing length](#Normalizing-length)

## Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import csv
from ast import literal_eval
from nltk import pos_tag_sents
from pelitk import lex
import joblib
import numpy as np
import math
from collections import Counter

In [2]:
# Set preferred notebook format

%pprint # Turn off pretty printing
InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

Pretty printing has been turned OFF


**Note:** As described in the [README.md]('../README.md'), The frequency information from COCA referenced here is not freely available but can be purchased at https://corpus.byu.edu/coca. Without this data you will be able to see a few rows of these dataframes, but will not be able to run the code yourself. The t-scores and K-bands were also calculated using these data.

In [3]:
# Import necessary dictionaries

coca_freq_dict = joblib.load('../../COCA_data/COCA_2020_lemma_freq_dict.pkl')
coca_word_lemma_dict = joblib.load('../../COCA_data/COCA_2020_word_lemma_dict.pkl')
col_freq_dict = joblib.load('../../COCA_data/COCA_2020_collocate_freq_dict.pkl')
MI_dict = joblib.load('../../COCA_data/COCA_2020_MI_dict.pkl')
tscore_dict = joblib.load('../../COCA_data/COCA_2020_tscore_dict.pkl')
kband_dict = joblib.load('../../COCA_data/COCA_2020_lemma_Kband_dict.pkl') # All items lower-case

In [4]:
# Read in original text (transcribed and with corrected spelling)

f = open("../docs/C1_original_corrected.txt", "r")
C1_orig = f.read()

**Note:** In addition to correcting spelling, contractions were changed to full words, '&' to 'and', '20' to 'twenty', and 'Mr' to 'Mister'.

In [5]:
# Read in modified text

f = open("../docs/C1_normalized.txt", "r")
C1_norm = f.read()

**Note:** These modified texts were modified based on the [`Normalizing length`](#Normalizing-length) goals described later, but are incorporated here to avoid having to go through the text processing procedure twice.

In [6]:
# Create dataframe

texts_df = pd.DataFrame({'text_id':pd.Series(['C1_orig','C1_norm']),
                         'text':pd.Series([C1_orig,C1_norm])})

texts_df

Unnamed: 0,text_id,text
0,C1_orig,I do agree to the statement that children brou...
1,C1_norm,I do agree to the statement that children brou...


## Text processing

The tokenizer, part-of-speech tagger, and lemmatizer tools are the same ones used in the creation of the [PELIC](https://github.com/ELI-Data-Mining-Group/PELIC-dataset) corpus. The tokenizer and lemmatizer are not open access but are based on the ones from [NLTK](https://www.nltk.org/), and using the public NLTK tools will yield similar results. For a more detailed description of the modified tools, please see [Naismith et al. (2022)](https://benjamins.com/catalog/ijlcr.21002.nai).

In [7]:
# Change to working directory containing elitools

%cd '../../ELI_Data_Mining/Data-Archive/elitools/'

/Users/Ben/Documents/ELI_Data_Mining/Data-Archive/elitools


In [8]:
# Load lemmatizer

%run -i 'lemmatizer_class.py'
lemmatizer = lemmatizer()

In [9]:
# Load the tokenizer module

%run -i 'tokenizer.py'

In [10]:
# Return to previous working directory

%cd '../../../Collocational_proficiency_Naismith_2022/notebooks'

/Users/Ben/Documents/Collocational_proficiency_Naismith_2022/notebooks


### Tokenization

In [11]:
# Tokenize text (nltk-based)

texts_df['toks'] = texts_df.text.apply(tokenize)

texts_df

Unnamed: 0,text_id,text,toks
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child..."
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child..."


### POS tagging
- NLTK (PELIC)
- CLAWS7 (COCA)

#### NLTK
As there are only three texts (each with three versions), to avoid any errors, I have used the default NLTK tagger (Penn Treebank tagset), then manually checked and corrected the tags.

In [12]:
# Apply nltk tagger to create series

C1_NLTK = pd.Series(pos_tag_sents(texts_df['toks']))

In [13]:
# Check tags

# Write out tagged texts
C1_NLTK.to_csv('../docs/C1_NLTK.csv', index=False, header=False) 

# Read in the checked tagged texts as a series
C1_NLTK_CHECKED = pd.read_csv("../docs/C1_NLTK_CHECKED.csv", header=None, squeeze = True) 
C1_NLTK_CHECKED = [literal_eval(x) for x in C1_NLTK_CHECKED]



  C1_NLTK_CHECKED = pd.read_csv("../docs/C1_NLTK_CHECKED.csv", header=None, squeeze = True)


In [14]:
# Create column based on checked tagged texts

texts_df['tok_POS_NLTK'] = C1_NLTK_CHECKED
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (..."
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (..."


#### CLAWS7

By also tagging with CLAWS7, it is easier to match the POS to the COCA info, rather than the Penn tagset used by NLTK and then having to convert.

Free CLAWS tagger: http://ucrel-api.lancaster.ac.uk/claws/free.html

Again, these tagged texts should be manually checked prior to use.

In [15]:
# Read in tagged CLAWS texts

f = open("../docs/C1_original_CLAWS.txt", "r")
C1_orig_CLAWS = f.read()

f = open("../docs/C1_normalized_CLAWS.txt", "r")
C1_norm_CLAWS = f.read()

In [16]:
# Remove new line characters, split on whitespace, and remove identifier at end

C1_orig_CLAWS = C1_orig_CLAWS.replace('\n', '').split(' ')[:-2]
C1_norm_CLAWS = C1_norm_CLAWS.replace('\n', '').split(' ')[:-2]

In [17]:
# Change tags into tuples

C1_orig_CLAWS = [tuple(x.split('_')) for x in C1_orig_CLAWS]
C1_norm_CLAWS = [tuple(x.split('_')) for x in C1_norm_CLAWS]

In [18]:
texts_df['toks_POS_CLAWS'] = pd.Series([C1_orig_CLAWS,C1_norm_CLAWS])
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, PPIS1), (do, VD0), (agree, VVI), (to, II)..."
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, PPIS1), (do, VD0), (agree, VVI), (to, II)..."


### Lemmatization

#### NLTK

In [19]:
# Create lemmatized text column using our lemmatizer loaded earlier

texts_df['lemmas_NLTK'] = texts_df['tok_POS_NLTK'].apply(lemmatizer.lemmatize_text)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, PPIS1), (do, VD0), (agree, VVI), (to, II)...","[I, do, agree, to, the, statement, that, child..."
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, PPIS1), (do, VD0), (agree, VVI), (to, II)...","[I, do, agree, to, the, statement, that, child..."


#### CLAWS

In [20]:
# Keep only first letter of CLAWS PoS tags

texts_df.toks_POS_CLAWS = texts_df.toks_POS_CLAWS.apply(lambda row: [(x[0],x[1][0].lower()) for x in row])

In [21]:
# Remove puncuation from CLAWS texts

COCA_POS = sorted(list(set([x[1] for x in coca_freq_dict.keys()])))
texts_df.toks_POS_CLAWS = texts_df.toks_POS_CLAWS.apply(lambda row: [x for x in row if x[1] in COCA_POS])

In [22]:
# Check lemmas not in COCA dict

sorted(list(set([x for y in texts_df.toks_POS_CLAWS.apply(lambda row: [x for x in row if (x[0].lower(),x[1]) not in coca_word_lemma_dict]).to_list() for x in y])))

[("'realities", 'n'), ('Microsoft', 'n')]

In [23]:
# Create CLAWS lemma column

# First lower case all toks (as in the word_lemma dict)
texts_df['lemmas_CLAWS'] = texts_df.toks_POS_CLAWS.apply(lambda row: [(x[0].lower(),x[1]) for x in row])

# Then map dict
texts_df.lemmas_CLAWS = texts_df.lemmas_CLAWS.apply(
    lambda row:[coca_word_lemma_dict[x] if x in coca_word_lemma_dict else x for x in row])

texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a..."
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a..."


### Length
Length counted manually rather than using the len(toks) or other RE-based counting. This is to ensure accuracy that would match how words are counted on IELTS tests (also done manually by examiners). These counts often match what Microsoft Word would provide.

In [24]:
# Create dictionary

text_len = {'C1_orig':254,'C1_norm':250}

In [25]:
# Create length column

texts_df['text_len'] = texts_df.text_id.map(text_len)
texts_df['text_len'] = texts_df['text_len'].astype(int)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",254
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",250


## Syntactic complexity

Analysis using [TAASSC](https://www.linguisticanalysistools.org/taassc.html), calculating the measures from Lu's (2010) [Syntactic Complexity Analyzer](https://aihaiyang.com/software/). Based on previous research, two metrics most important for predicting proficiency are the focus: Number of complex nominals per clause (CN/C), and Mean length of clause (MLC).

In [26]:
# Read in TAASSC analysis file

TAASSC = pd.read_csv("../docs/C1_TAASSC_sca.csv")

In [27]:
# Rename files to match texts_df

file_names = {'C1_original_corrected.txt':'C1_orig','C1_normalized.txt':'C1_norm'}
TAASSC.filename = TAASSC.filename.map(file_names)
TAASSC = TAASSC.loc[~TAASSC.filename.isnull()]
TAASSC

Unnamed: 0,filename,nwords,MLS,MLT,MLC,C_S,VP_T,C_T,DC_C,DC_T,T_S,CT_T,CP_T,CP_C,CN_T,CN_C
0,C1_orig,259,16.1875,15.235294,11.772727,1.375,1.823529,1.294118,0.227273,0.294118,1.0625,0.294118,0.470588,0.363636,2.647059,2.045455
5,C1_norm,255,15.9375,15.0,11.590909,1.375,1.823529,1.294118,0.227273,0.294118,1.0625,0.294118,0.411765,0.318182,2.647059,2.045455


In [28]:
# Keep only relevant syntactic complexity columns and rename them

TAASSC = TAASSC[['filename','MLC','CN_C']]
TAASSC = TAASSC.rename(columns={"filename": "text_id",'CN_C':'CNC'})
TAASSC

Unnamed: 0,text_id,MLC,CNC
0,C1_orig,11.772727,2.045455
5,C1_norm,11.590909,2.045455


In [29]:
# Merge TAASSC data with texts_df

texts_df = pd.merge(texts_df, TAASSC, on='text_id')
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",254,11.772727,2.045455
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",250,11.590909,2.045455


## Lexical diversity

vocD (with lemmas) using functions from [PELITK](https://github.com/ELI-Data-Mining-Group/pelitk)

In [30]:
# Remove punctuation before calculating

punctuation = ['.','!','?',';',':','#','"',"'",'``','`',',','--','-','...',')','(',"''"]

texts_df['vocD'] = texts_df.toks.apply(lambda row: [x for x in row if x not in punctuation])

In [31]:
# Create vocD column

texts_df['vocD'] = texts_df.lemmas_NLTK.apply(lex.vocd)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",254,11.772727,2.045455,70.99553
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",250,11.590909,2.045455,70.775894


## Lexical sophistication

### Advanced Guiraud (AG)
AG based on lemmas using a frequency list (PSL3) compiled from the PELIC learner corpus (see dissertation section 2.2.2).

In [32]:
# Read in PSL3 list for manual checking of items in texts that are off list

f = open("../docs/psl3.txt", "r")
PSL3 = f.read()
PSL3 = sorted(PSL3.split('\n'))
len(PSL3)
PSL3[-10:]

2000

['yesterday', 'yet', 'yogurt', 'you', 'young', 'your', 'yours', 'yourself', 'youth', 'zoo']

In [33]:
# Create AG column (punctuation removed)

texts_df['AG'] = texts_df.lemmas_NLTK.apply(lambda row: [x for x in row if x not in punctuation]).apply(
    lex.adv_guiraud,freq_list = 'PSL3')

texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",254,11.772727,2.045455,70.99553,1.443148
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",250,11.590909,2.045455,70.775894,1.454648


### Contextual diversity

Analysis using [TAALES](https://www.linguisticanalysistools.org/taales.html). Based on previous research, one metric is the focus: contextual diversity as in Monteiro et al. (2018).

In [34]:
# Read in TAALES analysis

TAALES = pd.read_csv("../docs/C1_TAALES.csv")

In [35]:
# Rename files to match texts_df

TAALES.Filename = TAALES.Filename.map(file_names)
TAALES = TAALES.loc[~TAALES.Filename.isnull()]

In [36]:
# Keep only relevant contextual diversity columns and rename them

TAALES = TAALES[['Filename','COCA_Academic_Range_AW','COCA_Academic_Bigram_Range','COCA_Academic_Trigram_Range']]
TAALES = TAALES.rename(columns={"Filename": "text_id",'COCA_Academic_Range_AW':'unigram_range',
                                'COCA_Academic_Bigram_Range':'bigram_range',
                                'COCA_Academic_Trigram_Range':'trigram_range'})
TAALES

Unnamed: 0,text_id,unigram_range,bigram_range,trigram_range
1,C1_orig,0.590773,0.122303,0.019143
3,C1_norm,0.588858,0.117178,0.018722


In [37]:
# Merge TAALES data with texts_df

texts_df = pd.merge(texts_df, TAALES, on='text_id')
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",254,11.772727,2.045455,70.99553,1.443148,0.590773,0.122303,0.019143
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",250,11.590909,2.045455,70.775894,1.454648,0.588858,0.117178,0.018722


## Collocation measures
3 measures which make up 'CollGram' profile from Granger & Bestgen / Bestgen & Granger (2014):
- mean MI
- mean t-score
- proportion or bigrams absent from reference corpus

In [38]:
# Extract potential collocations in span 4

def find_cols(lemma_list):
    col_list = list(zip(lemma_list,lemma_list[1:]))+list(zip(lemma_list,lemma_list[2:]))\
    +list(zip(lemma_list,lemma_list[3:]))+list(zip(lemma_list,lemma_list[4:]))
    return col_list

In [39]:
# Create possible collocations column

texts_df['possible_cols'] = texts_df.lemmas_CLAWS.apply(find_cols)

In [40]:
# Lower-case (doesn't matter that 'I' gets lowered as not in collocate dict)

texts_df['possible_cols'] = texts_df.possible_cols.apply(
    lambda row: [((x[0][0].lower(),x[0][1]),(x[1][0].lower(),x[1][1])) for x in row])

In [41]:
# Create list of all possible collocations

possible_cols = sorted(list(set([x for y in texts_df.possible_cols.to_list() for x in y])))
possible_cols[:5]
possible_cols[-5:]
len(possible_cols)

[(("'realities", 'n'), ('in', 'i')), (("'realities", 'n'), ('life', 'n')), (("'realities", 'n'), ('of', 'i')), (("'realities", 'n'), ('their', 'a')), (('a', 'a'), ('a', 'a'))]

[(('world', 'n'), ('organization', 'n')), (('would', 'v'), ('be', 'v')), (('would', 'v'), ('bill', 'n')), (('would', 'v'), ('gate', 'n')), (('would', 'v'), ('mister', 'n'))]

941

### Mean MI

MI is not calculated for any bigrams with freq less than 5 or MI less than 1.

In [42]:
# Create column with MI for each possible collocation in MI dict

texts_df['col_MI'] = texts_df.possible_cols.apply(lambda row: [(x,MI_dict[x]) for x in row if x in MI_dict])

In [43]:
# Find mean MI for each text based on tokens and types

texts_df['mean_MI'] = texts_df.col_MI.apply(lambda row: np.mean([x[1] for x in row]))

### Proportion of absent/low MI word combinations

In [44]:
# Create column of two-word combinations not in collocation dict

texts_df['absent'] = texts_df.possible_cols.apply(lambda row: [x for x in row if x not in col_freq_dict])

In [45]:
# Find proportion of absent two-word combinations compared to total two-word combinations in the text

texts_df['absent_prop'] = texts_df.absent.apply(lambda row: len(row)) / texts_df.possible_cols.apply(lambda row: len(row))

In [46]:
# Find proportion of absent two-word combination types compared to total two-word combination types in the text

texts_df['absent_prop_types'] = texts_df.absent.apply(lambda row: len(set(row))) / texts_df.possible_cols.apply(lambda row: len(set(row)))

### Mean t-scores

In [47]:
# Create column with t-score for each bigram

texts_df['col_tscore'] = texts_df.possible_cols.apply(lambda row: [(x,tscore_dict[x]) for x in row if x in tscore_dict])

In [48]:
# Find mean t-score for each text based on tokens and types

texts_df['mean_tscore'] = texts_df.col_tscore.apply(lambda row: np.mean([x[1] for x in row]))
texts_df['mean_tscore_types'] = texts_df.col_tscore.apply(lambda row: np.mean([x[1] for x in set(row)]))

In [49]:
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,absent_prop_types,col_tscore,mean_tscore,mean_tscore_types
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",254,11.772727,2.045455,70.99553,1.443148,0.590773,0.122303,0.019143,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...","[((('expose', 'v'), ('to', 'i')), 2.4), ((('li...",3.25973,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...",0.926441,0.928879,"[((('expose', 'v'), ('to', 'i')), 133.938), ((...",93.616959,96.967682
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",250,11.590909,2.045455,70.775894,1.454648,0.588858,0.117178,0.018722,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...","[((('expose', 'v'), ('to', 'i')), 2.4), ((('li...",3.229444,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...",0.927273,0.930055,"[((('expose', 'v'), ('to', 'i')), 133.938), ((...",92.030778,95.287937


## Accuracy
Grammatical accuracy and collocational accuracy. Errors manually annotated and counted. 

### Grammatical accuracy

In [50]:
# Create lists of manually identified errors

C1_orig_grammar = ['are taught necessary skills','families also are highly motivated']
C1_norm_grammar = ['are taught necessary skills','families also are highly motivated']

len(C1_orig_grammar)
len(C1_norm_grammar)

2

2

In [51]:
# Create grammatical accuracy column

grammar_dict = {'C1_orig':len(C1_orig_grammar),'C1_norm':len(C1_norm_grammar)}

texts_df['grammar_errors'] = texts_df.text_id.map(grammar_dict)
texts_df['grammar_errors_per_100'] = (texts_df.grammar_errors/texts_df.text_len)*100

In [52]:
# Add punctuation column (manually counted)

punc_dict = {'C1_orig':8,'C1_norm':8}

texts_df['punc_errors'] = texts_df.text_id.map(punc_dict)
texts_df['punc_errors_per_100'] = (texts_df.punc_errors/texts_df.text_len)*100

### Collocational accuracy

In [53]:
# Record of collocation errors. MI calculations not comparable with two and three word collocations.

C1_orig_errors = ['agree to the statement','in the weekends','collect some pocket money','computer organization','in summing up']
C1_norm_errors = ['agree to the statement','in the weekends','collect some pocket money','computer organization','in summing up']

len(C1_orig_errors)
len(C1_norm_errors)

5

5

In [54]:
# Accurate collocations (manually annotated)

C1_orig_cols = ['poor families ','poor parents','are (prematurely) exposed to','learning to survive','low family income','sacrificing luxuries for','essential items','realities of life','home or social environment','serve as an example','taught necessary skills','skills for survival','from a very early age','contribute to','good example','accompany their parents','sell produce','at the market','in terms of','highly motivated','set high goals','improve their economic and social situation','A relevant example','founder of','impoverished background','poor backgrounds','robbed of their childhood','feel cheated','turn to crime','early exposure','family role models','direct contribution','sheer motivation']
C1_norm_cols = ['poor families ','poor parents','are (prematurely) exposed to','learning to survive','low family income','sacrificing luxuries for','essential items','realities of life','home or social environment','serve as an example','taught necessary skills','skills for survival','from a very early age','contribute to','good example','accompany their parents','sell produce','at the market','in terms of','highly motivated','set high goals','improve their economic and social situation','A relevant example','founder of','impoverished background','poor backgrounds','robbed of their childhood','feel cheated','turn to crime','early exposure','family role models','direct contribution','sheer motivation']

len(C1_orig_cols)
len(C1_norm_cols)

33

33

In [55]:
# Create error and accurate cols columns

errors_dict = {'C1_orig':len(C1_orig_errors),'C1_norm':len(C1_norm_errors)}
correct_dict = {'C1_orig':len(C1_orig_cols),'C1_norm':len(C1_norm_cols)}

texts_df['col_errors'] = texts_df.text_id.map(errors_dict)
texts_df['correct_cols'] = texts_df.text_id.map(correct_dict)

In [56]:
# Create errors and correct cols per 100 words columns

texts_df['col_errors_per_100'] = (texts_df.col_errors/texts_df.text_len)*100
texts_df['correct_cols_per_100'] = (texts_df.correct_cols/texts_df.text_len)*100

In [57]:
# Create 'bad' cols column for future use

texts_df['bad_cols'] = (C1_orig_errors,C1_norm_errors)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,absent_prop_types,col_tscore,mean_tscore,mean_tscore_types,grammar_errors,grammar_errors_per_100,punc_errors,punc_errors_per_100,col_errors,correct_cols,col_errors_per_100,correct_cols_per_100,bad_cols
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",254,11.772727,2.045455,70.99553,1.443148,0.590773,0.122303,0.019143,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...","[((('expose', 'v'), ('to', 'i')), 2.4), ((('li...",3.25973,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...",0.926441,0.928879,"[((('expose', 'v'), ('to', 'i')), 133.938), ((...",93.616959,96.967682,2,0.787402,8,3.149606,5,33,1.968504,12.992126,"[agree to the statement, in the weekends, coll..."
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",250,11.590909,2.045455,70.775894,1.454648,0.588858,0.117178,0.018722,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...","[((('expose', 'v'), ('to', 'i')), 2.4), ((('li...",3.229444,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...",0.927273,0.930055,"[((('expose', 'v'), ('to', 'i')), 133.938), ((...",92.030778,95.287937,2,0.8,8,3.2,5,33,2.0,13.2,"[agree to the statement, in the weekends, coll..."


### Collocation frequency bands
Percentage of collocations containing low/mid/high freq items.  
- High = K1-2
- Mid = K3-9
- Low = K10+

In [58]:
# Tokenize collocations

C1_orig_cols_toks = [x.split() for x in C1_orig_cols]
C1_norm_cols_toks = [x.split() for x in C1_norm_cols]

In [59]:
C1_orig_cols_toks
C1_norm_cols_toks

[['poor', 'families'], ['poor', 'parents'], ['are', '(prematurely)', 'exposed', 'to'], ['learning', 'to', 'survive'], ['low', 'family', 'income'], ['sacrificing', 'luxuries', 'for'], ['essential', 'items'], ['realities', 'of', 'life'], ['home', 'or', 'social', 'environment'], ['serve', 'as', 'an', 'example'], ['taught', 'necessary', 'skills'], ['skills', 'for', 'survival'], ['from', 'a', 'very', 'early', 'age'], ['contribute', 'to'], ['good', 'example'], ['accompany', 'their', 'parents'], ['sell', 'produce'], ['at', 'the', 'market'], ['in', 'terms', 'of'], ['highly', 'motivated'], ['set', 'high', 'goals'], ['improve', 'their', 'economic', 'and', 'social', 'situation'], ['A', 'relevant', 'example'], ['founder', 'of'], ['impoverished', 'background'], ['poor', 'backgrounds'], ['robbed', 'of', 'their', 'childhood'], ['feel', 'cheated'], ['turn', 'to', 'crime'], ['early', 'exposure'], ['family', 'role', 'models'], ['direct', 'contribution'], ['sheer', 'motivation']]

[['poor', 'families'], ['poor', 'parents'], ['are', '(prematurely)', 'exposed', 'to'], ['learning', 'to', 'survive'], ['low', 'family', 'income'], ['sacrificing', 'luxuries', 'for'], ['essential', 'items'], ['realities', 'of', 'life'], ['home', 'or', 'social', 'environment'], ['serve', 'as', 'an', 'example'], ['taught', 'necessary', 'skills'], ['skills', 'for', 'survival'], ['from', 'a', 'very', 'early', 'age'], ['contribute', 'to'], ['good', 'example'], ['accompany', 'their', 'parents'], ['sell', 'produce'], ['at', 'the', 'market'], ['in', 'terms', 'of'], ['highly', 'motivated'], ['set', 'high', 'goals'], ['improve', 'their', 'economic', 'and', 'social', 'situation'], ['A', 'relevant', 'example'], ['founder', 'of'], ['impoverished', 'background'], ['poor', 'backgrounds'], ['robbed', 'of', 'their', 'childhood'], ['feel', 'cheated'], ['turn', 'to', 'crime'], ['early', 'exposure'], ['family', 'role', 'models'], ['direct', 'contribution'], ['sheer', 'motivation']]

In [60]:
# Collocations with PoS (manually lemmatized and tagged based on above)

C1_orig_cols_toks_POS = [[('poor','j'), ('family','n')],
                         [('poor','j'), ('parent','n')],
                         [('be','v'), ('prematurely','r'), ('expose','v'), ('to','i')],
                         [('learn','v'), ('to','t'), ('survive','v')],
                         [('low','j'), ('family','n'), ('income','n')],
                         [('sacrifice','v'), ('luxury','n'), ('for','i')],
                         [('essential','j'), ('item','n')],
                         [('reality','n'), ('of','i'), ('life','n')],
                         [('home','n'), ('or','c'), ('social','j'), ('environment','n')],
                         [('serve','v'), ('as','i'), ('a','a'), ('example','n')],
                         [('teach','v'), ('necessary','j'), ('skill','n')],
                         [('skill','n'), ('for','i'), ('survival','n')],
                         [('from','i'), ('a','a'), ('very','r'), ('early','j'), ('age','n')],
                         [('contribute','v'), ('to','i')],
                         [('good','j'), ('example','n')],
                         [('accompany','v'), ('their','a'), ('parent','n')],
                         [('sell','v'), ('produce','n')],
                         [('at','i'), ('the','a'), ('market','n')],
                         [('in','i'), ('term','n'), ('of','i')],
                         [('highly','r'), ('motivated','j')],
                         [('set','v'), ('high','j'), ('goal','n')],
                         [('improve','v'), ('their','a'), ('economic','j'), ('and','c'), ('social','j'), ('situation','n')],
                         [('a','a'), ('relevant','j'), ('example','n')],
                         [('founder','n'), ('of','i')],
                         [('impoverished','j'), ('background','n')],
                         [('poor','j'), ('background','n')],
                         [('rob','v'),('of','i'), ('their','a'), ('childhood','n')],
                         [('feel','v'), ('cheat','v')],
                         [('turn','v'), ('to','i'), ('crime','n')],
                         [('early','j'), ('exposure','n')],
                         [('family','n'), ('role','n'), ('model','n')],
                         [('direct','j'), ('contribution','n')],
                         [('sheer','j'), ('motivation','n')]]

C1_norm_cols_toks_POS = C1_orig_cols_toks_POS

In [61]:
# Create collocation dict

col_dict = {'C1_orig':C1_orig_cols_toks_POS,'C1_norm':C1_norm_cols_toks_POS}

# Create column with collocations

texts_df['cols'] = texts_df.text_id.map(col_dict)

In [62]:
# Create column of the freq bands of the highest kband item in each collocation

texts_df['col_kband'] = texts_df.cols.apply(
    lambda row:[sorted([kband_dict[y] for y in x],reverse=True)[0] for x in row])

In [63]:
# Create (kband, cols) tuples

texts_df['kband_cols'] = list(zip(texts_df.col_kband,texts_df.cols))
texts_df['kband_cols'] = texts_df['kband_cols'].apply(lambda row: list(zip(row[0],row[1])))

In [64]:
# Group Kbands

high_freq_K = list(range(1,3))
mid_freq_K = list(range(3,10))
low_freq_K = list(range(10,101))

In [65]:
# Create columns of percentages of cols that contain low, med, high kband items (highest only)

texts_df['K10to16_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(low_freq_K)]))
texts_df['K3to9_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(mid_freq_K)]))
texts_df['K1to2_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(high_freq_K)]))

In [66]:
# Add percent columns

texts_df['K10to16_p'] = texts_df['K10to16_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])
texts_df['K3to9_p'] = texts_df['K3to9_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])
texts_df['K1to2_p'] = texts_df['K1to2_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])

In [67]:
# Create separate columns with low/mid/high cols for ease of viewing

texts_df['K1to2_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if x[0] <= 2])
texts_df['K3to9_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if 10 > x[0] > 2])
texts_df['K10to16_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if x[0] > 9])

In [68]:
# Round all stats to 3 digits for ease of use

texts_df = round(texts_df,3)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,absent_prop_types,col_tscore,mean_tscore,mean_tscore_types,grammar_errors,grammar_errors_per_100,punc_errors,punc_errors_per_100,col_errors,correct_cols,col_errors_per_100,correct_cols_per_100,bad_cols,cols,col_kband,kband_cols,K10to16_cols,K3to9_cols,K1to2_cols,K10to16_p,K3to9_p,K1to2_p,K1to2_cols_K,K3to9_cols_K,K10to16_cols_K
0,C1_orig,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",254,11.773,2.045,70.996,1.443,0.591,0.122,0.019,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...","[((('expose', 'v'), ('to', 'i')), 2.4), ((('li...",3.26,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...",0.926,0.929,"[((('expose', 'v'), ('to', 'i')), 133.938), ((...",93.617,96.968,2,0.787,8,3.15,5,33,1.969,12.992,"[agree to the statement, in the weekends, coll...","[[(poor, j), (family, n)], [(poor, j), (parent...","[1, 1, 14, 2, 2, 7, 3, 1, 1, 1, 2, 3, 1, 2, 1,...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...",3,12,18,0.091,0.364,0.545,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex..."
1,C1_norm,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, PRP), (do, VBP), (agree, VB), (to, IN), (...","[(I, p), (do, v), (agree, v), (to, i), (the, a...","[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",250,11.591,2.045,70.776,1.455,0.589,0.117,0.019,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...","[((('expose', 'v'), ('to', 'i')), 2.4), ((('li...",3.229,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...",0.927,0.93,"[((('expose', 'v'), ('to', 'i')), 133.938), ((...",92.031,95.288,2,0.8,8,3.2,5,33,2.0,13.2,"[agree to the statement, in the weekends, coll...","[[(poor, j), (family, n)], [(poor, j), (parent...","[1, 1, 14, 2, 2, 7, 3, 1, 1, 1, 2, 3, 1, 2, 1,...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...",3,12,18,0.091,0.364,0.545,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex..."


## Normalizing length

Results of comparison between original and normalized versions

In [69]:
# Create function for finding range of + or - 5%

def find_range(stat):
    low = str(round(stat*.95,2))
    high = str(round(stat*1.05,2))
    stat_range = low + ' - ' + high
    return stat_range

In [70]:
# vocD: mod within 5% of orig (remember that changes slightly every time calculated as based on samples)

C1_vocd = texts_df['vocD'].to_list()
C1_vocd 

find_range(C1_vocd[0])

[70.996, 70.776]

'67.45 - 74.55'

In [71]:
# AG (PSL): mod within 5% of orig

C1_AG = texts_df['AG'].to_list()
C1_AG

find_range(C1_AG[0])

[1.443, 1.455]

'1.37 - 1.52'

In [72]:
# Mean MI (words): mod within 5% of orig

C1_mean_MI = texts_df['mean_MI'].to_list()
C1_mean_MI

find_range(C1_mean_MI[0])

[3.26, 3.229]

'3.1 - 3.42'

In [73]:
# Mean t-score (words): mod within 5% of orig

C1_mean_t_score = texts_df['mean_tscore'].to_list()
C1_mean_t_score

find_range(C1_mean_t_score[0])

[93.617, 92.031]

'88.94 - 98.3'

In [74]:
# Mean proportion of bigrams (words): mod within 5% of orig or closest possible

C1_absent_prop = texts_df['absent_prop'].to_list()
C1_absent_prop

find_range(C1_absent_prop[0])

[0.926, 0.927]

'0.88 - 0.97'

In [75]:
# Grammar errors per 100: mod within 5% of orig

C1_grammar_errors_per_100 = texts_df['grammar_errors_per_100'].to_list()
C1_grammar_errors_per_100

find_range(C1_grammar_errors_per_100[0])

[0.787, 0.8]

'0.75 - 0.83'

In [76]:
# Punctuation errors per 100: mod within 5% of orig

C1_punc_errors_per_100 = texts_df['punc_errors_per_100'].to_list()
C1_punc_errors_per_100

find_range(C1_punc_errors_per_100[0])

[3.15, 3.2]

'2.99 - 3.31'

In [77]:
# Collocation errors per 100: mod within 5% of orig

C1_col_errors_per_100 = texts_df['col_errors_per_100'].to_list()
C1_col_errors_per_100

find_range(C1_col_errors_per_100[0])

[1.969, 2.0]

'1.87 - 2.07'

In [78]:
# Accurate colls per 100

C1_correct_cols_per_100 = texts_df['correct_cols_per_100'].to_list()
C1_correct_cols_per_100

find_range(C1_correct_cols_per_100[0])

[12.992, 13.2]

'12.34 - 13.64'

In [79]:
# K10-16 cols percent

C1_K10to16_p = texts_df['K10to16_p'].to_list()
C1_K10to16_p

find_range(C1_K10to16_p[0])

[0.091, 0.091]

'0.09 - 0.1'

In [80]:
# K3-9 cols percent

C1_K3to9_p = texts_df['K3to9_p'].to_list()
C1_K3to9_p

find_range(C1_K3to9_p[0])

[0.364, 0.364]

'0.35 - 0.38'

In [81]:
# K1-2 cols percent

C1_K1to2_p = texts_df['K1to2_p'].to_list()
C1_K1to2_p

find_range(C1_K1to2_p[0])

[0.545, 0.545]

'0.52 - 0.57'

In [82]:
# Bigram range: mod within 5% of orig

C1_bigram_range = texts_df['bigram_range'].to_list()
C1_bigram_range

find_range(C1_bigram_range[0])

[0.122, 0.117]

'0.12 - 0.13'

In [83]:
# CNC: mod within 5% of orig

C1_CNC = texts_df['CNC'].to_list()
C1_CNC

find_range(C1_CNC[0])

[2.045, 2.045]

'1.94 - 2.15'

In [84]:
# MLC: mod within 5% of orig

C1_MLC = texts_df['MLC'].to_list()
C1_MLC

find_range(C1_MLC[0])

[11.773, 11.591]

'11.18 - 12.36'

In [85]:
# Final comparison with relevant stats only

texts_final = texts_df[['text_id','text','lemmas_NLTK','lemmas_CLAWS','text_len','MLC','CNC','grammar_errors_per_100',
                        'punc_errors_per_100','vocD','AG','bigram_range','mean_MI','absent_prop',
                        'mean_tscore','col_errors_per_100','correct_cols_per_100','K10to16_p','K3to9_p','K1to2_p',
                        'kband_cols', 'K1to2_cols','K3to9_cols','K10to16_cols',
                        'K1to2_cols_K','K3to9_cols_K','K10to16_cols_K','bad_cols']]

In [86]:
# Pickle for later use

joblib.dump(texts_final ,'../docs/C1_orig&norm.pkl')

['../docs/C1_orig&norm.pkl']

In [87]:
# Pickle possible cols for collocation identification notebook

C1_cols = texts_df[['text_id','lemmas_CLAWS','possible_cols']]

joblib.dump(C1_cols ,'../docs/C1_cols.pkl')

['../docs/C1_cols.pkl']

[Back to top](#Normalizing-B1-original-text)