# Normalizing B2 original text

<br>

**Language: Python**

This notebook shows the process used for creating the length-normalized B2 text from the original public IELTS Task 2 B2 text. The original text is a public academic writing sample from the [IELTS website](https://www.ielts.org/en-us/about-the-test/sample-test-questions). The original text is 349 words and the desired length is 250 words (see chapters 5.1 and 5.2 of the dissertation).

**Notebook contents:**
- [Initial setup](#Initial-setup)
- [Text processing](#Text-processing)
- [Syntactic complexity](#Syntactic-complexity)
- [Lexical diversity](#Lexical-diversity)
- [Lexical sophistication](#Lexical-sophistication)
- [Collocation measures](#Collocation-measures)
- [Accuracy](#Accuracy)
- [Normalizing length](#Normalizing-length)

## Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import csv
from ast import literal_eval
from nltk import pos_tag_sents
from pelitk import lex
import joblib
import numpy as np
import math
from collections import Counter
from CLAWSTag import Tagger

In [2]:
# Set preferred notebook format

%pprint # Turn off pretty printing
InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

Pretty printing has been turned OFF


**Note:** As described in the [README.md]('../README.md'), The frequency information from COCA referenced here is not freely available but can be purchased at https://corpus.byu.edu/coca. Without this data you will be able to see a few rows of these dataframes, but will not be able to run the code yourself. The t-scores and K-bands were also calculated using these data.

In [3]:
# Import necessary dictionaries

coca_freq_dict = joblib.load('../../COCA_data/COCA_2020_lemma_freq_dict.pkl')
coca_word_lemma_dict = joblib.load('../../COCA_data/COCA_2020_word_lemma_dict.pkl')
col_freq_dict = joblib.load('../../COCA_data/COCA_2020_collocate_freq_dict.pkl')
MI_dict = joblib.load('../../COCA_data/COCA_2020_MI_dict.pkl')
tscore_dict = joblib.load('../../COCA_data/COCA_2020_tscore_dict.pkl')
kband_dict = joblib.load('../../COCA_data/COCA_2020_lemma_Kband_dict.pkl') # All items lower-case

In [4]:
# Read in original text (transcribed and with corrected spelling)

f = open("../docs/B2_original_corrected.txt", "r")
B2_orig = f.read()

**Note:** In addition to correcting spelling, contractions were changed to full words, '&' to 'and', '20' to 'twenty', and 'Mr' to 'Mister'.

In [5]:
# Read in modified text

f = open("../docs/B2_normalized.txt", "r")
B2_norm = f.read()

**Note:** These modified texts were modified based on the [`Normalizing length`](#Normalizing-length) goals described later, but are incorporated here to avoid having to go through the text processing procedure twice.

In [6]:
# Create dataframe

texts_df = pd.DataFrame({'text_id':pd.Series(['B2_orig','B2_norm']),
                         'text':pd.Series([B2_orig,B2_norm])})

texts_df

Unnamed: 0,text_id,text
0,B2_orig,"I greatly support the idea. I support it, beca..."
1,B2_norm,I greatly support the idea.\nraised in a certa...


## Text processing

The tokenizer, part-of-speech tagger, and lemmatizer tools are the same ones used in the creation of the [PELIC](https://github.com/ELI-Data-Mining-Group/PELIC-dataset) corpus. The tokenizer and lemmatizer are not open access but are based on the ones from [NLTK](https://www.nltk.org/), and using the public NLTK tools will yield similar results. For a more detailed description of the modified tools, please see [Naismith et al. (2022)](https://benjamins.com/catalog/ijlcr.21002.nai).

In [7]:
# Change to working directory containing elitools

%cd '../../ELI_Data_Mining/Data-Archive/elitools/'

/Users/Ben/Documents/ELI_Data_Mining/Data-Archive/elitools


In [8]:
# Load lemmatizer

%run -i 'lemmatizer_class.py'
lemmatizer = lemmatizer()

In [9]:
# Load the tokenizer module

%run -i 'tokenizer.py'

In [10]:
# Return to previous working directory

%cd '../../../Collocational_proficiency_Naismith_2022/notebooks'

/Users/Ben/Documents/Collocational_proficiency_Naismith_2022/notebooks


### Tokenization

In [11]:
# Tokenize text (nltk-based)

texts_df['toks'] = texts_df.text.apply(tokenize)

texts_df

Unnamed: 0,text_id,text,toks
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support..."
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in..."


### POS tagging
- NLTK (PELIC)
- CLAWS7 (COCA)

#### NLTK
As there are only three texts (each with three versions), to avoid any errors, I have used the default NLTK tagger (Penn Treebank tagset), then manually checked and corrected the tags.

In [12]:
# Apply nltk tagger to create series

B2_NLTK = pd.Series(pos_tag_sents(texts_df['toks']))

In [13]:
# Check tags

# Write out tagged texts
B2_NLTK.to_csv('../docs/B2_NLTK.csv', index=False, header=False) 

# Read in the checked tagged texts as a series
B2_NLTK_CHECKED = pd.read_csv("../docs/B2_NLTK_CHECKED.csv", header=None, squeeze = True) 
B2_NLTK_CHECKED = [literal_eval(x) for x in B2_NLTK_CHECKED]



  B2_NLTK_CHECKED = pd.read_csv("../docs/B2_NLTK_CHECKED.csv", header=None, squeeze = True)


In [14]:
# Create column based on checked tagged texts

texts_df['tok_POS_NLTK'] = B2_NLTK_CHECKED
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the..."
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the..."


#### CLAWS7

By also tagging with CLAWS7, it is easier to match the POS to the COCA info, rather than the Penn tagset used by NLTK and then having to convert.

Free CLAWS tagger: http://ucrel-api.lancaster.ac.uk/claws/free.html

Again, these tagged texts should be manually checked prior to use.

In [15]:
# Read in tagged CLAWS texts

f = open("../docs/B2_original_CLAWS.txt", "r")
B2_orig_CLAWS = f.read()

f = open("../docs/B2_normalized_CLAWS.txt", "r")
B2_norm_CLAWS = f.read()

In [16]:
# Remove new line characters, split on whitespace, and remove identifier at end

B2_orig_CLAWS = B2_orig_CLAWS.replace('\n', '').split(' ')[:-2]
B2_norm_CLAWS = B2_norm_CLAWS.replace('\n', '').split(' ')[:-2]

In [17]:
# Change tags into tuples

B2_orig_CLAWS = [tuple(x.split('_')) for x in B2_orig_CLAWS]
B2_norm_CLAWS = [tuple(x.split('_')) for x in B2_norm_CLAWS]

In [18]:
texts_df['toks_POS_CLAWS'] = pd.Series([B2_orig_CLAWS,B2_norm_CLAWS])
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, PPIS1), (greatly, RR), (support, VV0), (t..."
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, PPIS1), (greatly, RR), (support, VV0), (t..."


### Lemmatization

#### NLTK

In [19]:
# Create lemmatized text column using our lemmatizer loaded earlier

texts_df['lemmas_NLTK'] = texts_df['tok_POS_NLTK'].apply(lemmatizer.lemmatize_text)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, PPIS1), (greatly, RR), (support, VV0), (t...","[I, greatly, support, the, idea, ., I, support..."
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, PPIS1), (greatly, RR), (support, VV0), (t...","[I, greatly, support, the, idea, ., raise, in,..."


#### CLAWS

In [20]:
# Keep only first letter of CLAWS PoS tags

texts_df.toks_POS_CLAWS = texts_df.toks_POS_CLAWS.apply(lambda row: [(x[0],x[1][0].lower()) for x in row])

In [21]:
# Remove puncuation from CLAWS texts

COCA_POS = sorted(list(set([x[1] for x in coca_freq_dict.keys()])))
texts_df.toks_POS_CLAWS = texts_df.toks_POS_CLAWS.apply(lambda row: [x for x in row if x[1] in COCA_POS])

In [22]:
# Check lemmas not in COCA dict

sorted(list(set([x for y in texts_df.toks_POS_CLAWS.apply(lambda row: [x for x in row if (x[0].lower(),x[1]) not in coca_word_lemma_dict]).to_list() for x in y])))

# Used as 'every day' (adverb) rather than 'everyday' (adjective)

[('everyday', 'r')]

In [23]:
# Create CLAWS lemma column

# First lower case all toks (as in the word_lemma dict)
texts_df['lemmas_CLAWS'] = texts_df.toks_POS_CLAWS.apply(lambda row: [(x[0].lower(),x[1]) for x in row])

# Then map dict
texts_df.lemmas_CLAWS = texts_df.lemmas_CLAWS.apply(
    lambda row:[coca_word_lemma_dict[x] if x in coca_word_lemma_dict else x for x in row])

texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., I, support...","[(I, p), (greatly, r), (support, v), (the, a),..."
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),..."


### Length
Length counted manually rather than using the len(toks) or other RE-based counting. This is to ensure accuracy that would match how words are counted on IELTS tests (also done manually by examiners). These counts often match what Microsoft Word would provide.

In [24]:
# Create dictionary

text_len = {'B2_orig':349,'B2_norm':250}

In [25]:
# Create length column

texts_df['text_len'] = texts_df.text_id.map(text_len)
texts_df['text_len'] = texts_df['text_len'].astype(int)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., I, support...","[(I, p), (greatly, r), (support, v), (the, a),...",349
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",250


## Syntactic complexity

Analysis using [TAASSC](https://www.linguisticanalysistools.org/taassc.html), calculating the measures from Lu's (2010) [Syntactic Complexity Analyzer](https://aihaiyang.com/software/). Based on previous research, two metrics most important for predicting proficiency are the focus: Number of complex nominals per clause (CN/C), and Mean length of clause (MLC).

In [26]:
# Read in TAASSC analysis file

TAASSC = pd.read_csv("../docs/B2_TAASSC_sca.csv")

In [27]:
# Rename files to match texts_df

file_names = {'B2_original_corrected.txt':'B2_orig','B2_normalized.txt':'B2_norm'}
TAASSC.filename = TAASSC.filename.map(file_names)
TAASSC = TAASSC.loc[~TAASSC.filename.isnull()]
TAASSC

Unnamed: 0,filename,nwords,MLS,MLT,MLC,C_S,VP_T,C_T,DC_C,DC_T,T_S,CT_T,CP_T,CP_C,CN_T,CN_C
1,B2_norm,256,17.066667,15.058824,7.314286,2.333333,2.529412,2.058824,0.457143,0.941176,1.133333,0.588235,0.117647,0.057143,1.647059,0.8
3,B2_orig,357,15.521739,14.28,7.285714,2.130435,2.48,1.96,0.408163,0.8,1.086957,0.52,0.12,0.061224,1.68,0.857143


In [28]:
# Keep only relevant syntactic complexity columns and rename them

TAASSC = TAASSC[['filename','MLC','CN_C']]
TAASSC = TAASSC.rename(columns={"filename": "text_id",'CN_C':'CNC'})
TAASSC

Unnamed: 0,text_id,MLC,CNC
1,B2_norm,7.314286,0.8
3,B2_orig,7.285714,0.857143


In [29]:
# Merge TAASSC data with texts_df

texts_df = pd.merge(texts_df, TAASSC, on='text_id')
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., I, support...","[(I, p), (greatly, r), (support, v), (the, a),...",349,7.285714,0.857143
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",250,7.314286,0.8


## Lexical diversity

vocD (with lemmas) using functions from [PELITK](https://github.com/ELI-Data-Mining-Group/pelitk)

In [30]:
# Remove punctuation before calculating

punctuation = ['.','!','?',';',':','#','"',"'",'``','`',',','--','-','...',')','(',"''"]

texts_df['vocD'] = texts_df.toks.apply(lambda row: [x for x in row if x not in punctuation])

In [31]:
# Create vocD column

texts_df['vocD'] = texts_df.lemmas_NLTK.apply(lex.vocd)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., I, support...","[(I, p), (greatly, r), (support, v), (the, a),...",349,7.285714,0.857143,46.781413
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",250,7.314286,0.8,43.456769


## Lexical sophistication

### Advanced Guiraud (AG)
AG based on lemmas using a frequency list (PSL3) compiled from the PELIC learner corpus (see dissertation section 2.2.2).

In [32]:
# Read in PSL3 list for manual checking of items in texts that are off list

f = open("../docs/psl3.txt", "r")
PSL3 = f.read()
PSL3 = sorted(PSL3.split('\n'))
len(PSL3)
PSL3[-10:]

2000

['yesterday', 'yet', 'yogurt', 'you', 'young', 'your', 'yours', 'yourself', 'youth', 'zoo']

In [33]:
# Create AG column (punctuation removed)

texts_df['AG'] = texts_df.lemmas_NLTK.apply(lambda row: [x for x in row if x not in punctuation]).apply(
    lex.adv_guiraud,freq_list = 'PSL3')

texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., I, support...","[(I, p), (greatly, r), (support, v), (the, a),...",349,7.285714,0.857143,46.781413,0.899735
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",250,7.314286,0.8,43.456769,0.9375


### Contextual diversity

Analysis using [TAALES](https://www.linguisticanalysistools.org/taales.html). Based on previous research, one metric is the focus: contextual diversity as in Monteiro et al. (2018).

In [34]:
# Read in TAALES analysis

TAALES = pd.read_csv("../docs/B2_TAALES.csv")

In [35]:
# Rename files to match texts_df

TAALES.Filename = TAALES.Filename.map(file_names)
TAALES = TAALES.loc[~TAALES.Filename.isnull()]

In [36]:
# Keep only relevant contextual diversity columns and rename them

TAALES = TAALES[['Filename','COCA_Academic_Range_AW','COCA_Academic_Bigram_Range','COCA_Academic_Trigram_Range']]
TAALES = TAALES.rename(columns={"Filename": "text_id",'COCA_Academic_Range_AW':'unigram_range',
                                'COCA_Academic_Bigram_Range':'bigram_range',
                                'COCA_Academic_Trigram_Range':'trigram_range'})
TAALES

Unnamed: 0,text_id,unigram_range,bigram_range,trigram_range
1,B2_norm,0.656513,0.171084,0.024246
3,B2_orig,0.667837,0.164258,0.029408


In [37]:
# Merge TAALES data with texts_df

texts_df = pd.merge(texts_df, TAALES, on='text_id')
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., I, support...","[(I, p), (greatly, r), (support, v), (the, a),...",349,7.285714,0.857143,46.781413,0.899735,0.667837,0.164258,0.029408
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",250,7.314286,0.8,43.456769,0.9375,0.656513,0.171084,0.024246


## Collocation measures
3 measures which make up 'CollGram' profile from Granger & Bestgen / Bestgen & Granger (2014):
- mean MI
- mean t-score
- proportion or bigrams absent from reference corpus

In [38]:
# Extract potential collocations in span 4

def find_cols(lemma_list):
    col_list = list(zip(lemma_list,lemma_list[1:]))+list(zip(lemma_list,lemma_list[2:]))\
    +list(zip(lemma_list,lemma_list[3:]))+list(zip(lemma_list,lemma_list[4:]))
    return col_list

In [39]:
# Create possible collocations column

texts_df['possible_cols'] = texts_df.lemmas_CLAWS.apply(find_cols)

In [40]:
# Lower-case (doesn't matter that 'I' gets lowered as not in collocate dict)

texts_df['possible_cols'] = texts_df.possible_cols.apply(
    lambda row: [((x[0][0].lower(),x[0][1]),(x[1][0].lower(),x[1][1])) for x in row])

In [41]:
# Create list of all possible collocations

possible_cols = sorted(list(set([x for y in texts_df.possible_cols.to_list() for x in y])))
possible_cols[:5]
possible_cols[-5:]
len(possible_cols)

[(('a', 'a'), ('a', 'a')), (('a', 'a'), ('always', 'r')), (('a', 'a'), ('be', 'v')), (('a', 'a'), ('but', 'c')), (('a', 'a'), ('certain', 'j'))]

[(('work', 'v'), ('very', 'r')), (('world', 'n'), ('a', 'a')), (('world', 'n'), ('be', 'v')), (('world', 'n'), ('tough', 'j')), (('world', 'n'), ('very', 'r'))]

1305

### Mean MI

MI is not calculated for any bigrams with freq less than 5 or MI less than 1.

In [42]:
# Create column with MI for each possible collocation in MI dict

texts_df['col_MI'] = texts_df.possible_cols.apply(lambda row: [(x,MI_dict[x]) for x in row if x in MI_dict])

In [43]:
# Find mean MI for each text based on tokens and types

texts_df['mean_MI'] = texts_df.col_MI.apply(lambda row: np.mean([x[1] for x in row]))

### Proportion of absent/low MI word combinations

In [44]:
# Create column of two-word combinations not in collocation dict

texts_df['absent'] = texts_df.possible_cols.apply(lambda row: [x for x in row if x not in col_freq_dict])

In [45]:
# Find proportion of absent two-word combinations compared to total two-word combinations in the text

texts_df['absent_prop'] = texts_df.absent.apply(lambda row: len(row)) / texts_df.possible_cols.apply(lambda row: len(row))

In [46]:
# Find proportion of absent two-word combination types compared to total two-word combination types in the text

texts_df['absent_prop_types'] = texts_df.absent.apply(lambda row: len(set(row))) / texts_df.possible_cols.apply(lambda row: len(set(row)))

### Mean t-scores

In [47]:
# Create column with t-score for each bigram

texts_df['col_tscore'] = texts_df.possible_cols.apply(lambda row: [(x,tscore_dict[x]) for x in row if x in tscore_dict])

In [48]:
# Find mean t-score for each text based on tokens and types

texts_df['mean_tscore'] = texts_df.col_tscore.apply(lambda row: np.mean([x[1] for x in row]))
texts_df['mean_tscore_types'] = texts_df.col_tscore.apply(lambda row: np.mean([x[1] for x in set(row)]))

In [49]:
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,absent_prop_types,col_tscore,mean_tscore,mean_tscore_types
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., I, support...","[(I, p), (greatly, r), (support, v), (the, a),...",349,7.285714,0.857143,46.781413,0.899735,0.667837,0.164258,0.029408,"[((i, p), (greatly, r)), ((greatly, r), (suppo...","[((('because', 'i'), ('of', 'i')), 2.29), ((('...",2.9402,"[((i, p), (greatly, r)), ((greatly, r), (suppo...",0.964838,0.966216,"[((('because', 'i'), ('of', 'i')), 443.825), (...",162.09286,145.6102
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",250,7.314286,0.8,43.456769,0.9375,0.656513,0.171084,0.024246,"[((i, p), (greatly, r)), ((greatly, r), (suppo...","[((('such', 'i'), ('as', 'i')), 6.02), ((('har...",2.878056,"[((i, p), (greatly, r)), ((greatly, r), (suppo...",0.964497,0.965478,"[((('such', 'i'), ('as', 'i')), 494.477), ((('...",153.301056,126.052867


## Accuracy
Grammatical accuracy and collocational accuracy. Errors manually annotated and counted. 

### Grammatical accuracy

In [50]:
# Create lists of manually identified errors

B2_orig_grammar = ['are raise','oppose to it','is used to have','is easily gave','watched their parent every day worked','have work hard','they bought (buy)','children needs',"so doesn't (don't')",'to discovered']
B2_norm_grammar = ['is used to have','is easily gave','watched their parent every day worked','have work hard','they bought (buy)',"so doesn't (don't')",'to discovered']

len(B2_orig_grammar)
len(B2_norm_grammar)

10

7

In [51]:
# Create grammatical accuracy column

grammar_dict = {'B2_orig':len(B2_orig_grammar),'B2_norm':len(B2_norm_grammar)}

texts_df['grammar_errors'] = texts_df.text_id.map(grammar_dict)
texts_df['grammar_errors_per_100'] = (texts_df.grammar_errors/texts_df.text_len)*100

In [52]:
# Add punctuation column (manually counted)

punc_dict = {'B2_orig':9,'B2_norm':6}

texts_df['punc_errors'] = texts_df.text_id.map(punc_dict)
texts_df['punc_errors_per_100'] = (texts_df.punc_errors/texts_df.text_len)*100

### Collocational accuracy

In [53]:
# Record of collocation errors. MI calculations not comparable with two and three word collocations.

B2_orig_errors = ['raise in (values)','psychological values','oppose to it','is easily gave','well-trained to face adulthood','put food in the table','set their mind that','family love life','express love by money','impact to','adult life problems']
B2_norm_errors = ['raised in (values)','psychological values','oppose to it','well-trained to face adulthood','put food in the table','set their mind that','express love by money','impact to']

len(B2_orig_errors)
len(B2_norm_errors)

11

8

In [54]:
# Accurate collocations (manually annotated)

B2_orig_cols = ['support the idea','value of hard work','in the condition','come easily','comes from a wealthy family','have money','worked very hard','have the advantage','see the reality (and embrace it)','blinded by','power of money','expensive clothes','are never home','source of happiness','all the time','the art of','On the contrary','grow up with','sense of respect','have a disadvantage','have the time to','face problems','the following reason','work for it','see the fact','tough place','authentic self','love their children','basic necessity','care about','use it well']
B2_norm_cols = ['support the idea','value of hard work','in the condition','come easily','comes from a wealthy family','have money','worked very hard','have the advantage','see the reality (and embrace it)','blinded by','power of money','expensive clothes','are never home','source of happiness','all the time','the art of','On the contrary','grow up with','sense of respect','have a disadvantage','have the time to','face problems']

len(B2_orig_cols)
len(B2_norm_cols)

31

22

In [55]:
# Create 'bad' cols column for future use

texts_df['bad_cols'] = (B2_orig_errors,B2_norm_errors)

In [56]:
# Create error and accurate cols columns

errors_dict = {'B2_orig':len(B2_orig_errors),'B2_norm':len(B2_norm_errors)}
correct_dict = {'B2_orig':len(B2_orig_cols),'B2_norm':len(B2_norm_cols)}

texts_df['col_errors'] = texts_df.text_id.map(errors_dict)
texts_df['correct_cols'] = texts_df.text_id.map(correct_dict)

In [57]:
# Create errors and correct cols per 100 words columns

texts_df['col_errors_per_100'] = (texts_df.col_errors/texts_df.text_len)*100
texts_df['correct_cols_per_100'] = (texts_df.correct_cols/texts_df.text_len)*100

### Collocation frequency bands
Percentage of collocations containing low/mid/high freq items.  
- High = K1-2
- Mid = K3-9
- Low = K10+

In [58]:
# Tokenize collocations

B2_orig_cols_toks = [x.split() for x in B2_orig_cols]
B2_norm_cols_toks = [x.split() for x in B2_norm_cols]

In [59]:
B2_orig_cols_toks
B2_norm_cols_toks

[['support', 'the', 'idea'], ['value', 'of', 'hard', 'work'], ['in', 'the', 'condition'], ['come', 'easily'], ['comes', 'from', 'a', 'wealthy', 'family'], ['have', 'money'], ['worked', 'very', 'hard'], ['have', 'the', 'advantage'], ['see', 'the', 'reality', '(and', 'embrace', 'it)'], ['blinded', 'by'], ['power', 'of', 'money'], ['expensive', 'clothes'], ['are', 'never', 'home'], ['source', 'of', 'happiness'], ['all', 'the', 'time'], ['the', 'art', 'of'], ['On', 'the', 'contrary'], ['grow', 'up', 'with'], ['sense', 'of', 'respect'], ['have', 'a', 'disadvantage'], ['have', 'the', 'time', 'to'], ['face', 'problems'], ['the', 'following', 'reason'], ['work', 'for', 'it'], ['see', 'the', 'fact'], ['tough', 'place'], ['authentic', 'self'], ['love', 'their', 'children'], ['basic', 'necessity'], ['care', 'about'], ['use', 'it', 'well']]

[['support', 'the', 'idea'], ['value', 'of', 'hard', 'work'], ['in', 'the', 'condition'], ['come', 'easily'], ['comes', 'from', 'a', 'wealthy', 'family'], ['have', 'money'], ['worked', 'very', 'hard'], ['have', 'the', 'advantage'], ['see', 'the', 'reality', '(and', 'embrace', 'it)'], ['blinded', 'by'], ['power', 'of', 'money'], ['expensive', 'clothes'], ['are', 'never', 'home'], ['source', 'of', 'happiness'], ['all', 'the', 'time'], ['the', 'art', 'of'], ['On', 'the', 'contrary'], ['grow', 'up', 'with'], ['sense', 'of', 'respect'], ['have', 'a', 'disadvantage'], ['have', 'the', 'time', 'to'], ['face', 'problems']]

In [60]:
# Collocations with PoS (manually lemmatized and tagged based on above)

B2_orig_cols_toks_POS = [[('support','v'),('the','a'),('idea','n')],
                         [('value','n'),('of','i'),('hard','j'),('work','n')],
                         [('in','i'), ('the','a'), ('condition','n')],
                         [('come','v'), ('easily','r')],
                         [('come','v'), ('from','i'), ('a','a'), ('wealthy','j'), ('family','n')],
                         [('have','v'), ('money','n')],
                         [('work','v'), ('very','r'), ('hard','r')],
                         [('have','v'), ('the','a'), ('advantage','n')],
                         [('see','v'), ('the','a'), ('reality','n'),('and','c'), ('embrace','v'), ('it','p')],
                         [('blind','v'), ('by','i')],
                         [('power','n'), ('of','i'), ('money','n')],
                         [('expensive','j'), ('clothes','n')],
                         [('be','v'), ('never','r'), ('home','r')],
                         [('source','n'), ('of','i'), ('happiness','n')],
                         [('all','d'), ('the','a'), ('time','n')],
                         [('the','a'), ('art','n'), ('of','i')],
                         [('on','i'), ('the','a'), ('contrary','n')],
                         [('grow','v'), ('up','r'), ('with','i')],
                         [('sense','n'), ('of','i'), ('respect','n')],
                         [('have','v'), ('a','a'), ('disadvantage','n')],
                         [('have','v'), ('the','a'), ('time','n'), ('to','t')],
                         [('face','v'), ('problem','n')],
                         [('the','a'), ('following','j'), ('reason','n')],
                         [('work','v'), ('for','i'), ('it','p')],
                         [('see','v'), ('the','a'), ('fact','n')],
                         [('tough','j'), ('place','n')],
                         [('authentic','j'), ('self','n')],
                         [('love','v'), ('their','a'), ('child','n')],
                         [('basic','j'), ('necessity','n')],
                         [('care','v'), ('about','i')],
                         [('use','v'), ('it','p'), ('well','r')]]

B2_norm_cols_toks_POS =  [[('support','v'),('the','a'),('idea','n')],
                         [('value','n'),('of','i'),('hard','j'),('work','n')],
                         [('in','i'), ('the','a'), ('condition','n')],
                         [('come','v'), ('easily','r')],
                         [('come','v'), ('from','i'), ('a','a'), ('wealthy','j'), ('family','n')],
                         [('have','v'), ('money','n')],
                         [('work','v'), ('very','r'), ('hard','r')],
                         [('have','v'), ('the','a'), ('advantage','n')],
                         [('see','v'), ('the','a'), ('reality','n'),('and','c'), ('embrace','v'), ('it','p')],
                         [('blind','v'), ('by','i')],
                         [('power','n'), ('of','i'), ('money','n')],
                         [('expensive','j'), ('clothes','n')],
                         [('be','v'), ('never','r'), ('home','r')],
                         [('source','n'), ('of','i'), ('happiness','n')],
                         [('all','d'), ('the','a'), ('time','n')],
                         [('the','a'), ('art','n'), ('of','i')],
                         [('on','i'), ('the','a'), ('contrary','n')],
                         [('grow','v'), ('up','r'), ('with','i')],
                         [('sense','n'), ('of','i'), ('respect','n')],
                         [('have','v'), ('a','a'), ('disadvantage','n')],
                         [('have','v'), ('the','a'),('time','n'), ('to','t')],
                         [('face','v'), ('problem','n')]]

In [61]:
# Create collocation dict

col_dict = {'B2_orig':B2_orig_cols_toks_POS,'B2_norm':B2_norm_cols_toks_POS}

# Create column with collocations

texts_df['cols'] = texts_df.text_id.map(col_dict)

In [62]:
# Create column of the freq bands of the highest kband item in each collocation

texts_df['col_kband'] = texts_df.cols.apply(
    lambda row:[sorted([kband_dict[y] for y in x],reverse=True)[0] for x in row])

In [63]:
# Create (kband, cols) tuples

texts_df['kband_cols'] = list(zip(texts_df.col_kband,texts_df.cols))
texts_df['kband_cols'] = texts_df['kband_cols'].apply(lambda row: list(zip(row[0],row[1])))

In [64]:
# Group Kbands

high_freq_K = list(range(1,3))
mid_freq_K = list(range(3,10))
low_freq_K = list(range(10,101))

In [65]:
# Create columns of percentages of cols that contain low, med, high kband items (highest only)

texts_df['K10to16_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(low_freq_K)]))
texts_df['K3to9_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(mid_freq_K)]))
texts_df['K1to2_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(high_freq_K)]))

In [66]:
# Add percent columns

texts_df['K10to16_p'] = texts_df['K10to16_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])
texts_df['K3to9_p'] = texts_df['K3to9_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])
texts_df['K1to2_p'] = texts_df['K1to2_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])

In [67]:
# Create separate columns with low/mid/high cols for ease of viewing

texts_df['K1to2_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if x[0] <= 2])
texts_df['K3to9_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if 10 > x[0] > 2])
texts_df['K10to16_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if x[0] > 9])

In [68]:
# Round all stats to 3 digits for ease of use

texts_df = round(texts_df,3)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,absent_prop_types,col_tscore,mean_tscore,mean_tscore_types,grammar_errors,grammar_errors_per_100,punc_errors,punc_errors_per_100,bad_cols,col_errors,correct_cols,col_errors_per_100,correct_cols_per_100,cols,col_kband,kband_cols,K10to16_cols,K3to9_cols,K1to2_cols,K10to16_p,K3to9_p,K1to2_p,K1to2_cols_K,K3to9_cols_K,K10to16_cols_K
0,B2_orig,"I greatly support the idea. I support it, beca...","[I, greatly, support, the, idea, ., I, support...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., I, support...","[(I, p), (greatly, r), (support, v), (the, a),...",349,7.286,0.857,46.781,0.9,0.668,0.164,0.029,"[((i, p), (greatly, r)), ((greatly, r), (suppo...","[((('because', 'i'), ('of', 'i')), 2.29), ((('...",2.94,"[((i, p), (greatly, r)), ((greatly, r), (suppo...",0.965,0.966,"[((('because', 'i'), ('of', 'i')), 443.825), (...",162.093,145.61,10,2.865,9,2.579,"[raise in (values), psychological values, oppo...",11,31,3.152,8.883,"[[(support, v), (the, a), (idea, n)], [(value,...","[1, 1, 1, 2, 3, 1, 1, 2, 3, 8, 1, 2, 1, 4, 1, ...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...",0,8,23,0.0,0.258,0.742,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[]
1,B2_norm,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raised, in...","[(I, PRP), (greatly, RB), (support, VBP), (the...","[(I, p), (greatly, r), (support, v), (the, a),...","[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",250,7.314,0.8,43.457,0.938,0.657,0.171,0.024,"[((i, p), (greatly, r)), ((greatly, r), (suppo...","[((('such', 'i'), ('as', 'i')), 6.02), ((('har...",2.878,"[((i, p), (greatly, r)), ((greatly, r), (suppo...",0.964,0.965,"[((('such', 'i'), ('as', 'i')), 494.477), ((('...",153.301,126.053,7,2.8,6,2.4,"[raised in (values), psychological values, opp...",8,22,3.2,8.8,"[[(support, v), (the, a), (idea, n)], [(value,...","[1, 1, 1, 2, 3, 1, 1, 2, 3, 8, 1, 2, 1, 4, 1, ...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...",0,6,16,0.0,0.273,0.727,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[]


## Normalizing length

Results of comparison between original and normalized versions

In [69]:
# Create function for finding range of + or - 5%

def find_range(stat):
    low = str(round(stat*.95,2))
    high = str(round(stat*1.05,2))
    stat_range = low + ' - ' + high
    return stat_range

In [70]:
# vocD: mod within 5% of orig (remember that changes slightly every time calculated as based on samples)

B2_vocd = texts_df['vocD'].to_list()
B2_vocd 

find_range(B2_vocd[0])

[46.781, 43.457]

'44.44 - 49.12'

In [71]:
# AG (PSL): mod within 5% of orig

B2_AG = texts_df['AG'].to_list()
B2_AG

find_range(B2_AG[0])

[0.9, 0.938]

'0.85 - 0.95'

In [72]:
# Mean MI (words): mod within 5% of orig

B2_mean_MI = texts_df['mean_MI'].to_list()
B2_mean_MI

find_range(B2_mean_MI[0])

[2.94, 2.878]

'2.79 - 3.09'

In [73]:
# Mean t-score (words): mod within 5% of orig

B2_mean_t_score = texts_df['mean_tscore'].to_list()
B2_mean_t_score

find_range(B2_mean_t_score[0])

[162.093, 153.301]

'153.99 - 170.2'

In [74]:
# Mean proportion of bigrams (words): mod within 5% of orig or closest possible (see B2)

B2_absent_prop = texts_df['absent_prop'].to_list()
B2_absent_prop

find_range(B2_absent_prop[0])

[0.965, 0.964]

'0.92 - 1.01'

In [75]:
# Grammar errors per 100: mod within 5% of orig

B2_grammar_errors_per_100 = texts_df['grammar_errors_per_100'].to_list()
B2_grammar_errors_per_100

find_range(B2_grammar_errors_per_100[0])

[2.865, 2.8]

'2.72 - 3.01'

In [76]:
# Punctuation errors per 100: mod within 5% of orig

B2_punc_errors_per_100 = texts_df['punc_errors_per_100'].to_list()
B2_punc_errors_per_100

find_range(B2_punc_errors_per_100[0])

[2.579, 2.4]

'2.45 - 2.71'

In [77]:
# Collocation errors per 100: mod within 5% of orig

B2_col_errors_per_100 = texts_df['col_errors_per_100'].to_list()
B2_col_errors_per_100

find_range(B2_col_errors_per_100[0])

[3.152, 3.2]

'2.99 - 3.31'

In [78]:
# Accurate colls per 100

B2_correct_cols_per_100 = texts_df['correct_cols_per_100'].to_list()
B2_correct_cols_per_100

find_range(B2_correct_cols_per_100[0])

[8.883, 8.8]

'8.44 - 9.33'

In [79]:
# K10-16 cols percent

B2_K10to16_p = texts_df['K10to16_p'].to_list()
B2_K10to16_p

find_range(B2_K10to16_p[0])

[0.0, 0.0]

'0.0 - 0.0'

In [80]:
# K3-9 cols percent

B2_K3to9_p = texts_df['K3to9_p'].to_list()
B2_K3to9_p

find_range(B2_K3to9_p[0])

[0.258, 0.273]

'0.25 - 0.27'

In [81]:
# K1-2 cols percent

B2_K1to2_p = texts_df['K1to2_p'].to_list()
B2_K1to2_p

find_range(B2_K1to2_p[0])

[0.742, 0.727]

'0.7 - 0.78'

In [82]:
# Bigram range: mod within 5% of orig

B2_bigram_range = texts_df['bigram_range'].to_list()
B2_bigram_range

find_range(B2_bigram_range[0])

[0.164, 0.171]

'0.16 - 0.17'

In [83]:
# CNC: mod within 5% of orig

B2_CNC = texts_df['CNC'].to_list()
B2_CNC

find_range(B2_CNC[0])

[0.857, 0.8]

'0.81 - 0.9'

In [84]:
# MLC: mod within 5% of orig

B2_MLC = texts_df['MLC'].to_list()
B2_MLC

find_range(B2_MLC[0])

[7.286, 7.314]

'6.92 - 7.65'

In [85]:
# Final comparison with relevant stats only

texts_final = texts_df[['text_id','text','lemmas_NLTK','lemmas_CLAWS','text_len','MLC','CNC','grammar_errors_per_100',
                        'punc_errors_per_100','vocD','AG','bigram_range','mean_MI','absent_prop',
                        'mean_tscore','col_errors_per_100','correct_cols_per_100','K10to16_p','K3to9_p','K1to2_p',
                        'kband_cols', 'K1to2_cols','K3to9_cols','K10to16_cols',
                        'K1to2_cols_K','K3to9_cols_K','K10to16_cols_K','bad_cols']]

In [86]:
# Pickle for later use

joblib.dump(texts_final ,'../docs/B2_orig&norm.pkl')

['../docs/B2_orig&norm.pkl']

In [87]:
# Pickle possible cols for collocation identification notebook

B2_cols = texts_df[['text_id','lemmas_CLAWS','possible_cols']]

joblib.dump(B2_cols ,'../docs/B2_cols.pkl')

['../docs/B2_cols.pkl']

[Back to top](#Normalizing-B1-original-text)