# Normalizing B1 original text

<br>

**Language: Python**

This notebook shows the process used for creating the length-normalized B1 text from the original public IELTS Task 2 B1 text. The original text is a public academic writing sample from the [IELTS website](https://www.ielts.org/en-us/about-the-test/sample-test-questions). The original text is 172 words and the desired length is 250 words (see chapters 5.1 and 5.2 of the dissertation).

**Notebook contents:**
- [Initial setup](#Initial-setup)
- [Text processing](#Text-processing)
- [Syntactic complexity](#Syntactic-complexity)
- [Lexical diversity](#Lexical-diversity)
- [Lexical sophistication](#Lexical-sophistication)
- [Collocation measures](#Collocation-measures)
- [Accuracy](#Accuracy)
- [Normalizing length](#Normalizing-length)

## Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import csv
from ast import literal_eval
from nltk import pos_tag_sents
from pelitk import lex
import joblib
import numpy as np
import math
from collections import Counter

In [2]:
# Set preferred notebook format

%pprint # Turn off pretty printing
InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

Pretty printing has been turned OFF


**Note:** As described in the [README.md]('../README.md'), The frequency information from COCA referenced here is not freely available but can be purchased at https://corpus.byu.edu/coca. Without this data you will be able to see a few rows of these dataframes, but will not be able to run the code yourself. The t-scores and K-bands were also calculated using these data.

In [3]:
# Import necessary dictionaries

coca_freq_dict = joblib.load('../../COCA_data/COCA_2020_lemma_freq_dict.pkl')
coca_word_lemma_dict = joblib.load('../../COCA_data/COCA_2020_word_lemma_dict.pkl')
col_freq_dict = joblib.load('../../COCA_data/COCA_2020_collocate_freq_dict.pkl')
MI_dict = joblib.load('../../COCA_data/COCA_2020_MI_dict.pkl')
tscore_dict = joblib.load('../../COCA_data/COCA_2020_tscore_dict.pkl')
kband_dict = joblib.load('../../COCA_data/COCA_2020_lemma_Kband_dict.pkl') # All items lower-case

In [4]:
# Read in original text (transcribed and with corrected spelling)

f = open("../docs/B1_original_corrected.txt", "r")
B1_orig = f.read()

**Note:** In addition to correcting spelling, contractions were changed to full words, '&' to 'and', '20' to 'twenty', and 'Mr' to 'Mister'.

In [5]:
# Read in modified text

f = open("../docs/B1_normalized.txt", "r")
B1_norm = f.read()

**Note:** These modified texts were modified based on the [`Normalizing length`](#Normalizing-length) goals described later, but are incorporated here to avoid having to go through the text processing procedure twice.

In [6]:
# Create dataframe

texts_df = pd.DataFrame({'text_id':pd.Series(['B1_orig','B1_norm']),
                         'text':pd.Series([B1_orig,B1_norm])})

texts_df

Unnamed: 0,text_id,text
0,B1_orig,I disagree that point about children brought u...
1,B1_norm,I disagree that point about children brought u...


## Text processing

The tokenizer, part-of-speech tagger, and lemmatizer tools are the same ones used in the creation of the [PELIC](https://github.com/ELI-Data-Mining-Group/PELIC-dataset) corpus. The tokenizer and lemmatizer are not open access but are based on the ones from [NLTK](https://www.nltk.org/), and using the public NLTK tools will yield similar results. For a more detailed description of the modified tools, please see [Naismith et al. (2022)](https://benjamins.com/catalog/ijlcr.21002.nai).

In [7]:
# Change to working directory containing elitools

%cd '../../ELI_Data_Mining/Data-Archive/elitools/'

/Users/Ben/Documents/ELI_Data_Mining/Data-Archive/elitools


In [8]:
# Load lemmatizer

%run -i 'lemmatizer_class.py'
lemmatizer = lemmatizer()

In [9]:
# Load the tokenizer module

%run -i 'tokenizer.py'

In [10]:
# Return to previous working directory

%cd '../../../Collocational_proficiency_Naismith_2022/notebooks'

/Users/Ben/Documents/Collocational_proficiency_Naismith_2022/notebooks


### Tokenization

In [11]:
# Tokenize text (nltk-based)

texts_df['toks'] = texts_df.text.apply(tokenize)

texts_df

Unnamed: 0,text_id,text,toks
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br..."
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br..."


### POS tagging
- NLTK (PELIC)
- CLAWS7 (COCA)

#### NLTK
As there are only three texts (each with three versions), to avoid any errors, I have used the default NLTK tagger (Penn Treebank tagset), then manually checked and corrected the tags.

In [12]:
# Apply nltk tagger to create series

B1_NLTK = pd.Series(pos_tag_sents(texts_df['toks']))

In [13]:
# Check tags

# Write out tagged texts
B1_NLTK.to_csv('../docs/B1_NLTK.csv', index=False, header=False) 

# Read in the checked tagged texts as a series
B1_NLTK_CHECKED = pd.read_csv("../docs/B1_NLTK_CHECKED.csv", header=None, squeeze = True) 
B1_NLTK_CHECKED = [literal_eval(x) for x in B1_NLTK_CHECKED]



  B1_NLTK_CHECKED = pd.read_csv("../docs/B1_NLTK_CHECKED.csv", header=None, squeeze = True)


In [14]:
# Create column based on checked tagged texts

texts_df['tok_POS_NLTK'] = B1_NLTK_CHECKED
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point..."
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point..."


#### CLAWS7

By also tagging with CLAWS7, it is easier to match the POS to the COCA info, rather than the Penn tagset used by NLTK and then having to convert.

Free CLAWS tagger: http://ucrel-api.lancaster.ac.uk/claws/free.html

Again, these tagged texts should be manually checked prior to use.

In [15]:
# Read in tagged CLAWS texts

f = open("../docs/B1_original_CLAWS.txt", "r")
B1_orig_CLAWS = f.read()

f = open("../docs/B1_normalized_CLAWS.txt", "r")
B1_norm_CLAWS = f.read()

In [16]:
# Remove new line characters, split on whitespace, and remove identifier at end

B1_orig_CLAWS = B1_orig_CLAWS.replace('\n', '').split(' ')[:-2]
B1_norm_CLAWS = B1_norm_CLAWS.replace('\n', '').split(' ')[:-2]

In [17]:
# Change tags into tuples

B1_orig_CLAWS = [tuple(x.split('_')) for x in B1_orig_CLAWS]
B1_norm_CLAWS = [tuple(x.split('_')) for x in B1_norm_CLAWS]

In [18]:
texts_df['toks_POS_CLAWS'] = pd.Series([B1_orig_CLAWS,B1_norm_CLAWS])
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, PPIS1), (disagree, VV0), (that, DD1), (po..."
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, PPIS1), (disagree, VV0), (that, DD1), (po..."


### Lemmatization

#### NLTK

In [19]:
# Create lemmatized text column using our lemmatizer loaded earlier

texts_df['lemmas_NLTK'] = texts_df['tok_POS_NLTK'].apply(lemmatizer.lemmatize_text)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, PPIS1), (disagree, VV0), (that, DD1), (po...","[I, disagree, that, point, about, child, bring..."
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, PPIS1), (disagree, VV0), (that, DD1), (po...","[I, disagree, that, point, about, child, bring..."


#### CLAWS

In [20]:
# Keep only first letter of CLAWS PoS tags

texts_df.toks_POS_CLAWS = texts_df.toks_POS_CLAWS.apply(lambda row: [(x[0],x[1][0].lower()) for x in row])

In [21]:
# Remove puncuation from CLAWS texts

COCA_POS = sorted(list(set([x[1] for x in coca_freq_dict.keys()])))
texts_df.toks_POS_CLAWS = texts_df.toks_POS_CLAWS.apply(lambda row: [x for x in row if x[1] in COCA_POS])

In [22]:
# Check lemmas not in COCA dict

sorted(list(set([x for y in texts_df.toks_POS_CLAWS.apply(lambda row: [x for x in row if (x[0].lower(),x[1]) not in coca_word_lemma_dict]).to_list() for x in y])))

[]

In [23]:
# Create CLAWS lemma column

# First lower case all toks (as in the word_lemma dict)
texts_df['lemmas_CLAWS'] = texts_df.toks_POS_CLAWS.apply(lambda row: [(x[0].lower(),x[1]) for x in row])

# Then map dict
texts_df.lemmas_CLAWS = texts_df.lemmas_CLAWS.apply(
    lambda row:[coca_word_lemma_dict[x] if x in coca_word_lemma_dict else x for x in row])

texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),..."
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),..."


### Length
Length counted manually rather than using the len(toks) or other RE-based counting. This is to ensure accuracy that would match how words are counted on IELTS tests (also done manually by examiners). These counts often match what Microsoft Word would provide.

In [24]:
# Create dictionary

text_len = {'B1_orig':172,'B1_norm':250}

In [25]:
# Create length column

texts_df['text_len'] = texts_df.text_id.map(text_len)
texts_df['text_len'] = texts_df['text_len'].astype(int)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",172
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",250


## Syntactic complexity

Analysis using [TAASSC](https://www.linguisticanalysistools.org/taassc.html), calculating the measures from Lu's (2010) [Syntactic Complexity Analyzer](https://aihaiyang.com/software/). Based on previous research, two metrics most important for predicting proficiency are the focus: Number of complex nominals per clause (CN/C), and Mean length of clause (MLC).

In [26]:
# Read in TAASSC analysis file

TAASSC = pd.read_csv("../docs/B1_TAASSC_sca.csv")

In [27]:
# Rename files to match texts_df

file_names = {'B1_original_corrected.txt':'B1_orig','B1_normalized.txt':'B1_norm'}
TAASSC.filename = TAASSC.filename.map(file_names)
TAASSC = TAASSC.loc[~TAASSC.filename.isnull()]
TAASSC

Unnamed: 0,filename,nwords,MLS,MLT,MLC,C_S,VP_T,C_T,DC_C,DC_T,T_S,CT_T,CP_T,CP_C,CN_T,CN_C
1,B1_orig,172,17.2,15.636364,6.615385,2.6,2.727273,2.363636,0.307692,0.727273,1.1,0.545455,0.090909,0.038462,1.454545,0.615385
3,B1_norm,250,16.666667,15.625,6.410256,2.6,2.875,2.4375,0.358974,0.875,1.066667,0.5625,0.1875,0.076923,1.5625,0.641026


In [28]:
# Keep only relevant syntactic complexity columns and rename them

TAASSC = TAASSC[['filename','MLC','CN_C']]
TAASSC = TAASSC.rename(columns={"filename": "text_id",'CN_C':'CNC'})
TAASSC

Unnamed: 0,text_id,MLC,CNC
1,B1_orig,6.615385,0.615385
3,B1_norm,6.410256,0.641026


In [29]:
# Merge TAASSC data with texts_df

texts_df = pd.merge(texts_df, TAASSC, on='text_id')
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",172,6.615385,0.615385
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",250,6.410256,0.641026


## Lexical diversity

vocD (with lemmas) using functions from [PELITK](https://github.com/ELI-Data-Mining-Group/pelitk)

In [30]:
# Remove punctuation before calculating

punctuation = ['.','!','?',';',':','#','"',"'",'``','`',',','--','-','...',')','(',"''"]

texts_df['vocD'] = texts_df.toks.apply(lambda row: [x for x in row if x not in punctuation])

In [31]:
# Create vocD column

texts_df['vocD'] = texts_df.lemmas_NLTK.apply(lex.vocd)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",172,6.615385,0.615385,47.869657
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",250,6.410256,0.641026,48.723059


## Lexical sophistication

### Advanced Guiraud (AG)
AG based on lemmas using a frequency list (PSL3) compiled from the PELIC learner corpus (see dissertation section 2.2.2).

In [32]:
# Read in PSL3 list for manual checking of items in texts that are off list

f = open("../docs/psl3.txt", "r")
PSL3 = f.read()
PSL3 = sorted(PSL3.split('\n'))
len(PSL3)
PSL3[-10:]

2000

['yesterday', 'yet', 'yogurt', 'you', 'young', 'your', 'yours', 'yourself', 'youth', 'zoo']

In [33]:
# Create AG column (punctuation removed)

texts_df['AG'] = texts_df.lemmas_NLTK.apply(lambda row: [x for x in row if x not in punctuation]).apply(
    lex.adv_guiraud,freq_list = 'PSL3')

texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",172,6.615385,0.615385,47.869657,0.381246
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",250,6.410256,0.641026,48.723059,0.379473


### Contextual diversity

Analysis using [TAALES](https://www.linguisticanalysistools.org/taales.html). Based on previous research, one metric is the focus: contextual diversity as in Monteiro et al. (2018).

In [34]:
# Read in TAALES analysis

TAALES = pd.read_csv("../docs/B1_TAALES.csv")

In [35]:
# Rename files to match texts_df

TAALES.Filename = TAALES.Filename.map(file_names)
TAALES = TAALES.loc[~TAALES.Filename.isnull()]

In [36]:
# Keep only relevant contextual diversity columns and rename them

TAALES = TAALES[['Filename','COCA_Academic_Range_AW','COCA_Academic_Bigram_Range','COCA_Academic_Trigram_Range']]
TAALES = TAALES.rename(columns={"Filename": "text_id",'COCA_Academic_Range_AW':'unigram_range',
                                'COCA_Academic_Bigram_Range':'bigram_range',
                                'COCA_Academic_Trigram_Range':'trigram_range'})
TAALES

Unnamed: 0,text_id,unigram_range,bigram_range,trigram_range
0,B1_norm,0.641003,0.094547,0.027453
4,B1_orig,0.63163,0.08789,0.027092


In [37]:
# Merge TAALES data with texts_df

texts_df = pd.merge(texts_df, TAALES, on='text_id')
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",172,6.615385,0.615385,47.869657,0.381246,0.63163,0.08789,0.027092
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",250,6.410256,0.641026,48.723059,0.379473,0.641003,0.094547,0.027453


## Collocation measures
3 measures which make up 'CollGram' profile from Granger & Bestgen / Bestgen & Granger (2014):
- mean MI
- mean t-score
- proportion or bigrams absent from reference corpus

In [38]:
# Extract potential collocations in span 4

def find_cols(lemma_list):
    col_list = list(zip(lemma_list,lemma_list[1:]))+list(zip(lemma_list,lemma_list[2:]))\
    +list(zip(lemma_list,lemma_list[3:]))+list(zip(lemma_list,lemma_list[4:]))
    return col_list

In [39]:
# Create possible collocations column

texts_df['possible_cols'] = texts_df.lemmas_CLAWS.apply(find_cols)

In [40]:
# Lower-case (doesn't matter that 'I' gets lowered as not in collocate dict)

texts_df['possible_cols'] = texts_df.possible_cols.apply(
    lambda row: [((x[0][0].lower(),x[0][1]),(x[1][0].lower(),x[1][1])) for x in row])

In [41]:
# Create list of all possible collocations

possible_cols = sorted(list(set([x for y in texts_df.possible_cols.to_list() for x in y])))
possible_cols[:5]
possible_cols[-5:]
len(possible_cols)

[(('a', 'a'), ('about', 'i')), (('a', 'a'), ('and', 'c')), (('a', 'a'), ('child', 'n')), (('a', 'a'), ('culture', 'n')), (('a', 'a'), ('for', 'i'))]

[(('work', 'v'), ('work', 'v')), (('young', 'j'), ('do', 'v')), (('young', 'j'), ('for', 'i')), (('young', 'j'), ('they', 'p')), (('young', 'j'), ('work', 'n'))]

930

### Mean MI

MI is not calculated for any bigrams with freq less than 5 or MI less than 1.

In [42]:
# Create column with MI for each possible collocation in MI dict

texts_df['col_MI'] = texts_df.possible_cols.apply(lambda row: [(x,MI_dict[x]) for x in row if x in MI_dict])

In [43]:
# Find mean MI for each text based on tokens and types

texts_df['mean_MI'] = texts_df.col_MI.apply(lambda row: np.mean([x[1] for x in row]))

### Proportion of absent/low MI word combinations

In [44]:
# Create column of two-word combinations not in collocation dict

texts_df['absent'] = texts_df.possible_cols.apply(lambda row: [x for x in row if x not in col_freq_dict])

In [45]:
# Find proportion of absent two-word combinations compared to total two-word combinations in the text

texts_df['absent_prop'] = texts_df.absent.apply(lambda row: len(row)) / texts_df.possible_cols.apply(lambda row: len(row))

In [46]:
# Find proportion of absent two-word combination types compared to total two-word combination types in the text

texts_df['absent_prop_types'] = texts_df.absent.apply(lambda row: len(set(row))) / texts_df.possible_cols.apply(lambda row: len(set(row)))

### Mean t-scores

In [47]:
# Create column with t-score for each bigram

texts_df['col_tscore'] = texts_df.possible_cols.apply(lambda row: [(x,tscore_dict[x]) for x in row if x in tscore_dict])

In [48]:
# Find mean t-score for each text based on tokens and types

texts_df['mean_tscore'] = texts_df.col_tscore.apply(lambda row: np.mean([x[1] for x in row]))
texts_df['mean_tscore_types'] = texts_df.col_tscore.apply(lambda row: np.mean([x[1] for x in set(row)]))

In [49]:
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,absent_prop_types,col_tscore,mean_tscore,mean_tscore_types
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",172,6.615385,0.615385,47.869657,0.381246,0.63163,0.08789,0.027092,"[((i, p), (disagree, v)), ((disagree, v), (tha...","[((('our', 'a'), ('country', 'n')), 2.01), (((...",2.429286,"[((i, p), (disagree, v)), ((disagree, v), (tha...",0.979228,0.980551,"[((('our', 'a'), ('country', 'n')), 157.939), ...",115.839286,118.466583
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",250,6.410256,0.641026,48.723059,0.379473,0.641003,0.094547,0.027453,"[((i, p), (disagree, v)), ((disagree, v), (tha...","[((('our', 'a'), ('country', 'n')), 2.01), (((...",2.522,"[((i, p), (disagree, v)), ((disagree, v), (tha...",0.984848,0.986175,"[((('our', 'a'), ('country', 'n')), 157.939), ...",111.671867,116.725


## Accuracy
Grammatical accuracy and collocational accuracy. Errors manually annotated and counted. 

### Grammatical accuracy

In [50]:
# Create lists of manually identified errors

B1_orig_grammar = ['they want they','had everything give','can do prepare','a money','they doing','could their money','(could) entrance to','their parents which persons','a pocket money','had a work']
B1_norm_grammar = ['they want they','had everything give','can do prepare','a money','they doing','could their money','(could) entrance to','their parents which persons','a pocket money','had a work','could buying','they working','start a work','when they 15','and is good']

len(B1_orig_grammar)
len(B1_norm_grammar)

10

15

In [51]:
# Create grammatical accuracy column

grammar_dict = {'B1_orig':len(B1_orig_grammar),'B1_norm':len(B1_norm_grammar)}

texts_df['grammar_errors'] = texts_df.text_id.map(grammar_dict)
texts_df['grammar_errors_per_100'] = (texts_df.grammar_errors/texts_df.text_len)*100

In [52]:
# Add punctuation column (manually counted)

punc_dict = {'B1_orig':13,'B1_norm':19}

texts_df['punc_errors'] = texts_df.text_id.map(punc_dict)
texts_df['punc_errors_per_100'] = (texts_df.punc_errors/texts_df.text_len)*100

### Collocational accuracy

In [53]:
# Record of collocation errors. MI calculations not comparable with two and three word collocations.

B1_orig_errors = ['disagree that point','show that situation','country parents','is not effect to','from twenty ages','social experience','age is late','work by (children ages)','culture about','could their money','country children','accept the money by','study at money','had a work']
B1_norm_errors = ['disagree that point','show that situation','country parents','is not effect to','from twenty ages','social experience','age is late','work by (children ages)','culture about','could their money','country children','accept the money by','study at money','positive school','prepared their life','work my country','very disagree','prepare with many problems','for future time','had a work']

len(B1_orig_errors)
len(B1_norm_errors)

14

20

In [54]:
# Accurate collocations (manually annotated)

B1_orig_cols = ['good effect','on the other hand','in my case','start work','hear about','do work for','entrance to the bank','perfectly prepare']
B1_norm_cols = ['good effect','on the other hand','in my case','start work','hear about','do work for','entrance to the bank','perfectly prepare','good parents','have money','work as a journalist','very young']

len(B1_orig_cols)
len(B1_norm_cols)

8

12

In [55]:
# Create 'bad' cols column for future use

texts_df['bad_cols'] = (B1_orig_errors,B1_norm_errors)

In [56]:
# Create error and accurate cols columns

errors_dict = {'B1_orig':len(B1_orig_errors),'B1_norm':len(B1_norm_errors)}
correct_dict = {'B1_orig':len(B1_orig_cols),'B1_norm':len(B1_norm_cols)}

texts_df['col_errors'] = texts_df.text_id.map(errors_dict)
texts_df['correct_cols'] = texts_df.text_id.map(correct_dict)

In [57]:
# Create errors and correct cols per 100 words columns

texts_df['col_errors_per_100'] = (texts_df.col_errors/texts_df.text_len)*100
texts_df['correct_cols_per_100'] = (texts_df.correct_cols/texts_df.text_len)*100

### Collocation frequency bands
Percentage of collocations containing low/mid/high freq items.  
- High = K1-2
- Mid = K3-9
- Low = K10+

In [58]:
# Tokenize collocations

B1_orig_cols_toks = [x.split() for x in B1_orig_cols]
B1_norm_cols_toks = [x.split() for x in B1_norm_cols]

In [59]:
B1_orig_cols_toks
B1_norm_cols_toks

[['good', 'effect'], ['on', 'the', 'other', 'hand'], ['in', 'my', 'case'], ['start', 'work'], ['hear', 'about'], ['do', 'work', 'for'], ['entrance', 'to', 'the', 'bank'], ['perfectly', 'prepare']]

[['good', 'effect'], ['on', 'the', 'other', 'hand'], ['in', 'my', 'case'], ['start', 'work'], ['hear', 'about'], ['do', 'work', 'for'], ['entrance', 'to', 'the', 'bank'], ['perfectly', 'prepare'], ['good', 'parents'], ['have', 'money'], ['work', 'as', 'a', 'journalist'], ['very', 'young']]

In [60]:
# Collocations with PoS (manually lemmatized and tagged based on above)

B1_orig_cols_toks_POS = [[('good','j'),('effect','n')],
                         [('on','i'),('the','a'),('other','j'),('hand','n')],
                         [('in','i'), ('my','a'), ('case','n')],
                         [('start','v'),('work','n')],
                         [('hear','v'),('about','i')],
                         [('do','v'),('work','n'),('for','i')],
                         [('entrance','n'), ('to', 'i'),('the','a'), ('bank','n')],
                         [('perfectly','r'),('prepare','v')]]


B1_norm_cols_toks_POS =  [[('good','j'),('effect','n')],
                         [('on','i'),('the','a'),('other','j'),('hand','n')],
                         [('in','i'), ('my','a'), ('case','n')],
                         [('start','v'),('work','n')],
                         [('hear','v'),('about','i')],
                         [('do','v'),('work','n'),('for','i')],
                         [('entrance','n'), ('to', 'i'),('the','a'), ('bank','n')],
                         [('perfectly','r'),('prepare','v')],
                         [('good','j'),('parent','n')],
                         [('have','v'),('money','n')],
                         [('work','v'),('as','i'),('a','a'),('journalist','n')],
                         [('very','r'),('young','j')]]

In [61]:
# Create collocation dict

col_dict = {'B1_orig':B1_orig_cols_toks_POS,'B1_norm':B1_norm_cols_toks_POS}

# Create column with collocations

texts_df['cols'] = texts_df.text_id.map(col_dict)

In [62]:
# Create column of the freq bands of the highest kband item in each collocation

texts_df['col_kband'] = texts_df.cols.apply(
    lambda row:[sorted([kband_dict[y] for y in x],reverse=True)[0] for x in row])

In [63]:
# Create (kband, cols) tuples

texts_df['kband_cols'] = list(zip(texts_df.col_kband,texts_df.cols))
texts_df['kband_cols'] = texts_df['kband_cols'].apply(lambda row: list(zip(row[0],row[1])))

In [64]:
# Group Kbands

high_freq_K = list(range(1,3))
mid_freq_K = list(range(3,10))
low_freq_K = list(range(10,101))

In [65]:
# Create columns of percentages of cols that contain low, med, high kband items (highest only)

texts_df['K10to16_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(low_freq_K)]))
texts_df['K3to9_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(mid_freq_K)]))
texts_df['K1to2_cols'] = texts_df.col_kband.apply(lambda row: len([x for x in row if x in(high_freq_K)]))

In [66]:
# Add percent columns

texts_df['K10to16_p'] = texts_df['K10to16_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])
texts_df['K3to9_p'] = texts_df['K3to9_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])
texts_df['K1to2_p'] = texts_df['K1to2_cols']/(texts_df['K10to16_cols']+texts_df['K3to9_cols']+texts_df['K1to2_cols'])

In [67]:
# Create separate columns with low/mid/high cols for ease of viewing

texts_df['K1to2_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if x[0] <= 2])
texts_df['K3to9_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if 10 > x[0] > 2])
texts_df['K10to16_cols_K'] = texts_df.kband_cols.apply(lambda row: [x for x in row if x[0] > 9])

In [68]:
# Round all stats to 3 digits for ease of use

texts_df = round(texts_df,3)
texts_df

Unnamed: 0,text_id,text,toks,tok_POS_NLTK,toks_POS_CLAWS,lemmas_NLTK,lemmas_CLAWS,text_len,MLC,CNC,vocD,AG,unigram_range,bigram_range,trigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,absent_prop_types,col_tscore,mean_tscore,mean_tscore_types,grammar_errors,grammar_errors_per_100,punc_errors,punc_errors_per_100,bad_cols,col_errors,correct_cols,col_errors_per_100,correct_cols_per_100,cols,col_kband,kband_cols,K10to16_cols,K3to9_cols,K1to2_cols,K10to16_p,K3to9_p,K1to2_p,K1to2_cols_K,K3to9_cols_K,K10to16_cols_K
0,B1_orig,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",172,6.615,0.615,47.87,0.381,0.632,0.088,0.027,"[((i, p), (disagree, v)), ((disagree, v), (tha...","[((('our', 'a'), ('country', 'n')), 2.01), (((...",2.429,"[((i, p), (disagree, v)), ((disagree, v), (tha...",0.979,0.981,"[((('our', 'a'), ('country', 'n')), 157.939), ...",115.839,118.467,10,5.814,13,7.558,"[disagree that point, show that situation, cou...",14,8,8.14,4.651,"[[(good, j), (effect, n)], [(on, i), (the, a),...","[1, 1, 1, 1, 1, 1, 4, 3]","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...",0,2,6,0.0,0.25,0.75,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[]
1,B1_norm,I disagree that point about children brought u...,"[I, disagree, that, point, about, children, br...","[(I, PRP), (disagree, VBP), (that, IN), (point...","[(I, p), (disagree, v), (that, d), (point, n),...","[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",250,6.41,0.641,48.723,0.379,0.641,0.095,0.027,"[((i, p), (disagree, v)), ((disagree, v), (tha...","[((('our', 'a'), ('country', 'n')), 2.01), (((...",2.522,"[((i, p), (disagree, v)), ((disagree, v), (tha...",0.985,0.986,"[((('our', 'a'), ('country', 'n')), 157.939), ...",111.672,116.725,15,6.0,19,7.6,"[disagree that point, show that situation, cou...",20,12,8.0,4.8,"[[(good, j), (effect, n)], [(on, i), (the, a),...","[1, 1, 1, 1, 1, 1, 4, 3, 1, 1, 3, 1]","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...",0,3,9,0.0,0.25,0.75,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[]


## Normalizing length

Results of comparison between original and normalized versions

In [69]:
# Create function for finding range of + or - 5%

def find_range(stat):
    low = str(round(stat*.95,2))
    high = str(round(stat*1.05,2))
    stat_range = low + ' - ' + high
    return stat_range

In [70]:
# vocD: mod within 5% of orig (remember that changes slightly every time calculated as based on samples)

B1_vocd = texts_df['vocD'].to_list()
B1_vocd 

find_range(B1_vocd[0])

[47.87, 48.723]

'45.48 - 50.26'

In [71]:
# AG (PSL): mod within 5% of orig

B1_AG = texts_df['AG'].to_list()
B1_AG

find_range(B1_AG[0])

[0.381, 0.379]

'0.36 - 0.4'

In [72]:
# Mean MI (words): mod within 5% of orig

B1_mean_MI = texts_df['mean_MI'].to_list()
B1_mean_MI

find_range(B1_mean_MI[0])

[2.429, 2.522]

'2.31 - 2.55'

In [73]:
# Mean t-score (words): mod within 5% of orig

B1_mean_t_score = texts_df['mean_tscore'].to_list()
B1_mean_t_score

find_range(B1_mean_t_score[0])

[115.839, 111.672]

'110.05 - 121.63'

In [74]:
# Mean proportion of bigrams (words): mod within 5% of orig or closest possible (see B2)

B1_absent_prop = texts_df['absent_prop'].to_list()
B1_absent_prop

find_range(B1_absent_prop[0])

[0.979, 0.985]

'0.93 - 1.03'

In [75]:
# Grammar errors per 100: mod within 5% of orig

B1_grammar_errors_per_100 = texts_df['grammar_errors_per_100'].to_list()
B1_grammar_errors_per_100

find_range(B1_grammar_errors_per_100[0])

[5.814, 6.0]

'5.52 - 6.1'

In [76]:
# Punctuation errors per 100: mod within 5% of orig

B1_punc_errors_per_100 = texts_df['punc_errors_per_100'].to_list()
B1_punc_errors_per_100

find_range(B1_punc_errors_per_100[0])

[7.558, 7.6]

'7.18 - 7.94'

In [77]:
# Collocation errors per 100: mod within 5% of orig

B1_col_errors_per_100 = texts_df['col_errors_per_100'].to_list()
B1_col_errors_per_100

find_range(B1_col_errors_per_100[0])

[8.14, 8.0]

'7.73 - 8.55'

In [78]:
# Accurate colls per 100

B1_correct_cols_per_100 = texts_df['correct_cols_per_100'].to_list()
B1_correct_cols_per_100

find_range(B1_correct_cols_per_100[0])

[4.651, 4.8]

'4.42 - 4.88'

In [79]:
# K10-16 cols percent

B1_K10to16_p = texts_df['K10to16_p'].to_list()
B1_K10to16_p

find_range(B1_K10to16_p[0])

[0.0, 0.0]

'0.0 - 0.0'

In [80]:
# K3-9 cols percent

B1_K3to9_p = texts_df['K3to9_p'].to_list()
B1_K3to9_p

find_range(B1_K3to9_p[0])

[0.25, 0.25]

'0.24 - 0.26'

In [81]:
# K1-2 cols percent

B1_K1to2_p = texts_df['K1to2_p'].to_list()
B1_K1to2_p

find_range(B1_K1to2_p[0])

[0.75, 0.75]

'0.71 - 0.79'

In [82]:
# Bigram range: mod within 5% of orig

B1_bigram_range = texts_df['bigram_range'].to_list()
B1_bigram_range

find_range(B1_bigram_range[0])

[0.088, 0.095]

'0.08 - 0.09'

In [83]:
# CNC: mod within 5% of orig

B1_CNC = texts_df['CNC'].to_list()
B1_CNC

find_range(B1_CNC[0])

[0.615, 0.641]

'0.58 - 0.65'

In [84]:
# MLC: mod within 5% of orig

B1_MLC = texts_df['MLC'].to_list()
B1_MLC

find_range(B1_MLC[0])

[6.615, 6.41]

'6.28 - 6.95'

In [85]:
# Final comparison with relevant stats only

texts_final = texts_df[['text_id','text','lemmas_NLTK','lemmas_CLAWS','text_len','MLC','CNC','grammar_errors_per_100',
                        'punc_errors_per_100','vocD','AG','bigram_range','mean_MI','absent_prop',
                        'mean_tscore','col_errors_per_100','correct_cols_per_100','K10to16_p','K3to9_p','K1to2_p',
                        'kband_cols', 'K1to2_cols','K3to9_cols','K10to16_cols',
                        'K1to2_cols_K','K3to9_cols_K','K10to16_cols_K','bad_cols']]

In [86]:
# Pickle for later use

joblib.dump(texts_final ,'../docs/B1_orig&norm.pkl')

['../docs/B1_orig&norm.pkl']

In [87]:
# Pickle possible cols for collocation identification notebook

B1_cols = texts_df[['text_id','lemmas_CLAWS','possible_cols']]

joblib.dump(B1_cols ,'../docs/B1_cols.pkl')

['../docs/B1_cols.pkl']

[Back to top](#Normalizing-B1-original-text)