# Processing base texts

<br>

**Language: Python**

This notebook shows the process used for calculating measures related to lexical diversity and sophistication, grammatical complexity, and other collocational proficiency metrics. These are the same indices calculated during the text normalization process from notebooks 01, 02, 03.


**Notebook contents:**
- [Initial setup](#Initial-setup)
- [Text processing](#Text-processing)
- [Syntactic complexity](#Syntactic-complexity)
- [Lexical diversity](#Lexical-diversity)
- [Lexical sophistication](#Lexical-sophistication)
- [Collocation measures](#Collocation-measures)

## Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import csv
from ast import literal_eval
from nltk import pos_tag_sents
from pelitk import lex
import joblib
import numpy as np

In [2]:
# Set preferred notebook format

%pprint # Turn off pretty printing
InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

Pretty printing has been turned OFF


**Note:** As described in the [README.md]('../README.md'), The frequency information from COCA referenced here is not freely available but can be purchased at https://corpus.byu.edu/coca. Without this data you will be able to see a few rows of these dataframes, but will not be able to run the code yourself. The t-scores and K-bands were also calculated using these data.

In [3]:
# Import necessary dictionaries

col_freq_dict = joblib.load('../../COCA_data/COCA_2020_collocate_freq_dict.pkl')
MI_dict = joblib.load('../../COCA_data/COCA_2020_MI_dict.pkl')
tscore_dict = joblib.load('../../COCA_data/COCA_2020_tscore_dict.pkl')
kband_dict = joblib.load('../../COCA_data/COCA_2020_lemma_Kband_dict.pkl')

In [5]:
# Load base_texts

base_df = joblib.load('../docs/base_texts.pkl')
base_df

Unnamed: 0,text_id,text,lemmas_NLTK,lemmas_CLAWS,correct_cols,col_errors,K1to2,K3to9,K10to16,kband_cols,K1to2_cols,K3to9_cols,K10to16_cols,bad_cols,kband_non_cols
0,text1,I disagree that point about children brought u...,"[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",12,18,9,3,0,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[],"[disagree that point, show that situation, cou...","[(1, (I, p)), (1, (about, i)), (1, (child, n))..."
1,text11,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",22,12,16,6,0,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[],"[raise in (values), psychological values, well...","[(1, (I, p)), (1, (a, a)), (1, (such, i)), (1,..."
2,text21,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",32,6,18,11,3,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex...","[agree to the statement, in the weekends, coll...","[(1, (I, p)), (1, (do, v)), (1, (that, c)), (1..."


In [7]:
# Write out texts to txt files for syntactic analysis

file = '../docs/base_text{}.txt' #make each name 'anonID' + index number

n = 0 # to number the files using index of dataframe
for row in base_df.iterrows():
    with open(file.format(base_df.index[n]), 'w') as f:
        f.write(row[1][1])
        n += 1

1384

1393

1512

## Syntactic complexity

Analysis using [TAASSC](https://www.linguisticanalysistools.org/taassc.html), calculating the measures from Lu's (2010) [Syntactic Complexity Analyzer](https://aihaiyang.com/software/). Based on previous research, two metrics most important for predicting proficiency are the focus: Number of complex nominals per clause (CN/C), and Mean length of clause (MLC).

In [8]:
# Read in TAASSC analysis file

TAASSC = pd.read_csv("../docs/base_TAASSC_sca.csv")

In [9]:
# Rename files to match texts_df

file_names = {'base_text0.txt':'text1','base_text1.txt':'text11','base_text2.txt':'text21'}
TAASSC.filename = TAASSC.filename.map(file_names)
TAASSC = TAASSC.loc[~TAASSC.filename.isnull()]
TAASSC

Unnamed: 0,filename,nwords,MLS,MLT,MLC,C_S,VP_T,C_T,DC_C,DC_T,T_S,CT_T,CP_T,CP_C,CN_T,CN_C
0,text1,251,16.733333,15.6875,6.435897,2.6,2.9375,2.4375,0.358974,0.875,1.066667,0.5625,0.1875,0.076923,1.625,0.666667
1,text11,258,17.2,15.176471,7.371429,2.333333,2.588235,2.058824,0.457143,0.941176,1.133333,0.588235,0.117647,0.057143,1.705882,0.828571
2,text21,255,15.9375,15.0,11.590909,1.375,1.823529,1.294118,0.227273,0.294118,1.0625,0.294118,0.411765,0.318182,2.647059,2.045455


In [10]:
# Keep only relevant syntactic complexity columns and rename them

TAASSC = TAASSC[['filename','MLC','CN_C']]
TAASSC = TAASSC.rename(columns={"filename": "text_id",'CN_C':'CNC'})
TAASSC

Unnamed: 0,text_id,MLC,CNC
0,text1,6.435897,0.666667
1,text11,7.371429,0.828571
2,text21,11.590909,2.045455


In [11]:
# Merge TAASSC data with texts_df

base_df = pd.merge(base_df, TAASSC, on='text_id')
base_df

Unnamed: 0,text_id,text,lemmas_NLTK,lemmas_CLAWS,correct_cols,col_errors,K1to2,K3to9,K10to16,kband_cols,K1to2_cols,K3to9_cols,K10to16_cols,bad_cols,kband_non_cols,MLC,CNC
0,text1,I disagree that point about children brought u...,"[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",12,18,9,3,0,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[],"[disagree that point, show that situation, cou...","[(1, (I, p)), (1, (about, i)), (1, (child, n))...",6.435897,0.666667
1,text11,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",22,12,16,6,0,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[],"[raise in (values), psychological values, well...","[(1, (I, p)), (1, (a, a)), (1, (such, i)), (1,...",7.371429,0.828571
2,text21,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",32,6,18,11,3,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex...","[agree to the statement, in the weekends, coll...","[(1, (I, p)), (1, (do, v)), (1, (that, c)), (1...",11.590909,2.045455


## Lexical diversity

vocD (with lemmas) using functions from [PELITK](https://github.com/ELI-Data-Mining-Group/pelitk)

In [12]:
# Remove punctuation before calculating

punctuation = ['.','!','?',';',':','#','"',"'",'``','`',',','--','-','...',')','(',"''"]

base_df['vocD'] = base_df.lemmas_NLTK.apply(lambda row: [x for x in row if x not in punctuation])

In [13]:
# Create vocD column

base_df['vocD'] = base_df.lemmas_NLTK.apply(lex.vocd)
base_df

Unnamed: 0,text_id,text,lemmas_NLTK,lemmas_CLAWS,correct_cols,col_errors,K1to2,K3to9,K10to16,kband_cols,K1to2_cols,K3to9_cols,K10to16_cols,bad_cols,kband_non_cols,MLC,CNC,vocD
0,text1,I disagree that point about children brought u...,"[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",12,18,9,3,0,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[],"[disagree that point, show that situation, cou...","[(1, (I, p)), (1, (about, i)), (1, (child, n))...",6.435897,0.666667,48.559455
1,text11,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",22,12,16,6,0,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[],"[raise in (values), psychological values, well...","[(1, (I, p)), (1, (a, a)), (1, (such, i)), (1,...",7.371429,0.828571,44.71992
2,text21,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",32,6,18,11,3,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex...","[agree to the statement, in the weekends, coll...","[(1, (I, p)), (1, (do, v)), (1, (that, c)), (1...",11.590909,2.045455,70.301695


## Lexical sophistication

### AG
AG based on lemmas using a frequency list (PSL3) compiled from the PELIC learner corpus (see dissertation section 2.2.2)

In [14]:
# Read in PSL3 list for manual checking of items in texts that are off list

f = open("psl3.txt", "r")
PSL3 = f.read()
PSL3 = sorted(PSL3.split('\n'))

FileNotFoundError: [Errno 2] No such file or directory: 'psl3.txt'

In [15]:
# Create AG column

base_df['AG'] = base_df.lemmas_NLTK.apply(lambda row: [x for x in row if x not in punctuation]).apply(
    lex.adv_guiraud,freq_list = 'PSL3')

base_df

Unnamed: 0,text_id,text,lemmas_NLTK,lemmas_CLAWS,correct_cols,col_errors,K1to2,K3to9,K10to16,kband_cols,K1to2_cols,K3to9_cols,K10to16_cols,bad_cols,kband_non_cols,MLC,CNC,vocD,AG
0,text1,I disagree that point about children brought u...,"[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",12,18,9,3,0,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[],"[disagree that point, show that situation, cou...","[(1, (I, p)), (1, (about, i)), (1, (child, n))...",6.435897,0.666667,48.559455,0.379473
1,text11,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",22,12,16,6,0,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[],"[raise in (values), psychological values, well...","[(1, (I, p)), (1, (a, a)), (1, (such, i)), (1,...",7.371429,0.828571,44.71992,0.935674
2,text21,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",32,6,18,11,3,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex...","[agree to the statement, in the weekends, coll...","[(1, (I, p)), (1, (do, v)), (1, (that, c)), (1...",11.590909,2.045455,70.301695,1.454648


### Contextual diversity

Analysis using [TAALES](https://www.linguisticanalysistools.org/taales.html). Based on previous research, one metric is the focus: contextual diversity as in Monteiro et al. (2018).

In [16]:
# Read in TAALES analysis (program freely available for download)

TAALES = pd.read_csv("../docs/base_TAALES.csv")
TAALES

Unnamed: 0,Filename,Word Count,COCA_Academic_Bigram_Frequency,COCA_Academic_Bigram_Range,COCA_Academic_Bigram_Frequency_Log,COCA_Academic_Bigram_Range_Log,COCA_academic_bi_MI,COCA_academic_bi_MI2,COCA_academic_bi_T,COCA_academic_bi_DP,COCA_academic_bi_AC,COCA_academic_bi_prop_10k,COCA_academic_bi_prop_20k,COCA_academic_bi_prop_30k,COCA_academic_bi_prop_40k,COCA_academic_bi_prop_50k,COCA_academic_bi_prop_60k,COCA_academic_bi_prop_70k,COCA_academic_bi_prop_80k,COCA_academic_bi_prop_90k,COCA_academic_bi_prop_100k
0,base_text0.txt,250,83.69312,0.094385,0.98841,-1.587054,1.17213,7.796596,26.148618,0.01929,4236.44724,0.325301,0.37751,0.457831,0.497992,0.534137,0.558233,0.594378,0.626506,0.630522,0.650602
1,base_text1.txt,258,192.550594,0.169061,1.38415,-1.247401,1.445914,8.981604,44.264475,0.04276,10205.439701,0.455253,0.513619,0.568093,0.607004,0.645914,0.66537,0.680934,0.684825,0.692607,0.692607
2,base_text2.txt,251,118.450084,0.116812,1.227782,-1.38821,1.666179,8.841818,38.390386,0.049158,6188.176233,0.424,0.504,0.548,0.576,0.592,0.608,0.616,0.628,0.644,0.648


In [17]:
# Rename files to match texts_df

TAALES.Filename = TAALES.Filename.map(file_names)
TAALES = TAALES.loc[~TAALES.Filename.isnull()]

In [18]:
# Keep only relevant contextual diversity column and rename it

TAALES = TAALES[['Filename','COCA_Academic_Bigram_Range']]
TAALES = TAALES.rename(columns={"Filename": "text_id",'COCA_Academic_Bigram_Range':'bigram_range'})
TAALES

Unnamed: 0,text_id,bigram_range
0,text1,0.094385
1,text11,0.169061
2,text21,0.116812


In [19]:
# Merge TAALES data with texts_df

base_df = pd.merge(base_df, TAALES, on='text_id')
base_df

Unnamed: 0,text_id,text,lemmas_NLTK,lemmas_CLAWS,correct_cols,col_errors,K1to2,K3to9,K10to16,kband_cols,K1to2_cols,K3to9_cols,K10to16_cols,bad_cols,kband_non_cols,MLC,CNC,vocD,AG,bigram_range
0,text1,I disagree that point about children brought u...,"[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",12,18,9,3,0,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[],"[disagree that point, show that situation, cou...","[(1, (I, p)), (1, (about, i)), (1, (child, n))...",6.435897,0.666667,48.559455,0.379473,0.094385
1,text11,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",22,12,16,6,0,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[],"[raise in (values), psychological values, well...","[(1, (I, p)), (1, (a, a)), (1, (such, i)), (1,...",7.371429,0.828571,44.71992,0.935674,0.169061
2,text21,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",32,6,18,11,3,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex...","[agree to the statement, in the weekends, coll...","[(1, (I, p)), (1, (do, v)), (1, (that, c)), (1...",11.590909,2.045455,70.301695,1.454648,0.116812


## Collocation measures
3 measures which make up 'CollGram' profile from Granger & Bestgen / Bestgen & Granger (2014):
- mean MI
- mean t-score
- proportion or bigrams absent from reference corpus

In [20]:
# Extract potential collocations in span 4 (lots for each lemma)

def find_cols(lemma_list):
    col_list = list(zip(lemma_list,lemma_list[1:]))+list(zip(lemma_list,lemma_list[2:]))\
    +list(zip(lemma_list,lemma_list[3:]))+list(zip(lemma_list,lemma_list[4:]))
    return col_list

In [21]:
# Create possible collocations column

base_df['possible_cols'] = base_df.lemmas_CLAWS.apply(find_cols)

In [22]:
# Lower-case (doesn't matter that 'I' gets lowered as not in collocate dict anyways)

base_df['possible_cols'] = base_df.possible_cols.apply(
    lambda row: [((x[0][0].lower(),x[0][1]),(x[1][0].lower(),x[1][1])) for x in row])

In [23]:
# Create list of all possible collocations

possible_cols = sorted(list(set([x for y in base_df.possible_cols.to_list() for x in y])))

### Mean MI

MI is not calculated for any bigrams with freq less than 5 or MI less than 1.

In [24]:
# Create column with MI for each possible collocation in MI dict

base_df['col_MI'] = base_df.possible_cols.apply(lambda row: [(x,MI_dict[x]) for x in row if x in MI_dict])

In [25]:
# Find mean MI for each text

base_df['mean_MI'] = base_df.col_MI.apply(lambda row: np.mean([x[1] for x in row]))

### Proportion of absent/low MI word combinations

In [26]:
# Create column of two-word combinations not in collocation dict

base_df['absent'] = base_df.possible_cols.apply(lambda row: [x for x in row if x not in col_freq_dict])

In [27]:
# Find proportion of absent two-word combinations compared to total two-word combinations in the text

base_df['absent_prop'] = base_df.absent.apply(lambda row: len(row)) / base_df.possible_cols.apply(lambda row: len(row))

### Mean t-scores

In [28]:
# Create column with t-score for each bigram

base_df['col_tscore'] = base_df.possible_cols.apply(lambda row: [(x,tscore_dict[x]) for x in row if x in tscore_dict])

In [29]:
# Find mean t-score for each text based on tokens and types

base_df['mean_tscore'] = base_df.col_tscore.apply(lambda row: np.mean([x[1] for x in row]))

In [30]:
base_df

Unnamed: 0,text_id,text,lemmas_NLTK,lemmas_CLAWS,correct_cols,col_errors,K1to2,K3to9,K10to16,kband_cols,K1to2_cols,K3to9_cols,K10to16_cols,bad_cols,kband_non_cols,MLC,CNC,vocD,AG,bigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,col_tscore,mean_tscore
0,text1,I disagree that point about children brought u...,"[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",12,18,9,3,0,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[],"[disagree that point, show that situation, cou...","[(1, (I, p)), (1, (about, i)), (1, (child, n))...",6.435897,0.666667,48.559455,0.379473,0.094385,"[((i, p), (disagree, v)), ((disagree, v), (tha...","[((('our', 'a'), ('country', 'n')), 2.01), (((...",2.526154,"[((i, p), (disagree, v)), ((disagree, v), (tha...",0.986869,"[((('our', 'a'), ('country', 'n')), 157.939), ...",106.159769
1,text11,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",22,12,16,6,0,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[],"[raise in (values), psychological values, well...","[(1, (I, p)), (1, (a, a)), (1, (such, i)), (1,...",7.371429,0.828571,44.71992,0.935674,0.169061,"[((i, p), (greatly, r)), ((greatly, r), (suppo...","[((('such', 'i'), ('as', 'i')), 6.02), ((('har...",2.867632,"[((i, p), (greatly, r)), ((greatly, r), (suppo...",0.962818,"[((('such', 'i'), ('as', 'i')), 494.477), ((('...",157.983842
2,text21,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",32,6,18,11,3,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex...","[agree to the statement, in the weekends, coll...","[(1, (I, p)), (1, (do, v)), (1, (that, c)), (1...",11.590909,2.045455,70.301695,1.454648,0.116812,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...","[((('expose', 'v'), ('to', 'i')), 2.4), ((('li...",3.232958,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...",0.928283,"[((('expose', 'v'), ('to', 'i')), 133.938), ((...",93.147535


### Collocation frequency bands
Percentage of collocations containing low/mid/high freq items.  
- High = K1-2
- Mid = K3-9
- Low = K10+

In [31]:
# Create columns of percentages of cols that contain low, med, high kband items (highest only)

base_df['K10to16_p'] = base_df['K10to16']/(base_df['K10to16']+base_df['K3to9']+base_df['K1to2'])
base_df['K3to9_p'] = base_df['K3to9']/(base_df['K10to16']+base_df['K3to9']+base_df['K1to2'])
base_df['K1to2_p'] = base_df['K1to2']/(base_df['K10to16']+base_df['K3to9']+base_df['K1to2'])

In [32]:
# Round all stats to 3 digits for ease of use

base_df = round(base_df,3)
base_df

Unnamed: 0,text_id,text,lemmas_NLTK,lemmas_CLAWS,correct_cols,col_errors,K1to2,K3to9,K10to16,kband_cols,K1to2_cols,K3to9_cols,K10to16_cols,bad_cols,kband_non_cols,MLC,CNC,vocD,AG,bigram_range,possible_cols,col_MI,mean_MI,absent,absent_prop,col_tscore,mean_tscore,K10to16_p,K3to9_p,K1to2_p
0,text1,I disagree that point about children brought u...,"[I, disagree, that, point, about, child, bring...","[(I, p), (disagree, v), (that, d), (point, n),...",12,18,9,3,0,"[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(1, [('good', 'j'), ('effect', 'n')]), (1, [(...","[(4, [('entrance', 'n'), ('to', 'i'), ('the', ...",[],"[disagree that point, show that situation, cou...","[(1, (I, p)), (1, (about, i)), (1, (child, n))...",6.436,0.667,48.559,0.379,0.094,"[((i, p), (disagree, v)), ((disagree, v), (tha...","[((('our', 'a'), ('country', 'n')), 2.01), (((...",2.526,"[((i, p), (disagree, v)), ((disagree, v), (tha...",0.987,"[((('our', 'a'), ('country', 'n')), 157.939), ...",106.16,0.0,0.25,0.75
1,text11,I greatly support the idea.\nraised in a certa...,"[I, greatly, support, the, idea, ., raise, in,...","[(I, p), (greatly, r), (support, v), (the, a),...",22,12,16,6,0,"[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(1, [('support', 'v'), ('the', 'a'), ('idea',...","[(3, [('come', 'v'), ('from', 'i'), ('a', 'a')...",[],"[raise in (values), psychological values, well...","[(1, (I, p)), (1, (a, a)), (1, (such, i)), (1,...",7.371,0.829,44.72,0.936,0.169,"[((i, p), (greatly, r)), ((greatly, r), (suppo...","[((('such', 'i'), ('as', 'i')), 6.02), ((('har...",2.868,"[((i, p), (greatly, r)), ((greatly, r), (suppo...",0.963,"[((('such', 'i'), ('as', 'i')), 494.477), ((('...",157.984,0.0,0.273,0.727
2,text21,I do agree to the statement that children brou...,"[I, do, agree, to, the, statement, that, child...","[(I, p), (do, v), (agree, v), (to, i), (the, a...",32,6,18,11,3,"[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(1, [('poor', 'j'), ('family', 'n')]), (1, [(...","[(7, [('sacrifice', 'v'), ('luxury', 'n'), ('f...","[(14, [('be', 'v'), ('prematurely', 'r'), ('ex...","[agree to the statement, in the weekends, coll...","[(1, (I, p)), (1, (do, v)), (1, (that, c)), (1...",11.591,2.045,70.302,1.455,0.117,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...","[((('expose', 'v'), ('to', 'i')), 2.4), ((('li...",3.233,"[((i, p), (do, v)), ((do, v), (agree, v)), ((a...",0.928,"[((('expose', 'v'), ('to', 'i')), 133.938), ((...",93.148,0.094,0.344,0.562


In [33]:
# Add text_len column

base_df['text_len'] = 250

In [34]:
# Pickle for later use

joblib.dump(base_df,'../docs/base_texts_processed.pkl')

['../docs/base_texts_processed.pkl']

[Back to top](#Processing-base-texts)