In [1]:
import pandas as pd

In [2]:
train_data = pd.read_csv('./learning-agency-lab-automated-essay-scoring-2/train.csv')

### Automatic scoring re-definition in terms of DL
**There are a number of attributes of the essay we want to analyze. They are**
- Grammar - No / Low grammatical mistakes
- Sentence structures - Coherent sentence structures
- Coherent & logical reasoning of the concept
- Consistency of the focus on topic in essay
- Decent length of the essay
- Supporting arguments & numbers/stats to emphasis your view point

Since DL models can't directly give us this info unless specifically trained for the task. We need to go ahead with some assumptions to make it work. 
- Small scale LLMs are not able to judge the essay w.r.t these parameters. Always giving more info than necessary & many times wrong info being shared. So ruling out LLMs to generate data at this point.
- For grammatical mistakes - Word mistakes can be identified with a good english dictionary, sentence structure mistakes can be inferred by passing a word embedding list through a LSTM and providing that context to the model while grading. Or we can use online sentence correction tools to identify if sentences require a large amount of corrections or not.
- Subject verb agreement could be a starter for grammar
- Logical reasoning can't be directly inferred, so using sentences embeddings in a sequence can be processed to estimate it in abstract way
- Consitency of focus on topic also will have to be done through sentence embeddings in a sequence
- Length of the essay as separate feature
- Numbers, percentages, metrics etc. if we can identify them and have features related to existance & extant of them

In [3]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("gotutiyan/gec-t5-small-clang8")
model = AutoModelForSeq2SeqLM.from_pretrained("gotutiyan/gec-t5-small-clang8")

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
from transformers.generation.configuration_utils import GenerationConfig

In [8]:
model.

T5Config {
  "_name_or_path": "gotutiyan/gec-t5-small-clang8",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 8,
  "num_heads": 6,
  "num_layers": 8,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 32128
}

In [25]:
print(train_data['full_text'].iloc[0])

Many people have car where they live. The thing they don't know is that when you use a car alot of thing can happen like you can get in accidet or the smoke that the car has is bad to breath on if someone is walk but in VAUBAN,Germany they dont have that proble because 70 percent of vauban's families do not own cars,and 57 percent sold a car to move there. Street parkig ,driveways and home garages are forbidden on the outskirts of freiburd that near the French and Swiss borders. You probaly won't see a car in Vauban's streets because they are completely "car free" but If some that lives in VAUBAN that owns a car ownership is allowed,but there are only two places that you can park a large garages at the edge of the development,where a car owner buys a space but it not cheap to buy one they sell the space for you car for $40,000 along with a home. The vauban people completed this in 2006 ,they said that this an example of a growing trend in Europe,The untile states and some where else ar

In [9]:
tokenized = tokenizer("The thing they don't know is that when you use a car alot of thing can happen like you can get in accidet or the smoke that the car has is bad to breath on if someone is walk but in VAUBAN,Germany they dont have that proble because 70 percent of vauban's families do not own cars,and 57 percent sold a car to move there.",
                      return_tensors='pt')

In [11]:
tokenizer.decode(
model.generate(inputs=tokenized['input_ids'])[0],min_length=100, max_length=512)

"<pad> The thing they don't know is that when you use a car a"

In [30]:
model.generate(inputs=tokenized['input_ids'],max_new_tokens=512)

tensor([[    0,    37,   589,    79,   278,     3,    31,     3,    17,   214,
            19,    24,   116,    25,   169,     3,     9,   443,     3,     9,
           418,    13,   378,    54,  1837,   114,    25,    54,   129,    16,
            46,  3125,    42,     8,  7269,    24,     8,   443,    65,    19,
          1282,    12, 13418,    30,     3,    99,   841,    19,  3214,     3,
             6,    68,    16,   584,  6727, 25534,     3,     6,  3434,    79,
           103,    59,    43,    24,   682,   250,  2861,  1093,    13,   409,
            76,  3478,     3,    31,     7,  1791,   103,    59,   293,  2948,
             3,     6,    11,     3,  3436,  1093,  1789,     3,     9,   443,
            12,   888,   132,     3,     5,     1]])

In [32]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("alice-hml/mBERT_grammatical_error_tagger")
model = AutoModelForTokenClassification.from_pretrained("alice-hml/mBERT_grammatical_error_tagger")

In [33]:
tokenized = tokenizer("The thing they don't know is that when you use a car alot of thing can happen like you can get in accidet or the smoke that the car has is bad to breath on if someone is walk but in VAUBAN,Germany they dont have that proble because 70 percent of vauban's families do not own cars,and 57 percent sold a car to move there.",
                      return_tensors='pt')

In [44]:
out = model(tokenized['input_ids'])

In [64]:
vocabulary = tokenizer.vocab
reverse_vocabulary = {value:key for key,value in vocabulary.items()}

[101,
 10117,
 40414,
 10689,
 16938,
 112,
 188,
 21852,
 10124,
 10189,
 10841,
 13028,
 11760,
 169,
 13000,
 10164,
 11290,
 10108,
 40414,
 10944,
 84630,
 11850,
 13028,
 10944,
 15329,
 10106,
 13621,
 65074,
 10123,
 10345,
 10105,
 100332,
 10189,
 10105,
 13000,
 10393,
 10124,
 15838,
 10114,
 33989,
 54006,
 10135,
 12277,
 30455,
 10124,
 33734,
 10473,
 10106,
 69342,
 82439,
 41275,
 117,
 12775,
 10689,
 11758,
 10529,
 10189,
 11284,
 11203,
 12373,
 10923,
 22362,
 10108,
 10321,
 105807,
 112,
 187,
 15300,
 10149,
 10472,
 12542,
 24602,
 117,
 10111,
 11817,
 22362,
 15337,
 169,
 13000,
 10114,
 18577,
 11155,
 119,
 102]

In [69]:
for token, label in zip(tokenized['input_ids'][0].tolist(), out['logits'][0].argmax(dim=1).tolist()):
    print(reverse_vocabulary[token],label)

[CLS] 0
The 0
thing 0
they 0
don 0
' 0
t 0
know 0
is 0
that 0
when 0
you 0
use 0
a 0
car 0
al 0
##ot 0
of 0
thing 0
can 0
happen 0
like 0
you 0
can 0
get 0
in 0
ac 0
##cide 0
##t 0
or 0
the 0
smoke 0
that 0
the 0
car 0
has 0
is 0
bad 0
to 0
br 0
##eath 0
on 0
if 0
someone 0
is 0
walk 0
but 0
in 0
VA 0
##UB 0
##AN 0
, 0
Germany 0
they 0
dont 0
have 0
that 0
pro 0
##ble 0
because 0
70 0
percent 0
of 0
va 0
##uban 0
' 0
s 0
families 0
do 0
not 1
own 2
cars 0
, 0
and 0
57 0
percent 0
sold 0
a 0
car 0
to 0
move 0
there 0
. 0
[SEP] 0


In [46]:
len(tokenized['input_ids'][0])

84

- Hypothesis - 1 
    - More number of misspelt words in essays which got scored highly

In [3]:
import enchant

In [4]:
english_dict = enchant.Dict('en_US')

In [8]:
from tqdm import tqdm
import re
import swifter

tqdm.pandas()

In [6]:

def remove_special_chars(text):
    """
    Removes any special characters or number from text, simply returns all words in lower case
    """
    return re.sub('\s+', ' ', re.sub('[^A-Za-z]', ' ', text)).lower()

In [9]:
train_data['full_text_cleaned'] = train_data['full_text'].swifter.apply(remove_special_chars)

Pandas Apply:   0%|          | 0/17307 [00:00<?, ?it/s]

In [10]:
train_data['misspelt_words'] = train_data['full_text_cleaned'].swifter.apply(lambda x: [word for word in x.split() if not english_dict.check(word)])

Pandas Apply:   0%|          | 0/17307 [00:00<?, ?it/s]

In [11]:
train_data['length_of_misspelt'] = train_data['misspelt_words'].apply(len)

In [12]:

train_data.groupby('score')['length_of_misspelt'].describe().sort_index(ascending=False)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6,156.0,26.153846,13.154248,3.0,16.0,23.0,35.0,70.0
5,970.0,23.869072,12.00741,1.0,15.0,23.0,30.0,84.0
4,3926.0,20.658176,11.024157,0.0,13.0,19.0,26.0,103.0
3,6280.0,17.363694,9.854664,0.0,10.0,16.0,22.0,90.0
2,4723.0,14.533983,9.242295,0.0,8.0,13.0,19.0,92.0
1,1252.0,19.308307,12.041993,1.0,11.0,17.0,24.0,149.0


- Hypothesis 1 is wrong

- Hypotheis 2
    - More number of Nouns used in higher score essays

In [14]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [24]:
train_data['spacy_doc'] = train_data['full_text'].swifter.apply(nlp)

Pandas Apply:   0%|          | 0/17307 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [19]:
def get_freq_pos_tags(text):
    """
    Returns the frequency of each part of speech tag in the text
    """
    doc = nlp(text)
    pos_tags = {}
    for token in doc:
        if token.pos_ in pos_tags:
            pos_tags[token.pos_] += 1
        else:
            pos_tags[token.pos_] = 1
    return pos_tags

In [17]:
train_data['pos_tags'] = train_data['full_text'].progress_apply(get_pos_tags)

Pandas Apply:   0%|          | 0/17307 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [13]:
error_dict

{'live.': 161,
 'alot': 2056,
 'accidet': 2,
 'VAUBAN,Germany': 6,
 'dont': 4073,
 'proble': 8,
 '70': 261,
 "vauban's": 18,
 'cars,and': 13,
 '57': 206,
 'there.': 1346,
 'parkig': 2,
 ',driveways': 2,
 'freiburd': 1,
 'Swiss': 55,
 'borders.': 35,
 'probaly': 117,
 "Vauban's": 258,
 '"car': 176,
 'free"': 29,
 'VAUBAN': 22,
 'allowed,but': 5,
 'development,where': 8,
 '$40,000': 109,
 'home.': 412,
 'vauban': 24,
 '2006': 35,
 ',they': 15,
 'Europe,The': 1,
 'untile': 3,
 '"smart': 129,
 'planning".': 28,
 'tailes': 1,
 'passengee': 1,
 '12': 471,
 'Europe': 1099,
 '50': 430,
 'States.': 573,
 'honeslty': 9,
 'Vaudan': 2,
 '5,500': 104,
 'mile.': 18,
 'artical': 343,
 'David': 94,
 '"All': 70,
 'change"': 20,
 'beacuse': 265,
 'thats': 1543,
 'bycles': 1,
 'car.': 3378,
 'thik': 11,
 'to.': 857,
 ',the': 25,
 'reduced"communtunties,and': 1,
 'act,if': 4,
 'cautiously.': 21,
 'Maany': 1,
 'year.': 282,
 'bill,80': 1,
 '20': 161,
 'transports.': 7,
 'this.': 814,
 'NASA': 6581,
 '"face

In [9]:
train_data.

Unnamed: 0,essay_id,full_text,score
0,000d118,Many people have car where they live. The thin...,3
1,000fe60,I am a scientist at NASA that is discussing th...,3
2,001ab80,People always wish they had the same technolog...,4
3,001bdc0,"We all heard about Venus, the planet without a...",4
4,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3
...,...,...,...
17302,ffd378d,"the story "" The Challenge of Exploing Venus "" ...",2
17303,ffddf1f,Technology has changed a lot of ways that we l...,4
17304,fff016d,If you don't like sitting around all day than ...,2
17305,fffb49b,"In ""The Challenge of Exporing Venus,"" the auth...",1


In [5]:
import spacy
nlp = spacy.load("en_core_web_sm",enable=['parser'])

In [2]:
import pandas as pd
data = pd.read_csv('./learning-agency-lab-automated-essay-scoring-2/train.csv')

In [28]:
%%timeit
docs= list()
for doc in nlp.pipe(data['full_text'].sample(200),batch_size=100,n_process=8):
    docs.append(doc)

4.47 s ± 527 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%%timeit
docs= list()
for doc in nlp.pipe(data['full_text'].sample(500),batch_size=20,n_process=8):
    p_doc = nlp(doc)
    docs.append([sent.text for sent in p_doc.sents])

7.78 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
%%timeit
docs= list()
for doc in data['full_text'].sample(200):
    p_doc = nlp(doc)
    docs.append([sent.text for sent in p_doc.sents])

KeyboardInterrupt: 

In [15]:
 sentences = []
        # Adding batching as higher number of text causing memory issues & slowing down multiprocessing throughput
        for doc in tqdm(nlp.pipe(self.text,batch_size=100,n_process=8),total=len(self.text)):
            sentences.append([sent.text for sent in doc.sents])
        return sentences

[]