# Trying a pipeline-based approach

In [1]:
import pandas as pd
import numpy as np

train_data = pd.read_csv("/kaggle/input/feedback-prize-english-language-learning/train.csv")
train_data.head()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5


In [2]:
story = []
for i in train_data['full_text']:
    story.append(i.lower().replace("\n", "").strip())

In [3]:
story = np.array(story).reshape(-1, 1)
story.shape

(3911, 1)

In [4]:
cohesion = np.array(train_data['cohesion']).astype(float)
syntax = np.array(train_data['syntax']).astype(float)
vocabulary = np.array(train_data['vocabulary']).astype(float)
phraseology = np.array(train_data['phraseology']).astype(float)
grammar = np.array(train_data['grammar']).astype(float)
conventions = np.array(train_data['conventions']).astype(float)
conventions.shape

(3911,)

In [5]:
test_data = pd.read_csv("/kaggle/input/feedback-prize-english-language-learning/test.csv")
test_data

Unnamed: 0,text_id,full_text
0,0000C359D63E,when a person has no experience on a job their...
1,000BAD50D026,Do you think students would benefit from being...
2,00367BB2546B,"Thomas Jefferson once states that ""it is wonde..."


In [6]:
story_pred = []
for i in test_data['full_text']:
    story_pred.append(i.lower().replace("\n", "").strip())
story_pred = np.array(story_pred).reshape(-1, 1)

In [7]:
from atom import ATOMRegressor

atom1 = ATOMRegressor(story, cohesion, test_size = 0.33, random_state = 42, verbose = 2)
atom2 = ATOMRegressor(story, syntax, test_size = 0.33, random_state = 42, verbose = 2)
atom3 = ATOMRegressor(story, vocabulary, test_size = 0.33, random_state = 42, verbose = 2)
atom4 = ATOMRegressor(story, phraseology, test_size = 0.33, random_state = 42, verbose = 2)
atom5 = ATOMRegressor(story, grammar, test_size = 0.33, random_state = 42, verbose = 2)
atom6 = ATOMRegressor(story, conventions, test_size = 0.33, random_state = 42, verbose = 2)

Woodwork may not support Python 3.7 in next non-bugfix release.
Featuretools may not support Python 3.7 in next non-bugfix release.


Algorithm task: regression.

Shape: (3911, 2)
Memory: 9.21 MB
Scaled: False
Categorical features: 1 (100.0%)
Outlier values: 6 (0.1%)
-------------------------------------
Train set size: 2621
Test set size: 1290

Algorithm task: regression.

Shape: (3911, 2)
Memory: 9.21 MB
Scaled: False
Categorical features: 1 (100.0%)
Outlier values: 23 (0.4%)
-------------------------------------
Train set size: 2621
Test set size: 1290

Algorithm task: regression.

Shape: (3911, 2)
Memory: 9.21 MB
Scaled: False
Categorical features: 1 (100.0%)
Outlier values: 1 (0.0%)
-------------------------------------
Train set size: 2621
Test set size: 1290

Algorithm task: regression.

Shape: (3911, 2)
Memory: 9.21 MB
Scaled: False
Categorical features: 1 (100.0%)
Outlier values: 8 (0.2%)
-------------------------------------
Train set size: 2621
Test set size: 1290

Algorithm task: regression.

Shape: (3911, 2)
Memory: 9.21 MB
Scaled: False
Categorical features: 1 (100.0%)
----------------------------------

In [8]:
atom1.dataset

Unnamed: 0,corpus,target
0,the great artist michelangelo i thinking he is...,2.0
1,one of the things i want to acumplish in the f...,2.5
2,its always good to ask people how it feels or ...,3.0
3,in order for students to have a good experienc...,3.0
4,"dear, principal,i will like to star in saying ...",3.5
...,...,...
3906,i think an enjoyable educational activity for ...,4.0
3907,some people may say that oppocite but to start...,3.5
3908,"""unles you try to do somthing beyond what you ...",2.5
3909,having activities after school are good ideas ...,4.0


In [9]:
atom1.textclean()
atom2.textclean()
atom3.textclean()
atom4.textclean()
atom5.textclean()
atom6.textclean()

Cleaning the corpus...
 --> Decoding unicode characters to ascii.
 --> Converting text to lower case.
 --> Dropping 0 emails from 0 documents.
 --> Dropping 0 URL links from 0 documents.
 --> Dropping 0 HTML tags from 0 documents.
 --> Dropping 3 emojis from 3 documents.
 --> Dropping 2847 numbers from 994 documents.
 --> Dropping punctuation from the text.
Cleaning the corpus...
 --> Decoding unicode characters to ascii.
 --> Converting text to lower case.
 --> Dropping 0 emails from 0 documents.
 --> Dropping 0 URL links from 0 documents.
 --> Dropping 0 HTML tags from 0 documents.
 --> Dropping 3 emojis from 3 documents.
 --> Dropping 2847 numbers from 994 documents.
 --> Dropping punctuation from the text.
Cleaning the corpus...
 --> Decoding unicode characters to ascii.
 --> Converting text to lower case.
 --> Dropping 0 emails from 0 documents.
 --> Dropping 0 URL links from 0 documents.
 --> Dropping 0 HTML tags from 0 documents.
 --> Dropping 3 emojis from 3 documents.
 --> Dro

In [10]:
atom1.tokenize()
atom2.tokenize()
atom3.tokenize()
atom4.tokenize()
atom5.tokenize()
atom6.tokenize()

Tokenizing the corpus...
Tokenizing the corpus...
Tokenizing the corpus...
Tokenizing the corpus...
Tokenizing the corpus...
Tokenizing the corpus...


In [11]:
atom1.vectorize(strategy = 'tfidf')
atom2.vectorize(strategy = 'tfidf')
atom3.vectorize(strategy = 'tfidf')
atom4.vectorize(strategy = 'tfidf')
atom5.vectorize(strategy = 'tfidf')
atom6.vectorize(strategy = 'tfidf')

Fitting Vectorizer...
Vectorizing the corpus...
Fitting Vectorizer...
Vectorizing the corpus...
Fitting Vectorizer...
Vectorizing the corpus...
Fitting Vectorizer...
Vectorizing the corpus...
Fitting Vectorizer...
Vectorizing the corpus...
Fitting Vectorizer...
Vectorizing the corpus...


In [12]:
atom1.available_models()

Unnamed: 0,acronym,fullname,estimator,module,needs_scaling,accepts_sparse,supports_gpu
0,Dummy,Dummy Estimator,DummyRegressor,sklearn.dummy,False,False,False
1,GP,Gaussian Process,GaussianProcessRegressor,sklearn.gaussian_process._gpr,False,False,False
2,OLS,Ordinary Least Squares,LinearRegression,sklearn.linear_model._base,True,True,True
3,Ridge,Ridge Estimator,Ridge,sklearn.linear_model._ridge,True,True,True
4,Lasso,Lasso Regression,Lasso,sklearn.linear_model._coordinate_descent,True,True,True
5,EN,ElasticNet Regression,ElasticNet,sklearn.linear_model._coordinate_descent,True,True,True
6,Lars,Least Angle Regression,Lars,sklearn.linear_model._least_angle,True,False,True
7,BR,Bayesian Ridge,BayesianRidge,sklearn.linear_model._bayes,True,False,False
8,ARD,Automatic Relevant Determination,ARDRegression,sklearn.linear_model._bayes,True,False,False
9,Huber,Huber Regression,HuberRegressor,sklearn.linear_model._huber,True,False,False


In [13]:
atom1.run(models = ['Tree', 'Bag', 'ET', 'RF', 'AdaB', 'GBM'], metric = 'mse')
atom2.run(models = ['Tree', 'Bag', 'ET', 'RF', 'AdaB', 'GBM'], metric = 'mse')
atom3.run(models = ['Tree', 'Bag', 'ET', 'RF', 'AdaB', 'GBM'], metric = 'mse')
atom4.run(models = ['Tree', 'Bag', 'ET', 'RF', 'AdaB', 'GBM'], metric = 'mse')
atom5.run(models = ['Tree', 'Bag', 'ET', 'RF', 'AdaB', 'GBM'], metric = 'mse')
atom6.run(models = ['Tree', 'Bag', 'ET', 'RF', 'AdaB', 'GBM'], metric = 'mse')



Models: Tree, Bag, ET, RF, AdaB, GBM
Metric: neg_mean_squared_error


Results for Decision Tree:
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -0.0
Test evaluation --> neg_mean_squared_error: -0.6595
Time elapsed: 54.080s
-------------------------------------------------
Total time: 54.080s


Results for Bagging:
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -0.0663
Test evaluation --> neg_mean_squared_error: -0.3563
Time elapsed: 1m:05s
-------------------------------------------------
Total time: 1m:05s


Results for Extra-Trees:
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -0.0
Test evaluation --> neg_mean_squared_error: -0.3224
Time elapsed: 4m:35s
-------------------------------------------------
Total time: 4m:35s


Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_erro

In [14]:
atom1.evaluate()

Unnamed: 0,neg_mean_absolute_error,neg_mean_absolute_percentage_error,max_error,neg_mean_squared_error,neg_mean_squared_log_error,r2,neg_root_mean_squared_error
Tree,-0.626744,-0.214347,-3.0,-0.659496,-0.041499,-0.536787,-0.812094
Bag,-0.477171,-0.166809,-2.15,-0.356273,-0.022368,0.169796,-0.596886
ET,-0.457957,-0.161002,-2.08,-0.322362,-0.020439,0.248817,-0.567769
RF,-0.463488,-0.163074,-2.06,-0.33354,-0.021143,0.22277,-0.577529
AdaB,-0.464671,-0.164608,-2.045814,-0.335156,-0.021342,0.219004,-0.578927
GBM,-0.453934,-0.159253,-2.10085,-0.320189,-0.020235,0.253881,-0.565853


In [15]:
atom2.evaluate()

Unnamed: 0,neg_mean_absolute_error,neg_mean_absolute_percentage_error,max_error,neg_mean_squared_error,neg_mean_squared_log_error,r2,neg_root_mean_squared_error
Tree,-0.627519,-0.223587,-2.5,-0.670349,-0.043523,-0.622536,-0.818748
Bag,-0.471744,-0.171242,-2.45,-0.346343,-0.022687,0.161699,-0.588509
ET,-0.446647,-0.162053,-2.045,-0.307936,-0.020313,0.254661,-0.55492
RF,-0.453829,-0.1648,-2.175,-0.320388,-0.021092,0.224522,-0.566028
AdaB,-0.466567,-0.17123,-2.224566,-0.332702,-0.022081,0.194716,-0.576803
GBM,-0.442287,-0.16037,-1.944459,-0.305084,-0.020122,0.261564,-0.552344


In [16]:
atom3.evaluate()

Unnamed: 0,neg_mean_absolute_error,neg_mean_absolute_percentage_error,max_error,neg_mean_squared_error,neg_mean_squared_log_error,r2,neg_root_mean_squared_error
Tree,-0.519767,-0.170249,-2.5,-0.466473,-0.026938,-0.398032,-0.682988
Bag,-0.412597,-0.13672,-1.75,-0.271835,-0.015927,0.185302,-0.521378
ET,-0.386829,-0.128754,-1.945,-0.245198,-0.014487,0.265136,-0.495175
RF,-0.393729,-0.130886,-1.92,-0.254095,-0.014963,0.238471,-0.504078
AdaB,-0.399189,-0.134313,-1.985294,-0.256961,-0.015277,0.22988,-0.506913
GBM,-0.388929,-0.12947,-1.996141,-0.250311,-0.014722,0.249812,-0.500311


In [17]:
atom4.evaluate()

Unnamed: 0,neg_mean_absolute_error,neg_mean_absolute_percentage_error,max_error,neg_mean_squared_error,neg_mean_squared_log_error,r2,neg_root_mean_squared_error
Tree,-0.590698,-0.200414,-2.5,-0.596124,-0.036906,-0.400737,-0.772091
Bag,-0.462907,-0.161141,-1.75,-0.327083,-0.020478,0.231439,-0.571912
ET,-0.440651,-0.153583,-1.695,-0.293527,-0.018407,0.310288,-0.541781
RF,-0.448419,-0.155837,-1.76,-0.305193,-0.019032,0.282877,-0.552442
AdaB,-0.464194,-0.16084,-1.804441,-0.330435,-0.020553,0.223563,-0.574835
GBM,-0.440533,-0.153495,-1.754707,-0.296021,-0.018573,0.304428,-0.544078


In [18]:
atom5.evaluate()

Unnamed: 0,neg_mean_absolute_error,neg_mean_absolute_percentage_error,max_error,neg_mean_squared_error,neg_mean_squared_log_error,r2,neg_root_mean_squared_error
Tree,-0.65814,-0.22969,-3.0,-0.731783,-0.04629,-0.50817,-0.855443
Bag,-0.504651,-0.178183,-2.1,-0.392764,-0.024679,0.190533,-0.626709
ET,-0.488329,-0.173409,-2.26,-0.361263,-0.022809,0.255453,-0.601052
RF,-0.491713,-0.174959,-2.16,-0.365411,-0.023112,0.246906,-0.604492
AdaB,-0.511716,-0.187631,-2.137584,-0.389699,-0.025254,0.19685,-0.624258
GBM,-0.480825,-0.171037,-1.976782,-0.354992,-0.022404,0.268378,-0.595812


In [19]:
atom6.evaluate()

Unnamed: 0,neg_mean_absolute_error,neg_mean_absolute_percentage_error,max_error,neg_mean_squared_error,neg_mean_squared_log_error,r2,neg_root_mean_squared_error
Tree,-0.685271,-0.24017,-2.5,-0.77907,-0.049389,-0.696822,-0.882649
Bag,-0.50124,-0.180045,-2.1,-0.392636,-0.025233,0.144836,-0.626606
ET,-0.480178,-0.172716,-2.055,-0.353815,-0.022814,0.229388,-0.594823
RF,-0.490341,-0.176158,-1.96,-0.373452,-0.024028,0.186619,-0.611107
AdaB,-0.504203,-0.185672,-1.983616,-0.385367,-0.025338,0.160667,-0.620779
GBM,-0.473977,-0.170295,-2.05275,-0.349651,-0.022567,0.238456,-0.591313
