# Text analysis

Online store launches a new service. Now users can edit and supplement product descriptions, as in wiki communities. That is, clients offer their edits and comment on the changes of others. The store needs a tool that will search for toxic comments and send them for moderation. 

**Task**: to train the model to classify comments into positive and negative.

*Limitation*: to build a model with the value of the quality metric *F1* at least 0.75. 

**Basic steps**

1. Data preparation and analysis.
2. Model training. 
3. Conclusions.

**Data description**

The data is in the file `toxic_comments.csv`.

* *"text"* – comment text
* *"toxic"* — target variable

Downloading all the libraries necessary for the work.

In [1]:
!pip install pymystem3



In [2]:
!pip install catboost



In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
from pymystem3 import Mystem
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
from sklearn.pipeline import Pipeline

In [4]:
from tqdm import tqdm
tqdm.pandas()

## Data preparation and analysis

In [5]:
data = pd.read_csv('/Users/aasheremeeva/Desktop/All DS Projects/text analysis/toxic_comments.csv')

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [7]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [8]:
data.duplicated().sum()

0

The dataframe is quite large. The original text is in English, no duplicates were found.

We will carry out lemmatization and cleaning of unnecessary characters.

In [9]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aasheremeeva/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aasheremeeva/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aasheremeeva/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def lemmatize_clear(text):
    text = text.lower()
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', text)
    lemm_text = " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(cleared_text)])
    return lemm_text

In [11]:
%%time
data['lemm_text'] = data['text'].progress_apply(lemmatize_clear)

100%|██████████████████████████████████| 159292/159292 [13:36<00:00, 194.98it/s]


CPU times: user 11min 21s, sys: 1min 28s, total: 12min 49s
Wall time: 13min 37s


In [12]:
data = data.drop(['text'], axis=1)

In [13]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,toxic,lemm_text
0,0,0,explanation why the edits make under my userna...
1,1,0,d aww he match this background colour i m seem...
2,2,0,hey man i m really not try to edit war it s ju...
3,3,0,more i can t make any real suggestion on impro...
4,4,0,you sir be my hero any chance you remember wha...
5,5,0,congratulation from me a well use the tool wel...
6,6,1,cocksucker before you piss around on my work
7,7,0,your vandalism to the matt shirvington article...
8,8,0,sorry if the word nonsense be offensive to you...
9,9,0,alignment on this subject and which be contrar...


We'll divide the data into predictor variables and a target variable. We will also highlight the training, validation and test set.

In [14]:
y = data['toxic']
X = data.drop(['toxic'], axis=1)

In [15]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.4,random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_valid, y_valid, test_size=0.5,random_state=42)

In [16]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

tf_idf = TfidfVectorizer(stop_words=list(stopwords))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aasheremeeva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
X_train = tf_idf.fit_transform(X_train['lemm_text'])
X_valid = tf_idf.transform(X_valid['lemm_text'])
X_test = tf_idf.transform(X_test['lemm_text'])

print(X_train.shape)
print(X_valid.shape)
print(X_test.shape)

(95575, 110820)
(31858, 110820)
(31859, 110820)


As a result of the first steps, the initial dataset was studied, the data was prepared for further work with models, and also divided into the necessary samples.

## Model training

Consider several models: 
* Logistic Regression
* Decision Tree
* CatBoost
* SGDClassifier

### Logistic Regression

In [18]:
%%time

pipe_lr = Pipeline([('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(random_state=42))])

param_range = [9, 10]
param_range_fl = [1.0, 0.5]

grid_params_lr = [{'clf__penalty': ['l1', 'l2'],
        'clf__C': param_range_fl,
        'clf__solver': ['liblinear']}] 

lr_gs = GridSearchCV(pipe_lr, param_grid = grid_params_lr,  cv=3, scoring = 'f1', n_jobs=-1)
lr_gs.fit(X_train, y_train)

lr_best = lr_gs.best_params_
print(lr_best)

means = lr_gs.cv_results_['mean_test_score']
f1_lr = max(means)
print('F1:',f1_lr)

{'clf__C': 1.0, 'clf__penalty': 'l1', 'clf__solver': 'liblinear'}
F1: 0.7355725701835439
CPU times: user 504 ms, sys: 265 ms, total: 769 ms
Wall time: 6.31 s


In [19]:
%%time

pred_lr = lr_gs.predict(X_valid)

f1_lr = f1_score(y_valid, pred_lr)
print('F1 (valid):', f1_lr)

F1 (valid): 0.7561322431567722
CPU times: user 28.7 ms, sys: 13 ms, total: 41.8 ms
Wall time: 66 ms


### Decision Tree

In [20]:
%%time

pipe_ds = Pipeline([('tfidf', TfidfTransformer()),
                ('clf', DecisionTreeClassifier(random_state=42))])

grid_params_ds = [{'clf__max_depth': [10,100]}] 

ds_gs = GridSearchCV(pipe_ds, param_grid = grid_params_ds,  cv=3, scoring = 'f1', n_jobs=-1)
ds_gs.fit(X_train, y_train)

ds_best = ds_gs.best_params_
print(ds_best)

means = ds_gs.cv_results_['mean_test_score']
f1_ds = max(means)
print('F1:',f1_ds)

{'clf__max_depth': 100}
F1: 0.7111356607444698
CPU times: user 27.5 s, sys: 142 ms, total: 27.6 s
Wall time: 52.7 s


In [21]:
%%time

pred_ds = ds_gs.predict(X_valid)

f1_ds = f1_score(y_valid, pred_ds)
print('F1 (valid):', f1_ds)

F1 (valid): 0.7191734417344176
CPU times: user 40.3 ms, sys: 18.3 ms, total: 58.6 ms
Wall time: 87.6 ms


### CatBoost

In [22]:
%%time

pipe_cb = Pipeline([('tfidf', TfidfTransformer()),
                ('clf', CatBoostClassifier(random_state=42))])

grid_params_cb = [{'clf__iterations': [400],
                  'clf__max_depth': [4,6,10],
                  }] 

cb_gs = GridSearchCV(pipe_cb, param_grid = grid_params_cb,  cv=3, scoring = 'f1', n_jobs=-1)
cb_gs.fit(X_train, y_train)

cb_best = cb_gs.best_params_
print(cb_best)

means = cb_gs.cv_results_['mean_test_score']
f1_cb = max(means)
print('F1:',f1_cb)

Learning rate set to 0.140689
0:	learn: 0.5573019	total: 1.1s	remaining: 7m 20s
1:	learn: 0.4652988	total: 2.25s	remaining: 7m 27s
2:	learn: 0.4014840	total: 5.64s	remaining: 12m 26s
3:	learn: 0.3535993	total: 8.46s	remaining: 13m 57s
4:	learn: 0.3214866	total: 12.2s	remaining: 16m 3s
5:	learn: 0.2982086	total: 14.5s	remaining: 15m 51s
6:	learn: 0.2825005	total: 18.2s	remaining: 16m 59s
7:	learn: 0.2699804	total: 20.2s	remaining: 16m 29s
8:	learn: 0.2604874	total: 22.5s	remaining: 16m 19s
9:	learn: 0.2521225	total: 24.6s	remaining: 15m 59s
10:	learn: 0.2465402	total: 27.8s	remaining: 16m 23s
11:	learn: 0.2420011	total: 30.2s	remaining: 16m 16s
12:	learn: 0.2383764	total: 33s	remaining: 16m 20s
13:	learn: 0.2349351	total: 35.7s	remaining: 16m 24s
14:	learn: 0.2322074	total: 38.5s	remaining: 16m 28s
15:	learn: 0.2296440	total: 41.8s	remaining: 16m 43s
16:	learn: 0.2271430	total: 44.4s	remaining: 16m 40s
17:	learn: 0.2253366	total: 47.6s	remaining: 16m 49s
18:	learn: 0.2236821	total: 51.3

Learning rate set to 0.14069
0:	learn: 0.5585537	total: 3.1s	remaining: 20m 36s
1:	learn: 0.4634456	total: 6.37s	remaining: 21m 7s
2:	learn: 0.4002103	total: 10s	remaining: 22m 4s
3:	learn: 0.3549379	total: 12.4s	remaining: 20m 23s
4:	learn: 0.3238319	total: 15.4s	remaining: 20m 19s
5:	learn: 0.3008744	total: 17.7s	remaining: 19m 24s
6:	learn: 0.2838500	total: 19.8s	remaining: 18m 32s
7:	learn: 0.2716045	total: 23.2s	remaining: 18m 54s
8:	learn: 0.2622492	total: 25.8s	remaining: 18m 42s
9:	learn: 0.2550462	total: 29.2s	remaining: 18m 58s
10:	learn: 0.2493122	total: 32.8s	remaining: 19m 18s
11:	learn: 0.2448830	total: 35.3s	remaining: 19m 1s
12:	learn: 0.2395148	total: 39s	remaining: 19m 21s
13:	learn: 0.2361648	total: 41.9s	remaining: 19m 14s
14:	learn: 0.2332209	total: 45.6s	remaining: 19m 29s
15:	learn: 0.2308863	total: 49.3s	remaining: 19m 44s
16:	learn: 0.2282959	total: 51.5s	remaining: 19m 20s
17:	learn: 0.2264924	total: 53.5s	remaining: 18m 55s
18:	learn: 0.2245412	total: 56.5s	r

Learning rate set to 0.14069
0:	learn: 0.5584087	total: 1.93s	remaining: 12m 48s
1:	learn: 0.4643308	total: 5.17s	remaining: 17m 9s
2:	learn: 0.3992415	total: 7.61s	remaining: 16m 46s
3:	learn: 0.3531523	total: 11s	remaining: 18m 11s
4:	learn: 0.3207054	total: 12.9s	remaining: 16m 57s
5:	learn: 0.2978539	total: 15.4s	remaining: 16m 54s
6:	learn: 0.2816391	total: 17.6s	remaining: 16m 27s
7:	learn: 0.2690052	total: 19.5s	remaining: 15m 55s
8:	learn: 0.2594520	total: 22.5s	remaining: 16m 15s
9:	learn: 0.2512075	total: 24.9s	remaining: 16m 10s
10:	learn: 0.2455725	total: 27.8s	remaining: 16m 23s
11:	learn: 0.2413133	total: 30.2s	remaining: 16m 15s
12:	learn: 0.2370642	total: 33.4s	remaining: 16m 34s
13:	learn: 0.2339865	total: 35.4s	remaining: 16m 15s
14:	learn: 0.2312835	total: 39s	remaining: 16m 40s
15:	learn: 0.2288812	total: 41.5s	remaining: 16m 36s
16:	learn: 0.2270299	total: 44.9s	remaining: 16m 51s
17:	learn: 0.2252851	total: 48.4s	remaining: 17m 8s
18:	learn: 0.2232833	total: 51.1s

258:	learn: 0.1364177	total: 12m	remaining: 6m 32s
259:	learn: 0.1363493	total: 12m 3s	remaining: 6m 29s
260:	learn: 0.1362857	total: 12m 4s	remaining: 6m 26s
261:	learn: 0.1362193	total: 12m 7s	remaining: 6m 23s
262:	learn: 0.1360257	total: 12m 10s	remaining: 6m 20s
263:	learn: 0.1359181	total: 12m 12s	remaining: 6m 17s
264:	learn: 0.1358121	total: 12m 14s	remaining: 6m 14s
265:	learn: 0.1356600	total: 12m 17s	remaining: 6m 11s
266:	learn: 0.1354806	total: 12m 19s	remaining: 6m 8s
267:	learn: 0.1353357	total: 12m 22s	remaining: 6m 5s
268:	learn: 0.1352341	total: 12m 24s	remaining: 6m 2s
269:	learn: 0.1351278	total: 12m 27s	remaining: 5m 59s
270:	learn: 0.1350052	total: 12m 29s	remaining: 5m 56s
271:	learn: 0.1349382	total: 12m 31s	remaining: 5m 53s
272:	learn: 0.1348278	total: 12m 34s	remaining: 5m 51s
273:	learn: 0.1347538	total: 12m 37s	remaining: 5m 48s
274:	learn: 0.1346956	total: 12m 40s	remaining: 5m 45s
275:	learn: 0.1346368	total: 12m 42s	remaining: 5m 42s
276:	learn: 0.134563

147:	learn: 0.1422328	total: 13m 14s	remaining: 22m 32s
148:	learn: 0.1419938	total: 13m 18s	remaining: 22m 25s
149:	learn: 0.1416431	total: 13m 23s	remaining: 22m 18s
150:	learn: 0.1413550	total: 13m 28s	remaining: 22m 13s
151:	learn: 0.1412302	total: 13m 32s	remaining: 22m 6s
152:	learn: 0.1409889	total: 13m 37s	remaining: 21m 59s
153:	learn: 0.1407682	total: 13m 42s	remaining: 21m 53s
154:	learn: 0.1405788	total: 13m 46s	remaining: 21m 46s
155:	learn: 0.1403975	total: 13m 51s	remaining: 21m 40s
156:	learn: 0.1402658	total: 13m 55s	remaining: 21m 32s
157:	learn: 0.1399391	total: 14m	remaining: 21m 26s
158:	learn: 0.1398243	total: 14m 5s	remaining: 21m 21s
159:	learn: 0.1395596	total: 14m 9s	remaining: 21m 14s
160:	learn: 0.1392334	total: 14m 14s	remaining: 21m 9s
161:	learn: 0.1390918	total: 14m 19s	remaining: 21m 3s
162:	learn: 0.1388702	total: 14m 24s	remaining: 20m 56s
163:	learn: 0.1385752	total: 14m 29s	remaining: 20m 51s
164:	learn: 0.1382794	total: 14m 36s	remaining: 20m 47s
1

222:	learn: 0.1306266	total: 26m 48s	remaining: 21m 16s
223:	learn: 0.1305534	total: 26m 51s	remaining: 21m 5s
224:	learn: 0.1304839	total: 26m 54s	remaining: 20m 55s
225:	learn: 0.1303662	total: 26m 56s	remaining: 20m 44s
226:	learn: 0.1301841	total: 27m	remaining: 20m 34s
227:	learn: 0.1301028	total: 27m 3s	remaining: 20m 24s
228:	learn: 0.1299501	total: 27m 6s	remaining: 20m 14s
229:	learn: 0.1298785	total: 27m 10s	remaining: 20m 4s
230:	learn: 0.1297900	total: 27m 13s	remaining: 19m 55s
231:	learn: 0.1296932	total: 27m 16s	remaining: 19m 44s
232:	learn: 0.1296222	total: 27m 18s	remaining: 19m 34s
233:	learn: 0.1293793	total: 27m 21s	remaining: 19m 24s
234:	learn: 0.1293108	total: 27m 24s	remaining: 19m 14s
235:	learn: 0.1292493	total: 27m 27s	remaining: 19m 4s
236:	learn: 0.1290198	total: 27m 30s	remaining: 18m 54s
237:	learn: 0.1288609	total: 27m 33s	remaining: 18m 45s
238:	learn: 0.1287712	total: 27m 36s	remaining: 18m 35s
239:	learn: 0.1286892	total: 27m 39s	remaining: 18m 26s
2

Learning rate set to 0.14069
0:	learn: 0.5582898	total: 14.6s	remaining: 1h 37m 25s
1:	learn: 0.4587671	total: 29.2s	remaining: 1h 36m 58s
2:	learn: 0.3922687	total: 44.3s	remaining: 1h 37m 43s
3:	learn: 0.3421079	total: 59.8s	remaining: 1h 38m 38s
4:	learn: 0.3073224	total: 1m 14s	remaining: 1h 37m 36s
5:	learn: 0.2827580	total: 1m 28s	remaining: 1h 37m 9s
6:	learn: 0.2645027	total: 1m 44s	remaining: 1h 37m 45s
7:	learn: 0.2512865	total: 1m 58s	remaining: 1h 37m 10s
8:	learn: 0.2393858	total: 2m 14s	remaining: 1h 37m 32s
9:	learn: 0.2314990	total: 2m 30s	remaining: 1h 37m 34s
10:	learn: 0.2249660	total: 2m 45s	remaining: 1h 37m 31s
11:	learn: 0.2187425	total: 2m 59s	remaining: 1h 36m 48s
12:	learn: 0.2131510	total: 3m 15s	remaining: 1h 37m 13s
13:	learn: 0.2097958	total: 3m 33s	remaining: 1h 38m 10s
14:	learn: 0.2060936	total: 3m 51s	remaining: 1h 38m 57s
15:	learn: 0.2035139	total: 4m 8s	remaining: 1h 39m 18s
16:	learn: 0.2007794	total: 4m 25s	remaining: 1h 39m 50s
17:	learn: 0.19851

210:	learn: 0.1109388	total: 44m 31s	remaining: 39m 52s
211:	learn: 0.1108157	total: 44m 36s	remaining: 39m 33s
212:	learn: 0.1106480	total: 44m 41s	remaining: 39m 14s
213:	learn: 0.1104220	total: 44m 46s	remaining: 38m 55s
214:	learn: 0.1103203	total: 44m 51s	remaining: 38m 36s
215:	learn: 0.1100692	total: 44m 56s	remaining: 38m 17s
216:	learn: 0.1099859	total: 45m 2s	remaining: 37m 58s
217:	learn: 0.1097623	total: 45m 8s	remaining: 37m 40s
218:	learn: 0.1094891	total: 45m 14s	remaining: 37m 23s
219:	learn: 0.1094134	total: 45m 19s	remaining: 37m 5s
220:	learn: 0.1092738	total: 45m 24s	remaining: 36m 46s
221:	learn: 0.1091435	total: 45m 29s	remaining: 36m 28s
222:	learn: 0.1088938	total: 45m 35s	remaining: 36m 10s
223:	learn: 0.1088215	total: 45m 40s	remaining: 35m 53s
224:	learn: 0.1086750	total: 45m 45s	remaining: 35m 35s
225:	learn: 0.1083818	total: 45m 51s	remaining: 35m 18s
226:	learn: 0.1082582	total: 45m 56s	remaining: 35m
227:	learn: 0.1081471	total: 46m 1s	remaining: 34m 43s


Learning rate set to 0.140689
0:	learn: 0.5496392	total: 17.2s	remaining: 1h 54m 14s
1:	learn: 0.4585798	total: 34s	remaining: 1h 52m 36s
2:	learn: 0.3903811	total: 49.8s	remaining: 1h 49m 53s
3:	learn: 0.3410780	total: 1m 5s	remaining: 1h 48m 36s
4:	learn: 0.3051066	total: 1m 22s	remaining: 1h 48m 45s
5:	learn: 0.2809780	total: 1m 39s	remaining: 1h 49m 13s
6:	learn: 0.2637375	total: 1m 55s	remaining: 1h 48m 27s
7:	learn: 0.2499812	total: 2m 12s	remaining: 1h 48m 30s
8:	learn: 0.2399819	total: 2m 29s	remaining: 1h 48m 5s
9:	learn: 0.2315729	total: 2m 45s	remaining: 1h 47m 45s
10:	learn: 0.2238019	total: 3m 3s	remaining: 1h 48m 25s
11:	learn: 0.2190574	total: 3m 23s	remaining: 1h 49m 25s
12:	learn: 0.2136212	total: 3m 45s	remaining: 1h 51m 51s
13:	learn: 0.2098979	total: 4m 5s	remaining: 1h 52m 38s
14:	learn: 0.2066277	total: 4m 25s	remaining: 1h 53m 34s
15:	learn: 0.2036342	total: 4m 43s	remaining: 1h 53m 19s
16:	learn: 0.2011327	total: 5m 1s	remaining: 1h 53m 7s
17:	learn: 0.1984209	t

196:	learn: 0.1161294	total: 23m 19s	remaining: 24m 1s
197:	learn: 0.1157823	total: 23m 24s	remaining: 23m 52s
198:	learn: 0.1156928	total: 23m 29s	remaining: 23m 43s
199:	learn: 0.1156030	total: 23m 35s	remaining: 23m 35s
200:	learn: 0.1153069	total: 23m 40s	remaining: 23m 26s
201:	learn: 0.1150717	total: 23m 45s	remaining: 23m 17s
202:	learn: 0.1149880	total: 23m 50s	remaining: 23m 8s
203:	learn: 0.1147654	total: 23m 55s	remaining: 22m 59s
204:	learn: 0.1145875	total: 24m	remaining: 22m 50s
205:	learn: 0.1144058	total: 24m 6s	remaining: 22m 42s
206:	learn: 0.1142296	total: 24m 11s	remaining: 22m 33s
207:	learn: 0.1141080	total: 24m 16s	remaining: 22m 24s
208:	learn: 0.1139509	total: 24m 21s	remaining: 22m 16s
209:	learn: 0.1138699	total: 24m 27s	remaining: 22m 7s
210:	learn: 0.1137501	total: 24m 32s	remaining: 21m 58s
211:	learn: 0.1135162	total: 24m 37s	remaining: 21m 50s
212:	learn: 0.1134363	total: 24m 42s	remaining: 21m 41s
213:	learn: 0.1131522	total: 24m 48s	remaining: 21m 33s


Learning rate set to 0.167283
0:	learn: 0.5285090	total: 2.75s	remaining: 18m 16s
1:	learn: 0.4254435	total: 5.48s	remaining: 18m 10s
2:	learn: 0.3578424	total: 8.07s	remaining: 17m 48s
3:	learn: 0.3128584	total: 10.7s	remaining: 17m 43s
4:	learn: 0.2810484	total: 13.3s	remaining: 17m 34s
5:	learn: 0.2581623	total: 16.2s	remaining: 17m 41s
6:	learn: 0.2428689	total: 18.9s	remaining: 17m 38s
7:	learn: 0.2317341	total: 21.5s	remaining: 17m 31s
8:	learn: 0.2233212	total: 24.1s	remaining: 17m 26s
9:	learn: 0.2172042	total: 26.8s	remaining: 17m 26s
10:	learn: 0.2121785	total: 29.5s	remaining: 17m 22s
11:	learn: 0.2082907	total: 32.3s	remaining: 17m 23s
12:	learn: 0.2045719	total: 35s	remaining: 17m 21s
13:	learn: 0.2018509	total: 37.6s	remaining: 17m 15s
14:	learn: 0.1993280	total: 40.2s	remaining: 17m 11s
15:	learn: 0.1970112	total: 42.9s	remaining: 17m 8s
16:	learn: 0.1940598	total: 45.5s	remaining: 17m 4s
17:	learn: 0.1917820	total: 48.3s	remaining: 17m 5s
18:	learn: 0.1897899	total: 51s

152:	learn: 0.1214912	total: 6m 46s	remaining: 10m 56s
153:	learn: 0.1211580	total: 6m 49s	remaining: 10m 54s
154:	learn: 0.1208604	total: 6m 52s	remaining: 10m 51s
155:	learn: 0.1207432	total: 6m 54s	remaining: 10m 48s
156:	learn: 0.1204182	total: 6m 57s	remaining: 10m 46s
157:	learn: 0.1203225	total: 7m	remaining: 10m 43s
158:	learn: 0.1201932	total: 7m 2s	remaining: 10m 40s
159:	learn: 0.1199103	total: 7m 5s	remaining: 10m 38s
160:	learn: 0.1198033	total: 7m 8s	remaining: 10m 35s
161:	learn: 0.1195023	total: 7m 10s	remaining: 10m 32s
162:	learn: 0.1192957	total: 7m 13s	remaining: 10m 30s
163:	learn: 0.1192012	total: 7m 16s	remaining: 10m 27s
164:	learn: 0.1189827	total: 7m 18s	remaining: 10m 24s
165:	learn: 0.1188904	total: 7m 21s	remaining: 10m 22s
166:	learn: 0.1185870	total: 7m 23s	remaining: 10m 19s
167:	learn: 0.1184907	total: 7m 26s	remaining: 10m 16s
168:	learn: 0.1183985	total: 7m 29s	remaining: 10m 14s
169:	learn: 0.1181993	total: 7m 31s	remaining: 10m 11s
170:	learn: 0.117

304:	learn: 0.0995187	total: 13m 31s	remaining: 4m 12s
305:	learn: 0.0993344	total: 13m 33s	remaining: 4m 10s
306:	learn: 0.0992181	total: 13m 36s	remaining: 4m 7s
307:	learn: 0.0990835	total: 13m 39s	remaining: 4m 4s
308:	learn: 0.0990379	total: 13m 41s	remaining: 4m 2s
309:	learn: 0.0989930	total: 13m 44s	remaining: 3m 59s
310:	learn: 0.0986842	total: 13m 47s	remaining: 3m 56s
311:	learn: 0.0986391	total: 13m 49s	remaining: 3m 53s
312:	learn: 0.0985117	total: 13m 52s	remaining: 3m 51s
313:	learn: 0.0983173	total: 13m 55s	remaining: 3m 48s
314:	learn: 0.0982730	total: 13m 57s	remaining: 3m 46s
315:	learn: 0.0981706	total: 14m	remaining: 3m 43s
316:	learn: 0.0980818	total: 14m 2s	remaining: 3m 40s
317:	learn: 0.0980373	total: 14m 5s	remaining: 3m 38s
318:	learn: 0.0979286	total: 14m 8s	remaining: 3m 35s
319:	learn: 0.0978850	total: 14m 11s	remaining: 3m 32s
320:	learn: 0.0978419	total: 14m 13s	remaining: 3m 30s
321:	learn: 0.0976497	total: 14m 16s	remaining: 3m 27s
322:	learn: 0.097587

In [23]:
%%time

pred_cb = cb_gs.predict(X_valid)

f1_cb = f1_score(y_valid, pred_cb)
print('F1 (valid):', f1_cb)

F1 (valid): 0.7482629609834314
CPU times: user 683 ms, sys: 97.6 ms, total: 780 ms
Wall time: 237 ms


### SGDClassifier

In [24]:
%%time

pipe_sgd = Pipeline([('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(random_state=42))])

grid_params_sgd = [{'clf__loss': ['hinge', 'log', 'modified_huber'],
                  'clf__learning_rate': ['constant', 'optimal', 'adaptive'],
                   'clf__eta0': [0.01, 0.1, 0.3, 0.5]
                  }] 

sgd_gs = GridSearchCV(pipe_sgd, param_grid = grid_params_sgd,  cv=3, scoring = 'f1', n_jobs=-1)
sgd_gs.fit(X_train, y_train)

sgd_best = sgd_gs.best_params_
print(sgd_best)

means = sgd_gs.cv_results_['mean_test_score']
f1_sgd = max(means)
print('F1:',f1_sgd)

36 fits failed out of a total of 108.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/aasheremeeva/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/aasheremeeva/anaconda3/lib/python3.10/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/aasheremeeva/anaconda3/lib/python3.10/site-packages/sklearn/pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/aasheremeeva/anaconda3/lib/python3.10/site-packages/sklearn/b

{'clf__eta0': 0.1, 'clf__learning_rate': 'constant', 'clf__loss': 'modified_huber'}
F1: 0.6917421995801026
CPU times: user 922 ms, sys: 1.29 s, total: 2.21 s
Wall time: 11.5 s


In [25]:
%%time

pred_sgd = sgd_gs.predict(X_valid)

f1_sgd = f1_score(y_valid, pred_sgd)
print('F1 (valid):', f1_sgd)

F1 (valid): 0.7045017314351674
CPU times: user 52.8 ms, sys: 123 ms, total: 176 ms
Wall time: 39.4 ms


Let's compare F1 of all models: 

In [26]:
models_scores = [[0.756],
                    [0.719],
                    [0.748],
                    [0.705]]

model_comparison = pd.DataFrame(data=models_scores, index=["Logistic Regression", "Decision Tree", "CatBoost",'SGDClassifier'],columns=["F1 score"])
model_comparison

Unnamed: 0,F1 score
Logistic Regression,0.756
Decision Tree,0.719
CatBoost,0.748
SGDClassifier,0.705


The best model from the point of view of the F1 metric turned out to be the Logistic Regression.

Let's test the best model on a test sample.

In [27]:
pred_test = lr_gs.predict(X_test)

print('F1:', f1_score(y_test, pred_test))

F1: 0.7602233831742029


## Conclusion

As a result of all the procceeded steps: 

* The initial data were uploaded and analyzed.
* 4 different models were trained and tested.
* The best model with the target parameter F1 >0.75 was found.