<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#План-работы" data-toc-modified-id="План-работы-1">План работы</a></span></li><li><span><a href="#Загрузка-данных" data-toc-modified-id="Загрузка-данных-2">Загрузка данных</a></span></li><li><span><a href="#Очистка-данных" data-toc-modified-id="Очистка-данных-3">Очистка данных</a></span></li><li><span><a href="#Создание-модели" data-toc-modified-id="Создание-модели-4">Создание модели</a></span></li><li><span><a href="#Общий-вывод" data-toc-modified-id="Общий-вывод-5">Общий вывод</a></span></li></ul></div>

# Определение уровня английского языка по субтитрам

Просмотр фильмов на языке оригинала – популярный и эффективный способ изучения иностранного языка. Для того, чтобы сделать этот процесс эффективным, необходимо выбрать фильм, который подходит ученику по уровню, т.е. такой фильм, в котором понятно 50-70% диалогов. Если в фильме понятна меньшая часть диалогов, то просмотр фильма сильно затрудняется и становится неинтересным, а если большая, то не происходит обучения. Для того, чтобы оценить уровень английского языка преподавателю необходимо посмотреть фильм целиком, что требует много времени. В рамках данной работы будет предпринята попытка создать модель, предсказывающую уровень языка на основе субтитров.

## План работы

На входе у нас есть набор субтитров разных фильмов с оцененным уровнем английского языка. Основная часть работы будет посвящена подготовке этих данных для моделирования: очистке и лемматизации.
<br>В работе будет применен алгоритм частотно-инверсной частоты документа (tf-idf) для снижения веса часто встречающихся слов.
<br>В конце необходимо будет выполнить задачу многоклассовой классификации на основе модели логистической регрессии.

Этапы работы следующие:
1. Загрузить данные
2. Произвести очистку данных
3. Подготовить выборки
4. Собрать пайплайн для построения модели логистической регрессии.

## Загрузка данных

Сторонние библиотеки, которые были установлены для выполнения работы (закомментированы, потому что нет необходимости устанавливать их каждый раз):

In [1]:
#pip install pysrt
#pip install spacy

Импортируем библиотеки:

In [2]:
import pandas as pd
import numpy as np
import os
import re
import string

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

import spacy
import pysrt

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [3]:
# Настройки SpaCy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Создание словаря стоп-слов
stop = stopwords.words('english')

# Настройки токенизаторов:
porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

Создадим словарь уровней английского языка для удобства:

In [4]:
levels = {'A1': 1, 
          'A2': 2,
          'B1': 3,
          'B2': 4,
          'C1': 5,
          'C2': 6,
         }

Попробуем открыть один из файлов с субтитрами и проанализировать:

In [5]:
subs = pysrt.open('Subtitles/10_Cloverfield_lane(2016).srt')
print(subs.text)

<font color="#ffff80"><b>Fixed & Synced by bozxphd. Enjoy The Flick</b></font>
(CLANGING)
(DRAWER CLOSES)
(INAUDIBLE)
(CELL PHONE RINGING)
BEN ON PHONE: <i>Michelle,<br/>please don't hang up.</i>
<i>Just talk to me, okay?<br/>I can't believe you just left.</i>
<i>Michelle.</i>
<i>Come back.</i>
<i>Please say something.</i>
<i>Michelle, talk to me.</i>
<i>Look, we had an argument.<br/>Couples fight.</i>
<i>That is no reason<br/>to just leave everything behind.</i>
<i>Running away isn't gonna help it any.<br/>Michelle, please...</i>
(DIALTONE)
NEWSCASTER: More details on that.
<i>Elsewhere today,<br/>power has still not been restored</i>
<i>to many cities on the southern seaboard</i>
<i>in the wake of<br/>this afternoon's widespread blackout.</i>
<i>While there had been<br/>some inclement weather in the region,</i>
<i>the problem seems linked to<br/>what authorities are calling</i>
<i>a catastrophic power surge<br/>that has crippled traffic in the area.</i>
- (LOUD CRASH)<br/>- (GRUNTS)


Какие основные проблемы видно в субтитрах:
1. HTML-тэги
2. Переносы на новую строку \n
3. Знаки препинания, цифры
4. Заглавные буквы
5. Имена собственные
6. Базовые слова, например I, me, we и т.д.
7. Разные формы одних и тех же слов (do/did/done и т.д.)
8. Теги переводчиков в начале субтитров.

Все эти явления либо не несут полезной информации, либо снижают качество обучения нашей модели, при этом увеличивая время работы модели, поэтому в следующем разделе мы постараемся решить некоторые их этих проблем.

<br><br> Соберем субтитры в одну таблицу:

In [6]:
df = pd.DataFrame()
path = os.path.join('Subtitles/')
for file in os.listdir(path):
    df = df.append([[file.replace('.srt', ''), pysrt.open(os.path.join(path, file), encoding='latin-1').text]], ignore_index=True)
df.columns = ['Movie', 'Subtitles']
display(df)

Unnamed: 0,Movie,Subtitles
0,10_Cloverfield_lane(2016),"<font color=""#ffff80""><b>Fixed & Synced by boz..."
1,10_things_I_hate_about_you(1999),"Hey!\nI'll be right with you.\nSo, Cameron. He..."
2,A_knights_tale(2001),Resync: Xenzai[NEF]\nRETAIL\nShould we help hi...
3,A_star_is_born(2018),"- <i><font color=""#ffffff""> Synced and correct..."
4,Aladdin(1992),"<i>Oh, I come from a land\nFrom a faraway plac..."
...,...,...
274,While_You_Were_Sleeping(1995),"LUCY: <i>Okay, there are two things that</i>\n..."
275,Zootopia(2016),Fear. Treachery Bloodlust.\nThousands of years...
276,icarus.2017.web.x264-strife,Line drive to right field...\nSolemnly swear t...
277,mechanic-resurrection_,"Mr. Santos, so good to see you.\nWe saved your..."


Импортируем таблицу movie_titles, в которой содержится классификация уровня английского языка в субтитрах. Сразу избавимся от дубликатов:

In [7]:
movie_titles = pd.read_excel('movies_labels.xlsx').drop('id', axis=1).drop_duplicates(subset=['Movie'])
display(movie_titles.head(5))

Unnamed: 0,Movie,Level
0,10_Cloverfield_lane(2016),B1
1,10_things_I_hate_about_you(1999),B1
2,A_knights_tale(2001),B2
3,A_star_is_born(2018),B2
4,Aladdin(1992),A2/A2+


Объединим две таблицы:

In [8]:
df = pd.merge(df, movie_titles, how='left', on='Movie')
display(df)

Unnamed: 0,Movie,Subtitles,Level
0,10_Cloverfield_lane(2016),"<font color=""#ffff80""><b>Fixed & Synced by boz...",B1
1,10_things_I_hate_about_you(1999),"Hey!\nI'll be right with you.\nSo, Cameron. He...",B1
2,A_knights_tale(2001),Resync: Xenzai[NEF]\nRETAIL\nShould we help hi...,B2
3,A_star_is_born(2018),"- <i><font color=""#ffffff""> Synced and correct...",B2
4,Aladdin(1992),"<i>Oh, I come from a land\nFrom a faraway plac...",A2/A2+
...,...,...,...
274,While_You_Were_Sleeping(1995),"LUCY: <i>Okay, there are two things that</i>\n...",B1
275,Zootopia(2016),Fear. Treachery Bloodlust.\nThousands of years...,B2
276,icarus.2017.web.x264-strife,Line drive to right field...\nSolemnly swear t...,
277,mechanic-resurrection_,"Mr. Santos, so good to see you.\nWe saved your...",B1


## Очистка данных

Для начала проверим, какие уровни были указаны в таблице movie_titles:

In [9]:
df['Level'] = df['Level'].fillna('')
df['Level'].unique()

array(['B1', 'B2', 'A2/A2+', 'C1', 'B1, B2', '', 'A2/A2+, B1', 'A2'],
      dtype=object)

Для некоторых фильмов указаны два или три разных уровня языка. Там, где через слеш указан уровень с плюсом, оставим уровень без плюса, чтобы следовать официальной классификации. Там где указано два уровня без плюса, выберем более высокий, так как он включает в себя слова из более низкого уровня, таким образом модель будет обучаться точнее.

In [10]:
df['Level'] = df['Level'].replace(to_replace=['A2/A2+', 'B1, B2', 'A2/A2+, B1'], value=['A2', 'B2', 'B1'])

Приступим к очистке и лемматизации субтитров. Напишем функцию, которая будет проделывать следующие шаги и выдавать очищенный текст:

1. Очистка HTML-тэгов
2. Удаление \n, бэкслешей
3. Приведение к lowercase
4. Удаление знаков препинания
5. Проверка на стоп слова
6. Лемматизация

In [11]:
def clean(text):
    clean_text = re.sub('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', ' ', text).replace('\n', ' ').replace('\\', '').lower()
    clean_text = re.sub(r'[^\w\s]', ' ', clean_text)
    clean_text = word_tokenize(clean_text)
    clean_text = " ".join([w for w in clean_text if w not in stop])
    clean_text = nlp(clean_text)
    return " ".join([token.lemma_ for token in clean_text])

Применим функцию к набору субтитров:

In [12]:
df['Subtitles'] = df['Subtitles'].apply(clean)
print(df.loc[0, 'Subtitles'])

fix sync bozxphd enjoy flick clang drawer close inaudible cell phone ringing ben phone michelle please hang talk okay believe leave michelle come back please say something michelle talk look argument couple fight reason leave everything behind run away going to help michelle please dialtone newscaster detail elsewhere today power still restore many city southern seaboard wake afternoon widespread blackout inclement weather region problem seem link authority call catastrophic power surge crippled traffic area loud crash grunt tire screech scream glass shatter gasping groan horn honk inhale deeply sniff sigh gasp chain rattle breathe heavily grunt groan groan grunt sob chain jangle sob breathe heavily damn clatter grunt rumble footstep approach pant gasp door creak okay okay please please please hurt breathe heavily please let go okay tell anybody promise okay please let go please man need fluid shock go go keep alive work get handy boyfriend expect send cop look sorry one look clang gru

Соберем финальный датасет для анализа, в котором будет указан числовой уровень языка:

In [13]:
df_final = df.drop(index=df.loc[df['Level'] == ''].index).reset_index(drop=True)
df_final['level_numeric'] = df_final['Level'].map(levels)
display(df_final)

Unnamed: 0,Movie,Subtitles,Level,level_numeric
0,10_Cloverfield_lane(2016),fix sync bozxphd enjoy flick clang drawer clos...,B1,3
1,10_things_I_hate_about_you(1999),hey right cameron go nine school 1 0 year army...,B1,3
2,A_knights_tale(2001),resync xenzai nef retail help due list two min...,B2,4
3,A_star_is_born(2018),sync corrected mrcjnthn get â ª black eye open...,B2,4
4,Aladdin(1992),oh come land faraway place caravan camel roam ...,A2,2
...,...,...,...,...
223,We_are_the_Millers(2013),oh god full double rainbow way across sky whoa...,B1,3
224,While_You_Were_Sleeping(1995),lucy okay two thing remember childhood remembe...,B1,3
225,Zootopia(2016),fear treachery bloodlust thousand year ago for...,B2,4
226,mechanic-resurrection_,mr santos good see save usual table mr santo t...,B1,3


## Создание модели

Разделим выборку на обучающую и валидационную. Так как датасет небольшой, то тестовую выборку создавать не будем.

In [14]:
features = df_final['Subtitles']
target = df_final['level_numeric']

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

display(features_train.shape)
display(target_train.shape)

(171,)

(171,)

Рассчитаем частоту tf-idf для всего набора субтитров:

In [15]:
docs = features_train.values

tfidf_v = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

print(tfidf_v.fit_transform(docs).toarray())

[[0.         0.02972372 0.         ... 0.         0.         0.        ]
 [0.01867786 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.0063969  0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


Соберем пайплайн и подберем лучшие параметры с помощью GridSearchCV:

In [16]:
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'lr__solver': ['liblinear', 'lbfgs'],
               'lr__penalty': ['l1', 'l2'],
               'lr__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf_v),
                     ('lr', LogisticRegression(random_state=12345))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=3,
                           n_jobs=-1)

In [17]:
gs_lr_tfidf.fit(features_train.values, target_train.values)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END lr__C=1.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.3s
[CV 2/5] END lr__C=1.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.3s
[CV 3/5] END lr__C=1.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.3s
[CV 4/5] END lr__C=1.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.3s
[CV 5/5] END lr__C=1.0, lr__penalty=l1, lr

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 1/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/

[CV 2/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s
[CV 3/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 4/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 5/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 1/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.3s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 2/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.5s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 3/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.0s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 4/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.1s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 5/5] END lr__C=1.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.7s
[CV 1/5] END lr__C=1.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   3.7s
[CV 2/5] END lr__C=1.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   6.0s
[CV 3/5] END lr__C=1.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   4.6s
[CV 4/5] END lr__C=1.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=No

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  31.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  32.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  31.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  30.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  34.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  34.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  43.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  39.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  39.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END lr__C=1.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  39.4s
[CV 1/5] END lr__C=10.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.4s
[CV 2/5] END lr__C=10.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.4s
[CV 3/5] END lr__C=10.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.4s
[CV 4/5] END lr__C=10.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__nor

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 1/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.3s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 2/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.3s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 3/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 4/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.3s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 5/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.4s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 1/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   8.4s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 2/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   8.3s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 3/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.4s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 4/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.1s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 5/5] END lr__C=10.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.5s
[CV 1/5] END lr__C=10.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   4.1s
[CV 2/5] END lr__C=10.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   7.1s
[CV 3/5] END lr__C=10.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   5.3s
[CV 4/5] END lr__C=10.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__no

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  30.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  30.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  30.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  37.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  44.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  52.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  35.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  33.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  41.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END lr__C=10.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  42.2s
[CV 1/5] END lr__C=100.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.5s
[CV 2/5] END lr__C=100.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.5s
[CV 3/5] END lr__C=100.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.4s
[CV 4/5] END lr__C=100.0, lr__penalty=l1, lr__solver=liblinear, vect__ngram_range=(1, 1), vect

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/

[CV 2/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s
[CV 3/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/

[CV 4/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s
[CV 5/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   0.2s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/

[CV 1/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   6.9s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 2/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.7s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 3/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.3s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 4/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.0s


Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



[CV 5/5] END lr__C=100.0, lr__penalty=l1, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=   7.6s
[CV 1/5] END lr__C=100.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   4.6s
[CV 2/5] END lr__C=100.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   3.8s
[CV 3/5] END lr__C=100.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=   3.7s
[CV 4/5] END lr__C=100.0, lr__penalty=l2, lr__solver=liblinear, vect__ngram_range=(1, 1), vec

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  23.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  24.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  27.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  30.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7fe942290f70>, vect__use_idf=False; total time=  28.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  32.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  36.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  42.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  31.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END lr__C=100.0, lr__penalty=l2, lr__solver=lbfgs, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7fe942293af0>, vect__use_idf=False; total time=  31.5s


 0.68991597 0.67226891 0.70184874 0.71327731        nan        nan
 0.6897479  0.69563025 0.68991597 0.69596639 0.69563025 0.71310924
        nan        nan 0.70134454 0.6897479  0.70739496 0.67226891]


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(lowercase=False)),
                                       ('lr',
                                        LogisticRegression(random_state=12345))]),
             n_jobs=-1,
             param_grid=[{'lr__C': [1.0, 10.0, 100.0],
                          'lr__penalty': ['l1', 'l2'],
                          'lr__solver': ['liblinear', 'lbfgs'],
                          'vect__ngram_range': [(1, 1)], 'vect__norm': [None],
                          'vect__stop_words': [None],
                          'vect__tokenizer': [<function tokenizer at 0x7fe942290f70>,
                                              <function tokenizer_porter at 0x7fe942293af0>],
                          'vect__use_idf': [False]}],
             scoring='accuracy', verbose=3)

Выведем на экран лучшие параметры и результаты:

In [18]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'lr__C': 10.0, 'lr__penalty': 'l1', 'lr__solver': 'liblinear', 'vect__ngram_range': (1, 1), 'vect__norm': None, 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer_porter at 0x7fe942293af0>, 'vect__use_idf': False} 
CV Accuracy: 0.713


Точность на валидационной выборке 0.526

In [19]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(features_valid, target_valid))

Test Accuracy: 0.526


Выведем на экран предсказания модели и сравним с реальным уровнем английского:

In [20]:
result = features_valid.to_frame().merge(target_valid.to_frame(), left_index=True, right_index=True).merge(pd.DataFrame(clf.predict(features_valid), columns=['predictions']), left_index=True, right_index=True)
display(result)

Unnamed: 0,Subtitles,level_numeric,predictions
30,harvey read anything going to cause trouble gu...,4,4
20,happy birthday happy birthday hell end fuck bi...,5,5
47,telegraph machine beep train whistle blow tele...,5,4
52,last june see emily davison crush death beneat...,5,3
33,harvey reading sutter three year guy one slipp...,4,4
40,think think name go help raise like give somet...,4,3
16,resync lututkanan subscene idea argue speak en...,4,3
49,mr bates come morning say would quite thing do...,5,3
27,common error ocr capitalization issue fix tron...,4,4
50,hammer open tomorrow afternoon well let get pa...,5,4


Модель в основном занижает результат, подробнее о возможных причинах в разделе ниже.

## Общий вывод

Итак, мы построили модель, которая может предсказать уровень английского языка в фильме по субтитрам.
<br>Точность итоговой модели на валидационной выборке 0.526, а если сравнить предсказания с ответами, то видно, что в основном модель занижает результаты.
<br>Возможные причины такого результата:
1. Малый объем выборки. Увеличение выборки позволит расширить общий словарь и сделать резльтаты более точными.
2. Недостаточная очистка субтитров. В идеале нужно избавиться от имен собственных, эмозди, цифр, тегов создателей субтитров и прочих вещей, не добавляющих смысловой нагрузки для обучения.
3. Орфографические ошибки в словах.

Возможные дальнейшие усовершенствования модели:
1. Увеличить выборку. Для этого нужно найти субтитры с уже обозначенным уровнем английского языка или придумать систему математического расчета уровня языка.
2. Усовершенствовать очистку субтитров. Можно попробовать использовать словарь английских слов и отбрасывать все значения лемматизированных слов, которых в нем нет.
3. Отбрасывать первый субтитр. Часто он содержит тэг человека, создавшего субтитры. Это вряд ли снизит общий словарь полезных слов, но поможет избавиться от "мусора".