## Задание 5.1

Набор данных тут: https://github.com/sismetanin/rureviews, также есть в папке [Data](https://drive.google.com/drive/folders/1YAMe7MiTxA-RSSd8Ex2p-L0Dspe6Gs4L). Те, кто предпочитает работать с английским языком, могут использовать набор данных `sms_spam`.

Применим полученные навыки и решим задачу анализа тональности отзывов. 

Нужно повторить весь пайплайн от сырых текстов до получения обученной модели.

Обязательные шаги предобработки:
1. токенизация
2. приведение к нижнему регистру
3. удаление стоп-слов
4. лемматизация
5. векторизация (с настройкой гиперпараметров)
6. построение модели
7. оценка качества модели

Обязательно использование векторайзеров:
1. мешок n-грамм (диапазон для n подбирайте самостоятельно, запрещено использовать только униграммы).
2. tf-idf ((диапазон для n подбирайте самостоятельно, также нужно подбирать параметры max_df, min_df, max_features)
3. символьные n-граммы (диапазон для n подбирайте самостоятельно)

В качестве классификатора нужно использовать наивный байесовский классификатор. 

Для сравнения векторайзеров между собой используйте precision, recall, f1-score и accuracy. Для этого сформируйте датафрейм, в котором в строках будут разные векторайзеры, а в столбцах разные метрики качества, а в  ячейках будут значения этих метрик для соответсвующих векторайзеров.

In [19]:
import pandas as pd
import numpy as np

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [43]:
sms_spam_tbl = pd.read_csv('/content/drive/MyDrive/data/sms_spam.csv')

In [44]:
sms_spam_tbl.head(10)

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
4,spam,okmail: Dear Dave this is your final notice to...
5,ham,Aiya we discuss later lar... Pick u up at 4 is...
6,ham,Are you this much buzy
7,ham,Please ask mummy to call father
8,spam,Marvel Mobile Play the official Ultimate Spide...
9,ham,"fyi I'm at usf now, swing by the room whenever"


In [45]:
sms_spam_tbl.shape

(5559, 2)

In [46]:
sms_spam_tbl['numType'] = sms_spam_tbl['type'].map({'ham':0, 'spam':1})
sms_spam_tbl.drop(columns=['type'], inplace=True)

In [47]:
sms_spam_tbl['count']=0
for i in np.arange(0,len(sms_spam_tbl.text)):
    sms_spam_tbl.loc[i,'count'] = len(sms_spam_tbl.loc[i,'text'])

In [72]:
import lightgbm as lgb
from sklearn.model_selection import RandomizedSearchCV
lgbmodel_bst = lgb.LGBMClassifier(max_depth=6, n_estimators=200, num_leaves=40)
param_grid = {
    'num_leaves': list(range(8, 92, 4)),
    'min_data_in_leaf': [10, 20, 40, 60, 100],
    'max_depth': [3, 4, 5, 6, 8, 12, 16, -1],
    'learning_rate': [0.1, 0.05, 0.01, 0.005],
    'bagging_freq': [3, 4, 5, 6, 7],
    'bagging_fraction': np.linspace(0.6, 0.95, 10),
    'reg_alpha': np.linspace(0.1, 0.95, 10),
    'reg_lambda': np.linspace(0.1, 0.95, 10),
    "min_split_gain": [0.0, 0.1, 0.01],
    "min_child_weight": [0.001, 0.01, 0.1, 0.001],
    "min_child_samples": [20, 30, 25],
    "subsample": [1.0, 0.5, 0.8],
}
model = RandomizedSearchCV(lgbmodel_bst, param_grid, random_state=42)
# search = model.fit(X_train, y_train)
# search.best_params_

In [59]:
import nltk
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords         
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [65]:
wnl = WordNetLemmatizer()
sms_spam_lem = sms_spam_tbl.copy()
stop_words = set(stopwords.words('english'))
sms_spam_lem['text'] = sms_spam_lem['text'].apply(lambda x: ' '.join(t for t in x.split() if t.lower() not in stop_words))
sms_spam_lem['text'] = sms_spam_lem['text'].apply(lambda x: ' '.join(wnl.lemmatize(t.lower()) for t in word_tokenize(x)))


In [66]:
sms_spam_lem.head()

Unnamed: 0,text,numType,count
0,hope good week . checking,0,49
1,k..give back thanks .,0,23
2,also cbe only . pay .,0,43
3,"complimentary 4 star ibiza holiday £10,000 cas...",1,149
4,okmail : dear dave final notice collect 4* ten...,1,161


In [74]:
# different vectorizers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
# TF-IDF
tfidf_model = TfidfVectorizer()
sms_spam_tfidf = sms_spam_lem.copy()
tfidf_text = tfidf_model.fit_transform(sms_spam_tfidf['text'])

svd = TruncatedSVD(n_components=500, n_iter=5)
tfidf_svd = svd.fit_transform(tfidf_text)
tfidf_svd = pd.DataFrame(tfidf_svd)

In [75]:
tfidf_svd['count'] = (sms_spam_lem['count'] - sms_spam_lem['count'].min()) / (sms_spam_lem['count'].max() - sms_spam_lem['count'].min())


In [94]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(tfidf_svd, sms_spam_lem['numType'],test_size=.2, random_state=42)

In [76]:
search = model.fit(X_train, y_train)
search.best_params_

{'bagging_fraction': 0.7166666666666667,
 'bagging_freq': 6,
 'learning_rate': 0.05,
 'max_depth': 16,
 'min_child_samples': 25,
 'min_child_weight': 0.1,
 'min_data_in_leaf': 100,
 'min_split_gain': 0.01,
 'num_leaves': 28,
 'reg_alpha': 0.8555555555555555,
 'reg_lambda': 0.3833333333333333,
 'subsample': 0.5}

In [95]:
best_model = lgb.LGBMClassifier(bagging_fraction=0.7166666666666667, bagging_freq=6,
               boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.05, max_depth=16,
               min_child_samples=25, min_child_weight=0.1, min_data_in_leaf=100,
               min_split_gain=0.01, n_estimators=100, n_jobs=-1, num_leaves=28,
               objective=None, random_state=None, reg_alpha=0.8555555555555555,
               reg_lambda=0.3833333333333333, silent=True, subsample=0.5,
               subsample_for_bin=200000, subsample_freq=0)
best_model.fit(X_train,y_train)

LGBMClassifier(bagging_fraction=0.7166666666666667, bagging_freq=6,
               boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.05, max_depth=16,
               min_child_samples=25, min_child_weight=0.1, min_data_in_leaf=100,
               min_split_gain=0.01, n_estimators=100, n_jobs=-1, num_leaves=28,
               objective=None, random_state=None, reg_alpha=0.8555555555555555,
               reg_lambda=0.3833333333333333, silent=True, subsample=0.5,
               subsample_for_bin=200000, subsample_freq=0)

In [97]:
from sklearn.metrics import f1_score, precision_score, recall_score
prediction = best_model.predict(X_test)
print(f'F1 score is: {f1_score(prediction, y_test)}')
print(f'Precision score is: {precision_score(prediction, y_test)}')
print(f'Recall score is: {recall_score(prediction, y_test)}')

F1 score is: 0.9045936395759717
Precision score is: 0.8590604026845637
Recall score is: 0.9552238805970149


In [81]:
# Count
from sklearn.feature_extraction.text import CountVectorizer
ngram_model = CountVectorizer(binary=True, ngram_range=(2, 3))
sms_spam_count = sms_spam_lem.copy()
count_text = ngram_model.fit_transform(sms_spam_tfidf['text'])

count_svd = svd.fit_transform(count_text)
count_svd = pd.DataFrame(count_svd)
count_svd['count'] = (sms_spam_lem['count'] - sms_spam_lem['count'].min()) / (sms_spam_lem['count'].max() - sms_spam_lem['count'].min())
count_svd.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,count
0,2.097696e-07,1.552595e-06,2.138911e-07,2.490319e-06,7.858619e-06,3.219337e-06,1.875915e-06,-1.076652e-06,4.357234e-06,1.502624e-06,2.744853e-05,1.712693e-05,0.001757276,4.471953e-05,3.133917e-06,1.198259e-05,9.852387e-06,3.467911e-06,5.203996e-05,-3.132545e-06,8.751124e-05,-1.940539e-06,-6.010212e-06,-8.508948e-06,1.237719e-05,1.372762e-05,5.659176e-07,1.230278e-05,7.383484e-06,5.170928e-05,1.544727e-05,6.905023e-05,1.26206e-05,4.163107e-05,1.584765e-05,2.658645e-05,1.663303e-06,-1.476843e-05,-8.390584e-06,7.570636e-05,...,0.043482,0.009371,0.009784203,0.002003,0.020357,-0.01922356,0.003034544,-0.000646,0.03539007,0.01481289,0.00097,-0.018494,0.001005,-0.014508,0.030927,0.06084611,0.00821902,0.015295,0.007246,0.073083,-0.018132,-0.017889,-0.014504,-0.012402,-0.009816,-0.033111,0.015429,-0.006119,0.011565,-0.008338,-0.009027,-0.036728,-0.024045,0.004453,0.012792,-0.006264789,0.013066,-0.019998,-0.023405,0.051762
1,2.765201e-08,4.606746e-08,8.984553e-08,2.193636e-07,-4.451388e-08,3.526507e-07,-4.2811e-08,1.13524e-07,-2.161888e-07,-2.09359e-08,-7.485903e-08,7.698865e-08,6.332972e-07,3.663309e-07,2.808749e-07,-3.784598e-07,7.741332e-06,1.045611e-07,3.645099e-08,-7.04931e-07,2.922706e-06,-7.805591e-07,-1.623914e-06,1.182005e-06,1.002589e-06,7.836369e-07,2.273399e-06,2.005388e-06,-7.781898e-07,5.042939e-06,-2.784225e-06,-7.586874e-07,-1.468564e-07,-4.417874e-07,8.678385e-07,-1.155073e-06,-9.419038e-07,3.924526e-08,-3.636916e-06,1.414112e-05,...,0.003119,0.005411,-0.003357364,-0.002116,0.000829,0.002678155,0.001053067,-0.003084,0.002357363,0.001544547,0.003821,0.003478,-0.001477,-0.003238,-0.000658,0.002792125,0.0002000401,0.002266,0.001617,0.000946,-0.003597,0.002841,0.000383,0.005869,0.000147,0.003328,-0.00474,-0.003905,-0.00777,-0.005241,0.003822,0.000604,0.0017,-0.004547,0.000849,0.004452507,0.000523,-0.006332,0.000995,0.023128
2,2.867426e-11,-6.639903e-12,-7.883944e-11,-7.950413e-11,1.520481e-11,1.823723e-10,1.592245e-10,-1.113614e-10,3.471506e-10,-4.000615e-10,4.897428e-10,2.449179e-10,5.178418e-10,4.568748e-10,-4.281801e-10,-4.394713e-10,8.687718e-10,-8.733943e-10,-2.06725e-10,-1.123522e-09,-4.674181e-10,1.695185e-10,1.860616e-09,7.767027e-10,2.60139e-09,4.763778e-09,4.285622e-10,1.379876e-09,7.539128e-10,1.953613e-09,4.65505e-09,-1.678955e-09,2.475055e-09,2.115408e-09,6.717008e-11,-3.944155e-09,-3.648521e-09,2.953663e-09,-7.761209e-11,-4.00686e-09,...,-5e-06,3e-06,1.739377e-07,5e-06,-3e-06,5.809459e-07,1.687603e-07,-5e-06,2.772638e-07,2.775997e-07,6e-06,3e-06,-3e-06,5e-06,2e-06,-2.846019e-07,-5.640066e-07,-4e-06,8e-06,-1e-06,7e-06,-3e-06,-3e-06,-5e-06,5e-06,-1e-05,-8e-06,4e-06,-1e-06,4e-06,-6e-06,-1e-06,-8e-06,5e-06,3e-06,6.968276e-07,-1e-06,-6e-06,-3e-06,0.045154
3,2.81611e-05,2.505973e-07,1.21791e-06,0.002338123,0.0001487615,0.1522361,0.05753328,0.002079618,0.006779565,0.001158916,0.007065134,0.0074992,0.0001640361,0.141628,-0.0006159122,0.0001103654,2.239396e-05,5.629691e-06,-5.476934e-06,-0.1840657,-0.0002267535,-0.001411134,7.018339e-05,0.2092573,0.1657434,1.148029,0.02267188,0.0003704388,-0.009692172,-0.0001162964,-8.984737e-05,2.2773e-05,-0.01625218,-0.3545087,-0.01372769,0.01475613,0.0179678,0.007634295,0.03146761,-0.0002139506,...,-0.038867,-0.092931,-0.08620495,-0.066302,-0.099958,-0.04791931,0.1009236,-0.1384,-0.0408214,-0.07152187,-0.134333,-0.008394,-0.107434,0.014668,0.120629,0.1225097,-0.0886472,-0.018443,-0.151342,0.149977,-0.244241,-0.004477,-0.02195,0.044366,-6.2e-05,-0.006344,0.174557,-0.002952,0.032431,0.080364,-0.11822,0.060367,0.016927,-0.163519,-0.10643,0.06415258,-0.031137,0.05139,0.003654,0.161894
4,2.691703e-05,9.213285e-07,-9.153333e-07,0.002687159,0.0001930803,0.2082442,0.2431806,0.002139806,-1.845655e-05,0.006537373,0.003968245,0.00684738,0.0001010897,0.09187775,-0.0003884441,7.354411e-05,1.546438e-06,5.583395e-07,-7.197417e-06,-0.07070964,-0.0001582995,-0.001030884,-4.01112e-07,0.06630409,0.06095542,0.4387281,0.008062919,0.0001225718,-0.002679889,-8.370008e-06,2.148912e-05,-3.523017e-05,-0.00326268,-0.05790524,-0.002245172,-0.0007763734,0.002628167,-4.029879e-05,4.889094e-05,-2.907453e-05,...,-0.019592,-0.014934,0.0002969042,-0.029504,-0.063729,-0.03968727,0.007522807,-0.050426,0.009918325,-0.03770332,-0.050573,-0.007302,-0.058538,-0.003761,0.03988,0.02058976,-0.04094193,-0.043875,-0.005294,0.038069,-0.040584,-0.002172,0.001564,0.011201,-0.007095,-0.010002,0.032937,0.000394,0.001071,0.003633,-0.012206,0.032266,-0.022017,-0.025088,-0.004531,-0.007759771,-0.026284,-0.004685,-0.028835,0.17511


In [98]:
X_train,X_test,y_train,y_test = train_test_split(count_svd, sms_spam_lem['numType'],test_size=.2, random_state=42)

In [82]:
search = model.fit(X_train, y_train)
search.best_params_

{'bagging_fraction': 0.8333333333333333,
 'bagging_freq': 6,
 'learning_rate': 0.1,
 'max_depth': 6,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_data_in_leaf': 60,
 'min_split_gain': 0.01,
 'num_leaves': 72,
 'reg_alpha': 0.28888888888888886,
 'reg_lambda': 0.95,
 'subsample': 1.0}

In [99]:
best_model = lgb.LGBMClassifier(bagging_fraction=0.8333333333333333, bagging_freq=6,
               boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=6,
               min_child_samples=20, min_child_weight=0.001,
               min_data_in_leaf=60, min_split_gain=0.01, n_estimators=100,
               n_jobs=-1, num_leaves=72, objective=None, random_state=None,
               reg_alpha=0.28888888888888886, reg_lambda=0.95, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
best_model.fit(X_train,y_train)

LGBMClassifier(bagging_fraction=0.8333333333333333, bagging_freq=6,
               boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=6,
               min_child_samples=20, min_child_weight=0.001,
               min_data_in_leaf=60, min_split_gain=0.01, n_estimators=100,
               n_jobs=-1, num_leaves=72, objective=None, random_state=None,
               reg_alpha=0.28888888888888886, reg_lambda=0.95, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [100]:
prediction = best_model.predict(X_test)
print(f'F1 score is: {f1_score(prediction, y_test)}')
print(f'Precision score is: {precision_score(prediction, y_test)}')
print(f'Recall score is: {recall_score(prediction, y_test)}')

F1 score is: 0.8951048951048951
Precision score is: 0.8590604026845637
Recall score is: 0.9343065693430657


In [109]:
# Count(analyzer='char_wb')
ngram_char_model = CountVectorizer(analyzer='char_wb', ngram_range=(3, 5))
sms_spam_char = sms_spam_lem.copy()
char_text = ngram_char_model.fit_transform(sms_spam_tfidf['text'])

svd = TruncatedSVD(n_components=1000, n_iter=8)
char_svd = svd.fit_transform(char_text)
char_svd = pd.DataFrame(char_svd) 
char_svd.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999
0,1.71496,-0.130918,1.259888,-0.18421,0.124364,-0.491307,0.538396,0.237396,-0.271196,-0.357786,0.302153,0.137371,0.67172,0.468999,-0.886192,-0.489958,0.26288,0.62201,0.609022,0.550962,1.105368,0.300502,-0.856719,-1.018277,0.773912,-1.123802,0.817168,0.138417,0.054941,0.279321,0.275934,-0.001016,0.281553,-0.683935,0.242442,-0.651878,0.450863,0.108981,0.179686,-0.848024,...,-0.001356,0.04429,0.07882,-0.160322,0.009244,-0.047518,0.104653,-0.01192,0.087705,0.000868,0.052208,-0.158802,-0.154245,-0.040679,-0.005286,-0.066975,0.01273,0.03683,0.084221,-0.039421,-0.031732,0.058918,0.034107,-0.13235,0.000978,0.072296,0.059865,0.059105,0.015642,-0.083454,0.010003,0.045923,-0.036202,-0.021922,-0.11828,0.082544,-0.01542,-0.003367,-0.025317,-0.052082
1,0.91656,-0.298704,0.282819,0.658428,-0.112788,-0.002591,-0.489749,-0.013324,0.110176,-0.369065,-0.245601,0.027738,-0.249268,-0.423099,-0.282976,0.551186,0.280282,-0.322442,-0.305802,0.24453,0.836742,-0.402883,0.126966,0.013096,0.076345,-0.38811,0.449004,0.00933,0.2257,0.092481,-0.002924,0.357967,0.00844,-0.595793,-0.332182,0.503579,-0.361145,0.276901,-0.093682,-0.140152,...,0.086698,0.004417,-0.088896,-0.097971,-0.106093,-0.139587,-0.042003,-0.047951,0.051599,0.120364,-0.141055,-0.03888,0.045243,0.169572,0.004798,0.035776,-0.011539,-0.049465,-0.072505,-0.027311,0.055633,0.038163,0.062145,-0.093137,0.027899,-0.102148,0.03681,0.022978,0.062484,0.059171,0.068484,-0.045465,-0.013959,-0.101099,-8.2e-05,-0.057984,0.081073,-0.022199,0.057221,0.034223
2,1.154676,-0.509862,0.282723,1.319404,-0.16006,-0.588202,0.007661,0.380039,-0.145732,-0.286529,0.105394,0.1192,0.016516,-0.021766,-0.074911,0.044578,0.087553,-0.208879,-0.141525,0.199658,-0.001627,0.009459,-0.275326,0.082875,0.022838,0.036055,-0.131069,-0.032037,0.056158,-0.005624,-0.040978,-0.068034,-0.1176,-0.122154,0.008197,-0.08875,-0.069664,-0.133958,-0.075819,-0.005841,...,-0.078535,-0.03082,0.104968,-0.022027,-0.003664,-0.046018,-0.005245,0.114014,-0.071472,0.095325,0.067464,-0.003545,-0.071529,0.01398,-0.003763,-0.08235,-0.044677,-0.039442,-0.010351,-0.092727,0.037859,0.054368,-0.027525,-0.055884,0.089554,-0.149298,0.005665,-0.014704,0.181859,0.005122,0.057794,0.005308,-0.164291,0.035635,0.005426,0.080663,0.014686,-0.038568,-0.102498,0.004311
3,3.400197,-1.699236,-2.390335,0.509867,-2.101982,1.783691,3.194849,-3.557769,-1.158486,-1.336057,1.829501,-0.97667,2.388802,1.044521,1.898434,0.180732,-1.534358,-0.509854,-2.037889,0.649599,0.362198,-2.229466,0.463786,0.636345,-0.227703,-0.353464,-0.834887,-1.175866,0.891739,-0.141221,-3.781485,-2.181723,-1.004918,-0.509211,1.040497,-1.329595,0.450708,0.961824,1.466138,1.585731,...,0.117494,-1.084144,0.160615,0.010309,-0.233459,-0.028858,0.393652,-0.15885,-0.931938,0.153792,-0.710515,-0.236269,-0.378885,-0.070866,-0.172704,-0.059613,-0.362567,-0.012259,-0.310517,-0.324667,-0.03818,-0.055773,0.207647,0.620337,0.240602,-0.184049,-0.398546,-0.265352,-0.21803,0.703275,-0.565718,0.017965,0.769156,-0.940334,-0.348995,0.095657,0.323708,0.312861,0.037932,-0.350764
4,3.501736,-2.280575,-3.467966,-0.493,-3.127493,1.772344,2.006604,-1.347726,-0.36329,-1.372756,1.783914,1.002707,2.083135,0.204312,0.611118,-0.769753,-1.43887,-1.158014,-1.682491,0.570688,0.24976,-1.974686,-0.177602,0.793695,-1.177818,-0.333029,0.324841,-0.060527,0.717768,-0.233069,-2.550899,-1.415018,0.146813,-0.537219,1.446988,-2.718864,0.485503,0.965427,1.305122,2.570015,...,-0.013259,-0.033751,-0.054681,-0.0146,0.069647,-0.070502,0.067944,-0.016869,-0.057396,0.048976,-0.027164,-0.03722,-0.024124,0.158582,-0.118649,0.158648,-0.233218,-0.204824,-0.31036,-0.13528,0.164615,0.230196,0.030063,0.245133,0.090151,0.201276,-0.279134,-0.03567,-0.035241,0.166737,-0.018101,0.305343,0.001423,0.10261,-0.102453,0.094902,0.021427,0.05385,0.064184,-0.165717


In [110]:
X_train,X_test,y_train,y_test = train_test_split(char_svd, sms_spam_lem['numType'],test_size=.2, random_state=42)

In [111]:
search = model.fit(X_train, y_train)
search.best_params_

{'bagging_fraction': 0.8333333333333333,
 'bagging_freq': 6,
 'learning_rate': 0.1,
 'max_depth': 6,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_data_in_leaf': 60,
 'min_split_gain': 0.01,
 'num_leaves': 72,
 'reg_alpha': 0.28888888888888886,
 'reg_lambda': 0.95,
 'subsample': 1.0}

In [112]:
best_model = lgb.LGBMClassifier(bagging_fraction=0.8333333333333333, bagging_freq=6,
               boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=6,
               min_child_samples=20, min_child_weight=0.001,
               min_data_in_leaf=60, min_split_gain=0.01, n_estimators=100,
               n_jobs=-1, num_leaves=72, objective=None, random_state=None,
               reg_alpha=0.28888888888888886, reg_lambda=0.95, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
best_model.fit(X_train,y_train)

LGBMClassifier(bagging_fraction=0.8333333333333333, bagging_freq=6,
               boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=6,
               min_child_samples=20, min_child_weight=0.001,
               min_data_in_leaf=60, min_split_gain=0.01, n_estimators=100,
               n_jobs=-1, num_leaves=72, objective=None, random_state=None,
               reg_alpha=0.28888888888888886, reg_lambda=0.95, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [113]:
prediction = best_model.predict(X_test)
print(f'F1 score is: {f1_score(prediction, y_test)}')
print(f'Precision score is: {precision_score(prediction, y_test)}')
print(f'Recall score is: {recall_score(prediction, y_test)}')

F1 score is: 0.9225352112676056
Precision score is: 0.8791946308724832
Recall score is: 0.9703703703703703


## Сonclusion
By all metrics **CountVectorizer** with **char with word boundaries** is perfomming better when **TF-IDF** and ordinary  **CountVectorizer**. All of them performed quite well, so the results could be combine for the better outcome. If I wanted to improve the model's score, I would use the Ensemble of these models.

## Задание 5.2 Регулярные выражения

Регулярные выражения - способ поиска и анализа строк. Например, можно понять, какие даты в наборе строк представлены в формате DD/MM/YYYY, а какие - в других форматах. 

Или бывает, например, что перед работой с текстом, надо почистить его от своеобразного мусора: упоминаний пользователей, url и так далее.

Навык полезный, давайте в нём тоже потренируемся.

Для работы с регулярными выражениями есть библиотека **re**

In [2]:
import re

В регулярных выражениях, кроме привычных символов-букв, есть специальные символы:
* **?а** - ноль или один символ **а**
* **+а** - один или более символов **а**
* **\*а** - ноль или более символов **а** (не путать с +)
* **.** - любое количество любого символа

Пример:
Выражению \*a?b. соответствуют последовательности a, ab, abc, aa, aac НО НЕ abb!

Рассмотрим подробно несколько наиболее полезных функций:

### findall
возвращает список всех найденных непересекающихся совпадений.

Регулярное выражение **ab+c.**: 
* **a** - просто символ **a**
* **b+** - один или более символов **b**
* **c** - просто символ **c**
* **.** - любой символ


In [16]:
result = re.findall('ab+c.', 'abcdefghijkabcabcxabc') 
print(result)

['abcd', 'abca']


Вопрос на внимательность: почему нет abcx?

 Оно пересекается с 'abca'


**Задание**: вернуть список первых двух букв каждого слова в строке, состоящей из нескольких слов.

In [29]:
text = "AV is largest Analytics community of India"
result = re.findall(r'\b\w{1,2}', text) 
print(result)

['AV', 'is', 'la', 'An', 'I', 'co', 'of', 'In']


### split
разделяет строку по заданному шаблону


In [None]:
result = re.split(',', 'itsy, bitsy, teenie, weenie') 
print(result)

['itsy', ' bitsy', ' teenie', ' weenie']


можно указать максимальное количество разбиений

In [None]:
result = re.split(',', 'itsy, bitsy, teenie, weenie', maxsplit=2) 
print(result)

['itsy', ' bitsy', ' teenie, weenie']


**Задание**: разбейте строку, состоящую из нескольких предложений, по точкам, но не более чем на 3 предложения.

In [32]:
result = re.split(r'\.', 'itsy. bitsy. teenie. weenie.', maxsplit=2) 
print(result)

['itsy', ' bitsy', ' teenie. weenie.']


### sub
ищет шаблон в строке и заменяет все совпадения на указанную подстроку

параметры: (pattern, repl, string)

In [None]:
result = re.sub('a', 'b', 'abcabc')
print (result)

bbcbbc


**Задание**: напишите регулярное выражение, которое позволит заменить все цифры в строке на "DIG".

In [35]:
result = re.sub(r'\d', 'DIG', '132')
print(result)

DIGDIGDIG


**Задание**: напишите  регулярное выражение, которое позволит убрать url из строки.

In [7]:
result = re.sub(r'\b\S+://\S+\b', '', 'URL could have the form http://www.example.com/index.html, which indicates a protocol ( http )')
print(result)

URL could have the form , which indicates a protocol ( http )


### compile
компилирует регулярное выражение в отдельный объект

In [None]:
# Пример: построение списка всех слов строки:
prog = re.compile('[А-Яа-яё\-]+')
prog.findall("Слова? Да, больше, ещё больше слов! Что-то ещё.")

['Слова', 'Да', 'больше', 'ещё', 'больше', 'слов', 'Что-то', 'ещё']

**Задание**: для выбранной строки постройте список слов, которые длиннее трех символов.

In [9]:
prog = re.compile(r'[А-Яа-яё\-]{3,}')
prog.findall("Слова? Да, больше, ещё больше слов! Что-то ещё.")

['Слова', 'больше', 'ещё', 'больше', 'слов', 'Что-то', 'ещё']

**Задание**: вернуть список доменов (@gmail.com) из списка адресов электронной почты:

```
abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz
```

In [10]:
text = "abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz"
prog = re.compile(r'@\S+\b')
prog.findall(text)

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']