## Домашнее задание 8 (бонусное). Обработка текстов. 
Дедлайн: 24.06.2020 23:59

Ваша задача - определить тональность твита (0 - отрицательная, 4 - положительная) по его тексту.       
Ваша модель должна превзойти указанные бейзлайны (метрика качества - ***accuracy_score***) на тестовой выборке (***df_test***).     
Чем больше бейзлайнов вы пройдете, тем выше будет ваша оценка.       
Использовать можно любые модели и любые способы получения признаков. 

+ **!** Необходимо сделать результаты воспроизводимыми (фиксировать random_state)
+ **!** Для обучения можно использовать только ***df_train***. 
+ **!** Менять разбиение на  ***df_train*** и ***df_test*** нельзя.

**Оценивание (всего 10 баллов)**: 
+ Бейзлайн 1 0.73875 - 4 балла
+ Бейзлайн 2 0.75325 - 6 баллов
+ Бейзлайн 3 0.7635 - 8 баллов 
+ Бейзлайн 4 0.777 - 10 баллов

**Возможные направления улучшения качества**
+ улучшение предобработки (сейчас ее по сути нет)
+ подбор более удачной модели
+ подбор параметров модели 
+ feature engineering
+ feature selection

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, accuracy_score

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

In [6]:
from scipy.sparse import coo_matrix, hstack
from scipy.sparse.csr import csr_matrix

In [7]:
import io

In [8]:
df = pd.read_csv('https://raw.githubusercontent.com/esolovev/ling2019/master/module2/twi_data.csv', sep=';')

In [9]:
df.head(10)

Unnamed: 0,target,date,text
0,4,Tue Jun 02 02:59:24 PDT 2009,@JackAllTimeLow hope it went good! i couldnt m...
1,0,Sat Jun 06 00:25:20 PDT 2009,@SDI8732 Idk how to do it!!!
2,0,Fri Jun 05 12:07:23 PDT 2009,"@kmwindmill is here ! woop woop , would be bet..."
3,4,Mon Jun 01 14:55:06 PDT 2009,@Daydreamer1984 He explains the tailer better
4,0,Sat Jun 20 15:39:44 PDT 2009,still trying to get a pic on this twitter thin...
5,0,Mon Jun 01 17:05:44 PDT 2009,"personally, i'm pretty upset ian left the cab...."
6,4,Fri May 29 15:32:09 PDT 2009,Dance meeting sitting next to deb
7,4,Sun May 31 08:07:19 PDT 2009,@thespyglass ha... funnier the way you did it...
8,4,Mon Jun 01 18:12:27 PDT 2009,"wooh, i love @mileycyruss! i actuallly just sa..."
9,4,Sat May 30 09:17:18 PDT 2009,@EdinMarathonBot R-4_it is great I'm staying ...


In [10]:
# баланс классов
df.target.value_counts(normalize=True)

4    0.5
0    0.5
Name: target, dtype: float64

In [11]:
# разбиение и пропорции обучающей и тестовой выборки менять нельзя
SEED = 227
np.random.seed(SEED)
df_train, df_test = train_test_split(df, train_size=0.2, test_size=0.1, stratify=df.target, random_state=SEED)

In [12]:
df_train.shape

(8000, 3)

In [13]:
df_test.shape

(4000, 3)

In [14]:
y_train = df_train.target
y_test = df_test.target

## Baseline 1 
Count Vectorizer по словам + Naive Bayes

In [151]:
%%time
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(df_train.text)
X_test_count = count_vectorizer.transform(df_test.text)
X_train = X_train_count
X_test = X_test_count

CPU times: user 224 ms, sys: 15.3 ms, total: 239 ms
Wall time: 246 ms


In [13]:
%%time
model = MultinomialNB()
model.fit(X_train, y_train)

CPU times: user 5.08 ms, sys: 0 ns, total: 5.08 ms
Wall time: 23.4 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [14]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.71      0.82      0.76      2000
           4       0.78      0.66      0.72      2000

    accuracy                           0.74      4000
   macro avg       0.74      0.74      0.74      4000
weighted avg       0.74      0.74      0.74      4000

Accuracy: 0.73875


## Baseline 2 
TfidfVectorizer по словам + Logistic Regression

In [161]:
%%time
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.text)
X_test_tfidf = tfidf_vectorizer.transform(df_test.text)
X_train = X_train_tfidf
X_test = X_test_tfidf

CPU times: user 244 ms, sys: 18.9 ms, total: 263 ms
Wall time: 276 ms


In [16]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 154 ms, sys: 0 ns, total: 154 ms
Wall time: 42 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.75      0.76      0.76      2000
           4       0.76      0.74      0.75      2000

    accuracy                           0.75      4000
   macro avg       0.75      0.75      0.75      4000
weighted avg       0.75      0.75      0.75      4000

Accuracy: 0.75325


## Baseline 3
TfidfVectorizer по 1-3 граммам слов + TfidfVectorizer по 3-4граммам символов + LogisticRegression

In [21]:
%%time
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 4), strip_accents="unicode", tokenizer=tknzr.tokenize, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.text)
X_test_tfidf = tfidf_vectorizer.transform(df_test.text)

tfidf_vectorizer_char = TfidfVectorizer(ngram_range=(3, 4), analyzer='char')
X_train_tfidf_char = tfidf_vectorizer_char.fit_transform(df_train.text)
X_test_tfidf_char = tfidf_vectorizer_char.transform(df_test.text)

X_train = hstack((X_train_tfidf, X_train_tfidf_char))
X_test = hstack((X_test_tfidf, X_test_tfidf_char))

CPU times: user 3.64 s, sys: 112 ms, total: 3.75 s
Wall time: 3.81 s


# Благодаря удалению стоп слов, убиранию ' и использованию sentiment-aware nltk twitter токенизатора удалось побить 3 бейзлайн

In [22]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 448 ms, sys: 23.5 ms, total: 472 ms
Wall time: 482 ms


In [23]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.77      0.76      0.76      2000
           4       0.76      0.77      0.77      2000

    accuracy                           0.77      4000
   macro avg       0.77      0.77      0.77      4000
weighted avg       0.77      0.77      0.77      4000

Accuracy: 0.7665


## Baseline 4
Baseline 3 + эмбединги из spacy (вектор документа = среднее векторов всех его слов)

In [24]:
%%time
import spacy 
import en_core_web_lg
nlp = spacy.load('en_core_web_lg')

CPU times: user 8.76 s, sys: 1.43 s, total: 10.2 s
Wall time: 10.9 s


In [25]:
%%time
X_train_vectors = csr_matrix([nlp(twi_text).vector for twi_text in df_train.text])
X_test_vectors = csr_matrix([nlp(twi_text).vector for twi_text in df_test.text])
X_train = hstack((X_train_tfidf, X_train_tfidf_char, X_train_vectors))
X_test = hstack((X_test_tfidf, X_test_tfidf_char, X_test_vectors))

CPU times: user 2min 8s, sys: 1.46 s, total: 2min 9s
Wall time: 2min 11s


In [26]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 2.25 s, sys: 55.3 ms, total: 2.3 s
Wall time: 2.31 s


# и 4 бейзлайн тоже :)

In [27]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.77      0.80      0.79      2000
           4       0.79      0.77      0.78      2000

    accuracy                           0.78      4000
   macro avg       0.78      0.78      0.78      4000
weighted avg       0.78      0.78      0.78      4000

Accuracy: 0.7835


### Препроцессинг 

#### я написала огрумную функцию препроцессинга, которая давала 60 комбинаций параметров и провела все комбинации на первых трёх моделях чтобы найти лучшее сочетание


##### вот такие парметры и их варианты:
1) @хэндлы:
- убирать все
- убирать непопулярные
- приводить неполярные к виду @handle
- приводить все к виду @handle
- оставлять все

2) конечная пунктуация:
- оставлять разную (!?.)
- делать всё точками

3) смайлы
- оставлять как есть
- убирать все
- приводить к :) или :(

4) маркирование области отрицания
- маркировать
- не маркировать


### помогло ли это?

# NO

![Drag Racing](ohno.png)

#### но ниже можно посмотреть как это было

In [16]:
import re
import nltk
from nltk.tokenize import TweetTokenizer
from collections import Counter
tknzr = TweetTokenizer(strip_handles=False, reduce_len=True)

In [26]:
negations = ['never', 'no', 'not', 'noone', 'nowhere', 'none', 'nobody', 'neither', 'seldom', 'hardly', "n't", "cannot"]



In [27]:
with open('handles.txt', 'r', encoding='utf-8') as f:
    raw = f.read()
    pop_handles = raw.split('\n')
    

In [28]:
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [115]:
smiles_collection = {
    'positive': [":p", "(:", ";-)","=D",":]",":-D","=P",";D",";p",";]",";P","=p","=]",":-p",":')",";-D", ":)"],
    'negative': [':|',":'(", '=/', '=(',':-\\', '/:', '>:[', "='(", ':[', '):', ":("]
}
all_smiles = [':)_NEG', ':)', ':(_NEG', ':(']

In [132]:
# с аргументами, чтобы попробовать разные комбинации
def preprocessing(tweet, handles, smiles, neg_marked, end_punkt, verbose):
    # убирает частицы после апострофа, делит токен с отрицанием на token + not, убирает числа
    tweet = tweet.lower()
    tweet = re.sub("(?:'s|'ll|'ve)", '', tweet)
    tweet = re.sub("n't", ' not ', tweet)
    tweet = re.sub("(do|does|did|could|can|should|must|shall|will|would|have|wo)nt", 'not', tweet)
    tweet = re.sub("[124567890]+", '', tweet)
    tweet = re.sub("idk", 'not know', tweet)
    
    def end_punkt(tweet, end_punkt):
        if end_punkt:
            tweet = re.sub("[^?A-Za-z]!", "!. ", tweet)
            tweet = re.sub("\?!", "?. ", tweet)
            tweet = re.sub("\?", "?. ", tweet)
        else:
            tweet = re.sub("[!?]+", '. ', tweet)

        sentences = tweet.split('. ')
        return sentences
            
        
    def handles_format(sentences, handles):
        preproc = []
        for sentence in sentences:
            tokens = tknzr.tokenize(sentence.lower())
            if handles == 'delete_all':
                for token in tokens:
                    if ('@'not in token):
                        preproc.append(token)
            if handles == 'delete_unpopular':
                for token in tokens:
                    if ('@' in token) and (token in pop_handles):
                        preproc.append(token)
                    if '@' not in token:
                        preproc.append(token)
            if handles == 'reformat_unpopular':
                for token in tokens:
                    if ('@' in token) and (token in pop_handles):
                        preproc.append(token)
                    if ('@' in token) and (token not in pop_handles):
                        preproc.append('@handle')
                    else:
                        preproc.append(token)
            if handles == 'leave_all':
                preproc.extend(tokens)
            if handles == 'reformat_all':
                for token in tokens:
                    if ('@' in token):
                        preproc.append('@handle')
                    else:
                        preproc.append(token)
        return preproc
    
    
    def stop_words(preproc):
        no_stop_words = []
        for token in preproc:
            if token not in stopwords:
                no_stop_words.append(token)
        return no_stop_words
    
    
    def neg_marked_format(preproc, neg_marked):
        if neg_marked is True:
            neg_marked = []
            final = []
            negated = len(list(set(preproc) & set(negations)))
            condition = False
            if negated:
                for token in preproc:
                    if token in negations:
                        condition = True
                    elif condition == True:
                        neg_marked.append(token + '_NEG')
                    elif re.match('[^A-Za-z]', token) and condition == True:
                        condition = False
                    elif condition == False:
                        neg_marked.append(token)
                final = neg_marked
            else:
                final = preproc
        else:
            final = preproc
        return final
    
    def smiles_format(final, smiles):
        full_proc_tweet = []
        punkt = [',', ';', ':', '-']
        if smiles == 'reformat':
            for token in final:
                if token in smiles_collection['positive']:
                    full_proc_tweet.append(':)')
                elif '_' in token and token.split('_')[0] in smiles_collection['positive']:
                    full_proc_tweet.append(':)')
                elif token in smiles_collection['negative']:
                    full_proc_tweet.append(':(')
                elif '_' in token and token.split('_')[0] in smiles_collection['negative']:
                    full_proc_tweet.append(':(')
                elif re.match('[^A-Za-z.!?<3_]', token):
                    pass
                else:
                    full_proc_tweet.append(token)
                    
                
        elif smiles == 'delete_all':
            for token in final:
                if re.match('[^A-Za-z.!?_]', token):
                    pass
                else:
                    full_proc_tweet.append(token)
                    
        elif smiles == 'leave_all':
            for token in final:
                if token in punkt:
                    pass
                else:
                    full_proc_tweet.append(token)
                    
        return full_proc_tweet
    
    def testing(verbose):
        if verbose:
            print('sent split', sentences)
            print('token n hadnles', preproc)
            print('stop words', no_stop_words)
            print('neg_marked', final)
    
    sentences = end_punkt(tweet, end_punkt)
    
    preproc = handles_format(sentences, handles)
    
    no_stop_words = stop_words(preproc)
    
    final = neg_marked_format(no_stop_words, neg_marked)
    
    full_proc_tweet = smiles_format(final, smiles)
    
    testing(verbose)

            
    return full_proc_tweet

### Эксперимент по подбору параментров

In [68]:
handles_choice = ['delete_all', 'delete_unpopular', 'reformat_unpopular', 'leave_all', 'reformat_all']
smiles_choice = ['leave_all', 'delete_all', 'reformat']
neg_marked_choice = [True, False]
end_punkt_choice = [True, False] #при True оставляет разные конечные знаки препинания
verbose = False

In [138]:

def new_data(for_proc, a, b, c, d):
    x = for_proc.copy()
    problem = []
    for i, row in x.iterrows():
        raw = (row['text'])
        try:
            preproc = preprocessing(raw, handles=a, smiles=b, neg_marked=c, end_punkt=d, verbose=False )
            x.at[i,'text'] = " ".join(preproc)
        except Exception:
            problem.append(raw)
            preproc = raw
            x.at[i,'text'] = (preproc)

    
    return [x, problem]

def stats(preproc_data):
    a = preproc_data.copy()
    SEED = 227
    np.random.seed(SEED)
    df_train, df_test = train_test_split(a, train_size=0.2, test_size=0.1, stratify=df.target, random_state=SEED)
    
    y_train = df_train.target
    y_test = df_test.target
    
    count_vectorizer = CountVectorizer(ngram_range=(1, 6))
    X_train_count = count_vectorizer.fit_transform(df_train.text)

    X_test_count = count_vectorizer.transform(df_test.text)
    X_train = X_train_count
    X_test = X_test_count
    
    model = MultinomialNB()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_pred, y_test)
    print(f'Accuracy: {accuracy}')
    
    return accuracy

def stats2(preproc_data):
    a = preproc_data.copy()
    SEED = 227
    np.random.seed(SEED)
    df_train, df_test = train_test_split(a, train_size=0.2, test_size=0.1, stratify=df.target, random_state=SEED)
    
    y_train = df_train.target
    y_test = df_test.target
    
    tfidf_vectorizer = TfidfVectorizer()
    X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.text)
    X_test_tfidf = tfidf_vectorizer.transform(df_test.text)
    X_train = X_train_tfidf
    X_test = X_test_tfidf

    model = LogisticRegression(random_state=SEED, solver='liblinear')
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_pred, y_test)
    print(f'Accuracy: {accuracy}')
    
    return accuracy

def stats3(preproc_data):
    a = preproc_data.copy()
    SEED = 227
    np.random.seed(SEED)
    df_train, df_test = train_test_split(a, train_size=0.2, test_size=0.1, stratify=df.target, random_state=SEED)
    
    y_train = df_train.target
    y_test = df_test.target
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 4))
    X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.text)
    X_test_tfidf = tfidf_vectorizer.transform(df_test.text)

    tfidf_vectorizer_char = TfidfVectorizer(ngram_range=(3, 4), analyzer='char')
    X_train_tfidf_char = tfidf_vectorizer_char.fit_transform(df_train.text)
    X_test_tfidf_char = tfidf_vectorizer_char.transform(df_test.text)

    X_train = hstack((X_train_tfidf, X_train_tfidf_char))
    X_test = hstack((X_test_tfidf, X_test_tfidf_char))

    model = LogisticRegression(random_state=SEED, solver='liblinear')
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_pred, y_test)
    print(f'Accuracy: {accuracy}')
    
    return accuracy

In [120]:
tweet = '@JackAllTimeLow hope it went good! i couldnt make it, so i think we should hang out  haha. enjoy the rest of ur time here in Sydney  xx'
result = preprocessing(tweet, handles="delete_all", smiles="reformat", neg_marked= True, end_punkt=True, verbose = False)
print(result)

['hope', 'went', 'goo', '!', 'make_NEG', 'think_NEG', 'hang_NEG', 'haha_NEG', 'enjoy_NEG', 'rest_NEG', 'ur_NEG', 'time_NEG', 'sydney_NEG', 'xx_NEG']


In [133]:
df.head(1)

Unnamed: 0,target,date,text
0,4,Tue Jun 02 02:59:24 PDT 2009,@JackAllTimeLow hope it went good! i couldnt m...


In [130]:
tweet = for_proc.head(1).values[0]

In [134]:
stat = {}
problematic = []
for a in handles_choice:
    for b in smiles_choice:
        for c in neg_marked_choice:
            for d in end_punkt_choice:
                for_proc = df.copy()
                proc_res = new_data(for_proc, a, b, c, d)
                problematic.append([(a,b,c,d),proc_res[1]])
                print(proc_res[0].head(1)['text'])
                accuracy = stats(proc_res[0].copy())
                
                print(f"ACCURACY {accuracy}, handles_choice {a}, smiles_choice {b}, neg_marked_choice {c}, end_punkt_choice {d}")
                param = f"handles_choice {a}, smiles_choice {b}, neg_marked_choice {c}, end_punkt_choice {d}"
                stat.update({param:accuracy})
        

0    hope went good ! make_NEG ,_NEG think_NEG hang...
Name: text, dtype: object
Accuracy: 0.725
ACCURACY 0.725, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice True, end_punkt_choice True
0    hope went good ! make_NEG ,_NEG think_NEG hang...
Name: text, dtype: object
Accuracy: 0.725
ACCURACY 0.725, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice True, end_punkt_choice False
0    hope went good ! not make think hang haha enjo...
Name: text, dtype: object
Accuracy: 0.73525
ACCURACY 0.73525, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice True
0    hope went good ! not make think hang haha enjo...
Name: text, dtype: object
Accuracy: 0.73525
ACCURACY 0.73525, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice False
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.7265
ACCURACY 0.7265, handles_choice delete_all, sm

0    @jackalltimelow hope went good ! not make thin...
Name: text, dtype: object
Accuracy: 0.7335
ACCURACY 0.7335, handles_choice leave_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice True
0    @jackalltimelow hope went good ! not make thin...
Name: text, dtype: object
Accuracy: 0.7335
ACCURACY 0.7335, handles_choice leave_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice False
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.726
ACCURACY 0.726, handles_choice leave_all, smiles_choice delete_all, neg_marked_choice True, end_punkt_choice True
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.726
ACCURACY 0.726, handles_choice leave_all, smiles_choice delete_all, neg_marked_choice True, end_punkt_choice False
0    hope went good ! not make think hang haha enjo...
Name: text, dtype: object
Accuracy: 0.73375
ACCURACY 0.73375, handles_choice leave_all, smiles_

In [137]:
stat2 = {}
problematic = []
for a in handles_choice:
    for b in smiles_choice:
        for c in neg_marked_choice:
            for d in end_punkt_choice:
                for_proc = df.copy()
                proc_res = new_data(for_proc, a, b, c, d)
                problematic.append([(a,b,c,d),proc_res[1]])
                print(proc_res[0].head(1)['text'])
                accuracy = stats2(proc_res[0].copy())
                
                print(f"ACCURACY {accuracy}, handles_choice {a}, smiles_choice {b}, neg_marked_choice {c}, end_punkt_choice {d}")
                param = f"handles_choice {a}, smiles_choice {b}, neg_marked_choice {c}, end_punkt_choice {d}"
                stat2.update({param:accuracy})
        

0    hope went good ! make_NEG ,_NEG think_NEG hang...
Name: text, dtype: object
Accuracy: 0.737
ACCURACY 0.737, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice True, end_punkt_choice True
0    hope went good ! make_NEG ,_NEG think_NEG hang...
Name: text, dtype: object
Accuracy: 0.737
ACCURACY 0.737, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice True, end_punkt_choice False
0    hope went good ! not make think hang haha enjo...
Name: text, dtype: object
Accuracy: 0.7435
ACCURACY 0.7435, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice True
0    hope went good ! not make think hang haha enjo...
Name: text, dtype: object
Accuracy: 0.7435
ACCURACY 0.7435, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice False
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.73775
ACCURACY 0.73775, handles_choice delete_all, smil

Accuracy: 0.73525
ACCURACY 0.73525, handles_choice leave_all, smiles_choice leave_all, neg_marked_choice True, end_punkt_choice False
0    @jackalltimelow hope went good ! not make thin...
Name: text, dtype: object
Accuracy: 0.74275
ACCURACY 0.74275, handles_choice leave_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice True
0    @jackalltimelow hope went good ! not make thin...
Name: text, dtype: object
Accuracy: 0.74275
ACCURACY 0.74275, handles_choice leave_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice False
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.7375
ACCURACY 0.7375, handles_choice leave_all, smiles_choice delete_all, neg_marked_choice True, end_punkt_choice True
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.7375
ACCURACY 0.7375, handles_choice leave_all, smiles_choice delete_all, neg_marked_choice True, end_punkt_choice False
0    hop

In [139]:
stat3 = {}
problematic = []
for a in handles_choice:
    for b in smiles_choice:
        for c in neg_marked_choice:
            for d in end_punkt_choice:
                for_proc = df.copy()
                proc_res = new_data(for_proc, a, b, c, d)
                problematic.append([(a,b,c,d),proc_res[1]])
                print(proc_res[0].head(1)['text'])
                accuracy = stats3(proc_res[0].copy())
                
                print(f"ACCURACY {accuracy}, handles_choice {a}, smiles_choice {b}, neg_marked_choice {c}, end_punkt_choice {d}")
                param = f"handles_choice {a}, smiles_choice {b}, neg_marked_choice {c}, end_punkt_choice {d}"
                stat3.update({param:accuracy})

0    hope went good ! make_NEG ,_NEG think_NEG hang...
Name: text, dtype: object
Accuracy: 0.752
ACCURACY 0.752, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice True, end_punkt_choice True
0    hope went good ! make_NEG ,_NEG think_NEG hang...
Name: text, dtype: object
Accuracy: 0.752
ACCURACY 0.752, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice True, end_punkt_choice False
0    hope went good ! not make think hang haha enjo...
Name: text, dtype: object
Accuracy: 0.75775
ACCURACY 0.75775, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice True
0    hope went good ! not make think hang haha enjo...
Name: text, dtype: object
Accuracy: 0.75775
ACCURACY 0.75775, handles_choice delete_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice False
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.7465
ACCURACY 0.7465, handles_choice delete_all, sm

0    @jackalltimelow hope went good ! not make thin...
Name: text, dtype: object
Accuracy: 0.7605
ACCURACY 0.7605, handles_choice leave_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice True
0    @jackalltimelow hope went good ! not make thin...
Name: text, dtype: object
Accuracy: 0.7605
ACCURACY 0.7605, handles_choice leave_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice False
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.747
ACCURACY 0.747, handles_choice leave_all, smiles_choice delete_all, neg_marked_choice True, end_punkt_choice True
0    hope went good ! make_NEG think_NEG hang_NEG h...
Name: text, dtype: object
Accuracy: 0.747
ACCURACY 0.747, handles_choice leave_all, smiles_choice delete_all, neg_marked_choice True, end_punkt_choice False
0    hope went good ! not make think hang haha enjo...
Name: text, dtype: object
Accuracy: 0.7535
ACCURACY 0.7535, handles_choice leave_all, smiles_ch

###  Какой препроцессинг выбрать?

In [141]:
import operator
sorted_d = sorted(stat3.items(), key=operator.itemgetter(1))

In [145]:
print(sorted_d[-1])

('handles_choice leave_all, smiles_choice leave_all, neg_marked_choice False, end_punkt_choice False', 0.7605)


## Никакой он только мешает

In [146]:
from sklearn.ensemble import GradientBoostingRegressor


### Попробую другую модлеь

In [185]:
%%time
model = GradientBoostingRegressor(random_state=SEED,
                                           n_estimators=100, 
                                           max_depth=5,
                                           learning_rate=0.1,
                                           min_samples_leaf=1, 
                                           criterion = 'mse',
                                           min_impurity_split=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)









ValueError: Classification metrics can't handle a mix of continuous and binary targets

In [153]:
print(accuracy) #первый способ преобразовать

0.752


In [163]:
print(accuracy) # второй

0.752


In [172]:
print(accuracy) # третий

0.752


## а что если спейси дать модель получше?

In [186]:
print(accuracy) # с более большой моделью

0.752
