Осуществим предобработку данных с Твиттера, чтобы отчищенный данные в дальнейшем использовать для задачи классификации. Данный датасет содержит негативные (label = 1) и нейтральные (label = 0) высказывания.
Для работы объединим train_df и test_df.

Задания:

1) Удалим @user из всех твитов с помощью паттерна "@[\w]*". Для этого создадим функцию: 
 - для того, чтобы найти все вхождения паттерна в тексте, необходимо использовать re.findall(pattern, input_txt)
 - для для замены @user на пробел, необходимо использовать re.sub()

2) Изменим регистр твитов на нижний с помощью .lower().

3) Заменим сокращения с апострофами (пример: ain't, can't) на пробел, используя apostrophe_dict. Для этого необходимо сделать функцию: для каждого слова в тексте проверить (for word in text.split()), если слово есть в словаре apostrophe_dict в качестве ключа (сокращенного слова), то заменить ключ на значение (полную версию слова).

4) Заменим сокращения на их полные формы, используя short_word_dict. Для этого воспользуемся функцией, используемой в предыдущем пункте.

5) Заменим эмотиконы (пример: ":)" = "happy") на пробелы, используя emoticon_dict. Для этого воспользуемся функцией, используемой в предыдущем пункте.

6) Заменим пунктуацию на пробелы, используя re.sub() и паттерн r'[^\w\s]'.

7) Заменим спец. символы на пробелы, используя re.sub() и паттерн r'[^a-zA-Z0-9]'.

8) Заменим числа на пробелы, используя re.sub() и паттерн r'[^a-zA-Z]'.

9) Удалим из текста слова длиной в 1 символ, используя ' '.join([w for w in x.split() if len(w)>1]).

10) Поделим твиты на токены с помощью nltk.tokenize.word_tokenize, создав новый столбец 'tweet_token'.

11) Удалим стоп-слова из токенов, используя nltk.corpus.stopwords. Создадим столбец 'tweet_token_filtered' без стоп-слов.

12) Применим стемминг к токенам с помощью nltk.stem.PorterStemmer. Создадим столбец 'tweet_stemmed' после применения стемминга.

13) Применим лемматизацию к токенам с помощью nltk.stem.wordnet.WordNetLemmatizer. Создадим столбец 'tweet_lemmatized' после применения лемматизации.

14) Сохраним результат предобработки в pickle-файл.

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
nltk.download('punkt')
nltk.download('wordnet')

import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)
import os

from common import init_data

apostrophe_dict, short_word_dict, emoticon_dict = init_data()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\konst\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\konst\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
train_df = pd.read_csv('train_tweets.csv')
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [3]:
test_df = pd.read_csv('test_tweets.csv')
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [4]:
combine_df = train_df.append(test_df, ignore_index = True, sort = False)
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,@user when a father is dysfunctional and is s...
1,2,0.0,@user @user thanks for #lyft credit i can't us...
2,3,0.0,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...
4,5,0.0,factsguide: society now #motivation


In [5]:
print(combine_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49159 entries, 0 to 49158
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      49159 non-null  int64  
 1   label   31962 non-null  float64
 2   tweet   49159 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 1.1+ MB
None


1. Удалим @user из всех твитов с помощью паттерна "@[\w]*". Для этого создадим функцию: 
 - для того, чтобы найти все вхождения паттерна в тексте, необходимо использовать re.findall(pattern, input_txt)
 - для для замены @user на пробел, необходимо использовать re.sub()


In [6]:
def replace_by_regex(input_string: str, repl: str, pattern: str) -> str:
    return re.sub(pattern=pattern, repl=repl, string=input_string)

combine_df['tweet'] = combine_df['tweet'].apply(lambda x: replace_by_regex(x, " ", "@[\w]*"))
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when a father is dysfunctional and is so se...
1,2,0.0,thanks for #lyft credit i can't use cause ...
2,3,0.0,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...
4,5,0.0,factsguide: society now #motivation


2. Изменим регистр твитов на нижний с помощью .lower()

In [7]:
combine_df['tweet'] = combine_df['tweet'].apply(lambda x: x.lower())
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when a father is dysfunctional and is so se...
1,2,0.0,thanks for #lyft credit i can't use cause ...
2,3,0.0,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...
4,5,0.0,factsguide: society now #motivation


3. Заменим сокращения с апострофами (пример: ain't, can't) на пробел, используя apostrophe_dict. Для этого необходимо сделать функцию: для каждого слова в тексте проверить (for word in text.split()), если слово есть в словаре apostrophe_dict в качестве ключа (сокращенного слова), то заменить ключ на значение (полную версию слова).

In [8]:
def replace_substr(input_string: str, repl_dict: dict) -> str:
    output_string: list = input_string.split()
    for idx, word in enumerate(output_string):
        if word in repl_dict.keys():
            word = repl_dict[word]
        output_string[idx] = word
    
    return " ".join(output_string)

combine_df['tweet'] = combine_df['tweet'].apply(lambda x: replace_substr(x, apostrophe_dict))
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when a father is dysfunctional and is so selfi...
1,2,0.0,thanks for #lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ur...
4,5,0.0,factsguide: society now #motivation


4. Заменим сокращения на их полные формы, используя short_word_dict. Для этого воспользуемся функцией, используемой в предыдущем пункте.

In [9]:
combine_df['tweet'] = combine_df['tweet'].apply(lambda x: replace_substr(x, short_word_dict))
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when a father is dysfunctional and is so selfi...
1,2,0.0,thanks for #lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty
3,4,0.0,#model i love you take with you all the time i...
4,5,0.0,factsguide: society now #motivation


5. Заменим эмотиконы (пример: ":)" = "happy") на пробелы, используя emoticon_dict. Для этого воспользуемся функцией, используемой в предыдущем пункте.

In [10]:
combine_df['tweet'] = combine_df['tweet'].apply(lambda x: replace_substr(x, emoticon_dict))
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when a father is dysfunctional and is so selfi...
1,2,0.0,thanks for #lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty
3,4,0.0,#model i love you take with you all the time i...
4,5,0.0,factsguide: society now #motivation


6. Заменим пунктуацию на пробелы, используя re.sub() и паттерн r'[^\w\s]'

In [11]:
combine_df['tweet'] = combine_df['tweet'].apply(lambda x: replace_by_regex(x, " ", "[^\w\s]"))
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when a father is dysfunctional and is so selfi...
1,2,0.0,thanks for lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty
3,4,0.0,model i love you take with you all the time i...
4,5,0.0,factsguide society now motivation


7. Заменим спец. символы на пробелы, используя re.sub() и паттерн r'[^a-zA-Z0-9]'

In [12]:
combine_df['tweet'] = combine_df['tweet'].apply(lambda x: replace_by_regex(x, " ", "[^a-zA-Z0-9]"))
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when a father is dysfunctional and is so selfi...
1,2,0.0,thanks for lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty
3,4,0.0,model i love you take with you all the time i...
4,5,0.0,factsguide society now motivation


8. Заменим числа на пробелы, используя re.sub() и паттерн r'[^a-zA-Z]'

In [13]:
combine_df['tweet'] = combine_df['tweet'].apply(lambda x: replace_by_regex(x, " ", "[^a-zA-Z]"))
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when a father is dysfunctional and is so selfi...
1,2,0.0,thanks for lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty
3,4,0.0,model i love you take with you all the time i...
4,5,0.0,factsguide society now motivation


9. Удалим из текста слова длиной в 1 символ, используя ' '.join([w for w in x.split() if len(w)>1])

In [14]:
def delete_short_word(input_string: str) -> str:
    return ' '.join([w for w in input_string.split() if len(w)>1])

combine_df['tweet'] = combine_df['tweet'].apply(delete_short_word)
combine_df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when father is dysfunctional and is so selfish...
1,2,0.0,thanks for lyft credit cannot use cause they d...
2,3,0.0,bihday your majesty
3,4,0.0,model love you take with you all the time in ur
4,5,0.0,factsguide society now motivation


10. Поделим твиты на токены с помощью nltk.tokenize.word_tokenize, создав новый столбец 'tweet_token'.

In [15]:
combine_df['tweet_token'] = combine_df['tweet'].apply(nltk.tokenize.word_tokenize)
combine_df.head()

Unnamed: 0,id,label,tweet,tweet_token
0,1,0.0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,..."
1,2,0.0,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau..."
2,3,0.0,bihday your majesty,"[bihday, your, majesty]"
3,4,0.0,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ..."
4,5,0.0,factsguide society now motivation,"[factsguide, society, now, motivation]"


11. Удалим стоп-слова из токенов, используя nltk.corpus.stopwords. Создадим столбец 'tweet_token_filtered' без стоп-слов.

In [16]:
def replace_stop_words(input_string: str, stop_words: set) -> list:
    return [word for word in input_string if not word in stop_words]

stop_words = set(nltk.corpus.stopwords.words("english"))

combine_df['tweet_token_filtered'] = combine_df['tweet_token'].apply(lambda x: replace_stop_words(x, stop_words))
combine_df.head()

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered
0,1,0.0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ..."
1,2,0.0,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee..."
2,3,0.0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]"
3,4,0.0,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...","[model, love, take, time, ur]"
4,5,0.0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


12. Применим стемминг к токенам с помощью nltk.stem.PorterStemmer. Создадим столбец 'tweet_stemmed' после применения стемминга.

In [17]:
def stemmer(input_string: str) -> list:
    return [ps.stem(word) for word in input_string]

ps = nltk.stem.PorterStemmer()

combine_df['tweet_stemmed'] = combine_df['tweet_token_filtered'].apply(stemmer)
combine_df.head()

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered,tweet_stemmed
0,1,0.0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ...","[father, dysfunct, selfish, drag, kid, dysfunc..."
1,2,0.0,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee...","[thank, lyft, credit, use, caus, offer, wheelc..."
2,3,0.0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesti]"
3,4,0.0,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...","[model, love, take, time, ur]","[model, love, take, time, ur]"
4,5,0.0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguid, societi, motiv]"


13. Применим лемматизацию к токенам с помощью nltk.stem.wordnet.WordNetLemmatizer. Создадим столбец 'tweet_lemmatized' после применения лемматизации.

In [18]:
def lemmatizer(input_string: str) -> list:
    return [wnlemm.lemmatize(word) for word in input_string]

wnlemm = nltk.stem.wordnet.WordNetLemmatizer()

combine_df['tweet_lemmatized'] = combine_df['tweet_token_filtered'].apply(lemmatizer)
combine_df.head()

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,1,0.0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ...","[father, dysfunct, selfish, drag, kid, dysfunc...","[father, dysfunctional, selfish, drag, kid, dy..."
1,2,0.0,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee...","[thank, lyft, credit, use, caus, offer, wheelc...","[thanks, lyft, credit, use, cause, offer, whee..."
2,3,0.0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesti]","[bihday, majesty]"
3,4,0.0,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...","[model, love, take, time, ur]","[model, love, take, time, ur]","[model, love, take, time, ur]"
4,5,0.0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguid, societi, motiv]","[factsguide, society, motivation]"


14. Сохраним результат предобработки в pickle-файл.

In [19]:
combine_df.to_pickle('df.pkl')