Осуществим предобработку данных с Твиттера, чтобы очищенные данные в дальнейшем
использовать для задачи классификации. Данный датасет содержит негативные (label = 1) и нейтральные (label = 0) высказывания. Для работы объединим train_df и test_df.
Задания:
1. Удалим @user из всех твитов с помощью паттерна "@[\w]*". Для этого создадим
функцию:
- для того, чтобы найти все вхождения паттерна в тексте, необходимо
использовать re.findall(pattern, input_txt)
- для замены @user на пробел, необходимо использовать re.sub()
2. Изменим регистр твитов на нижний с помощью .lower().
3. Заменим сокращения с апострофами (пример: ain't, can't) на пробел, используя
apostrophe_dict. Для этого необходимо сделать функцию: для каждого слова в
тексте проверить (for word in text.split()), если слово есть в словаре apostrophe_dict в
качестве ключа (сокращенного слова), то заменить ключ на значение (полную
версию слова).
4. Заменим сокращения на их полные формы, используя short_word_dict. Для этого
воспользуемся функцией, используемой в предыдущем пункте.
5. Заменим эмотиконы (пример: ":)" = "happy") на пробелы, используя emoticon_dict.
Для этого воспользуемся функцией, используемой в предыдущем пункте.
6. Заменим пунктуацию на пробелы, используя re.sub() и паттерн r'[^\w\s]'.
7. Заменим спец. символы на пробелы, используя re.sub() и паттерн r'[^a-zA-Z0-9]'.
8. Заменим числа на пробелы, используя re.sub() и паттерн r'[^a-zA-Z]'.
9. Удалим из текста слова длиной в 1 символ, используя ' '.join([w for w in x.split() if
len(w)>1]).
10. Поделим твиты на токены с помощью nltk.tokenize.word_tokenize, создав новый
столбец 'tweet_token'.
11. Удалим стоп-слова из токенов, используя nltk.corpus.stopwords. Создадим столбец
'tweet_token_filtered' без стоп-слов.
12. Применим стемминг к токенам с помощью nltk.stem.PorterStemmer. Создадим
столбец 'tweet_stemmed' после применения стемминга.
13. Применим лемматизацию к токенам с помощью
nltk.stem.wordnet.WordNetLemmatizer. Создадим столбец 'tweet_lemmatized' после
применения лемматизации.
14. Сохраним результат предобработки в pickle-файл.

In [17]:
import pandas as pd
import os
import re
import nltk
import json
from nltk.tokenize import word_tokenize
from nltk import tokenize as tknz
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import pickle

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/dmitriy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dmitriy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/dmitriy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
data_path = r'/media/dmitriy/Disk/Downloads/ai_nlp_hw_data/hw_1/'

In [4]:
df_train = pd.read_csv(os.path.join(data_path, r'train_tweets.csv'))
df_test = pd.read_csv(os.path.join(data_path, r'test_tweets.csv'))

In [5]:
print(df_train.shape, df_test.shape)
df = pd.concat([df_train, df_test], axis=0).reset_index(drop=True)
print(df.shape)

(31962, 3) (17197, 2)
(49159, 3)


In [6]:
df_train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [7]:
df_test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [8]:
def delete_pattern(text, pattern):
    return re.sub(pattern, ' ', text)

1. Удаление @user из всех твитов с помощью паттерна "@[\w]*"

In [9]:
pattern_for_username = "@[\w]*"
df['tweet'] = df['tweet'].apply(lambda x: delete_pattern(x, pattern_for_username))

2. Изменение регистра твитов на нижний с помощью .lower()

In [10]:
df['tweet'] = df['tweet'].apply(lambda x: x.lower())

In [18]:
with open(r'apostrophe.json', 'r', encoding='utf-8') as f:
    apostrophe_dict = json.load(f)

In [19]:
def replace_words(text, correct_dict):
    new_text = ''
    for word in text.split():
        if word in correct_dict.keys():
            new_word = correct_dict[word]
        else:
            new_word = word
        new_text = new_text + ' ' + new_word
    return new_text

3. Замена сокращений с апострофами (пример: ain't, can't) на пробел с использованием apostrophe_dict. 

In [20]:
df['tweet'] = df['tweet'].apply(lambda x: replace_words(x, apostrophe_dict))

In [21]:
with open(r'short_word_dict.json', 'r', encoding='utf-8') as f:
    short_word_dict = json.load(f)

4. Замена сокращений на их полные формы с использованием short_word_dict.

In [22]:
df['tweet'] = df['tweet'].apply(lambda x: replace_words(x, short_word_dict))

In [30]:
with open(r'emoticon_dict.json', 'r', encoding='utf-8') as f:
    emoticon_dict = json.load(f)

5. Замена эмотиконов (пример: ":)" = "happy") на пробелы с использованием emoticon_dict.

In [31]:
df['tweet'] = df['tweet'].apply(lambda x: replace_words(x, emoticon_dict))

In [32]:
def replace_punctuation(text, pattern_for_punctuation):
    return re.sub(pattern_for_punctuation, ' ', text)

6. Замена пунктуации на пробелы с использованием re.sub() и паттерна r'[^\w\s]'.

In [33]:
pattern_for_punctuation = r'[^\w\s]'
df['tweet'] = df['tweet'].apply(lambda x: delete_pattern(x, pattern_for_punctuation))

In [34]:
def replace_special_symb(text, pattern_for_special_symb):
    return re.sub(pattern_for_special_symb, ' ', text)

7. Замена спец. символов на пробелы с использованием re.sub() и паттерна r'[^a-zA-Z0-9]'.

In [35]:
pattern_for_special_symb = r'[^a-zA-Z0-9]'
df['tweet'] = df['tweet'].apply(lambda x: delete_pattern(x, pattern_for_special_symb))

In [36]:
def delete_short_words(text, min_len):
    return ' '.join([word for word in text.split() if len(word) > min_len])

8. Замена чисел на пробелы с использованием re.sub() и паттерна r'[^a-zA-Z]'.

In [37]:
pattern_for_numbers = r'[^a-zA-Z]'
df['tweet'] = df['tweet'].apply(lambda x: delete_pattern(x, pattern_for_special_symb))

9. Удаление из текста слова длиной в 1 символ с использованием конструкции ' '.join([w for w in x.split() if len(w)>1]).

In [38]:
min_len = 1
df['tweet'] = df['tweet'].apply(lambda x: delete_short_words(x, min_len))

In [39]:
df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,when father is dysfunctional and is so selfish...
1,2,0.0,thanks for lyft credit cannot use cause they d...
2,3,0.0,bihday your majesty
3,4,0.0,model love you take with you all the time in ur
4,5,0.0,factsguide society now motivation


10. Деление твитов на токены с помощью nltk.tokenize.word_tokenize с созданием нового столбца 'tweet_token'.

In [40]:
df['tweet_token'] = df['tweet'].apply(lambda x: tknz.word_tokenize(x))

In [41]:
df.head()

Unnamed: 0,id,label,tweet,tweet_token
0,1,0.0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,..."
1,2,0.0,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau..."
2,3,0.0,bihday your majesty,"[bihday, your, majesty]"
3,4,0.0,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ..."
4,5,0.0,factsguide society now motivation,"[factsguide, society, now, motivation]"


In [42]:
def delete_stop_words(text, stop_words):
    return ' '.join([word for word in text if word not in stop_words])

11. Удаление стоп-слов из токенов с использованием nltk.corpus.stopwords. Создание столбца 'tweet_token_filtered' без стоп-слов.

In [43]:
stop_words = set(stopwords.words("russian"))
df['tweet_token_filtered'] = df['tweet_token'].apply(lambda x: delete_stop_words(x, stop_words))

In [44]:
def compare_stemmer(text, stemmer):
    return ' '.join([stemmer.stem(word) for word in text.split()])

12. Применение стемминга к токенам с помощью nltk.stem.PorterStemmer. Создание столбца 'tweet_stemmed' после применения стемминга.

In [45]:
stemmer = PorterStemmer()
df['tweet_stemmed'] = df['tweet_token_filtered'].apply(lambda x: compare_stemmer(x, stemmer))

In [46]:
def compare_lemmatizer(text, lemmatizer):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

13. Применение лемматизации к токенам с помощью nltk.stem.wordnet.WordNetLemmatizer. Создание столбца 'tweet_lemmatized' после применения лемматизации.

In [47]:
lemmatizer = WordNetLemmatizer()
df['tweet_lemmatized'] = df['tweet_token_filtered'].apply(lambda x: compare_lemmatizer(x, lemmatizer))

In [48]:
df.head(10)

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,1,0.0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...",when father is dysfunctional and is so selfish...,when father is dysfunct and is so selfish he d...,when father is dysfunctional and is so selfish...
1,2,0.0,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...",thanks for lyft credit can not use cause they ...,thank for lyft credit can not use caus they do...,thanks for lyft credit can not use cause they ...
2,3,0.0,bihday your majesty,"[bihday, your, majesty]",bihday your majesty,bihday your majesti,bihday your majesty
3,4,0.0,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...",model love you take with you all the time in ur,model love you take with you all the time in ur,model love you take with you all the time in ur
4,5,0.0,factsguide society now motivation,"[factsguide, society, now, motivation]",factsguide society now motivation,factsguid societi now motiv,factsguide society now motivation
5,6,0.0,huge fan fare and big talking before they leav...,"[huge, fan, fare, and, big, talking, before, t...",huge fan fare and big talking before they leav...,huge fan fare and big talk befor they leav cha...,huge fan fare and big talking before they leav...
6,7,0.0,camping tomorrow danny,"[camping, tomorrow, danny]",camping tomorrow danny,camp tomorrow danni,camping tomorrow danny
7,8,0.0,the next school year is the year for exams can...,"[the, next, school, year, is, the, year, for, ...",the next school year is the year for exams can...,the next school year is the year for exam can ...,the next school year is the year for exam can ...
8,9,0.0,we won love the land allin cavs champions clev...,"[we, won, love, the, land, allin, cavs, champi...",we won love the land allin cavs champions clev...,we won love the land allin cav champion clevel...,we won love the land allin cavs champion cleve...
9,10,0.0,welcome here am it has it is so gr8,"[welcome, here, am, it, has, it, is, so, gr8]",welcome here am it has it is so gr8,welcom here am it ha it is so gr8,welcome here am it ha it is so gr8


14. Сохранение результата предобработки в pickle-файл.

In [49]:
with open(r'nlp_hw_1_results.pickle', 'wb') as f:
    pickle.dump(df, f)

In [50]:
with open(r'nlp_hw_1_results.pickle', 'rb') as f:
    df_check = pickle.load(f, encoding='utf-8')

In [51]:
df_check.head()

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,1,0.0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...",when father is dysfunctional and is so selfish...,when father is dysfunct and is so selfish he d...,when father is dysfunctional and is so selfish...
1,2,0.0,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...",thanks for lyft credit can not use cause they ...,thank for lyft credit can not use caus they do...,thanks for lyft credit can not use cause they ...
2,3,0.0,bihday your majesty,"[bihday, your, majesty]",bihday your majesty,bihday your majesti,bihday your majesty
3,4,0.0,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...",model love you take with you all the time in ur,model love you take with you all the time in ur,model love you take with you all the time in ur
4,5,0.0,factsguide society now motivation,"[factsguide, society, now, motivation]",factsguide society now motivation,factsguid societi now motiv,factsguide society now motivation
