## Введение в обработку естественного языка
### Урок 1. Обработка текста


Осуществим предобработку данных с Твиттера, чтобы отчищенный данные в дальнейшем использовать для задачи классификации. Данный датасет содержит негативные (label = 1) и нейтральные (label = 0) высказывания.
Для работы объединим train_df и test_df.

Задания:

1) Удалим @user из всех твитов с помощью паттерна "@[\w]*". Для этого создадим функцию: 
 - для того, чтобы найти все вхождения паттерна в тексте, необходимо использовать re.findall(pattern, input_txt)
 - для для замены @user на пробел, необходимо использовать re.sub()

2) Изменим регистр твитов на нижний с помощью .lower().

3) Заменим сокращения с апострофами (пример: ain't, can't) на пробел, используя apostrophe_dict. Для этого необходимо сделать функцию: для каждого слова в тексте проверить (for word in text.split()), если слово есть в словаре apostrophe_dict в качестве ключа (сокращенного слова), то заменить ключ на значение (полную версию слова).

4) Заменим сокращения на их полные формы, используя short_word_dict. Для этого воспользуемся функцией, используемой в предыдущем пункте.

5) Заменим эмотиконы (пример: ":)" = "happy") на пробелы, используя emoticon_dict. Для этого воспользуемся функцией, используемой в предыдущем пункте.

6) Заменим пунктуацию на пробелы, используя re.sub() и паттерн r'[^\w\s]'.

7) Заменим спец. символы на пробелы, используя re.sub() и паттерн r'[^a-zA-Z0-9]'.

8) Заменим числа на пробелы, используя re.sub() и паттерн r'[^a-zA-Z]'.

9) Удалим из текста слова длиной в 1 символ, используя ' '.join([w for w in x.split() if len(w)>1]).

10) Поделим твиты на токены с помощью nltk.tokenize.word_tokenize, создав новый столбец 'tweet_token'.

11) Удалим стоп-слова из токенов, используя nltk.corpus.stopwords. Создадим столбец 'tweet_token_filtered' без стоп-слов.

12) Применим стемминг к токенам с помощью nltk.stem.PorterStemmer. Создадим столбец 'tweet_stemmed' после применения стемминга.

13) Применим лемматизацию к токенам с помощью nltk.stem.wordnet.WordNetLemmatizer. Создадим столбец 'tweet_lemmatized' после применения лемматизации.

14) Сохраним результат предобработки в pickle-файл.


In [None]:
import pandas as pd
import re
import nltk
import warnings 
import os
from tqdm import tqdm

from inc.dicts import apostrophe_words, short_words, emoji
from inc.constants import STORAGE_PATH,INPUT_PATH

tqdm.pandas()
warnings.filterwarnings("ignore", category=DeprecationWarning)
nltk.download('punkt')

In [39]:
TRAIN_PATH = os.path.join(INPUT_PATH,'lesson_1/train_tweets.csv')
TEST_PATH = os.path.join(INPUT_PATH,'lesson_1/test_tweets.csv')
PIKLE_PATH = os.path.join(STORAGE_PATH, 'lesson_1.pkl')

In [8]:
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

In [9]:
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [10]:
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [16]:
df = pd.concat([train_df, test_df], ignore_index=True)
df.head()

Unnamed: 0,id,label,tweet
0,1,0.0,@user when a father is dysfunctional and is s...
1,2,0.0,@user @user thanks for #lyft credit i can't us...
2,3,0.0,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...
4,5,0.0,factsguide: society now #motivation


In [17]:
# Удалим @user из всех твитов с помощью паттерна "@[\w]*". Для этого создадим функцию:
def remove_user_txt(cell):
    return re.sub(r'@[\w]*','', cell)

In [18]:
df['tweet2'] = df.tweet.apply(remove_user_txt)

In [19]:
df.head()

Unnamed: 0,id,label,tweet,tweet2
0,1,0.0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can't use cause th...
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0.0,factsguide: society now #motivation,factsguide: society now #motivation


In [21]:
# Перевод в нижний регистр
df['tweet2'] = df.tweet2.apply(lambda x: x.lower())
df.head()

Unnamed: 0,id,label,tweet,tweet2
0,1,0.0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can't use cause th...
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0.0,factsguide: society now #motivation,factsguide: society now #motivation


In [22]:
# Заменим сокращения с апострофами
def replace_dict(cell, dictionary):
    def replace_word(word):
        return dictionary.get(word, word)
    return " ".join(replace_word(word) for word in cell.split())

In [23]:
df['tweet2'] = df.tweet2.apply(lambda x: replace_dict(x, apostrophe_words))
df.head()

Unnamed: 0,id,label,tweet,tweet2
0,1,0.0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so selfi...
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ur...
4,5,0.0,factsguide: society now #motivation,factsguide: society now #motivation


In [24]:
# Заменим сокращения на их полные формы
df['tweet2'] = df.tweet2.apply(lambda x: replace_dict(x, short_words))
df.head()

Unnamed: 0,id,label,tweet,tweet2
0,1,0.0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so selfi...
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...,#model i love you take with you all the time i...
4,5,0.0,factsguide: society now #motivation,factsguide: society now #motivation


In [25]:
# Заменим эмотиконы (пример: ":)" = "happy") на пробелы
df['tweet2'] = df.tweet2.apply(lambda x: replace_dict(x, emoji))
df.head()

Unnamed: 0,id,label,tweet,tweet2
0,1,0.0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so selfi...
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...,#model i love you take with you all the time i...
4,5,0.0,factsguide: society now #motivation,factsguide: society now #motivation


In [26]:
# Заменим пунктуацию на пробелы, используя re.sub() и паттерн r'[^\w\s]'
df['tweet2'] = df.tweet2.apply(lambda x: re.sub(r'[^\w\s]',' ', x))
df.head()

Unnamed: 0,id,label,tweet,tweet2
0,1,0.0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so selfi...
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for lyft credit i cannot use cause the...
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...,model i love you take with you all the time i...
4,5,0.0,factsguide: society now #motivation,factsguide society now motivation


In [27]:
# Заменим спец. символы на пробелы, используя re.sub() и паттерн r'[^a-zA-Z0-9]'
df['tweet2'] = df.tweet2.apply(lambda x: re.sub(r'[^a-zA-Z0-9]',' ', x))

In [28]:
# Заменим числа на пробелы, используя re.sub() и паттерн r'[^a-zA-Z]'
df['tweet2'] = df.tweet2.apply(lambda x: re.sub(r'[^a-zA-Z]',' ', x))

In [29]:
# Удалим из текста слова длиной в 1 символ, используя ' '.join([w for w in x.split() if len(w)>1])
df['tweet2'] = df.tweet2.apply(lambda x: ' '.join([w for w in x.split() if len(w)>1]))

In [32]:
# Поделим твиты на токены с помощью nltk.tokenize.word_tokenize, создав новый столбец 'tweet_token'.
df['tweet_token'] = df.tweet2.apply(lambda x: nltk.word_tokenize(x))

In [35]:
# Удалим стоп-слова из токенов, используя nltk.corpus.stopwords. Создадим столбец 'tweet_token_filtered' без стоп-слов.
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
df['tweet_token_filtered' ] = df.tweet_token.apply(lambda x: [w for w in x if w not in stop_words])
df.head()

[nltk_data] Downloading package stopwords to /home/eugene/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,id,label,tweet,tweet2,tweet_token,tweet_token_filtered
0,1,0.0,@user when a father is dysfunctional and is s...,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ..."
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee..."
2,3,0.0,bihday your majesty,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]"
3,4,0.0,#model i love u take with u all the time in ...,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...","[model, love, take, time, ur]"
4,5,0.0,factsguide: society now #motivation,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


In [36]:
# Применим стемминг к токенам с помощью nltk.stem.PorterStemmer. Создадим столбец 'tweet_stemmed' после применения стемминга.
stemmer = nltk.stem.PorterStemmer()
df['tweet_stemmed'] = df.tweet_token_filtered.apply(lambda x: [stemmer.stem(w) for w in x])
df.head()

Unnamed: 0,id,label,tweet,tweet2,tweet_token,tweet_token_filtered,tweet_stemmed
0,1,0.0,@user when a father is dysfunctional and is s...,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ...","[father, dysfunct, selfish, drag, kid, dysfunc..."
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee...","[thank, lyft, credit, use, caus, offer, wheelc..."
2,3,0.0,bihday your majesty,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesti]"
3,4,0.0,#model i love u take with u all the time in ...,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...","[model, love, take, time, ur]","[model, love, take, time, ur]"
4,5,0.0,factsguide: society now #motivation,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguid, societi, motiv]"


In [40]:
# Применим лемматизацию к токенам с помощью nltk.stem.wordnet.WordNetLemmatizer. Создадим столбец 'tweet_lemmatized' после применения лемматизации.
nltk.download('omw-1.4')
nltk.download('wordnet')

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
df['tweet_lemmatized'] = df.tweet_token_filtered.apply(lambda x: [lemmatizer.lemmatize(w) for w in x])
df.head()

[nltk_data] Downloading package omw-1.4 to /home/eugene/nltk_data...
[nltk_data] Downloading package wordnet to /home/eugene/nltk_data...


Unnamed: 0,id,label,tweet,tweet2,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,1,0.0,@user when a father is dysfunctional and is s...,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ...","[father, dysfunct, selfish, drag, kid, dysfunc...","[father, dysfunctional, selfish, drag, kid, dy..."
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee...","[thank, lyft, credit, use, caus, offer, wheelc...","[thanks, lyft, credit, use, cause, offer, whee..."
2,3,0.0,bihday your majesty,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesti]","[bihday, majesty]"
3,4,0.0,#model i love u take with u all the time in ...,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...","[model, love, take, time, ur]","[model, love, take, time, ur]","[model, love, take, time, ur]"
4,5,0.0,factsguide: society now #motivation,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguid, societi, motiv]","[factsguide, society, motivation]"


In [41]:
# Сохраним результат предобработки в pickle-файл.
df.to_pickle(PIKLE_PATH)