## Прежде чем заняться решением какой-то задачи связанной с текстами, эти тексты нужно обработать.

### NLTK (Natural Language Toolkit) это удобная библиотека для работы с текстом.

In [0]:
!pip install --upgrade nltk 

In [0]:
import nltk

In [0]:
# Сейчас должно открыться окно загрузки данных nltk
#nltk.download()

**Токенизация текста**

In [0]:
nltk.download('punkt')

In [0]:
from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "All work and no play makes jack a dull boy, all work and no play"
print(word_tokenize(data))

In [0]:
print(sent_tokenize("I was going home when she rung. It was a surprise."))

**Удаление стоп слов, которые часто встречаются, но не несут особого смысла. Они могут мешать.**

In [0]:
nltk.download('stopwords')

In [0]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

In [0]:
len(stopWords)

In [0]:
res = [word for word in word_tokenize(data) if word not in stopWords]

In [0]:
# Пропал токен 'no'
res

**Так же для каждого слова мы можем делать stemming, выделять его корень.**

In [0]:
from nltk.stem import PorterStemmer
words = ["game", "gaming", "gamed", "games", "compacted"]

In [0]:
ps = PorterStemmer()
list(map(ps.stem, words))

**Лемматизация**

In [0]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

In [0]:
nltk.download('wordnet')

In [0]:
wnl = nltk.WordNetLemmatizer()
print(list(map(wnl.lemmatize, tokens)))

In [0]:
# Если мы укажем часть речи слова
wnl.lemmatize('is', 'v')

**Part of speech tagging. NLTK умеет расставлять части речи словам в предложении.**

In [0]:
nltk.download('averaged_perceptron_tagger')

In [0]:
sentences = nltk.sent_tokenize(data)   
for sent in sentences:
    print(nltk.pos_tag(nltk.word_tokenize(sent)))

![POStags](POStags.png)

**Парсинг**

In [0]:
from nltk.corpus import treebank
nltk.download('treebank')

In [0]:
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()
# Сейчас должна открыться картинка.

### Регулярные выражения

Исчерпывающий пост https://habr.com/ru/post/349860/

**Рассмотрим несколько распространенных примеров использования регулярок.**

In [0]:
import re
# С помощью рег. выражения можно искать, заменять и сентезировать строки по шаблонам
# Парочка простых примеров

In [0]:
word = 'supercalifragilisticexpialidocious'
re.findall('[aeiou]|super', word)

In [0]:
re.findall('\d+', 'There is some numbers: 49 and 432')

In [0]:
re.sub('[,\.?!]',' ','How, to? split. text!').split()

In [0]:
re.sub('[^A-z]',' ','I 123 can 45 play 67 football').split()

![Regexp](Regexp.png)

In [0]:
#!pip install spacy

### Spacy
#### Это еще одна быстрая библиотека с решениями для NLP.
#### В ней реализованы многие вещи, которые есть и в NLTK

**Например NER (Named entities recognition)**

In [0]:
#!python -m spacy download en 

In [0]:
import en_core_web_sm
import spacy
nlp = en_core_web_sm.load()

In [0]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

## 20 newsgroups это датасет с 18000 новостей, сгруппированных по 20 темам.

In [0]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

In [0]:
list(newsgroups_train.target_names)

In [0]:
newsgroups_train.filenames.shape

In [0]:
newsgroups_train.target.shape

In [0]:
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

In [0]:
newsgroups_train.filenames.shape

In [0]:
newsgroups_train.data[0]

In [0]:
newsgroups_train.target[:10]

### Давайте векторизуем эти тексты с помощью TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)

#### Некоторые параметры TfidfVectorizer: 
#### input : string {‘filename’, ‘file’, ‘content’}
#### lowercase : boolean, default True
#### preprocessor : callable or None (default)
#### tokenizer : callable or None (default)
#### stop_words : string {‘english’}, list, or None (default)'
#### ngram_range : tuple (min_n, max_n)
#### max_df : float in range [0.0, 1.0] or int, default=1.0
#### min_df : float in range [0.0, 1.0] or int, default=1
#### max_features : int or None, default=None

In [0]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

In [0]:
vectorizer = TfidfVectorizer(lowercase=False)
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

In [0]:
vectorizer = TfidfVectorizer(min_df=0.2)
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

In [0]:
vectorizer = TfidfVectorizer(max_df=0.9)
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

In [0]:
# Эта штука работает дольше
vectorizer = TfidfVectorizer(tokenizer=nltk.word_tokenize)
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

In [0]:
vector = vectors.todense()[1]

In [0]:
vector[vector != 0]

In [0]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
wnl = nltk.WordNetLemmatizer()

In [0]:
def preproc1(text):
    return ' '.join([wnl.lemmatize(word) for word in word_tokenize(text.lower()) if word not in stopWords])

In [0]:
# Протестируем
st = "Oh, I think I ve landed Where there are miracles at work,  For the thirst and for the hunger Come the conference of birds"

In [0]:
preproc1(st)

In [0]:
%%time
vectorizer = TfidfVectorizer(preprocessor=preproc1)
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

In [0]:
vectors.shape

**Давайте сравним скорость с лемматайзером spacy**

In [0]:
import spacy
nlp = spacy.load('en')

In [0]:
def preproc2(text):
    return ' '.join([token.lemma_ for token in nlp(text.lower()) if token not in stopWords])

In [0]:
preproc2(st)

In [0]:
%%time
vectorizer = TfidfVectorizer(preprocessor=preproc2)
vectors = vectorizer.fit_transform(newsgroups_train.data)

In [0]:
vectors.shape

**Как видим spacy делает кучу всего сразу, хотя мы хотим только лемму. Поэтому работает дольше (сильно).**

In [0]:
vectorizer = TfidfVectorizer(max_features=1500, preprocessor=preproc1)
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

### Можем посмотреть на косинусную меру между векторами

#### В этих векторах очень много нулей, поэтому по умолчанию они записываются как sparce matrix для экономии памяти

In [0]:
import numpy as np
from numpy.linalg import norm

In [0]:
type(vectors)

In [0]:
newsgroups_train.target[:10]

In [0]:
np.unique(newsgroups_train.target)

In [0]:
dense_vectors = vectors.todense()

In [0]:
dense_vectors.shape

In [0]:
def cosine_sim(v1, v2):
    # v1, v2 (1 x dim)
    return np.array(v1 @ v2.T / norm(v1) / norm(v2))[0][0]

In [0]:
cosine_sim(dense_vectors[1], dense_vectors[1])

In [0]:
cosines = []
for i in range(10):
    cosines.append(cosine_sim(dense_vectors[0], dense_vectors[i]))

In [0]:
# [1, 3, 2, 0, 2, 0, 2, 1, 2, 1]
cosines

**Cамым близким оказался вектор из той же категории**

**Так же можно попробовать сделать классификацию на основе этих векторов**

In [0]:
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.linear_model import SGDClassifier

In [0]:
svc = svm.SVC()

In [0]:
X_train, X_test, y_train, y_test= train_test_split(dense_vectors, newsgroups_train.target, test_size=0.2)

In [0]:
y_train.shape, y_test.shape

In [0]:
%%time
svc.fit(X_train, y_train)

In [0]:
from sklearn.metrics.classification import accuracy_score

In [0]:
accuracy_score(y_test, svc.predict(X_test))

In [0]:
sgd = SGDClassifier()

In [0]:
sgd.fit(X_train, y_train)

In [0]:
accuracy_score(y_test, sgd.predict(X_test))

### Попробуем классифицировать на основе embeddings

In [0]:
categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)

In [0]:
import gensim.downloader as api
embeddings = api.load('glove-twitter-25')

In [0]:
def vectorize_sum(comment):
    """
    implement a function that converts preprocessed comment to a sum of token vectors
    """
    embedding_dim = embeddings.vectors.shape[1]
    features = np.zeros([embedding_dim], dtype='float32')
    
    # наш preproc1
    words = preproc1(comment).split()
    for word in words:
        if word in embeddings:
            features += embeddings[f'{word}']
    
    return features

In [0]:
vectorize_sum('I can swim')

In [0]:
X_wv = np.stack([vectorize_sum(text) for text in newsgroups_train.data])
X_train_wv, X_test_wv, y_train, y_test = train_test_split(X_wv, newsgroups_train.target, test_size=0.2)

In [0]:
X_train_wv.shape, X_test_wv.shape

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

wv_model = LogisticRegression().fit(X_train_wv, y_train)

In [0]:
accuracy_score(y_test, wv_model.predict(X_test_wv))