# Analiza danych tekstowych i języka naturalnego
**28.04.2020**
<br>

<div style="text-align: right">
<b>Patryk Pilarski</b><br>
1patryk.pilarski@gmail.com<br>
p.pilarski@sages.com.pl
</div>

**Text mining** (analiza tekstu) - ogół działań mających na celu pozyskanie wysokiej jakości informacji z tekstu. W procesie wykorzystywane są techniki NLP, IR oraz ML w celu pozyskania ustrukturyzowanych form czy wzorców, jak również istotnych wniosków. Typowy proces składa się z następujących etapów: 
- pozyskanie danych, 
- przygotowanie danych (normalizacja, nadawanie struktury), 
- ekstrakcja wzorców, 
- ewaluacja i interpretacja. 
<br>

Do najczęstszych zadań należą: klasyfikacja i klasteryzacja dokumentów, podsumowywanie tekstów, analiza sentymentu i named entity recognition.

-----
## Stringi

Unicode - komputerowy zestaw znaków mający w zamierzeniu obejmować wszystkie pisma używane na świecie

In [4]:
text = "🐢² = żółw • żółw"

In [5]:
for x in text:
    print(x)

🐢
²
 
=
 
ż
ó
ł
w
 
•
 
ż
ó
ł
w


In [6]:
text[0]

'🐢'

In [7]:
text[:2]

'🐢²'

In [None]:
text.split()

In [None]:
text.split("=")

In [None]:
[x.strip() for x in text.split("=")]

In [None]:
text.split(" = ")

In [None]:
text.upper()

In [None]:
text.strip("w")

In [None]:
"!!!??tu!?nie??!!".strip("?!")

In [None]:
text.replace("²", "^2")

In [None]:
lst = ["kto", "to", "taki"]

In [None]:
" ".join(lst)

**ZADANIE**
> Odwróć kolejność znaków w stringu `text`

----
## Regex

**Regex** (wyrażenia regularne) – wzorce opisujące łańcuchy symboli, pozwalają na znajdowanie lub modyfikowanie zdefiniowanych wzorców. Znajdują bardzo szerokie zastosowanie np. w ekstrakcji danych z tekstu takich jak adresy email, kody pocztowe, numery telefonów itp.<br> https://regex101.com/

In [8]:
import re

In [9]:
re.findall("w", text)

['w', 'w']

In [10]:
re.findall("[a-z]+", text)

['w', 'w']

In [11]:
re.findall("ł", text)

['ł', 'ł']

In [16]:
re.findall("\w+", text)

['²', 'żółw', 'żółw']

**ZADANIE**
> 1. Z tekstu `txt` wyciągnij wszytskie liczby<br>
> 2. Z tekstu `txt` wyciągnij wszystkie słowa zaczynające się od wielkiej litery<br>

In [17]:
txt = \
"""
Monty Python (also collectively known as the Pythons) were a British surreal comedy troupe who created the 
sketch comedy television show Monty Python's Flying Circus, which first aired on the BBC in 1969. 
Forty-five episodes were made over four series. The Python phenomenon developed from the television series 
into something larger in scope and impact, including touring stage shows, films, numerous albums, several books 
and musicals. The Pythons' influence on comedy has been compared to the Beatles' influence on music. 
Regarded as an enduring icon of 1970s pop culture, their sketch show has been referred to as being 
“an important moment in the evolution of television comedy".
Broadcast by the BBC between 1969 and 1974, Monty Python's Flying Circus was conceived, written and performed 
by its members Graham Chapman, John Cleese, Terry Gilliam, Eric Idle, Terry Jones, and Michael Palin. 
Loosely structured as a sketch show, but with an innovative stream-of-consciousness approach aided by 
Gilliam's animation, it pushed the boundaries of what was acceptable in style and content. 
A self-contained comedy team responsible for both writing and performing their work, the Pythons had creative 
control which allowed them to experiment with form and content, discarding rules of television comedy. 
Following their television work, they began making films, including Monty Python and the Holy Grail (1975), 
Life of Brian (1979) and The Meaning of Life (1983). Their influence on British comedy has been apparent 
for years, while in North America, it has coloured the work of cult performers from the early editions 
of Saturday Night Live through to more recent absurdist trends in television comedy. 
"Pythonesque" has entered the English lexicon as a result.
"""

In [29]:
re.findall("\d+",txt)
re.findall('[0-9]+',txt)

['1969', '1970', '1969', '1974', '1975', '1979', '1983']

In [33]:
re.findall('[A-Z][a-z]+',txt)

['Monty',
 'Python',
 'Pythons',
 'British',
 'Monty',
 'Python',
 'Flying',
 'Circus',
 'Forty',
 'The',
 'Python',
 'The',
 'Pythons',
 'Beatles',
 'Regarded',
 'Broadcast',
 'Monty',
 'Python',
 'Flying',
 'Circus',
 'Graham',
 'Chapman',
 'John',
 'Cleese',
 'Terry',
 'Gilliam',
 'Eric',
 'Idle',
 'Terry',
 'Jones',
 'Michael',
 'Palin',
 'Loosely',
 'Gilliam',
 'Pythons',
 'Following',
 'Monty',
 'Python',
 'Holy',
 'Grail',
 'Life',
 'Brian',
 'The',
 'Meaning',
 'Life',
 'Their',
 'British',
 'North',
 'America',
 'Saturday',
 'Night',
 'Live',
 'Pythonesque',
 'English']

----
## Wczytywanie danych

[Surowe dane](https://snap.stanford.edu/data/web-FineFoods.html)

**PROSTE PLIKI TEKSTOWE**

In [34]:
# open
with open("finefoods_sample.csv", "r") as f:
    foods = f.readlines()

In [35]:
foods

['score,summary,review\n',
 '1.0,Didn\'t Grow. Went Moldy.,"I\'m not sure what I did wrong... but I watered it once and followed the directions and waited for it to grow and nothing happened. It simply got fuzzy, and didn\'t sprout.  After reading the reviews (I wish I had ahead of my purchase) I see it turned moldy for other people as well."\n',
 '1.0,"Great food, but purchase from local store for A LOT less","As 2 others have stated you can get the 32 packs from Wal-Mart for about $12-13. You also do have the variety of choosing different flavors and different textures like the pate or shreds. My older and younger cats would rather eat the shredded or meat bits then the mashed pate thing, and I think they don\'t have such a strong smell as the pate does either!  If you do not mind paying double then go for it, but I am always searching for the lowest price and knowing that Amazon is usually cheaper thought I could find them on here, but that is obviously not true in this case! Better

In [38]:
foods[0]

'score,summary,review\n'

In [39]:
with open("finefoods_sample.csv", "r") as f:
    foods = f.read()

In [42]:
foods[:5000]

'score,summary,review\n1.0,Didn\'t Grow. Went Moldy.,"I\'m not sure what I did wrong... but I watered it once and followed the directions and waited for it to grow and nothing happened. It simply got fuzzy, and didn\'t sprout.  After reading the reviews (I wish I had ahead of my purchase) I see it turned moldy for other people as well."\n1.0,"Great food, but purchase from local store for A LOT less","As 2 others have stated you can get the 32 packs from Wal-Mart for about $12-13. You also do have the variety of choosing different flavors and different textures like the pate or shreds. My older and younger cats would rather eat the shredded or meat bits then the mashed pate thing, and I think they don\'t have such a strong smell as the pate does either!  If you do not mind paying double then go for it, but I am always searching for the lowest price and knowing that Amazon is usually cheaper thought I could find them on here, but that is obviously not true in this case! Better just to bu

**CSV** 

In [43]:
import pandas as pd

In [44]:
df = pd.read_csv("finefoods_sample.csv")

In [45]:
df.head()

Unnamed: 0,score,summary,review
0,1.0,Didn't Grow. Went Moldy.,I'm not sure what I did wrong... but I watered...
1,1.0,"Great food, but purchase from local store for ...",As 2 others have stated you can get the 32 pac...
2,1.0,"Product is fine, dodgy seller?",The Lavazza Gran Crema pods are fine if you're...
3,5.0,"A sweet, healthy treat","Low sugar, low fat, a measured amount, and swe..."
4,1.0,NOT BERRY... or CHERRY 90% APPLE,AVOID unless you want apple...( product held 9...


In [46]:
summary_raw = df["summary"].values.tolist()

In [47]:
summary_raw[:5]

["Didn't Grow. Went Moldy.",
 'Great food, but purchase from local store for A LOT less',
 'Product is fine, dodgy seller?',
 'A sweet, healthy treat',
 'NOT BERRY... or CHERRY  90% APPLE']

In [48]:
review_raw = df["review"].values.tolist()

In [49]:
review_raw[:5]

["I'm not sure what I did wrong... but I watered it once and followed the directions and waited for it to grow and nothing happened. It simply got fuzzy, and didn't sprout.  After reading the reviews (I wish I had ahead of my purchase) I see it turned moldy for other people as well.",
 "As 2 others have stated you can get the 32 packs from Wal-Mart for about $12-13. You also do have the variety of choosing different flavors and different textures like the pate or shreds. My older and younger cats would rather eat the shredded or meat bits then the mashed pate thing, and I think they don't have such a strong smell as the pate does either!  If you do not mind paying double then go for it, but I am always searching for the lowest price and knowing that Amazon is usually cheaper thought I could find them on here, but that is obviously not true in this case! Better just to buy some when you're at the store. At the store it is about .42 per can and at what I have been seeing on here it is ab

In [50]:
score = df["score"].values.tolist()
score[:5]

[1.0, 1.0, 1.0, 5.0, 1.0]

----
## Tokenizacja tekstu

**Tokenizacja** – proces dzielenia tekstu na mniejsze komponenty. Do najpopularniejszych  form tokenizacji należy podział tekstu na zdania oraz podział na słowa. W wyniku tokenizacji powstają tokeny.

In [51]:
example_sentence = review_raw[0]
example_sentence

"I'm not sure what I did wrong... but I watered it once and followed the directions and waited for it to grow and nothing happened. It simply got fuzzy, and didn't sprout.  After reading the reviews (I wish I had ahead of my purchase) I see it turned moldy for other people as well."

#### split

In [52]:
word_tokens = example_sentence.split()

In [53]:
print("Liczba tokenów: {}".format(len(word_tokens)))
print("Tokeny:")
print(word_tokens)

Liczba tokenów: 54
Tokeny:
["I'm", 'not', 'sure', 'what', 'I', 'did', 'wrong...', 'but', 'I', 'watered', 'it', 'once', 'and', 'followed', 'the', 'directions', 'and', 'waited', 'for', 'it', 'to', 'grow', 'and', 'nothing', 'happened.', 'It', 'simply', 'got', 'fuzzy,', 'and', "didn't", 'sprout.', 'After', 'reading', 'the', 'reviews', '(I', 'wish', 'I', 'had', 'ahead', 'of', 'my', 'purchase)', 'I', 'see', 'it', 'turned', 'moldy', 'for', 'other', 'people', 'as', 'well.']


#### re.findall

In [54]:
word_tokens = re.findall("[a-zA-Z']+", example_sentence)

In [55]:
print("Liczba tokenów: {}".format(len(word_tokens)))
print("Tokeny:")
print(word_tokens)

Liczba tokenów: 54
Tokeny:
["I'm", 'not', 'sure', 'what', 'I', 'did', 'wrong', 'but', 'I', 'watered', 'it', 'once', 'and', 'followed', 'the', 'directions', 'and', 'waited', 'for', 'it', 'to', 'grow', 'and', 'nothing', 'happened', 'It', 'simply', 'got', 'fuzzy', 'and', "didn't", 'sprout', 'After', 'reading', 'the', 'reviews', 'I', 'wish', 'I', 'had', 'ahead', 'of', 'my', 'purchase', 'I', 'see', 'it', 'turned', 'moldy', 'for', 'other', 'people', 'as', 'well']


**ZADANIE**
> 1. Wykorzystując re.findall podziel wszystkie podsumowania (`summary_raw`) na tokeny - słowa mają być zapisane małymi literami, wynik zapisz jako `summary_tokens`
> 2. Wykonaj to samo dla recenzji (`review_raw`), wynik zapisz jako `review_tokens`

In [63]:
summary_tokens = [re.findall("[a-zA-Z']+", x.lower()) for x in summary_raw]

In [60]:
review_tokens = [re.findall("[a-zA-Z']+", x.lower()) for x in review_raw]

----
## Usuwanie stopwords'ów

**Stopwords** - lista słów mało istotnych, odrzucanych w celu redukcji wymiarowości zbiorów

In [58]:
from collections import Counter

In [61]:
words = Counter([x for y in review_tokens for x in y])

In [62]:
print(f"liczba unikalnych słów: {len(words)}")

liczba unikalnych słów: 18813


In [64]:
print("10 najpopularniejszych słów:")
for x in words.most_common(10):
    print(x)

10 najpopularniejszych słów:
('the', 26083)
('i', 21947)
('and', 17426)
('a', 14916)
('to', 13922)
('it', 13162)
('of', 10771)
('this', 10060)
('is', 9495)
('in', 7513)


In [66]:
from stopwords import STOPWORDS

In [67]:
STOPWORDS

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'm",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'o

In [68]:
example = review_tokens[0]
example

["i'm",
 'not',
 'sure',
 'what',
 'i',
 'did',
 'wrong',
 'but',
 'i',
 'watered',
 'it',
 'once',
 'and',
 'followed',
 'the',
 'directions',
 'and',
 'waited',
 'for',
 'it',
 'to',
 'grow',
 'and',
 'nothing',
 'happened',
 'it',
 'simply',
 'got',
 'fuzzy',
 'and',
 "didn't",
 'sprout',
 'after',
 'reading',
 'the',
 'reviews',
 'i',
 'wish',
 'i',
 'had',
 'ahead',
 'of',
 'my',
 'purchase',
 'i',
 'see',
 'it',
 'turned',
 'moldy',
 'for',
 'other',
 'people',
 'as',
 'well']

In [69]:
[x for x in example if x not in STOPWORDS]

['sure',
 'wrong',
 'watered',
 'followed',
 'directions',
 'waited',
 'grow',
 'nothing',
 'happened',
 'simply',
 'got',
 'fuzzy',
 'sprout',
 'reading',
 'reviews',
 'wish',
 'ahead',
 'purchase',
 'see',
 'turned',
 'moldy',
 'people',
 'well']

In [70]:
def rm_stopwords(tokens, stopwords=STOPWORDS):
    return [x for x in tokens if x not in stopwords]

In [71]:
rm_stopwords(example)

['sure',
 'wrong',
 'watered',
 'followed',
 'directions',
 'waited',
 'grow',
 'nothing',
 'happened',
 'simply',
 'got',
 'fuzzy',
 'sprout',
 'reading',
 'reviews',
 'wish',
 'ahead',
 'purchase',
 'see',
 'turned',
 'moldy',
 'people',
 'well']

**ZADANIE**
> 1. Wykorzystując rm_stopwords usuń stopwordsy ze wszystkich podsumowań (`summary_tokens`), wynik zapisz jako `summary_nostop`
> 2. Wykonaj to samo dla recenzji (`review_tokens`), wynik zapisz jako `review_nostop`

-----
## Stemming

**Stemming** – proces usuwania ze słowa końcówki fleksyjnej w celu pozostawienia jedynie tematu wyrazu (rdzenia, stemu). Rdzeń nie musi być słowem. Najpopularniejszy stemmer to stemmer Portera.

In [None]:
to_stem = ["working", "worked", "works", "rune", "ran", "running", "lying"]

**Porter stemmer**

In [None]:
from nltk.stem import PorterStemmer

In [None]:
ps = PorterStemmer()

In [None]:
for x in to_stem:
    print(ps.stem(x))

#### Regexp stemmer

In [None]:
from nltk.stem import RegexpStemmer

In [None]:
rs = RegexpStemmer("ing$|s$|ed$", 4)

In [None]:
for x in to_stem:
    print(rs.stem(x))

In [None]:
def stem(tokens, stemmer=ps):
    return [stemmer.stem(x) for x in tokens]

**ZADANIE**
> 1. Wykorzystując stem przeprowadź stemming wszystkich podsumowań (`summary_nostop`), wynik zapisz jako `summary`
> 2. Wykonaj to samo dla recenzji (`review_nostop`), wynik zapisz jako `review`

----

In [None]:
words = Counter([x for y in review for x in y])

In [None]:
print(f"liczba unikalnych słów: {len(words)}")

In [None]:
print("10 najpopularniejszych słów:")
for x in words.most_common(10):
    print(x)

-----
## Document-term matrix

**Dokument-term matrix/term-document matrix** - macierz opisująca częstość wystąpień elementów (tokenów) w kolekcji dokumentów. W dtm wiersze odpowiadają dokumentom a kolumny tokenom, natomiast w tdm jest odwrotnie.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np

In [None]:
to_dtm = summary[:5]
to_dtm

### BoW

**Bag of Words** - popularna reprezentacja tekstu pomijająca informacje o strukturze i kolejności słów (tokenów), zachowująca natomiast informacje o częstości ich występowania.

In [None]:
def identity(x):
    return x

In [None]:
bow_vectorizer = CountVectorizer(preprocessor=identity, tokenizer=identity)
bow_dtm = bow_vectorizer.fit_transform(to_dtm)
bow_dtm.todense()

In [None]:
pd.DataFrame(bow_dtm.todense(), columns=bow_vectorizer.get_feature_names())

### tfidf

**tfidf** (term frequency inverse document frequency) - często wykorzystywana metoda obliczania wagi słów w oparciu o liczbę ich wystąpień w dokumencie i całym korpusie dokumentów.

**tf** - term frequency -> BoW <br>
**idf** - inverse document frequency -> $idf(t) = 1 + log(\frac{1 + n_{doc}}{1 + df(t)})$

In [None]:
tfidf_vectorizer = TfidfVectorizer(preprocessor=identity, tokenizer=identity)
tfidf_dtm = tfidf_vectorizer.fit_transform(to_dtm)
tfidf_dtm.todense()

In [None]:
pd.DataFrame(tfidf_dtm.todense(), columns=tfidf_vectorizer.get_feature_names())

**ZADANIE**
> Na podstawie `dtm_corpus` stwórz macierz dtm bow i wyświetl ją w postaci DataFrame'u

In [None]:
dtm_corpus = summary_raw[:5]
dtm_corpus

-----
## Klasyfikacja

🍎 -> jabłko  
🍐 -> nie-jabłko  
🐢 -> nie-jabłko  
🍎 -> jabłko

### Podział na zbiór treningowy i testowy

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
len(score)

In [None]:
train_review, test_review, train_labels, test_labels = train_test_split(review, score, 
                                                                        test_size=0.25, random_state=42)

In [None]:
len(train_review), len(test_review)

In [None]:
len(train_labels), len(test_labels)

### Przygotowanie zbioru treningowego

In [None]:
c_vect = CountVectorizer(preprocessor=identity, tokenizer=identity)
train_bow = c_vect.fit_transform(train_review)

In [None]:
pd.DataFrame(train_bow.todense(), columns=c_vect.get_feature_names()).head()

In [None]:
t_vect = TfidfVectorizer(preprocessor=identity, tokenizer=identity)
train_tfidf = t_vect.fit_transform(train_review)

In [None]:
pd.DataFrame(train_tfidf.todense(), columns=t_vect.get_feature_names()).head()

### Przygotowanie zbioru testowego

In [None]:
test_bow = c_vect.transform(test_review)

In [None]:
pd.DataFrame(test_bow.todense(), columns=c_vect.get_feature_names()).head()

In [None]:
test_tfidf = t_vect.transform(test_review)

In [None]:
pd.DataFrame(test_tfidf.todense(), columns=t_vect.get_feature_names()).head()

### Budowa modelu

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
nb = MultinomialNB()

In [None]:
nb.fit(train_bow, train_labels)

In [None]:
preds = nb.predict(test_bow)

In [None]:
preds

### Ewaluacja modelu

In [None]:
from sklearn import metrics

In [None]:
def print_metrics(true_labels, predicted_labels):
    print("Accuracy:", np.round(metrics.accuracy_score(true_labels, predicted_labels), 2))
    print("Precision:", np.round(metrics.precision_score(true_labels, predicted_labels), 2))
    print("Recall:", np.round(metrics.recall_score(true_labels, predicted_labels), 2))
    print("F1:", np.round(metrics.f1_score(true_labels, predicted_labels), 2))

In [None]:
print_metrics(test_labels, preds)

**ZADANIE**
> Wytrenuj model MultinomialNB wykorzystując reprezentację tfidf oraz wyświetl miary jakości modelu

### Inny model

In [None]:
from sklearn.svm import SVC

In [None]:
svm = SVC()
svm.fit(train_bow, train_labels)

In [None]:
p = svm.predict(test_bow)

In [None]:
print_metrics(test_labels, p)

**ZADANIE**
> Wytrenuj model SVC wykorzystując reprezentację tfidf oraz wyświetl miary jakości modelu