# Adatok előkészítése - notebook

##### 1. Feladat:
Adatgyűjtés és előkészítés: Termékértékelési adatok letöltése egy nyílt adatforrásból (pl. Kaggle vagy Amazon API), és előfeldolgozás (szövegbeli zajok eltávolítása, tokenizálás stb.)

##### Adathalmaz elérési link: https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews?resource=download

## Szükséges könyvtárak importálása

In [1]:
import os
import re

import numpy as np
import pandas as pd

import nltk

In [2]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer

In [30]:
!pip install nbformat
%run 'common.ipynb'



[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/vani/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /home/vani/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/vani/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /home/vani/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/vani/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package ave

## Adathalmazok betöltése, adatok megtekintése

In [3]:
# Fájlok útvonalainak előállítása
train_file = 'train.csv'
test_file = 'test.csv'

path = input('Path for the files: ')

train_file_path = (os.path.join(path, train_file)).replace('\\', '/')
test_file_path = (os.path.join(path, test_file)).replace('\\', '/')

### Train adathalmazhoz kapcsolódó műveletek

In [9]:
# Adat beolvasása dataframe-be
train_data = pd.read_csv(train_file_path, names=['sentiment', 'title', 'text'])

In [None]:
train_data.shape

(3600000, 3)

In [11]:
# Adatok leredukálása 1 000 000 sorra
reduce_rows = 1000000
train_data = train_data.head(reduce_rows)

In [12]:
train_data.head(10)

Unnamed: 0,sentiment,title,text
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,2,Amazing!,This soundtrack is my favorite music of all ti...
3,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
5,2,an absolute masterpiece,I am quite sure any of you actually taking the...
6,1,Buyer beware,"This is a self-published book, and if you want..."
7,2,Glorious story,I loved Whisper of the wicked saints. The stor...
8,2,A FIVE STAR BOOK,I just finished reading Whisper of the Wicked ...
9,2,Whispers of the Wicked Saints,This was a easy to read book that made me want...


In [13]:
train_data.shape

(1000000, 3)

In [14]:
train_data.sentiment.unique().tolist()

[2, 1]

In [15]:
train_data.sentiment.value_counts()

sentiment
2    505678
1    494322
Name: count, dtype: int64

### Test adathalmazhoz kapcsolódó műveletek

In [16]:
test_data = pd.read_csv(test_file_path, names=['sentiment', 'title', 'text'])

In [17]:
test_data.shape

(400000, 3)

In [18]:
# Adatok leredukálása 1 000 000 sorra
reduce_rows = 150000
test_data = test_data.head(reduce_rows)

In [19]:
test_data.head(10)

Unnamed: 0,sentiment,title,text
0,2,Great CD,My lovely Pat has one of the GREAT voices of h...
1,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
2,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
3,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
4,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...
5,1,DVD Player crapped out after one year,I also began having the incorrect disc problem...
6,1,Incorrect Disc,"I love the style of this, but after a couple y..."
7,1,DVD menu select problems,I cannot scroll through a DVD menu that is set...
8,2,Unique Weird Orientalia from the 1930's,"Exotic tales of the Orient from the 1930's. ""D..."
9,1,"Not an ""ultimate guide""","Firstly,I enjoyed the format and tone of the b..."


#### Nan értékek detektálása

In [20]:
train_data.isna().sum()

sentiment     0
title        66
text          0
dtype: int64

In [21]:
# Azon sorok kiíratása amik tartalamnak Nan értéket - mivel rengeteg adatunk van, ezeket a sorokat törölhetjük
train_data[train_data.isna().any(axis=1)]

Unnamed: 0,sentiment,title,text
13265,1,,Couldn't get the device to work with my networ...
26554,1,,What separates this band from Evanescence (bes...
26827,2,,Falkenbach returns with more of the Viking/Fol...
36598,2,,I returned this because I received the same on...
37347,2,,This book is a great fantasy. I love this amaz...
...,...,...,...
952185,1,,To describe this record as typical death metal...
959838,2,,i read this book several years ago but it has ...
965184,1,,See other reviews. Does what it says it would ...
968955,2,,a great read for people wondering how john dee...


In [22]:
test_data.isna().sum()

sentiment     0
title        12
text          0
dtype: int64

In [23]:
# Azon sorok kiíratása amik tartalamnak Nan értéket - mivel rengeteg adatunk van, ezeket a sorokat törölhetjük
test_data[test_data.isna().any(axis=1)]

Unnamed: 0,sentiment,title,text
205,2,,Awesome.... simply awesome. I couldn't put thi...
2703,1,,Who is Joe Nickell? What are his qualification...
10875,1,,None the palace of pleasure volume l is not wo...
47630,1,,Crazy¡! I am 10 and this book was not a gud in...
66727,1,,this is a tereble book. dont read this book. i...
83136,1,,The book does have some good info but is dated...
86252,2,,i have every book written by nora roberts this...
101746,1,,OMG! WHAT FREAK! THIS WAS THE ANSWER TO TO DEM...
112957,2,,Random House failed to edit this book. There a...
120213,2,,This CD is good. A lot of the songs on here wa...


## Adat előkészítés

* Nan értéket tartalmazó sorok eltávolítása
* Szöveg kisbetűssé alakítása
* Speciális karakterek (pl. írásjelek !,?) eltávolítása
* Számok eltávolítása
* Felesleges szóközök eltávolítása
* Tokenizálás
* Stop szavak eltávolítása
* Lemmatizálás és szótőkeresés

### Nan értéket tartalmazó sorok eltávolítása 

In [24]:
train_data = train_data.dropna(subset=['title'])

In [None]:
train_data.shape

In [25]:
# Test adat
test_data = test_data.dropna(subset=['title'])

In [27]:
test_data.shape

(149988, 3)

### Szöbeg tisztítása regex-el

In [31]:
# Train adat
train_data['title'] = train_data['title'].apply(prepare_text)
train_data['text'] = train_data['text'].apply(prepare_text)

In [29]:
train_data.head(10)

Unnamed: 0,sentiment,title,text
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,2,Amazing!,This soundtrack is my favorite music of all ti...
3,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
5,2,an absolute masterpiece,I am quite sure any of you actually taking the...
6,1,Buyer beware,"This is a self-published book, and if you want..."
7,2,Glorious story,I loved Whisper of the wicked saints. The stor...
8,2,A FIVE STAR BOOK,I just finished reading Whisper of the Wicked ...
9,2,Whispers of the Wicked Saints,This was a easy to read book that made me want...


In [32]:
# Test adat
test_data['title'] = test_data['title'].apply(prepare_text)
test_data['text'] = test_data['text'].apply(prepare_text)

In [None]:
test_data.head(10)

Unnamed: 0,sentiment,title,text
0,2,great cd,my lovely pat has one of the great voices of h...
1,2,one of the best game music soundtracks for a g...,despite the fact that i have only played a sma...
2,1,batteries died within a year,i bought this charger in jul and it worked ok ...
3,2,works fine but maha energy is better,check out maha energys website their powerex m...
4,2,great for the nonaudiophile,reviewed quite a bit of the combo players and ...
5,1,dvd player crapped out after one year,i also began having the incorrect disc problem...
6,1,incorrect disc,i love the style of this but after a couple ye...
7,1,dvd menu select problems,i cannot scroll through a dvd menu that is set...
8,2,unique weird orientalia from the s,exotic tales of the orient from the s dr shen ...
9,1,not an ultimate guide,firstlyi enjoyed the format and tone of the bo...


### Tokenizálás Stop szavak eltávolítása és Lemmatizálás

In [33]:
# Train adat
train_data['title_tokens'] = train_data['title'].apply(tokenizer_stop_words_lemmatize)
train_data['text_tokens'] = train_data['text'].apply(tokenizer_stop_words_lemmatize)

In [34]:
train_data.head(10)

Unnamed: 0,sentiment,title,text,title_tokens,text_tokens
0,2,stuning even for the nongamer,this sound track was beautiful it paints the s...,"[stun, even, nongamer]","[sound, track, beautiful, paint, senery, mind,..."
1,2,the best soundtrack ever to anything,im reading a lot of reviews saying that this i...,"[best, soundtrack, ever, anything]","[im, read, lot, review, say, best, game, sound..."
2,2,amazing,this soundtrack is my favorite music of all ti...,[amaze],"[soundtrack, favorite, music, time, hand, inte..."
3,2,excellent soundtrack,i truly like this soundtrack and i enjoy video...,"[excellent, soundtrack]","[truly, like, soundtrack, enjoy, video, game, ..."
4,2,remember pull your jaw off the floor after hea...,if youve played the game you know how divine t...,"[remember, pull, jaw, floor, hear]","[youve, play, game, know, divine, music, every..."
5,2,an absolute masterpiece,i am quite sure any of you actually taking the...,"[absolute, masterpiece]","[quite, sure, actually, take, time, read, play..."
6,1,buyer beware,this is a selfpublished book and if you want t...,"[buyer, beware]","[selfpublished, book, want, know, whyread, par..."
7,2,glorious story,i loved whisper of the wicked saints the story...,"[glorious, story]","[love, whisper, wicked, saint, story, amaze, p..."
8,2,a five star book,i just finished reading whisper of the wicked ...,"[five, star, book]","[finish, read, whisper, wicked, saint, fell, l..."
9,2,whispers of the wicked saints,this was a easy to read book that made me want...,"[whisper, wicked, saint]","[easy, read, book, make, want, keep, read, eas..."


In [35]:
# Test adat
test_data['title_tokens'] = test_data['title'].apply(tokenizer_stop_words_lemmatize)
test_data['text_tokens'] = test_data['text'].apply(tokenizer_stop_words_lemmatize)

In [36]:
test_data.head(10)

Unnamed: 0,sentiment,title,text,title_tokens,text_tokens
0,2,great cd,my lovely pat has one of the great voices of h...,"[great, cd]","[lovely, pat, one, great, voice, generation, l..."
1,2,one of the best game music soundtracks for a g...,despite the fact that i have only played a sma...,"[one, best, game, music, soundtracks, game, di...","[despite, fact, play, small, portion, game, mu..."
2,1,batteries died within a year,i bought this charger in jul and it worked ok ...,"[batteries, die, within, year]","[buy, charger, jul, work, ok, design, nice, co..."
3,2,works fine but maha energy is better,check out maha energys website their powerex m...,"[work, fine, maha, energy, better]","[check, maha, energys, website, powerex, mhcf,..."
4,2,great for the nonaudiophile,reviewed quite a bit of the combo players and ...,"[great, nonaudiophile]","[review, quite, bite, combo, players, hesitant..."
5,1,dvd player crapped out after one year,i also began having the incorrect disc problem...,"[dvd, player, crap, one, year]","[also, begin, incorrect, disc, problems, ive, ..."
6,1,incorrect disc,i love the style of this but after a couple ye...,"[incorrect, disc]","[love, style, couple, years, dvd, give, proble..."
7,1,dvd menu select problems,i cannot scroll through a dvd menu that is set...,"[dvd, menu, select, problems]","[scroll, dvd, menu, set, vertically, triangle,..."
8,2,unique weird orientalia from the s,exotic tales of the orient from the s dr shen ...,"[unique, weird, orientalia]","[exotic, tales, orient, dr, shen, fu, weird, t..."
9,1,not an ultimate guide,firstlyi enjoyed the format and tone of the bo...,"[ultimate, guide]","[firstlyi, enjoy, format, tone, book, author, ..."


### Adatok elmentése csv fájlba

In [37]:
destination_train_file = 'cleaned_train.csv'
destination_test_file = 'cleaned_test.csv'

destination_train_file_path = os.path.join(path, destination_train_file)
destination_test_file_path = os.path.join(path, destination_test_file)
train_data.to_csv(destination_train_file_path)
test_data.to_csv(destination_test_file_path)