# Adatok előkészítése - notebook

##### 1. Feladat:
Adatgyűjtés és előkészítés: Termékértékelési adatok letöltése egy nyílt adatforrásból (pl. Kaggle vagy Amazon API), és előfeldolgozás (szövegbeli zajok eltávolítása, tokenizálás stb.)

##### Adathalmaz elérési link: https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews?resource=download

## Szükséges könyvtárak importálása

In [None]:
#!pip install --upgrade nltk



In [1]:
import os
import re

import numpy as np
import pandas as pd

import nltk

In [None]:
nltk.download('all')
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Vani\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Vani\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Vani\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\Vani\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Vani\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[

True

In [2]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## Adathalmazok betöltése, adatok megtekintése

In [3]:
file = 'train.csv'
test_file = 'test.csv'

path = input('Path for the files: ')

file_path = (os.path.join(path, file)).replace('\\', '/')
test_file_path = (os.path.join(path, test_file)).replace('\\', '/')

In [4]:
data = pd.read_csv(file_path, names=['sentiment', 'title', 'text'])

In [5]:
data.head(10)

Unnamed: 0,sentiment,title,text
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,2,Amazing!,This soundtrack is my favorite music of all ti...
3,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
5,2,an absolute masterpiece,I am quite sure any of you actually taking the...
6,1,Buyer beware,"This is a self-published book, and if you want..."
7,2,Glorious story,I loved Whisper of the wicked saints. The stor...
8,2,A FIVE STAR BOOK,I just finished reading Whisper of the Wicked ...
9,2,Whispers of the Wicked Saints,This was a easy to read book that made me want...


In [6]:
data.shape

(3600000, 3)

In [7]:
data.sentiment.unique().tolist()

[2, 1]

In [8]:
data.sentiment.value_counts()

2    1800000
1    1800000
Name: sentiment, dtype: int64

#### Nan értékek detektálása

In [9]:
data.isna().sum()

sentiment     0
title        77
text          0
dtype: int64

In [10]:
# Azon sorok kiíratása amik tartalamnak Nan értéket - mivel rengeteg adatunk van, ezt a 77 sort törölhetjük
data[data.isna().any(axis=1)]

Unnamed: 0,sentiment,title,text
26554,1,,What separates this band from Evanescence (bes...
26827,2,,Falkenbach returns with more of the Viking/Fol...
36598,2,,I returned this because I received the same on...
132358,2,,"I am a Shakespeare buff, so I didn't find this..."
134465,2,,"Goes at quite a steady pace, however, this is ..."
...,...,...,...
3292195,2,,"The Last Van Gogh, is a compelling, visually p..."
3293615,1,,Can't write a review- it's been a month since ...
3403351,1,,It is not a game. It is only a memory cardIt w...
3493132,2,,Al Spath's diary is a must for all poker playe...


## Adat előkészítés

* Nan értéket tartalmazó sorok eltávolítása
* Szöveg kisbetűssé alakítása
* Speciális karakterek (pl. írásjelek !,?) eltávolítása
* Számok eltávolítása
* Felesleges szóközök eltávolítása
* Tokenizálás
* Stop szavak eltávolítása
* Lemmatizálás és szótőkeresés

### Nan értéket tartalmazó sorok eltávolítása 

In [11]:
data = data.dropna(subset=['title'])

In [12]:
data.shape

(3599923, 3)

### Szöbeg tisztítása regex-el

In [13]:
def prepare_text(text):
    text = text.lower()  # szöveg kisbetűssé alakítása
    text = re.sub(r'[^\w\s]', '', text)  # írásjelek eltávolítása
    text = re.sub(r'\d+', '', text)  # számok eltávolítása
    text = re.sub(r'\s+', ' ', text).strip()  # extra szóközök eltávolítása

    return text

In [14]:
data['title'] = data['title'].apply(prepare_text)
data['text'] = data['text'].apply(prepare_text)

In [15]:
data.head(10)

Unnamed: 0,sentiment,title,text
0,2,stuning even for the nongamer,this sound track was beautiful it paints the s...
1,2,the best soundtrack ever to anything,im reading a lot of reviews saying that this i...
2,2,amazing,this soundtrack is my favorite music of all ti...
3,2,excellent soundtrack,i truly like this soundtrack and i enjoy video...
4,2,remember pull your jaw off the floor after hea...,if youve played the game you know how divine t...
5,2,an absolute masterpiece,i am quite sure any of you actually taking the...
6,1,buyer beware,this is a selfpublished book and if you want t...
7,2,glorious story,i loved whisper of the wicked saints the story...
8,2,a five star book,i just finished reading whisper of the wicked ...
9,2,whispers of the wicked saints,this was a easy to read book that made me want...


### Tokenizálás Stop szavak eltávolítása és Lemmatizálás

In [16]:
global stop_words
stop_words = set(stopwords.words('english'))

global lemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
def tokenizer_stop_words_lemmatize(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    final_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    return final_tokens

In [21]:
data['title_tokens'] = data['title'].apply(tokenizer_stop_words_lemmatize)
data['text_tokens'] = data['text'].apply(tokenizer_stop_words_lemmatize)

In [22]:
data.head(10)

Unnamed: 0,sentiment,title,text,title_tokens,text_tokens
0,2,stuning even for the nongamer,this sound track was beautiful it paints the s...,"[stuning, even, nongamer]","[sound, track, beautiful, paint, senery, mind,..."
1,2,the best soundtrack ever to anything,im reading a lot of reviews saying that this i...,"[best, soundtrack, ever, anything]","[im, reading, lot, review, saying, best, game,..."
2,2,amazing,this soundtrack is my favorite music of all ti...,[amazing],"[soundtrack, favorite, music, time, hand, inte..."
3,2,excellent soundtrack,i truly like this soundtrack and i enjoy video...,"[excellent, soundtrack]","[truly, like, soundtrack, enjoy, video, game, ..."
4,2,remember pull your jaw off the floor after hea...,if youve played the game you know how divine t...,"[remember, pull, jaw, floor, hearing]","[youve, played, game, know, divine, music, eve..."
5,2,an absolute masterpiece,i am quite sure any of you actually taking the...,"[absolute, masterpiece]","[quite, sure, actually, taking, time, read, pl..."
6,1,buyer beware,this is a selfpublished book and if you want t...,"[buyer, beware]","[selfpublished, book, want, know, whyread, par..."
7,2,glorious story,i loved whisper of the wicked saints the story...,"[glorious, story]","[loved, whisper, wicked, saint, story, amazing..."
8,2,a five star book,i just finished reading whisper of the wicked ...,"[five, star, book]","[finished, reading, whisper, wicked, saint, fe..."
9,2,whispers of the wicked saints,this was a easy to read book that made me want...,"[whisper, wicked, saint]","[easy, read, book, made, want, keep, reading, ..."


In [23]:
data.tail(10)

Unnamed: 0,sentiment,title,text,title_tokens,text_tokens
3599990,2,buy this cd and youll thank yourself,tyler hiltona name you might not know now but ...,"[buy, cd, youll, thank]","[tyler, hiltona, name, might, know, sure, with..."
3599991,2,tyler rocks,there is only one word to describe tyler hilto...,"[tyler, rock]","[one, word, describe, tyler, hiltontalent, lov..."
3599992,2,awesome,absolutely amazing so relieving of my neck pai...,[awesome],"[absolutely, amazing, relieving, neck, pain, i..."
3599993,1,what a slap in the face to masami ueda,do not buy this cd ever this was probably just...,"[slap, face, masami, ueda]","[buy, cd, ever, probably, released, test, see,..."
3599994,1,too simplistic,while mr harrison makes some extremely valid a...,[simplistic],"[mr, harrison, make, extremely, valid, argumen..."
3599995,1,dont do it,the high chair looks great when it first comes...,[dont],"[high, chair, look, great, first, come, box, h..."
3599996,1,looks nice low functionality,i have used this highchair for kids now and fi...,"[look, nice, low, functionality]","[used, highchair, kid, finally, decided, sell,..."
3599997,1,compact but hard to clean,we have a small house and really wanted two of...,"[compact, hard, clean]","[small, house, really, wanted, two, high, chai..."
3599998,1,what is it saying,not sure what this book is supposed to be it i...,[saying],"[sure, book, supposed, really, rehash, old, id..."
3599999,2,makes my blood run redwhiteandblue,i agree that every american should read this b...,"[make, blood, run, redwhiteandblue]","[agree, every, american, read, book, everybody..."


### Adatok elmentése csv fájlba

In [24]:
dest_file = 'cleaned_train.csv'
dest_test_file = 'cleaned_test.csv'

dest_file_path = os.path.join(path, dest_file)
data.to_csv(dest_file_path)