Wczytanie zbioru danych.

In [1]:
import pandas as pd

pd.set_option('display.max_columns', 500)

df = pd.read_csv("reviews_train.csv")
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,summary,unixReviewTime,reviewTime,score
0,A35C43YE9HU9CN,B0064X7B4A,Joan Miller,"[0, 0]",I have decided not to play this game. I can't...,Friends,1396396800,"04 2, 2014",1.0
1,AHFS8CGWWXB5B,B00H1P4V3E,WASH ST. GAMER,"[3, 4]",The Amazon Appstore free app of the day for Ju...,"Amazon Makes This ""Longest Spring Ever"" for Fi...",1402272000,"06 9, 2014",2.0
2,A3EW8OTQ90NVHM,B00CLVW82O,Kindle Customer,"[0, 4]",this game was so mush fun I wish I could play ...,best,1368921600,"05 19, 2013",5.0
3,AJ3GHFJY1IUTD,B007T9WVKM,BrawlMaster4,"[0, 2]","Its pretty fun and very good looking, but you...",Fun Game,1350172800,"10 14, 2012",5.0
4,A3JJGBS4EL603S,B00J206J5E,"K. Wilson ""thesupe""","[0, 0]",good graphics; immersive storyline; hard to st...,great game!,1396915200,"04 8, 2014",5.0


Wstępne przetwarzanie danych

Odrzucanie niepełnych wierszy, stanowiących nieznaczną częścią pełnego zbioru.

In [2]:
print("Incomplete rows: ", len(df) - len(df.dropna()))
df = df.dropna()

Incomplete rows:  46


Balansowanie zbioru danych w celu uzyskania równej reprezentacji każdej z klas.

In [3]:
from imblearn.under_sampling import RandomUnderSampler

print(df.score.value_counts() / df.shape[0] * 100, df.shape[0])

X = df[list(set(df.columns) - {"score"})]
y = df['score']
rus = RandomUnderSampler(sampling_strategy='not minority', random_state=42)
X_res, y_res = rus.fit_resample(X, y)

X_res['score'] = y_res
df= X_res

print(df.score.value_counts() / df.shape[0] * 100, df.shape[0])
df.head()

5.0    50.642651
4.0    20.968970
3.0    11.496640
1.0    10.821510
2.0     6.070230
Name: score, dtype: float64 555745
1.0    20.0
2.0    20.0
3.0    20.0
4.0    20.0
5.0    20.0
Name: score, dtype: float64 168675


Unnamed: 0,asin,reviewerName,helpful,reviewText,summary,reviewTime,unixReviewTime,reviewerID,score
0,B006OBWGHO,butler,"[6, 7]",Disney cant make apps. They should just stick ...,Disney apps,"04 8, 2012",1333843200,A2O736G9DQ21NW,1.0
1,B008KMAJQU,Puffey,"[0, 6]",Didn't like controls and choice of weapons wou...,Sucks,"12 28, 2013",1388188800,A193KJ0I0CN0UC,1.0
2,B00AJ3ZJ2C,freeman,"[2, 4]",all it says is nice and gives you a dum!!!I wo...,hate it,"12 21, 2012",1356048000,AJEQ6CRM4OK4L,1.0
3,B007PTJOV0,austin powers,"[2, 4]",not fun at all at certain times when you need ...,not fun,"04 12, 2012",1334188800,AZ23GALJ72OHV,1.0
4,B0083BYESM,"richard from omaha ""richard from omaha""","[0, 1]",U shud not buy this app!! Its a scam!! It do...,Ugh!!,"12 22, 2013",1387670400,AUVDDKSBH0WOY,1.0


Przetworzenie kolumny 'helpful' na format użyteczny dla klasyfikatora.

In [4]:
df['helpful'] = df['helpful'].str.replace("[", "")
df['helpful'] = df['helpful'].str.replace("]", "")
df[['helpfulP', 'helpfulN']] = df['helpful'].str.split(',', 1, expand=True)
df['helpfulP'] = pd.to_numeric(df['helpfulP'])
df['helpfulN'] = pd.to_numeric(df['helpfulN'])

  df['helpful'] = df['helpful'].str.replace("[", "")
  df['helpful'] = df['helpful'].str.replace("]", "")


Odrzucenie cechy reviewTime, ponieważ oznacza ona to samo co unixReviewTime a jest mniej praktyczna.

In [5]:
df.drop(["reviewTime", 'helpful'], axis=1, inplace=True)

Kodowanie cech tekstowych na liczby.

In [6]:
import json
from sklearn.preprocessing import LabelEncoder

# Kodowanie cech tekstowych na liczbowe
encoders = {}
for col in ['reviewerID', 'asin', 'reviewerName']:
    enc = LabelEncoder().fit(df[col])
    df[col] = enc.transform(df[col])
    # encoders.update({col:enc})
df.head()

Unnamed: 0,asin,reviewerName,reviewText,summary,unixReviewTime,reviewerID,score,helpfulP,helpfulN
0,2747,43149,Disney cant make apps. They should just stick ...,Disney apps,1333843200,30124,1.0,6,7
1,5394,31532,Didn't like controls and choice of weapons wou...,Sucks,1388188800,4713,1.0,0,6
2,7167,46230,all it says is nice and gives you a dum!!!I wo...,hate it,1356048000,59783,1.0,2,4
3,4036,41970,not fun at all at certain times when you need ...,not fun,1334188800,67587,1.0,2,4
4,4625,54619,U shud not buy this app!! Its a scam!! It do...,Ugh!!,1387670400,65465,1.0,0,1


Podstawowe informacje o zbiorze

In [7]:
print("---Dataset description---\n", df.describe())
print("---Correlation Matrix---\n", df.corr())

---Dataset description---
                 asin   reviewerName  unixReviewTime     reviewerID  \
count  168675.000000  168675.000000    1.686750e+05  168675.000000   
mean     5780.173594   27991.460999    1.368391e+09   34057.150135   
std      3444.592647   17044.186251    2.370555e+07   19676.427755   
min         0.000000       0.000000    1.300752e+09       0.000000   
25%      2748.000000   13158.000000    1.354320e+09   16946.000000   
50%      5793.000000   27533.000000    1.370477e+09   34025.000000   
75%      8582.000000   42450.000000    1.388621e+09   51251.000000   
max     12515.000000   58500.000000    1.406074e+09   68053.000000   

               score       helpfulP       helpfulN  
count  168675.000000  168675.000000  168675.000000  
mean        3.000000       3.744621       5.137430  
std         1.414218      30.395101      36.371267  
min         1.000000       0.000000       0.000000  
25%         2.000000       0.000000       0.000000  
50%         3.000000    

In [8]:
df.to_csv("reviews_preprocessed.csv", index=False)