# Preparing The Training/Test Set

As we planned to carry out classification with our topic models, we required a test set to be held out. Due to the dataset being very large, we were unsure as to whether we would be able to successfully carry out both training on all of the training set and testing on all of the test set, but decided we could always down sample them at a later stage if we needed to.

In [2]:
import os
import pandas as pd
import random
from numpy.random import RandomState

As the dataset is very large, we cannot load it in automatically by, for example, a google drive link. Therefore, to run any lines which read in a dataset, like the one below, make sure to have the dataset downloaded to your working directory of your machine. This is the whole dataset in its original form from Kaggle.

In [3]:
df = pd.read_csv('Books_rating.csv')

In [4]:
df.head()

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


We begin by removing the columns which we will not be using in this project. We are primarily focussing on developing topic models and investigating through this how the information gained solely from the reviews impacts a classification task.

In [5]:
df.drop(['Id','Title','Price','User_id','profileName','review/helpfulness','review/time'], axis=1, inplace=True)


In [6]:
df.head()

Unnamed: 0,review/score,review/summary,review/text
0,4.0,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,5.0,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,5.0,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,4.0,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,4.0,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


We can rename the columns as so.

In [7]:
df=df.rename(columns={'review/score': 'score', 'review/summary': 'summary', 'review/text': 'text'})

In [8]:
df_combined=df
df_combined.head()

Unnamed: 0,score,summary,text
0,4.0,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,5.0,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,5.0,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,4.0,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,4.0,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


Now we add the summary to the review text, as we will be comparing how putting both into our models changes the results from just having the summaries.

In [9]:
df_combined['text'] = df.summary.astype(str).str.cat(df.text.astype(str), sep=' ')

In [10]:
df_combined.head()

Unnamed: 0,score,summary,text
0,4.0,Nice collection of Julie Strain images,Nice collection of Julie Strain images This is...
1,5.0,Really Enjoyed It,Really Enjoyed It I don't care much for Dr. Se...
2,5.0,Essential for every personal and Public Library,Essential for every personal and Public Librar...
3,4.0,Phlip Nel gives silly Seuss a serious treatment,Phlip Nel gives silly Seuss a serious treatmen...
4,4.0,Good academic overview,Good academic overview Philip Nel - Dr. Seuss:...


In [11]:
df_combined.iloc[0,2]

"Nice collection of Julie Strain images This is only for Julie Strain fans. It's a collection of her photos -- about 80 pages worth with a nice section of paintings by Olivia.If you're looking for heavy literary content, this isn't the place to find it -- there's only about 2 pages with text and everything else is photos.Bottom line: if you only want one book, the Six Foot One ... is probably a better choice, however, if you like Julie like I like Julie, you won't go wrong on this one either."

The following section performs stratified sampling to split the data into a test and training set, which correspond to 20% and 80% of the dataset respectively.

In [13]:
rng = RandomState()

In [14]:
ones = df_combined[df_combined['score'] == 1]
twos = df_combined[df_combined['score'] == 2]
threes = df_combined[df_combined['score'] == 3]
fours = df_combined[df_combined['score'] == 4]
fives = df_combined[df_combined['score'] == 5]

num_ones = round(len(ones) * 0.2)
num_twos = round(len(twos) * 0.2)
num_threes = round(len(threes) * 0.2)
num_fours = round(len(fours) * 0.2)
num_fives = round(len(fives) * 0.2)

test_ones = ones.sample(n=num_ones, random_state=rng)
train_ones = ones.loc[~ones.index.isin(test_ones.index)]
test_twos = twos.sample(n=num_twos, random_state=rng)
train_twos = twos.loc[~twos.index.isin(test_twos.index)]
test_threes = threes.sample(n=num_threes, random_state=rng)
train_threes = threes.loc[~threes.index.isin(test_threes.index)]
test_fours = fours.sample(n=num_fours, random_state=rng)
train_fours = fours.loc[~fours.index.isin(test_fours.index)]
test_fives = fives.sample(n=num_fives, random_state=rng)
train_fives = fives.loc[~fives.index.isin(test_fives.index)]

testframe = [test_ones,test_twos,test_threes,test_fours,test_fives]
trainframe = [train_ones,train_twos,train_threes,train_fours,train_fives]

test = pd.concat(testframe)
train = pd.concat(trainframe)

We save the test and train set to csvs with the following names.

In [15]:
test.to_csv('reviews_test.csv', sep=",", index=False)
train.to_csv('reviews_train.csv', sep=",", index=False)

Finally, this is a quick check to show that the test set is in fact 20% of the original dataset.

In [16]:
a = test.shape[0]
b = train.shape[0]
100 * a/(a+b)

20.000033333333334