# The code for my research project
To run code, click on 'cell' and then 'run all'. The following code was used for the pre processing.

In [19]:
# Below is the code to open the json file and convert it to a pandas dataframe.
# The code in this first cell is not mine; it was provided by the same people that provided the dataset.

import pandas as pd
import numpy as np
import gzip
import json

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF("Industrial_and_Scientific_5.json.gz")

# To use another dataset: replace Industrial_and_Scientific_5.json.gz by other 5-core json dataset filename

In [20]:
 # Remove unnessecary columns

df = df.drop(['verified', 'reviewTime', 'style', 'reviewerName', 'unixReviewTime', 'image'], axis=1)
df

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,vote
0,5.0,A1JB7HFWHRYHT7,B0000223SI,"This worked really well for what I used it for. So for my purposes it is getting full marks. This is an all around great, durable, and afforable sandpaper.\n\nPros:\n-Grit cuts really fast and evenly. No random deep scratches like I have seen in some cheaper paper\n-Didn't even have a hint of clogging up.\n-The adhesive is just what I needed. No permanent, but wasn't going anywhere.\n\nCons:\n-None",Couldn't have been happier with it's performance,
1,5.0,A2FCLJG5GV8SD6,B0000223SI,Fast cutting and good adheasive.,Good paper.,
2,5.0,A3IT9B33NWYQSL,B0000223SI,"Worked great for my lapping bench. I would like it if the adhesive were backed with waxed paper for storage and keeping the grit out, but all but the first 6 inches or so still functioned when it arrived. I used rubber cement to remedy that.",Handy!,
3,4.0,AUL5LCV4TT73P,B0000223SK,As advertised,As advertised,
4,5.0,A1V3I3L5JKO7TM,B0000223SK,seems like a pretty good value as opposed to buying it at the big box stores by the sheet.,seems like a pretty good value as opposed to buying it ...,
...,...,...,...,...,...,...
77066,5.0,A1UZ9AVZFWZS1A,B01HCVJ3K2,So far it has worked like a champ. Great solution for the standard heat bed.,I recommend it.,
77067,5.0,A1PMSQXD43WIS4,B01HCVJ3K2,Great quality solid state relay. I used this solid state relay to control my 3D printer heated bed. Its very reliable and takes the load off my print controller.,Great quality solid state relay,
77068,5.0,A225WHD7XZVIXL,B01HEQVQAK,Came with everything needed to install in my Monoprice Makerselect v2. Now I can really crank up the temp on my heated bed to print ABS and not worry about killing my motherboard.,Exactly as described,
77069,5.0,A3T05FOORNQI18,B01HEQVQAK,"Installed a month ago in my Monoprice Maker Select V2 3D printer. It does the job, no problems.\nSimple circuit that does what it should. That makes it a good buy.",Works Great,


In [21]:
# Optimizing the vote column for researching helpfulness votes

df['vote'] = df['vote'].fillna(0) # If vote value is missing: fill in a zero instead
df['vote'] = df['vote'].replace(',', '', regex=True) # If the number has a comma in it, remove it
df['vote'] = df['vote'].astype(int) # Convert all vote values from strings to integers

In [22]:
df['vote']

# No more missing vote values

0        0
1        0
2        0
3        0
4        0
        ..
77066    0
77067    0
77068    0
77069    0
77070    0
Name: vote, Length: 77071, dtype: int64

In [23]:
# Display the full review text

pd.set_option('display.max_colwidth', None) 

In [24]:
# Remove all duplicate entries of review text

df = df.sort_values(by='reviewText')
df = df.drop_duplicates(subset='reviewText', keep="first", ignore_index=True)

In [25]:
# Cleaning the review text

df['reviewText'] = df['reviewText'].str.replace('[^\w\s]','') # Remove punctuation
df['reviewText'] = df['reviewText'].replace('\n', ' ', regex=True) # Replace all newline chars with whitespace
df['reviewText'] = df['reviewText'].str.lower() # Convert all uppercase chars to lowercase

# Processing the data


In [26]:
# Add new column with the length of every review

df['reviewLength'] = df['reviewText'].str.split().str.len()
df.sample(n = 5)

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,vote,reviewLength
22912,2.0,ATW84RMD0UIKA,B00AZMGFI4,ive used this 4 times so far and my clock still squeaks the squeaking stops only for a matter of hours after application then starts right back up again life on this oil is way to short,Poor lifespan of lubrication,0,37.0
44058,5.0,A3QOPRD5VWTJ8P,B00J0GO8I0,this was my first spool after ive got assembled my first 3d printer cheap delta kossel mini for 220 and parts a printed with amazing quality it wraps a bit when cools down so consider this in your designs,Amazing quality even on cheap 3D printer.,2,39.0
27882,5.0,AUY525S8M0ZUW,B000GP05VS,neat little packages that make cyanoacrylate ca so convenient to use the krazy brand from my experience seems to be very good compared to other brands and they have not gotten the make me rich fever with a fair price so far i got some noname brands too and i will try and keep track of how they compare to the krazy brand i tape the little nozzle into a duct tape fold tape around the end of the nozzle to seal the end from air you need to be able to pull on the ends of the tape and the nozzle will come free for the next job and the glue seems to work well a month or 2 after it has been opened you just yank open the tape fold and the container is read to go for the next job i find that i dont need too much quantity but just want it for those many little jobs around the house and garage if one is not careful you can pay too much for this stuff there is an adhesive company that goes around buying up others for a monopoly and the prices that they command are almost criminal hey it is the american dream give you nothing and charge super high prices,I Can't Take My Hands Off This Stuff!,0,216.0
49040,5.0,A3ADJ0YMAYWQTP,B011KSFRKS,worked nicely,Five Stars,0,2.0
4287,1.0,ABTA0G9CQ0ELG,B00DRALJ28,cant calibrate the humidity reading was over 10 low off after i did the salt test put a bottle cap of damp salt in a baggie with the thermometer for 12 hours and it should read 75 humidity,Can't calibrate; 10 points off,0,38.0


In [27]:
df.shape

# The amount of reviews that are left after cleaning:

(58331, 7)

In [28]:
df.describe()

# Some descriptive statistics on the dataset

Unnamed: 0,overall,vote,reviewLength
count,58331.0,58331.0,58330.0
mean,4.470625,1.566131,52.815189
std,1.002965,18.657533,107.04545
min,1.0,0.0,0.0
25%,4.0,0.0,9.0
50%,5.0,0.0,24.0
75%,5.0,0.0,57.0
max,5.0,2333.0,5946.0


In [29]:
# Add column displaying the total amount of votes per individual product:
df['totalVotes'] = df.groupby('asin')["vote"].transform('sum')

# Add column displaying the votes for each review divided by the total votes per product:
df['votesPercentage'] = df['vote'] / df['totalVotes']

In [30]:
# Label each review text either short or long
# Important: when using a different dataset, make sure to change the number 53 to the mean review length for the
# dataset you're using.

df['length'] = np.where(df['reviewLength']>=53, 'long', 'short')

# Label a review as helpful or not depending on the votesPercentage
df['veryHelpful'] = np.where(df['votesPercentage']>=0.5, 'yes', 'no')
df.sample(n=5)

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,vote,reviewLength,totalVotes,votesPercentage,length,veryHelpful
682,5.0,A2F54TFGWPMN5J,B007BTPNC8,a good hard set of tapsas advertised,"Taps----starting, normal tap and bottoming tap",0,7.0,0,,short,no
30341,4.0,A246581CHFRHWE,B001Q4ZTPK,only complaint i wish it came in a box rather than a plastic wrapped throw away package would be nice to be able to use it and store it in its own containermade my own polishing box with places for all the compoundsstill would be nice to have a box to remember the compoundgrit etc,"Works great, great price...",0,55.0,15,0.0,long,no
22452,5.0,AGNU4QNN81MJH,B0058SORF8,ive been setting up my garage with seville classics ultrahd lighted workbench ultrahd storage cabinet ultrahd 12drawer rolling workbench and now this ultrahd steel pegboard set this stuff not only looks great it also lives up to its name ultrahd heavy duty this stuff is made to last what is surprising is the price when compared to similar products of the same high quality heavy duty construction,Seville Classic is the only way to go!,0,67.0,4,0.0,long,no
49595,4.0,A2P5YJCUDHSNV2,B0002EQU6C,works exactly as advertised could be a bit of a finer tip for application,Good thermal compound.,0,14.0,961,0.0,short,no
31510,4.0,A24KKRCPD755UI,B0027Z6CY4,pretty good chain a little pricy,Four Stars,0,6.0,13,0.0,short,no


# Results
Below are the results shown in a contingency table, showing the helpfulness in relation to review length.

In [31]:
contingency_table = pd.crosstab(df.veryHelpful, df.length, margins=True, margins_name="Total")
contingency_table

length,long,short,Total
veryHelpful,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,13581,41674,55255
yes,2076,1000,3076
Total,15657,42674,58331


In [32]:
# The same results, but as percentages:
contingency_table = pd.crosstab(df.veryHelpful, df.length, normalize=True)
contingency_table

length,long,short
veryHelpful,Unnamed: 1_level_1,Unnamed: 2_level_1
no,0.232826,0.71444
yes,0.03559,0.017144


In [33]:
dfc = df.copy() # Clone the dataset

# Reduce dataset to only include reviews that gave either one or five stars:
dfc = dfc.loc[(df['overall'] > 4) | (df['overall'] < 2)]

# Create new column labeling review as positive or negative:
dfc['sentiment'] = np.where(dfc['overall']==5, 'positive', 'negative')
dfc.head(5)

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,vote,reviewLength,totalVotes,votesPercentage,length,veryHelpful,sentiment
0,5.0,A3D1AFK1WU0TG,B001PNO368,used for winch switch,,0,4.0,0,,short,no,positive
6,5.0,A2D6HAJAC32XC0,B00MUT58Y2,everybody uses this stuff for a good reason nuff said a little goes a long way,"Works, no muss or fuss",0,16.0,308,0.0,short,no,positive
8,5.0,A35PPLVIPZLU36,B01F47B8AO,my size shot glasses,Great for drunks.,0,4.0,18,0.0,short,no,positive
9,5.0,A2VUW39TF5YCC1,B00TQ7DQU4,nice tape glows nicely,"Glows nicely. """,0,4.0,0,,short,no,positive
11,5.0,AQH4Z8W9WYE41,B00WW4H8XY,quick ship works great buy with confidence,"Works Great. "" Buy with confidence",0,7.0,0,,short,no,positive



Below is the second contingency table displaying the relation between helpfulness and sentiment.

In [34]:
contingency_table_2 = pd.crosstab(dfc.veryHelpful, dfc.sentiment, margins=True, margins_name="Total")
contingency_table_2

sentiment,negative,positive,Total
veryHelpful,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,1967,39244,41211
yes,229,1963,2192
Total,2196,41207,43403


In [35]:
contingency_table_2 = pd.crosstab(dfc.veryHelpful, dfc.sentiment, normalize=True)
contingency_table_2

# In percentages:

sentiment,negative,positive
veryHelpful,Unnamed: 1_level_1,Unnamed: 2_level_1
no,0.045319,0.904177
yes,0.005276,0.045227
