# Data Preprocessing

There are many things we would like to do to our text data before we give it to our models in order to allow the models to run more easily. From typing errors to badly formatted text, many issues can occur which would not exist in an ideal world. Therefore, at this data cleaning stage we attempt to solve as many of these problems as possible. Unfortunately, it will never be the case that we can sort every problem, so we focus on some key aspects here.

## 1. Preperation

First, load in the required packages.

In [7]:
import pandas as pd
import re
import string
import numpy as np 
import random
import pandas as pd 
from numpy.random import RandomState
rng = RandomState()

Now we load in our training and test datasets that we produced previously.

In [2]:
df_train = pd.read_csv("reviews_train.csv")
df_test = pd.read_csv("reviews_test.csv")

In [4]:
df_test.head()

Unnamed: 0,score,summary,text
0,1.0,"Misrepresented, majority of short stories not ...","Misrepresented, majority of short stories not ..."
1,1.0,Comments on &quot;Magic Time&quot;,Comments on &quot;Magic Time&quot; I am long t...
2,1.0,Missed opportunity,Missed opportunity The author must have done v...
3,1.0,Raise a Citizen!,Raise a Citizen! If you have Kids or Grandkids...
4,1.0,Unimpressed,Unimpressed I would never understand why this ...


Due to the large size of our dataset, we ran the data cleaning for the full train and test datasets on the HPC. To illustrate our steps, we will downsample the test and train to a manageable size and run through that in this notebook.

This is the code to produce the downsampled training set.

In [8]:
ones = df_train[df_train['score'] == 1]
twos = df_train[df_train['score'] == 2]
threes = df_train[df_train['score'] == 3]
fours = df_train[df_train['score'] == 4]
fives = df_train[df_train['score'] == 5]

num_ones = round(len(ones) * 1/48)
num_twos = round(len(twos) * 1/48)
num_threes = round(len(threes) * 1/48)
num_fours = round(len(fours) * 1/48)
num_fives = round(len(fives) * 1/48)

new_ones = ones.sample(n=num_ones, random_state=rng)
new_twos = twos.sample(n=num_twos, random_state=rng)
new_threes = threes.sample(n=num_threes, random_state=rng)
new_fours = fours.sample(n=num_fours, random_state=rng)
new_fives = fives.sample(n=num_fives, random_state=rng)

trainframe = [new_ones,new_twos,new_threes,new_fours,new_fives]

new_train = pd.concat(trainframe)

In [9]:
new_train.to_csv('train_50k.csv', sep=",", index=False)

This is the code to produce the downsampled test set.

In [12]:
ones = df_test[df_test['score'] == 1]
twos = df_test[df_test['score'] == 2]
threes = df_test[df_test['score'] == 3]
fours = df_test[df_test['score'] == 4]
fives = df_test[df_test['score'] == 5]

num_ones = round(len(ones) * 1/60)
num_twos = round(len(twos) * 1/60)
num_threes = round(len(threes) * 1/60)
num_fours = round(len(fours) * 1/60)
num_fives = round(len(fives) * 1/60)

new_ones = ones.sample(n=num_ones, random_state=rng)
new_twos = twos.sample(n=num_twos, random_state=rng)
new_threes = threes.sample(n=num_threes, random_state=rng)
new_fours = fours.sample(n=num_fours, random_state=rng)
new_fives = fives.sample(n=num_fives, random_state=rng)

testframe = [new_ones,new_twos,new_threes,new_fours,new_fives]

new_test = pd.concat(testframe)

In [14]:
new_test.to_csv('test_10k.csv', sep=",", index=False)

Now we have a training set of 50,000 and a test set of 10,000 to work with. If you are coming back to this notebook, then you can just run the cells below to load the downsampled test and training sets back in.

In [15]:
df_train = pd.read_csv("train_50k.csv")
df_test = pd.read_csv("test_10k.csv")

In [17]:
df_train.head()

Unnamed: 0,score,summary,text
0,1.0,Painful,Painful I should start by saying that I have e...
1,1.0,I returned it!,I returned it! I was left scratching my head a...
2,1.0,damn foole by luke,damn foole by luke the damn foole by tom mcgra...
3,1.0,A waste of time and poorly written,A waste of time and poorly written This book w...
4,1.0,Bo-o-o-oring!,Bo-o-o-oring! I don't know what to make of thi...


In [18]:
df_test.head()

Unnamed: 0,score,summary,text
0,1.0,how to be ashamed of your masculinity,how to be ashamed of your masculinity Stoltenb...
1,1.0,Misleading book for managers.,Misleading book for managers. This book is ent...
2,1.0,The Black Dahlia,The Black Dahlia I received great service for ...
3,1.0,Frugal Too Much,Frugal Too Much This book is a huge disappoint...
4,1.0,Depressing and inaccurate.,Depressing and inaccurate. Like another review...


## 2. Change Score To Sentiment

Here we create a function which maps each score to our chosen sentiment classification categories of negative being 3 stars or less and positive being four or five stars.

In [16]:
def map_sentiment(stars_received):
    if stars_received <= 3:
        return 0
    else:
        return 1

In [19]:
df_train['sentiment'] = [map_sentiment(x) for x in df_train['score']]
df_test['sentiment'] = [map_sentiment(x) for x in df_test['score']]

In [21]:
df_train.drop(['score'], axis=1, inplace=True)
df_test.drop(['score'], axis=1, inplace=True)

In [22]:
df_train.head()

Unnamed: 0,summary,text,sentiment
0,Painful,Painful I should start by saying that I have e...,0
1,I returned it!,I returned it! I was left scratching my head a...,0
2,damn foole by luke,damn foole by luke the damn foole by tom mcgra...,0
3,A waste of time and poorly written,A waste of time and poorly written This book w...,0
4,Bo-o-o-oring!,Bo-o-o-oring! I don't know what to make of thi...,0


In [24]:
df_test.head()

Unnamed: 0,summary,text,sentiment
0,how to be ashamed of your masculinity,how to be ashamed of your masculinity Stoltenb...,0
1,Misleading book for managers.,Misleading book for managers. This book is ent...,0
2,The Black Dahlia,The Black Dahlia I received great service for ...,0
3,Frugal Too Much,Frugal Too Much This book is a huge disappoint...,0
4,Depressing and inaccurate.,Depressing and inaccurate. Like another review...,0


## 3. Data Cleaning

In this section, we have performed various pre-processing steps on the dataset. This includes removing rows which have any empty values, changing abbreviations back to their standard version and removing stopwords. The text document is then converted into the lowercase for better consistency.

Subsequently, we remove all punctuation thereby reducing the unnecessary noise in the dataset. After that, we have also removed repeating characters from the words, such as if someone were to type "hhhhhhello".

Next we perform 'tokenization', which is splitting the text up into its constiuent words, and 'lemmatization' (reducing the derived words to their root form known as lemma). This limits the vocabulary to part of speech, which simplifies it for better results.

### 3.1 Removing Rows with Empty Values

It is simple to remove any rows with empty values. We can see that there are no null values left after we carry out this step.

In [25]:
df_train.dropna(axis= 0,how='any',inplace=True)
df_test.dropna(axis= 0,how='any',inplace=True)

In [26]:
df_train.isna().sum()

summary      0
text         0
sentiment    0
dtype: int64

In [27]:
df_test.isna().sum()

summary      0
text         0
sentiment    0
dtype: int64

Now we import some packages we require for the next stage.

In [51]:
import nltk
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ego99\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 3.2 Changing all Text to Lowercase

It is also straightforward to turn all text to lowercase.

In [28]:
df_train['text']=df_train['text'].str.lower()
df_test['text']=df_test['text'].str.lower()
df_train['summary']=df_train['summary'].str.lower()
df_test['summary']=df_test['summary'].str.lower()

In [29]:
df_train.head()

Unnamed: 0,summary,text,sentiment
0,painful,painful i should start by saying that i have e...,0
1,i returned it!,i returned it! i was left scratching my head a...,0
2,damn foole by luke,damn foole by luke the damn foole by tom mcgra...,0
3,a waste of time and poorly written,a waste of time and poorly written this book w...,0
4,bo-o-o-oring!,bo-o-o-oring! i don't know what to make of thi...,0


In [30]:
df_test.head()

Unnamed: 0,summary,text,sentiment
0,how to be ashamed of your masculinity,how to be ashamed of your masculinity stoltenb...,0
1,misleading book for managers.,misleading book for managers. this book is ent...,0
2,the black dahlia,the black dahlia i received great service for ...,0
3,frugal too much,frugal too much this book is a huge disappoint...,0
4,depressing and inaccurate.,depressing and inaccurate. like another review...,0


### 3.3 Restoring Common Abbreviations

The code below restores common abbreviations to their proper form.

In [31]:
def restore_common_abbr(caption):
    pat_is = re.compile("(it|he|she|that|this|there|here)(\'s)", re.I)
    pat_s = re.compile("(?<=[a-zA-Z])\'s")  
    pat_s2 = re.compile("(?<=s)\'s?")
    pat_not = re.compile("(?<=[a-zA-Z])n\'t")  # not abbreviation
    pat_would = re.compile("(?<=[a-zA-Z])\'d")  # would abbreviation
    pat_will = re.compile("(?<=[a-zA-Z])\'ll")  # will abbreviation
    pat_am = re.compile("(?<=[I|i])\'m")  # am abbreviation
    pat_are = re.compile("(?<=[a-zA-Z])\'re")  # are abbreviation
    pat_ve = re.compile("(?<=[a-zA-Z])\'ve")  # have abbreviation

    new_text = caption
    new_text = pat_is.sub(r"\1 is", new_text)
    new_text = pat_s.sub("", new_text)
    new_text = pat_s2.sub("", new_text)
    new_text = pat_not.sub(" not", new_text)
    new_text = pat_would.sub(" would", new_text)
    new_text = pat_will.sub(" will", new_text)
    new_text = pat_am.sub(" am", new_text)
    new_text = pat_are.sub(" are", new_text)
    new_text = pat_ve.sub(" have", new_text)
    new_text = new_text.replace('\'', ' ')
    return new_text

In [32]:
df_train['text']= df_train['text'].apply(lambda x: restore_common_abbr(x))
df_train['summary']= df_train['summary'].apply(lambda x: restore_common_abbr(x))
df_test['text']= df_test['text'].apply(lambda x: restore_common_abbr(x))
df_test['summary']= df_test['summary'].apply(lambda x: restore_common_abbr(x))

### 3.4 Removing Stop Words

To choose the stop words we would like to remove, we start with a standard set that comes with a package. We then inspected these and chose some to keep which we felt might still be useful.

In [19]:
STOPWORDS = set(stopwords.words('english'))
print(STOPWORDS)

{'have', 'my', 'so', 're', 'any', 'only', 'haven', 'between', 'be', 'am', "hadn't", 'its', 've', 'needn', 'you', 'doing', "don't", 'she', 'all', 'against', 'me', 'were', 'y', 'theirs', 'hadn', 'myself', 'should', "didn't", 'our', 'themselves', "should've", "you'd", 'where', 'a', 'hers', 'here', 'are', 'didn', 'below', 'is', 'this', 'ma', "that'll", "mustn't", 'weren', 'wouldn', 'their', 'ours', 'don', 'other', 'same', "won't", 'more', 'how', "isn't", 'few', 'than', 'too', 'in', 'if', 'yours', 'ain', 'yourself', "wouldn't", 'those', "she's", 'until', 'i', 'itself', 'has', 'shouldn', 'your', 'own', 'under', 'further', 'again', 'm', 'but', 'won', 'over', 'no', 'mustn', 'that', 'while', 'during', 'to', 'above', 'such', 'will', 'shan', "wasn't", 'him', 'them', 'was', 'he', 'yourselves', 'there', 'herself', 'for', "doesn't", "couldn't", 'doesn', 'not', 'did', 'hasn', 'with', 'some', 'into', 'does', 'nor', 'o', "needn't", 'and', 'most', 'his', 'having', 'after', "you've", 'the', 'they', 'then

In [33]:
#create our new stopoing words to keep some words like 'not'  
my_stopwords ={'have', 'my', 'so', 're', 'any', 'only', 'haven', 'between', 'be', 'am', "hadn't", 'its', 've', 'needn', 'you', 'doing', "don't", 'she', 'all', 'against', 'me', 'were', 'y', 'theirs', 'hadn', 'myself', 'should', "didn't", 'our', 'themselves', "should've", "you'd", 'where', 'a', 'hers', 'here', 'are', 'didn', 'below', 'is', 'this', 'ma', "that'll", "mustn't", 'weren', 'wouldn', 'their', 'ours', 'don', 'other', 'same', "won't", 'more', 'how', "isn't", 'few', 'than', 'too', 'in', 'if', 'yours', 'ain', 'yourself', "wouldn't", 'those', "she's", 'until', 'i', 'itself', 'has', 'shouldn', 'your', 'own', 'under', 'further', 'again', 'm', 'but', 'won', 'over', 'no', 'mustn', 'that', 'while', 'during', 'to', 'above', 'such', 'will', 'shan', "wasn't", 'him', 'them', 'was', 'he', 'yourselves', 'there', 'herself', 'for', "doesn't", "couldn't", 'doesn', 'did', 'hasn', 'with', 'some', 'into', 'does', 'nor', 'o', "needn't", 'and', 'most', 'his', 'having', 'after', "you've", 'the', 'they', 'then', 'isn', "shan't", 'once', 't', 'it', "shouldn't", 'been', 'can', 'being', 'of', 'about', 'just', 'down', "hasn't", 'now', 'ourselves', 'd', 'up', 'her', 'had', 'aren', 'what', 'from', "you'll", 'each', 'we', 'whom', "mightn't", 'do', 'because', 'through', 's', "weren't", 'himself', 'who', 'both', 'an', 'by', 'when', 'at', 'as', 'off', 'very', 'these', 'or', "it's", 'mightn', 'before', "you're", "haven't", 'on', 'why', 'll', 'which', 'wasn', "aren't", 'couldn', 'out'}

In [34]:
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in my_stopwords])

In [35]:
df_train['text'] = df_train['text'].apply(lambda text: cleaning_stopwords(text))
df_train['summary'] = df_train['summary'].apply(lambda text: cleaning_stopwords(text))
df_test['text'] = df_test['text'].apply(lambda text: cleaning_stopwords(text))
df_test['summary'] = df_test['summary'].apply(lambda text: cleaning_stopwords(text))

### 3.5 Removing Punctuations

The following code removes all of the punctuation, replacing each with a space to prevent words which are only seperated by a piece of punctuation from conjoining.

In [36]:
exclude = set('"#$%&\()*+-/:<=>@[\\]^_`{|}!?,.')
def cleaning_punctuations(lines):
    line = ' '.join(''.join(' ' if ch in exclude else ch for ch in lines).split())
    return line

In [37]:
df_train['text'] = df_train['text'].apply(lambda x: cleaning_punctuations(x))
df_train['summary'] = df_train['summary'].apply(lambda x: cleaning_punctuations(x))
df_test['text'] = df_test['text'].apply(lambda x: cleaning_punctuations(x))
df_test['summary'] = df_test['summary'].apply(lambda x: cleaning_punctuations(x))

In [38]:
df_train.head()

Unnamed: 0,summary,text,sentiment
0,painful,painful start saying enjoyed sophie kinsella b...,0
1,returned it,returned it left scratching head first chapter...,0
2,damn foole luke,damn foole luke damn foole tom mcgrath good bo...,0
3,waste time poorly written,waste time poorly written book poorly written ...,0
4,bo o o oring,bo o o oring not know make book wo not recap s...,0


In [39]:
df_test.head()

Unnamed: 0,summary,text,sentiment
0,ashamed masculinity,ashamed masculinity stoltenberg classic soft m...,0
1,misleading book managers,misleading book managers book entirely managem...,0
2,black dahlia,black dahlia received great service item not l...,0
3,frugal much,frugal much book huge disappointment even bigg...,0
4,depressing inaccurate,depressing inaccurate like another reviewer mu...,0


### 3.6 Removing Repeating Characters

The code below removes unwanted sets of repeated characters.

In [41]:
def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)

In [42]:
df_train['text'] = df_train['text'].apply(lambda x: cleaning_repeating_char(x))
df_train['summary'] = df_train['summary'].apply(lambda x: cleaning_repeating_char(x))
df_test['text'] = df_test['text'].apply(lambda x: cleaning_repeating_char(x))
df_test['summary'] = df_test['summary'].apply(lambda x: cleaning_repeating_char(x))

### 3.7 Removing Numeric Characters

It seems that numbers would not be particularly useful for our task, so the code below takes them out.

In [43]:
def cleaning_numbers(data):
    return re.sub('[0-9]+', ' ', data)

In [44]:
df_train['text'] = df_train['text'].apply(lambda x: cleaning_numbers(x))
df_train['summary'] = df_train['summary'].apply(lambda x: cleaning_numbers(x))
df_test['text'] = df_test['text'].apply(lambda x: cleaning_numbers(x))
df_test['summary'] = df_test['summary'].apply(lambda x: cleaning_numbers(x))

### 3.8 Removing Short Words

As a general rule, very short words probably do not provide much information, so the code below removes any word of less than 3 letters.

In [46]:
def transform_text(text):
    return ' '.join([word for word in text.split() if len(word) > 2])

In [47]:
df_train['text'] = df_train['text'].apply(lambda x: transform_text(x))
df_train['summary'] = df_train['summary'].apply(lambda x: transform_text(x))
df_test['text'] = df_test['text'].apply(lambda x: transform_text(x))
df_test['summary'] = df_test['summary'].apply(lambda x: transform_text(x))

In [48]:
df_train.head()

Unnamed: 0,summary,text,sentiment
0,painful,painful start saying enjoyed sophie kinsella b...,0
1,returned,returned left scratching head first chapter sc...,0
2,damn foole luke,damn foole luke damn foole tom mcgrath good bo...,0
3,waste time poorly written,waste time poorly written book poorly written ...,0
4,oring,oring not know make book not recap story done ...,0


In [49]:
df_test.head()

Unnamed: 0,summary,text,sentiment
0,ashamed masculinity,ashamed masculinity stoltenberg classic soft m...,0
1,misleading book managers,misleading book managers book entirely managem...,0
2,black dahlia,black dahlia received great service item not l...,0
3,frugal much,frugal much book huge disappointment even bigg...,0
4,depressing inaccurate,depressing inaccurate like another reviewer mu...,0


### 3.9 Tokenization

There is a function within a package available to perform tokenization, so we simply run that on our text and produce the following output.

In [52]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ego99\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [53]:
from nltk import word_tokenize

In [54]:
df_train['text'] = df_train['text'].apply(lambda x: word_tokenize(x))
df_train['summary'] = df_train['summary'].apply(lambda x: word_tokenize(x))
df_test['text'] = df_test['text'].apply(lambda x: word_tokenize(x))
df_test['summary'] = df_test['summary'].apply(lambda x: word_tokenize(x))

In [55]:
df_train.head()

Unnamed: 0,summary,text,sentiment
0,[painful],"[painful, start, saying, enjoyed, sophie, kins...",0
1,[returned],"[returned, left, scratching, head, first, chap...",0
2,"[damn, foole, luke]","[damn, foole, luke, damn, foole, tom, mcgrath,...",0
3,"[waste, time, poorly, written]","[waste, time, poorly, written, book, poorly, w...",0
4,[oring],"[oring, not, know, make, book, not, recap, sto...",0


In [56]:
df_test.head()

Unnamed: 0,summary,text,sentiment
0,"[ashamed, masculinity]","[ashamed, masculinity, stoltenberg, classic, s...",0
1,"[misleading, book, managers]","[misleading, book, managers, book, entirely, m...",0
2,"[black, dahlia]","[black, dahlia, received, great, service, item...",0
3,"[frugal, much]","[frugal, much, book, huge, disappointment, eve...",0
4,"[depressing, inaccurate]","[depressing, inaccurate, like, another, review...",0


### 3.10 Applying Lemmatizer

We take this code from the tutorial which runs the lemmatization.

In [57]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ego99\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [58]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ego99\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [59]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ego99\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [60]:
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# get the part of speech
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def tag_lem(doc,lemmatizer):
    tagged_sent = pos_tag(doc)  # get the part of speech

    new_s = []
    for c in tagged_sent:
        if c[0].isdigit():
            new_s.append("#number")
        elif c[0] not in string.punctuation:
            wordnet_pos = get_wordnet_pos(c[1]) or wordnet.NOUN
            new_s.append(lemmatizer.lemmatize(c[0], pos=wordnet_pos))  
    return new_s

lemmatizer = WordNetLemmatizer()

In [61]:
df_train['text'] = df_train['text'].apply(lambda x: tag_lem(x,lemmatizer))
df_train['summary'] = df_train['summary'].apply(lambda x: tag_lem(x,lemmatizer))
df_test['text'] = df_test['text'].apply(lambda x: tag_lem(x,lemmatizer))
df_test['summary'] = df_test['summary'].apply(lambda x: tag_lem(x,lemmatizer))

In [62]:
df_train.head()

Unnamed: 0,summary,text,sentiment
0,[painful],"[painful, start, say, enjoy, sophie, kinsella,...",0
1,[return],"[return, leave, scratch, head, first, chapter,...",0
2,"[damn, foole, luke]","[damn, foole, luke, damn, foole, tom, mcgrath,...",0
3,"[waste, time, poorly, write]","[waste, time, poorly, write, book, poorly, wri...",0
4,[oring],"[oring, not, know, make, book, not, recap, sto...",0


In [63]:
df_test.head()

Unnamed: 0,summary,text,sentiment
0,"[ashamed, masculinity]","[ashamed, masculinity, stoltenberg, classic, s...",0
1,"[mislead, book, manager]","[mislead, book, manager, book, entirely, manag...",0
2,"[black, dahlia]","[black, dahlia, receive, great, service, item,...",0
3,"[frugal, much]","[frugal, much, book, huge, disappointment, eve...",0
4,"[depress, inaccurate]","[depress, inaccurate, like, another, reviewer,...",0


## 4. Save Cleaned Data

Now we save our cleaned data to seperate files, so that we can easily retrieve whichever one we are working on in the future. First build data frames with the correct columns in.

In [64]:
df_train_combined = pd.DataFrame()

In [67]:
df_train_combined['text']=[x for x in df_train['text']]
df_train_combined['sentiment']=[x for x in df_train['sentiment']]

In [68]:
df_train_combined.head()

Unnamed: 0,text,sentiment
0,"[painful, start, say, enjoy, sophie, kinsella,...",0
1,"[return, leave, scratch, head, first, chapter,...",0
2,"[damn, foole, luke, damn, foole, tom, mcgrath,...",0
3,"[waste, time, poorly, write, book, poorly, wri...",0
4,"[oring, not, know, make, book, not, recap, sto...",0


In [69]:
df_train_summary = pd.DataFrame()

In [70]:
df_train_summary['text']=[x for x in df_train['summary']]
df_train_summary['sentiment']=[x for x in df_train['sentiment']]

In [71]:
df_train_summary.head()

Unnamed: 0,text,sentiment
0,[painful],0
1,[return],0
2,"[damn, foole, luke]",0
3,"[waste, time, poorly, write]",0
4,[oring],0


In [72]:
df_test_combined = pd.DataFrame()

In [73]:
df_test_combined['text']=[x for x in df_test['text']]

In [74]:
df_test_combined.head()

Unnamed: 0,text
0,"[ashamed, masculinity, stoltenberg, classic, s..."
1,"[mislead, book, manager, book, entirely, manag..."
2,"[black, dahlia, receive, great, service, item,..."
3,"[frugal, much, book, huge, disappointment, eve..."
4,"[depress, inaccurate, like, another, reviewer,..."


In [76]:
df_test_summary = pd.DataFrame()

In [77]:
df_test_summary['text']=[x for x in df_test['summary']]

In [78]:
df_test_summary.head()

Unnamed: 0,text
0,"[ashamed, masculinity]"
1,"[mislead, book, manager]"
2,"[black, dahlia]"
3,"[frugal, much]"
4,"[depress, inaccurate]"


In [79]:
df_test_sentiments = pd.DataFrame()

In [80]:
df_test_sentiments['sentiment']=[x for x in df_test['sentiment']]

In [81]:
df_test_sentiments.head()

Unnamed: 0,sentiment
0,0
1,0
2,0
3,0
4,0


Now we can save our cleaned data sets as csvs. When we ran this code on the HPC, we saved the files as the same names, but without the 'D' (meaning downsampled).

In [82]:
df_train_combined.to_csv('data_cleaned_full_text_and_summariesD.csv', sep=",", index=False)
df_train_summary.to_csv('data_cleaned_full_summariesD.csv', sep=",", index=False)
df_test_combined.to_csv('test_cleaned_text_and_summariesD.csv', sep=",", index=False)
df_test_summary.to_csv('test_cleaned_summariesD.csv', sep=",", index=False)
df_test_sentiments.to_csv('reviews_test_with_sentimentD.csv', sep=",", index=False)