<h2 style="text-align:center; color: DodgerBlue">Data Cleaning</h2>

<h3>Import libraries</h3>

In [1]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# making output consistent
seed = 46 

<h3>Import data </h3>

In [2]:
path_data = '../yelp_academic_dataset_review.pickle'
data = pd.read_pickle(path_data)

<h3>Data exploration</h3>

In [3]:
data.shape

(1569264, 10)

In [4]:
data.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes_cool,votes_funny,votes_useful
0,vcNAWiLM4dR7D2nwwJ7nCA,2007-05-17,15SdjuK7DmYqUAj6rjGowg,5,dr. goldberg offers everything i look for in a...,review,Xqd0DzHaiyRqVH3WRG7hzg,1,0,2
1,vcNAWiLM4dR7D2nwwJ7nCA,2010-03-22,RF6UnRTtG7tWMcrO2GEoAg,2,"Unfortunately, the frustration of being Dr. Go...",review,H1kH6QZV7Le4zqTRNxoZow,0,0,2
2,vcNAWiLM4dR7D2nwwJ7nCA,2012-02-14,-TsVN230RCkLYKBeLsuz7A,4,Dr. Goldberg has been my doctor for years and ...,review,zvJCcrpm2yOZrxKffwGQLA,1,0,1
3,vcNAWiLM4dR7D2nwwJ7nCA,2012-03-02,dNocEAyUucjT371NNND41Q,4,Been going to Dr. Goldberg for over 10 years. ...,review,KBLW4wJA_fwoWmMhiHRVOA,0,0,0
4,vcNAWiLM4dR7D2nwwJ7nCA,2012-05-15,ebcN2aqmNUuYNoyvQErgnA,4,Got a letter in the mail last week that said D...,review,zvJCcrpm2yOZrxKffwGQLA,1,0,2


<u>Removing trailing newlines in text</u>

In [5]:
data.loc[2, 'text']

"Dr. Goldberg has been my doctor for years and I like him.  I've found his office to be fairly efficient.  Today I actually got to see the doctor a few minutes early!  \n\nHe seems very engaged with his patients and his demeanor is friendly, yet authoritative.    \n\nI'm glad to have Dr. Goldberg as my doctor."

From the above text, we see that there are some trailing newlines ('\n') characters in the comments. They are unecessary for us.

In [6]:
# Removing all ('\n') characters using list comprehensions
data['text'] = [txt.replace('\n', '') for txt in data['text']]

In [7]:
data.loc[2, 'text']

"Dr. Goldberg has been my doctor for years and I like him.  I've found his office to be fairly efficient.  Today I actually got to see the doctor a few minutes early!  He seems very engaged with his patients and his demeanor is friendly, yet authoritative.    I'm glad to have Dr. Goldberg as my doctor."

<h3>Framing the scope of our project</h3>

We will use only the text and the review for our study

In [8]:
data = data.loc[:, ['text', 'stars']]

In [9]:
data.dtypes

text     object
stars     int64
dtype: object

Writing to csv: 

Check missing data:

In [10]:
data.isnull().sum()

text     0
stars    0
dtype: int64

The dataset <u>does not contain any missing values</u>, at least in the scope of our study 

<h3>Split train/test data</h3>

In [11]:
X, y = data.loc[:, 'text'], data.loc[:, 'stars']

test_size = 0.20

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

print("X_train.shape: ", X_train.shape, "y_train.shape: ", y_train.shape)
print("X_test.shape: ", X_test.shape, "y_test.shape: ", y_test.shape)

X_train.shape:  (1255411,) y_train.shape:  (1255411,)
X_test.shape:  (313853,) y_test.shape:  (313853,)


<h3>Saving to csv</h3>

In [12]:
X_train.head()

73295      The service was slow and the restaurant is dat...
1370564    The food offerings are a bit interesting, but ...
620464     Ah PT's.  Standard restaurant/bar that is supe...
1255454    I bought my wedding corset from Belle and vowe...
1163010                      They have amazing fried pickles
Name: text, dtype: object

In [13]:
y_train.head()

73295      1
1370564    4
620464     3
1255454    5
1163010    4
Name: stars, dtype: int64

In [14]:
train_df = np.vstack((X_train, y_train))
test_df = np.vstack((X_test, y_test))

In [15]:
train_df= pd.DataFrame({'text': train_df[0, :], 'stars': train_df[1, :]})
test_df = pd.DataFrame({'text': test_df[0, :], 'stars': test_df[1, :]})

In [16]:
train_df.head()

Unnamed: 0,stars,text
0,1,The service was slow and the restaurant is dat...
1,4,"The food offerings are a bit interesting, but ..."
2,3,Ah PT's. Standard restaurant/bar that is supe...
3,5,I bought my wedding corset from Belle and vowe...
4,4,They have amazing fried pickles


In [17]:
test_df.head()

Unnamed: 0,stars,text
0,4,"I""m still a fan!Went last Saturday night. Defi..."
1,5,Dr. Ruiz and all of the staff make everyone co...
2,5,Stumbled upon this place last night..Wow what ...
3,4,Big City Bagels has excellent service and bage...
4,5,Hands down the best show in Las Vegas!!


In [18]:
train_df.to_csv("train_df.csv", index=False)
test_df.to_csv("test_df.csv", index=False)

In [19]:
train_df = pd.read_csv("train_df.csv")
train_df.head()

Unnamed: 0,stars,text
0,1,The service was slow and the restaurant is dat...
1,4,"The food offerings are a bit interesting, but ..."
2,3,Ah PT's. Standard restaurant/bar that is supe...
3,5,I bought my wedding corset from Belle and vowe...
4,4,They have amazing fried pickles


In [20]:
test_df = pd.read_csv("test_df.csv")
test_df.head()

Unnamed: 0,stars,text
0,4,"I""m still a fan!Went last Saturday night. Defi..."
1,5,Dr. Ruiz and all of the staff make everyone co...
2,5,Stumbled upon this place last night..Wow what ...
3,4,Big City Bagels has excellent service and bage...
4,5,Hands down the best show in Las Vegas!!
