# Actionable Insights from Lululemon Reviews - Dataframe Construction and Cleaning

Amanda Cheney  
Metis Project 4  
Part 2 of 4    
November 13, 2020  

**Objective**  

Natural language processing & unsupervised learning exploration of customer reviews of lululemon’s best-selling sports bras to derive actionable insights for product development and management team and develop a recommender system to provide a curated collection of reviews specifically tailored to customer product needs.

**Data Sources**   
9,000+ reviews of all 13 of Lululemon's bestselling sports bras, collected using Selenium. 

**This Notebook**  
Creates one large dataframe from all my scraped data, performs cleaning and feature engineering to arrive at a final database ready for NLP Preprocessing in the following notebook.

## Imports

In [1]:
import pandas as pd
import pickle

## Load Product Reviews

In [2]:
with open('zero.pickle', 'rb') as read_file:
    zero = pickle.load(read_file)

In [3]:
len(zero)

89

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good! Note that the number of total ratings matches the length of the list.

In [4]:
zero[0]

{'product_name': 'Like a Cloud Bra Light Support, B/C Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Like-a-Cloud-Bra/_/prod9960745?color=45609',
 'product_list_price': '$58.00 USD',
 'product_avg_rating': '4.6',
 'title': "Haven't taken it off",
 'content': 'I purchased this bra in the hazy jade color about 2 weeks ago and have been wearing it everyday since... My fiancé is so sick of it that he went out and bought it in the other colors. Please make more!!!',
 'rating': '5',
 'name': 'Anonymous',
 'date': '2020-10-31',
 'review counter': 1,
 'num_total_ratings': '(89)'}

In [5]:
zero = pd.DataFrame(zero)

In [6]:
with open('one.pickle', 'rb') as read_file:
    one = pickle.load(read_file)

In [7]:
len(one)

3036

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [8]:
one[0]

{'product_name': 'Energy Bra Medium Support, B–D Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Energy-Bra-32925/_/prod9360058?color=47199',
 'product_list_price': '$52.00 USD',
 'product_avg_rating': '4.1',
 'title': 'Comfortable, supportive exercise bra',
 'content': 'Good bra that looks good under exercise wear and is soft',
 'rating': '5',
 'name': 'Nancy Lulu',
 'date': '2020-10-31',
 'review counter': 1,
 'num_total_ratings': '(3069)'}

In [9]:
one = pd.DataFrame(one)

In [10]:
with open('two.pickle', 'rb') as read_file:
    two = pickle.load(read_file)

In [11]:
len(two)

1534

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [12]:
two[0]

{'product_name': 'Free To Be Bra Wild Light Support, A/B Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Free-To-Be-Bra-Wild/_/prod2810229?color=45881',
 'product_list_price': '$48.00 USD',
 'product_avg_rating': '4.1',
 'title': 'Straps feel apart',
 'content': 'Loved this bra for comfort and to wear for everyday use when feeling lazy but still want some structure and might support. However the straps are too thin and two have already snapped off where they join to the band. Super disappointing.',
 'rating': '2',
 'name': 'TheBreezie1',
 'date': '2020-10-27',
 'review counter': 1,
 'num_total_ratings': '(1551)'}

In [13]:
two = pd.DataFrame(two)

In [14]:
with open('three.pickle', 'rb') as read_file:
    three = pickle.load(read_file)

In [15]:
len(three)

53

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good!

In [16]:
three[0]

{'product_name': 'Invigorate Bra Long Line Medium Support, B/C Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Invigorate-Bra-Long-Line/_/prod8780603?color=0001',
 'product_list_price': '$58.00 USD',
 'product_avg_rating': '4.2',
 'title': 'Great sports bra',
 'content': 'Awesome support and coverage! I’m a UK 32K (US 32O) and it covers me well in a size 10 with minimal bounce',
 'rating': '5',
 'name': 'AshleyNM416',
 'date': '2020-10-15',
 'review counter': 1,
 'num_total_ratings': '(53)'}

In [17]:
three = pd.DataFrame(three)

In [18]:
with open('four.pickle', 'rb') as read_file:
    four = pickle.load(read_file)

In [19]:
len(four)

848

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [20]:
four[0]

{'product_name': 'Flow Y Bra Nulu Light Support, B/C Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Flow-Y-Bra-Nulu/_/prod8910081?color=46599',
 'product_list_price': '$48.00 USD',
 'product_avg_rating': '3.8',
 'title': 'Comfort',
 'content': 'I was so shocked by how accommodating this bra is for wide backs and broad shoulders. I absolutely love it and will be buying more!',
 'rating': '5',
 'name': 'Caitlin Brown',
 'date': '2020-10-31',
 'review counter': 1,
 'num_total_ratings': '(852)'}

In [21]:
four = pd.DataFrame(four)

In [22]:
with open('five.pickle', 'rb') as read_file:
    five = pickle.load(read_file)

In [23]:
len(five)

123

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good!

In [24]:
five[0]

{'product_name': 'Ebb to Street Bra Light Support, C/D Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Ebb-to-Street-Bra-CD/_/prod9750488?color=19964',
 'product_list_price': '$54.00 USD',
 'product_avg_rating': '4.2',
 'title': 'Awesome support and coverage',
 'content': '34dd and absolutely love this bra. I got my normal bra size 8 and it fits perfectly. No side boob, very supportive, a bit of cleavage but it’s tasteful, and very comfortable. So excited lulu is finally making bras in styles for larger busts.',
 'rating': '5',
 'name': 'Hyksnuiijbih',
 'date': '2020-10-31',
 'review counter': 1,
 'num_total_ratings': '(123)'}

In [25]:
five = pd.DataFrame(five)

In [26]:
with open('six.pickle', 'rb') as read_file:
    six = pickle.load(read_file)

In [27]:
len(six)

691

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
It looks like it's missing information for average product rating... 

In [28]:
six[0]

{'product_name': 'Enlite Bra Zip Front High Support, A–E Cups',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Enlite-Bra-Zip-Front/_/prod9090126?color=0001',
 'product_list_price': '$108.00 USD',
 'product_avg_rating': '',
 'title': 'Fantastic Sports Bra',
 'content': 'I have a hard time finding a bra that works for running or a high impact workout.',
 'rating': '5',
 'name': 'Kristin1974',
 'date': '2020-10-31',
 'review counter': 1,
 'num_total_ratings': '(707)'}

... this one too....

In [29]:
six[300]

{'product_name': 'Enlite Bra Zip Front High Support, A–E Cups',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Enlite-Bra-Zip-Front/_/prod9090126?color=0001',
 'product_list_price': '$108.00 USD',
 'product_avg_rating': '',
 'title': 'Best Sports Bra Ever!',
 'content': "This is the best and only sports bra I will wear for any type of high-intensity, or cardio workout. Running is my exercise of choice, and with a large 34DD chest, finding the right support without having to double up is amazing. I find these are true to size, tight, and keep the girls in place. You can't really avoid a uni-boob with large boobs and a sports bra, but this one does its best. No bouncing, no pain, just the gals strapped in there for a long run.\nI would recommend this bra to anyone. It keeps its shape, stays up, and you never have to awkwardly adjust it while working out. I have both the back close, and the front zip one - the front zip one is my favourite and I will definitely be buying 

For some reason this product is missing it's average product rating. All the other information appear to be in tact - so I will just manually fill in product rating in the DataFrame using the product information from the website.

In [30]:
six = pd.DataFrame(six)

In [31]:
six['product_avg_rating']='3.5'

In [32]:
with open('seven.pickle', 'rb') as read_file:
    seven = pickle.load(read_file)

In [33]:
len(seven)

860

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [34]:
seven[0]

{'product_name': 'Energy Bra Long Line Medium Support, B–D Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Energy-Bra-Long-Line/_/prod9030660?color=47097',
 'product_list_price': '$58.00 USD',
 'product_avg_rating': '4.1',
 'title': 'Looking fly while working out and relaxing!',
 'content': 'Great bra for most day wear. Good for yoga, and leisure while looking fly.',
 'rating': '4',
 'name': 'Anna loves yoga athletics',
 'date': '2020-10-30',
 'review counter': 1,
 'num_total_ratings': '(870)'}

In [35]:
seven = pd.DataFrame(seven)

In [36]:
with open('eight.pickle', 'rb') as read_file:
    eight = pickle.load(read_file)

In [37]:
len(eight)

308

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [38]:
eight[0]

{'product_name': 'Ebb To Street Bra Light Support, A/B Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Ebb-To-Street-Bra-II/_/prod9270834?color=31382',
 'product_list_price': '$54.00 USD',
 'product_avg_rating': '3.1',
 'title': 'Not as good as I thought it would be :(',
 'content': 'Excellent fabric, not the best fit at all. Pretty disappointed :(',
 'rating': '1',
 'name': 'AliciaLifts',
 'date': '2020-10-31',
 'review counter': 1,
 'num_total_ratings': '(340)'}

In [39]:
eight = pd.DataFrame(eight)

In [40]:
with open('nine.pickle', 'rb') as read_file:
    nine = pickle.load(read_file)

In [41]:
len(nine)

540

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [42]:
nine[0]

{'product_name': 'Run Times Bra High Support, B–E Cups',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Run-Times-Bra/_/prod9520104?color=45773',
 'product_list_price': '$68.00 USD',
 'product_avg_rating': '4.0',
 'title': 'Perfect for big chested women on petite women',
 'content': 'I love this bra! I needed a high to medium support for an intense workout. Always having a hard time to find a sports bra that fits perfectly. I wear 32DDD or 34DD on a regular bra and this one is of the workout rare items for big chested small frame women!',
 'rating': '5',
 'name': 'Angies2020',
 'date': '2020-10-30',
 'review counter': 1,
 'num_total_ratings': '(558)'}

In [43]:
nine = pd.DataFrame(nine)

In [44]:
with open('ten.pickle', 'rb') as read_file:
    ten = pickle.load(read_file)

In [45]:
len(ten)

334

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [46]:
ten[0]

{'product_name': 'Energy Bra High Neck Medium Support, B–D Cup',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Energy-Bra-High-Neck/_/prod9270907?color=28948',
 'product_list_price': '$58.00 USD',
 'product_avg_rating': '4.3',
 'title': 'Good!!',
 'content': 'Comfortable, cute, but I feel like the cups should be better quality and more supportive for the price!',
 'rating': '4',
 'name': 'Britney Fitz',
 'date': '2020-10-31',
 'review counter': 1,
 'num_total_ratings': '(340)'}

In [47]:
ten = pd.DataFrame(ten)

In [48]:
with open('eleven.pickle', 'rb') as read_file:
    eleven = pickle.load(read_file)

In [49]:
len(eleven)

225

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [50]:
eleven[0]

{'product_name': 'Free To Be Serene Bra Long Line Light Support, C/D Cup Online Only',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Free-To-Be-Serene-Bra-Long-Line/_/prod9360057?color=45881',
 'product_list_price': '$58.00 USD',
 'product_avg_rating': '4.2',
 'title': 'Super comfortable!',
 'content': 'Flattering neckline and long line cut. I find it supportive enough for cross-training.',
 'rating': '5',
 'name': 'Anonymous',
 'date': '2020-10-24',
 'review counter': 1,
 'num_total_ratings': '(228)'}

In [51]:
eleven = pd.DataFrame(eleven)

In [52]:
eleven.shape

(225, 11)

In [53]:
with open('twelve.pickle', 'rb') as read_file:
    twelve = pickle.load(read_file)

In [54]:
len(twelve)

486

Let's have a look at an individual review to make sure we got all the pieces of information we wanted to get from scraping.   
This one looks good - although the number of total ratings differs slightly from the the length of the list, this may be because a few people have simply given a star rating without a written review.

In [55]:
twelve[0]

{'product_name': 'Enlite Bra Weave High Support, A–E Cup Online Only',
 'product_url': 'https://shop.lululemon.com/p/women-sports-bras/Enlite-Bra-Weave/_/prod9370109?color=0001',
 'product_list_price': '$98.00 USD',
 'product_avg_rating': '3.3',
 'title': 'Fits small',
 'content': 'I’m not sure if the bar is supposed to be so tight. I saw some of the reviews and though I’d better go one size up.\nI had a very hard time hooking the clasps. The bar is very tight. This is my first day wearing it, so I can only hope the fabric loose a up a bit.\nI bought it for horseback rising. And on the ground, this bra does have a good high impact support.',
 'rating': '3',
 'name': 'Anonymous',
 'date': '2020-10-31',
 'review counter': 1,
 'num_total_ratings': '(495)'}

In [56]:
twelve = pd.DataFrame(twelve)

In [57]:
twelve.shape

(486, 11)

# Build Dataframe

Now that we have all the individual product dataframes, I will concat them to make dataframe of the entire set of product reviews.

In [58]:
df = pd.concat([zero, one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve])

In [59]:
df.shape

(9127, 11)

In [60]:
with open('df.pickle', 'wb') as to_write:
    pickle.dump(df, to_write)

In [61]:
df.head()

Unnamed: 0,product_name,product_url,product_list_price,product_avg_rating,title,content,rating,name,date,review counter,num_total_ratings
0,"Like a Cloud Bra Light Support, B/C Cup",https://shop.lululemon.com/p/women-sports-bras...,$58.00 USD,4.6,Haven't taken it off,I purchased this bra in the hazy jade color ab...,5,Anonymous,2020-10-31,1,(89)
1,"Like a Cloud Bra Light Support, B/C Cup",https://shop.lululemon.com/p/women-sports-bras...,$58.00 USD,4.6,"comfortable, but...","comfortable, but not for small frame/big chest...",3,Ashley Traister,2020-10-31,2,(89)
2,"Like a Cloud Bra Light Support, B/C Cup",https://shop.lululemon.com/p/women-sports-bras...,$58.00 USD,4.6,Amazing,"If it weren't for the price, I would replace a...",5,meaglee321,2020-10-31,3,(89)
3,"Like a Cloud Bra Light Support, B/C Cup",https://shop.lululemon.com/p/women-sports-bras...,$58.00 USD,4.6,Comfort Bra 5*’s,Super soft and comfortable to wear all day. Ot...,5,Craftybayler,2020-10-30,4,(89)
4,"Like a Cloud Bra Light Support, B/C Cup",https://shop.lululemon.com/p/women-sports-bras...,$58.00 USD,4.6,So comfortable!,I originally bought 2 but I’m buying more. I w...,5,Ash the mail lady,2020-10-30,5,(89)


# Data Cleaning  
Overall things look good, but let's do a bit of cleaning.

In [62]:
df_cleaned = df.copy()

In [63]:
df_cleaned.shape

(9127, 11)

In [64]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9127 entries, 0 to 485
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   product_name        9127 non-null   object
 1   product_url         9127 non-null   object
 2   product_list_price  9127 non-null   object
 3   product_avg_rating  9127 non-null   object
 4   title               9127 non-null   object
 5   content             9127 non-null   object
 6   rating              9127 non-null   object
 7   name                9127 non-null   object
 8   date                9127 non-null   object
 9   review counter      9127 non-null   int64 
 10  num_total_ratings   9127 non-null   object
dtypes: int64(1), object(10)
memory usage: 855.7+ KB


Several of these features should be turned into ints or floats.

In [65]:
df_cleaned.rating.unique()

array(['5', '3', '1', '2', '4'], dtype=object)

In [66]:
df_cleaned['rating'] = df_cleaned['rating'].astype(int)

In [67]:
df_cleaned.rating.unique()

array([5, 3, 1, 2, 4])

In [68]:
df_cleaned.num_total_ratings.unique()

array(['(89)', '(3069)', '(1551)', '(53)', '(852)', '(123)', '(707)',
       '(870)', '(340)', '(558)', '(228)', '(495)'], dtype=object)

In [69]:
df_cleaned.num_total_ratings = df_cleaned.num_total_ratings.str.replace(")", "")
df_cleaned.num_total_ratings = df_cleaned.num_total_ratings.str.replace("(", "")
df_cleaned['num_total_ratings'] = df_cleaned['num_total_ratings'].astype(int)

In [70]:
df_cleaned.num_total_ratings.unique()

array([  89, 3069, 1551,   53,  852,  123,  707,  870,  340,  558,  228,
        495])

In [71]:
df_cleaned['product_avg_rating'].unique()

array(['4.6', '4.1', '4.2', '3.8', '3.5', '3.1', '4.0', '4.3', '3.3'],
      dtype=object)

for some reason, one or more of our products is missing an average product rating. - Can replace with NaN then covert variable type to float

In [72]:
df_cleaned['product_avg_rating'] = df_cleaned['product_avg_rating'].astype(float)

In [73]:
df_cleaned['product_list_price'].unique()

array(['$58.00 USD', '$52.00 USD', '$48.00 USD', '$54.00 USD',
       '$108.00 USD', '$68.00 USD', '$98.00 USD'], dtype=object)

In [74]:
df_cleaned.product_list_price = df_cleaned.product_list_price.str.replace("$", "")
df_cleaned.product_list_price = df_cleaned.product_list_price.str.replace("USD", "")
df_cleaned['product_list_price'] = df_cleaned['product_list_price'].astype(float)

In [75]:
df_cleaned['product_list_price'].unique()

array([ 58.,  52.,  48.,  54., 108.,  68.,  98.])

In [76]:
df_cleaned['date'].unique()

array(['2020-10-31', '2020-10-30', '2020-10-28', ..., '2019-06-14',
       '2019-09-01', '2019-06-28'], dtype=object)

In [77]:
df_cleaned['date'] =  pd.to_datetime(df_cleaned['date'])

In [78]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9127 entries, 0 to 485
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   product_name        9127 non-null   object        
 1   product_url         9127 non-null   object        
 2   product_list_price  9127 non-null   float64       
 3   product_avg_rating  9127 non-null   float64       
 4   title               9127 non-null   object        
 5   content             9127 non-null   object        
 6   rating              9127 non-null   int64         
 7   name                9127 non-null   object        
 8   date                9127 non-null   datetime64[ns]
 9   review counter      9127 non-null   int64         
 10  num_total_ratings   9127 non-null   int64         
dtypes: datetime64[ns](1), float64(2), int64(3), object(5)
memory usage: 855.7+ KB


# Preliminary EDA

In [79]:
df_cleaned.describe()

Unnamed: 0,product_list_price,product_avg_rating,rating,review counter,num_total_ratings
count,9127.0,9127.0,9127.0,9127.0,9127.0
mean,59.712501,3.961028,3.96965,784.839706,1587.956612
std,17.785354,0.298449,1.372717,780.487433,1110.712871
min,48.0,3.1,1.0,1.0,53.0
25%,48.0,3.8,3.0,202.0,707.0
50%,52.0,4.1,5.0,491.0,1551.0
75%,58.0,4.1,5.0,1144.5,3069.0
max,108.0,4.6,5.0,3036.0,3069.0


In [80]:
df_cleaned.product_name.value_counts(normalize=True)

Energy Bra Medium Support, B–D Cup                                    0.332639
Free To Be Bra Wild Light Support, A/B Cup                            0.168073
Energy Bra Long Line Medium Support, B–D Cup                          0.094226
Flow Y Bra Nulu Light Support, B/C Cup                                0.092911
Enlite Bra Zip Front High Support, A–E Cups                           0.075709
Run Times Bra High Support, B–E Cups                                  0.059165
Enlite Bra Weave High Support, A–E Cup Online Only                    0.053249
Energy Bra High Neck Medium Support, B–D Cup                          0.036595
Ebb To Street Bra Light Support, A/B Cup                              0.033746
Free To Be Serene Bra Long Line Light Support, C/D Cup Online Only    0.024652
Ebb to Street Bra Light Support, C/D Cup                              0.013476
Like a Cloud Bra Light Support, B/C Cup                               0.009751
Invigorate Bra Long Line Medium Support, B/C Cup    

In [81]:
df_cleaned.product_avg_rating.value_counts(normalize=True)

4.1    0.594938
3.8    0.092911
3.5    0.075709
4.0    0.059165
3.3    0.053249
4.2    0.043936
4.3    0.036595
3.1    0.033746
4.6    0.009751
Name: product_avg_rating, dtype: float64

In [95]:
print("mean: {:.2f}\n median: {:.2f}\n std dev: {:.2f}\n ".format(df_cleaned.rating.mean(), df_cleaned.rating.median(), df_cleaned.rating.std())) #, df_cleaned.rating.median(), df_cleaned.rating.std(), 

mean: 3.97
 median: 5.00
 std dev: 1.37
 


In [82]:
df_cleaned["rev_length"] = df_cleaned['content'].apply(len)

In [83]:
df_cleaned.rev_length.median(), df_cleaned.rev_length.mean(), df_cleaned.rev_length.std() 

(256.0, 311.9513531280815, 236.4312606315249)

In [84]:
df_cleaned.describe()

Unnamed: 0,product_list_price,product_avg_rating,rating,review counter,num_total_ratings,rev_length
count,9127.0,9127.0,9127.0,9127.0,9127.0,9127.0
mean,59.712501,3.961028,3.96965,784.839706,1587.956612,311.951353
std,17.785354,0.298449,1.372717,780.487433,1110.712871,236.431261
min,48.0,3.1,1.0,1.0,53.0,0.0
25%,48.0,3.8,3.0,202.0,707.0,144.0
50%,52.0,4.1,5.0,491.0,1551.0,256.0
75%,58.0,4.1,5.0,1144.5,3069.0,415.0
max,108.0,4.6,5.0,3036.0,3069.0,2479.0


One final step to check for rows with empty content and remove them.

In [110]:
empty = df_cleaned[df_cleaned.content == '']

In [116]:
empty.shape 

(37, 12)

In [114]:
empty.head()

Unnamed: 0,product_name,product_url,product_list_price,product_avg_rating,title,content,rating,name,date,review counter,num_total_ratings,rev_length
688,"Energy Bra Medium Support, B–D Cup",https://shop.lululemon.com/p/women-sports-bras...,52.0,4.1,,,1,AlexYork20,2020-03-02,689,3069,0
690,"Energy Bra Medium Support, B–D Cup",https://shop.lululemon.com/p/women-sports-bras...,52.0,4.1,,,3,WorkinoutLily,2020-03-02,691,3069,0
693,"Energy Bra Medium Support, B–D Cup",https://shop.lululemon.com/p/women-sports-bras...,52.0,4.1,,,1,Yanni Yang,2020-03-01,694,3069,0
761,"Energy Bra Medium Support, B–D Cup",https://shop.lululemon.com/p/women-sports-bras...,52.0,4.1,Great fit and very comfortable!,,5,Kls3,2019-12-28,762,3069,0
810,"Energy Bra Medium Support, B–D Cup",https://shop.lululemon.com/p/women-sports-bras...,52.0,4.1,great bra,,5,SarahAnvari,2019-10-28,811,3069,0


In [117]:
df_cleaned = df_cleaned[df_cleaned.content != '']

In [119]:
df_cleaned.shape

(9090, 12)

In [118]:
with open('df_cleaned.pickle', 'wb') as to_write:
    pickle.dump(df_cleaned, to_write)