### **In this study, I will apply data cleaning and feature extraction methods on Amazon Reviews dataset**

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
# df = pd.read_csv('output/vacuum_cleaner_complete_reviews.csv')
# cols = pd.read_table(file_name, nrows=1,).columns
file_name="output/amazon_reviews_us_Books_v1_00.tsv"
cols=['product_id', 'customer_id', 'helpful_votes', 'total_votes', 
      'star_rating', 'review_date', 'review_headline', 'review_body', 
      'vine', 'verified_purchase' ]
df = pd.read_table(file_name, usecols=cols, dtype={'star_rating':'int64'}, 
                   on_bad_lines='warn', quoting = 3, encoding="utf-8")
df.head()

Unnamed: 0,customer_id,product_id,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,25933450,0439873800,5,0,0,N,Y,Five Stars,I love it and so does my students!,2015-08-31
1,1801372,1623953553,5,0,0,N,Y,"Please buy ""I Saw a Friend""! Your children wil...",My wife and I ordered 2 books and gave them as...,2015-08-31
2,5782091,142151981X,5,0,0,N,Y,Shipped fast.,Great book just like all the others in the ser...,2015-08-31
3,32715830,014241543X,5,0,0,N,N,Five Stars,So beautiful,2015-08-31
4,14005703,1604600527,5,2,2,N,Y,Five Stars,Enjoyed the author's story and his quilts are ...,2015-08-31


**Let's convert the 'review_date' column to a meaningful time format**  

In [3]:
df['review_date'] = pd.to_datetime(df['review_date'])
df.head()

Unnamed: 0,customer_id,product_id,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,25933450,0439873800,5,0,0,N,Y,Five Stars,I love it and so does my students!,2015-08-31
1,1801372,1623953553,5,0,0,N,Y,"Please buy ""I Saw a Friend""! Your children wil...",My wife and I ordered 2 books and gave them as...,2015-08-31
2,5782091,142151981X,5,0,0,N,Y,Shipped fast.,Great book just like all the others in the ser...,2015-08-31
3,32715830,014241543X,5,0,0,N,N,Five Stars,So beautiful,2015-08-31
4,14005703,1604600527,5,2,2,N,Y,Five Stars,Enjoyed the author's story and his quilts are ...,2015-08-31


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10319090 entries, 0 to 10319089
Data columns (total 10 columns):
 #   Column             Dtype         
---  ------             -----         
 0   customer_id        int64         
 1   product_id         object        
 2   star_rating        int64         
 3   helpful_votes      int64         
 4   total_votes        int64         
 5   vine               object        
 6   verified_purchase  object        
 7   review_headline    object        
 8   review_body        object        
 9   review_date        datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(5)
memory usage: 787.3+ MB


In [5]:
df.describe().round(1)

Unnamed: 0,customer_id,star_rating,helpful_votes,total_votes
count,10319090.0,10319090.0,10319090.0,10319090.0
mean,28816859.3,4.4,1.5,2.2
std,15369516.0,1.0,11.7,14.1
min,10024.0,1.0,0.0,0.0
25%,15049018.0,4.0,0.0,0.0
50%,27943498.0,5.0,0.0,0.0
75%,43242666.2,5.0,1.0,2.0
max,53096584.0,5.0,6244.0,6534.0


In [6]:
# Determine how many missing values exist in the collection, in which case you can use .sum() chained onto is.na()
null_values=df.isna().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot*100
round(null_values,3).sort_values('percent',ascending=False)

Unnamed: 0,null,percent
review_body,197,0.002
review_headline,71,0.001
customer_id,0,0.0
product_id,0,0.0
star_rating,0,0.0
helpful_votes,0,0.0
total_votes,0,0.0
vine,0,0.0
verified_purchase,0,0.0
review_date,0,0.0


**If we have any missing values. We can drop them completely.**

In [7]:
df= df.dropna()
df.shape

(10318823, 10)

In [8]:
df=df.sample(100000)
df.shape

(100000, 10)

# Basic Feature Extraction - 1

Normally, I tried to make data cleaning first. Then, I realized that while making data cleaning, I am losing some of characters that can help data cleaning. Therefore, there will be two part of feature extraction. Here, I will extract features that can't be exracted after data cleaning.

### 1) Number of stopwords

In [9]:
!pip install -q wordcloud
import wordcloud
from nltk.corpus import stopwords
import nltk
import string
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yogesh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Yogesh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yogesh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Yogesh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [10]:
df['stopwords'] = df['review_body'].apply(lambda x: len([x for x in x.split() if x in stop]))
df[['review_body','stopwords']].head()

Unnamed: 0,review_body,stopwords
8056458,I enjoyed the layout to help me understand Lin...,14
6671483,I'm learning a lot about structural issues in ...,45
4593153,My son loves this book! A great buy!,1
6392189,The product was listed as &#34;Like New Condit...,21
3184725,gift,0


In [11]:
df.stopwords.loc[df.stopwords != 0].count()

91955

### 2) Number of Punctuation

In [12]:
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return count

df['punctuation'] = df['review_body'].apply(lambda x: count_punct(x))

In [13]:
df[['review_body','punctuation']].head()

Unnamed: 0,review_body,punctuation
8056458,I enjoyed the layout to help me understand Lin...,3
6671483,I'm learning a lot about structural issues in ...,19
4593153,My son loves this book! A great buy!,2
6392189,The product was listed as &#34;Like New Condit...,25
3184725,gift,0


In [14]:
df.punctuation.loc[df.punctuation != 0].count()

92832

### 3) Number of hashtag characters

One more interesting feature which we can extract from a review is calculating the number of hashtags or mentions present in it. This also helps in extracting extra information from our text data.

In [15]:
df['hastags'] = df['review_body'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df[['review_body','hastags']].head()

Unnamed: 0,review_body,hastags
8056458,I enjoyed the layout to help me understand Lin...,0
6671483,I'm learning a lot about structural issues in ...,0
4593153,My son loves this book! A great buy!,0
6392189,The product was listed as &#34;Like New Condit...,0
3184725,gift,0


In [16]:
df.hastags.loc[df.hastags != 0].count()

321

### 4) Number of numerics
Calculate the number of numerics which are present in the tweets can be useful. At least, it doesn't hurt to have such a data!

In [17]:
df['numerics'] = df['review_body'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['review_body','numerics']].head()

Unnamed: 0,review_body,numerics
8056458,I enjoyed the layout to help me understand Lin...,0
6671483,I'm learning a lot about structural issues in ...,0
4593153,My son loves this book! A great buy!,0
6392189,The product was listed as &#34;Like New Condit...,0
3184725,gift,0


In [18]:
df.numerics.loc[df.numerics != 0].count()

12886

### 5) Number of Uppercase words
Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identify those words.

In [19]:
df['upper'] = df['review_body'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
df[['review_body','upper']].head()

Unnamed: 0,review_body,upper
8056458,I enjoyed the layout to help me understand Lin...,2
6671483,I'm learning a lot about structural issues in ...,2
4593153,My son loves this book! A great buy!,1
6392189,The product was listed as &#34;Like New Condit...,0
3184725,gift,0


In [20]:
df.upper.loc[df.upper != 0].count()

61810

### 6) Number of Emojis
Emojis can be indictor of some emotions that can be related to being customer satisfaction.

In [21]:
!pip install emot

import emot 
emot_obj = emot.core.emot() 

df['emoji'] = df['review_body'].apply(lambda x: len(emot_obj.emoji(x)["value"]))
df[['review_body','emoji']].head()



Unnamed: 0,review_body,emoji
8056458,I enjoyed the layout to help me understand Lin...,0
6671483,I'm learning a lot about structural issues in ...,0
4593153,My son loves this book! A great buy!,0
6392189,The product was listed as &#34;Like New Condit...,0
3184725,gift,0


In [22]:
df.emoji.loc[df.emoji != 0].count()

139

### 7) Number of Emoticons

***What is the difference between emoji and emoticons?***

*   :-) is an emoticon
*   😜 → emoji.

In [23]:
df['emoticon'] = df['review_body'].apply(lambda x: len(emot_obj.emoticons(x)["value"]))
df[['review_body','emoticon']].head()

Unnamed: 0,review_body,emoticon
8056458,I enjoyed the layout to help me understand Lin...,0
6671483,I'm learning a lot about structural issues in ...,0
4593153,My son loves this book! A great buy!,0
6392189,The product was listed as &#34;Like New Condit...,0
3184725,gift,0


In [24]:
df.emoticon.loc[df.emoticon != 0].count()

3474



---



---



---



# **Text cleaning techniques** 

### 1) Make all text lower case

The first pre-processing step which we will do is transform our reviews into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘Analytics’ and ‘analytics’ will be taken as different words.

In [25]:
df['Text'] = df['review_body'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['Text'].head()

8056458    i enjoyed the layout to help me understand lin...
6671483    i'm learning a lot about structural issues in ...
4593153                 my son loves this book! a great buy!
6392189    the product was listed as &#34;like new condit...
3184725                                                 gift
Name: Text, dtype: object

### 2) Removing Punctuation

In [26]:
df['Text'] = df['Text'].str.replace('[^\w\s]',' ')
df['Text'].head()

  df['Text'] = df['Text'].str.replace('[^\w\s]',' ')


8056458    i enjoyed the layout to help me understand lin...
6671483    i m learning a lot about structural issues in ...
4593153                 my son loves this book  a great buy 
6392189    the product was listed as   34 like new condit...
3184725                                                 gift
Name: Text, dtype: object

### 3) Removal of Stop Words

In [27]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['Text'].sample(10)

7904209    taught assessment course included children adu...
5311033    great book cool pop outs grandson loves book m...
8270003    james r white one best best theologians apolog...
6529530        love read put book love author reading pieces
751207                     excellently done meet expectation
9623660    excellent condition missing pages pages marked...
223638     enjoyed debbie mason stories people charming c...
4307308    thoroughly enjoyed easy read book found excell...
4304118    first began reading print version novel found ...
3590913    kind novel bindloss specialty young englishman...
Name: Text, dtype: object

### 4) Removing URLs

In [28]:
def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

In [29]:
# remove all urls from df
import re
import string

df['Text'] = df['Text'].apply(lambda x: remove_url(x))

### 5) Remove html tags

In [30]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

In [31]:
# remove all html tags from df
df['Text'] = df['Text'].apply(lambda x: remove_html(x))

 ### 6) Removing Emojis
Emojis can be indictor of some emotions that can be related to being customer satisfaction. Unfortunately, we need to remove the emojis in our text analysis

In [32]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [33]:
#Example
remove_emoji("Omg another Earthquake 😔😔")

'Omg another Earthquake '

In [34]:
# remove all emojis from df
df['Text'] = df['Text'].apply(lambda x: remove_emoji(x))

### 7) Remove Emoticons

In previous steps, we have removed emoji. Now, going to remove emoticons. 

***What is the difference between emoji and emoticons?***

*   :-) is an emoticon
*   😜 → emoji.

In [35]:
# Reference : https://gist.github.com/Mylloon/e63f90e27b7e933779794cf6e39b758b

regex = r" *?[^\w\s]+"

def remove_emoticons(text):
    text = re.sub(regex, ' ', text)
    return text

In [36]:
#Example
remove_emoticons(" Hello :-) ")

' Hello  '

In [37]:
df['Text'] = df['Text'].apply(lambda x: remove_emoticons(x))

### Spell Correction

We’ve all seen reviews with a plethora of spelling mistakes. Product reviews are often filled with hastly written text that are barely legible at times.

In that regard, spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will be treated as different words even if they are used in the same sense.

To achieve this we will use the textblob library. 

In [38]:
from textblob import TextBlob
df['Text'][:5].apply(lambda x: str(TextBlob(x).correct()))

8056458    enjoyed layout help understand line time poor ...
6671483    learning lot structural issues architecture bo...
4593153                             son loves book great buy
6392189    product listed 34 like new condition 34 certai...
3184725                                                 gift
Name: Text, dtype: object

In [39]:
# We could do some of the cleaning steps as a sum of opreation like this:

# Apply a first round of text cleaning techniques
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [40]:
df['Text'] = df.Text.apply(round1)
df.Text

8056458    enjoyed layout help understand linux time poor...
6671483    learning lot structural issues architecture bo...
4593153                             son loves book great buy
6392189    product listed  like new condition  certainly ...
3184725                                                 gift
                                 ...                        
7183971    max boot expert military warfare book fascinat...
282851                information found net opinionated text
8125376    highly recommended first time parents basic ne...
783860                                            great read
5670992    first time mom book lifesaver given gift purch...
Name: Text, Length: 100000, dtype: object

In [41]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [42]:
df['Text'] = df.Text.apply(round2)
df.Text

8056458    enjoyed layout help understand linux time poor...
6671483    learning lot structural issues architecture bo...
4593153                             son loves book great buy
6392189    product listed  like new condition  certainly ...
3184725                                                 gift
                                 ...                        
7183971    max boot expert military warfare book fascinat...
282851                information found net opinionated text
8125376    highly recommended first time parents basic ne...
783860                                            great read
5670992    first time mom book lifesaver given gift purch...
Name: Text, Length: 100000, dtype: object

Let's check whether the frequent words make sense or not

In [43]:
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:20]
freq

br         130435
book       118979
read        40832
one         32877
great       27080
story       23666
like        23475
good        22393
would       20526
love        19214
well        18314
books       17444
time        17340
life        15956
really      15933
much        14804
reading     14596
many        14243
also        14191
first       13441
dtype: int64

# Basic Feature Extraction - 2

###  Number of Words

In [44]:
df['word_count'] = df['Text'].apply(lambda x: len(str(x).split(" ")))
df[['Text','word_count']].head()

Unnamed: 0,Text,word_count
8056458,enjoyed layout help understand linux time poor...,15
6671483,learning lot structural issues architecture bo...,52
4593153,son loves book great buy,5
6392189,product listed like new condition certainly ...,38
3184725,gift,1


Again, let's check the data and number of null values

In [45]:
null_values=df.isna().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot*100
round(null_values,3).sort_values('percent',ascending=False)

Unnamed: 0,null,percent
customer_id,0,0.0
stopwords,0,0.0
Text,0,0.0
emoticon,0,0.0
emoji,0,0.0
upper,0,0.0
numerics,0,0.0
hastags,0,0.0
punctuation,0,0.0
review_date,0,0.0


### Number of characters

In [46]:
df['char_count'] = df['Text'].str.len() ## this also includes spaces
df[['Text','char_count']].head()

Unnamed: 0,Text,char_count
8056458,enjoyed layout help understand linux time poor...,92
6671483,learning lot structural issues architecture bo...,417
4593153,son loves book great buy,24
6392189,product listed like new condition certainly ...,236
3184725,gift,4


### 3) Average Word Length

In [47]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/(len(words)+0.000001))

In [48]:
df['avg_word'] = df['Text'].apply(lambda x: avg_word(x)).round(1)
df[['Text','avg_word']].head()

Unnamed: 0,Text,avg_word
8056458,enjoyed layout help understand linux time poor...,5.2
6671483,learning lot structural issues architecture bo...,7.0
4593153,son loves book great buy,4.0
6392189,product listed like new condition certainly ...,5.9
3184725,gift,4.0


In [49]:
list(df)

['customer_id',
 'product_id',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date',
 'stopwords',
 'punctuation',
 'hastags',
 'numerics',
 'upper',
 'emoji',
 'emoticon',
 'Text',
 'word_count',
 'char_count',
 'avg_word']

In [50]:
df.sample(5)

Unnamed: 0,customer_id,product_id,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,...,punctuation,hastags,numerics,upper,emoji,emoticon,Text,word_count,char_count,avg_word
5718638,47376068,307987035,5,2,2,N,Y,A must read for foodies,If you are interested in an honest look at mod...,2014-03-03,...,5,0,0,0,0,0,interested honest look modern agricultural pro...,22,154,6.0
5254922,12171199,1621940101,5,0,0,N,Y,"Joy, no matter the situation!",Excellent study of this great letter from the ...,2014-05-01,...,6,0,1,1,0,0,excellent study great letter apostle paul faci...,17,112,6.0
9744858,38562832,1582346100,4,0,0,N,N,Voluptuous Prose,"Yes, the characters indulge in all sorts of si...",2012-09-30,...,17,0,0,2,0,0,yes characters indulge sorts sins casual sex s...,48,363,6.6
1467360,20111189,802405576,5,0,0,N,Y,Five Stars,Very helpful and reminds us we aren't alone.,2015-04-17,...,2,0,0,0,0,0,helpful reminds us alone,4,24,5.2
2499975,25952696,802473806,5,0,0,N,Y,It is in excellent condition and the best pack...,My husband Sonny loves the the Ryrie Study Bib...,2015-01-22,...,9,0,0,0,0,0,husband sonny loves ryrie study bible excellen...,23,152,5.7




---
## **Now, let's apply round 1 and round 2 data cleaning processes on 'Summary' column**


Keep in mind that round1 operations make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.

In [51]:
df['review_headline'] = df.review_headline.apply(round1)
df.review_headline

8056458                                   linux a better way
6671483                  fantastic book for amateurs like me
4593153                                     a happy customer
6392189                                    not as advertised
3184725                                           five stars
                                 ...                        
7183971    sweeping epic longitudinalstudy of guerillawar...
282851                                           three stars
8125376                                very interesting book
783860                                                 great
5670992                                           great book
Name: review_headline, Length: 100000, dtype: object

And, round2 operations get rid of some additional punctuation and non-sensical text that was missed the first time around.

In [52]:
df['review_headline'] = df.review_headline.apply(round2)
df.review_headline

8056458                                   linux a better way
6671483                  fantastic book for amateurs like me
4593153                                     a happy customer
6392189                                    not as advertised
3184725                                           five stars
                                 ...                        
7183971    sweeping epic longitudinalstudy of guerillawar...
282851                                           three stars
8125376                                very interesting book
783860                                                 great
5670992                                           great book
Name: review_headline, Length: 100000, dtype: object

Let's check whether most frequent words make make. We can add our own stopwords depending on it

In [53]:
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:50]
freq

br            130435
book          118979
read           40832
one            32877
great          27080
story          23666
like           23475
good           22393
would          20526
love           19214
well           18314
books          17444
time           17340
life           15956
really         15933
much           14804
reading        14596
many           14243
also           14191
first          13441
author         13162
get            13123
way            12629
even           12407
people         11637
new            11448
know           10535
characters     10077
could           9946
series          9518
make            9495
written         9470
world           9289
recommend       9089
little          8978
think           8842
find            8824
work            8653
us              8632
see             8393
years           7951
found           7878
two             7797
want            7700
easy            7617
loved           7467
back            7404
things       

# Adding own stopwords

In [54]:
# Adding common words from our document to stop_words

add_words = ["br"]

stop_words = set(stopwords.words("english"))
stop_added = stop_words.union(add_words)

In [55]:
df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_added))
df['Text'].sample(10)

7913466    recommended gourmet magazine cookbook gold sta...
3264982          followed patricia bray years think best yet
7749457    person rarely picks book found book excellent ...
6680382    insider tells grew world managed become part t...
6854610    losing wonderful feline boy tigger friend loan...
2073566    book fine enjoyable great tsa comments little ...
4430464                              great book great writer
5141223    specialised book assume reader expert even glo...
1390873        great bio history fans xtc well worth reading
9095430    ph physicist teaching university level yoga me...
Name: Text, dtype: object

In [56]:
df1= df

In [57]:
mask = df1.Text.str.endswith('br') 
df1.loc[mask, 'Text'] = df1.loc[mask, 'Text'].str[:-2]

In [58]:
df1['Text'] = df1['Text'].str.rstrip('tty')

In [59]:
df1['Text'].apply(lambda x: x[:-2] if x.endswith('tty') else x)

8056458    enjoyed layout help understand linux time poor...
6671483    learning lot structural issues architecture bo...
4593153                              son loves book great bu
6392189    product listed like new condition certainly fa...
3184725                                                  gif
                                 ...                        
7183971    max boot expert military warfare book fascinat...
282851                 information found net opinionated tex
8125376    highly recommended first time parents basic ne...
783860                                            great read
5670992    first time mom book lifesaver given gift purch...
Name: Text, Length: 100000, dtype: object

In [60]:
df1.loc[df1.Text.str.endswith('br'), 'Text']

Series([], Name: Text, dtype: object)

In [61]:
df1.loc[df1.punctuation >= 1000].Text.tolist()

['admit greta christina book atheists angry things piss godless many ways better expected greta good writer anger expressed much reason rant pith well pique begins book list things tick surprised find agreed greta also smart enough know listing atrocities rightly adds list could far longer enough make even moral case religion knows moral case would prove religion wrong recognizes people jerks without help olympus recognizes anger destructive psychologically street points anger also often harnessed productively important reform movements come think true fact atheists angry seen nice little rebuttal pluralism attempt always look bright side religious life ta da ta da da da da da like radical critics pluralism however greta tacks opposite extreme book also read rah rah screed secular humanism greta gang thick pz myers simple minded enlightenment exclusivists gavin costa puts think points wrong christianity slavery christianity status women galileo bruno persecuted think also wrong suppose

In [62]:
df.loc[df.punctuation >= 1000].Text.tolist()

['admit greta christina book atheists angry things piss godless many ways better expected greta good writer anger expressed much reason rant pith well pique begins book list things tick surprised find agreed greta also smart enough know listing atrocities rightly adds list could far longer enough make even moral case religion knows moral case would prove religion wrong recognizes people jerks without help olympus recognizes anger destructive psychologically street points anger also often harnessed productively important reform movements come think true fact atheists angry seen nice little rebuttal pluralism attempt always look bright side religious life ta da ta da da da da da like radical critics pluralism however greta tacks opposite extreme book also read rah rah screed secular humanism greta gang thick pz myers simple minded enlightenment exclusivists gavin costa puts think points wrong christianity slavery christianity status women galileo bruno persecuted think also wrong suppose

In [63]:
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:50]
freq

book          118979
read           40852
one            32877
great          26205
like           23475
story          22990
good           22393
would          20526
love           19214
well           18314
books          17444
time           17340
life           15956
really         15895
much           14804
reading        14596
many           14198
also           14191
first          13383
author         13162
get            12933
even           12428
way            12361
people         11637
new            11448
know           10535
characters     10077
could           9946
series          9518
make            9495
written         9470
world           9289
recommend       9089
little          8978
think           8842
find            8824
work            8653
us              8632
see             8393
years           7951
found           7878
two             7797
want            7631
easy            7572
loved           7467
back            7404
things          7375
never        

**Now, let's save this clened processed data as CSV file** 

In [64]:
df.to_csv('output/Amazon_reviews_processed_1.csv', index=False)