# NLP PROJECT

### Problem Statement:

As an avid reader, I get many recommendations from my circle regarding which books I should read next. Having heard different opinions of Nassim Nicholas Taleb, I decided to use NLP to get a feeling based of people's reviews concerning his Incerto (consisting of 5 books) and possibly the topics within. I will divide this project into multiple Notebooks to make it easier to read.

### What This Project Shows:

1. Web scraping book reviews on goodreads.com
2. Exploratory Data Analysis
3. Exploring Sentimental Analysis and Topic Modelling NLP techniques
4. Conclusion based on analysis

Link for the  website: http://goodreads.com/

## Notebook 1: Web Scraping + Data Cleaning + Organizing:

The output of this notebook will have clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

### I. Web Scraping:

In [1]:
# Import libraries:
import urllib3
from bs4 import BeautifulSoup
import requests
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

In [2]:
# Print the status of the website (200 Yes ; 404 No):
result = requests.get("https://www.goodreads.com/")
print("status code for page: " + str(result.status_code))

status code for page: 200


In [3]:
# Scrapes first page of 30 top-ranked reviews of each book from goodreads.com
def url_to_reviews(url):
    '''Returns review data specifically from goodreads.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.findAll("div", {"class": "reviewText stacked"})]
    #text = [p.text for p in soup.find(class_="reviewText stacked").find_all('')]
    print(url)
    return text

# URLs of the 5 books (Incerto) in scope
urls = ['https://www.goodreads.com/book/show/38315.Fooled_by_Randomness?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=2',
        'https://www.goodreads.com/book/show/242472.The_Black_Swan?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=1',
        'https://www.goodreads.com/book/show/9402297-the-bed-of-procrustes?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=5',
        'https://www.goodreads.com/book/show/13530973-antifragile?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=3',
        'https://www.goodreads.com/book/show/36064445-skin-in-the-game?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=4']

# Book names
books = ['FbR', 'TBS', 'BoP', 'AF', 'SitG']

In [4]:
# Actually request transcripts (takes a few minutes to run)
reviews = [url_to_reviews(u) for u in urls]

https://www.goodreads.com/book/show/38315.Fooled_by_Randomness?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=2
https://www.goodreads.com/book/show/242472.The_Black_Swan?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=1
https://www.goodreads.com/book/show/9402297-the-bed-of-procrustes?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=5
https://www.goodreads.com/book/show/13530973-antifragile?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=3
https://www.goodreads.com/book/show/36064445-skin-in-the-game?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=4


In [5]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir reviews

for i, c in enumerate(books):
    with open("reviews/" + c + ".txt", "wb") as file:
        pickle.dump(reviews[i], file)

mkdir: reviews: File exists


In [6]:
# Load pickled files
data = {}
for i, c in enumerate(books):
    with open("reviews/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [7]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['FbR', 'TBS', 'BoP', 'AF', 'SitG'])

In [8]:
# More checks
data['FbR']

["\n\nYeah, you see. I’ve just checked and most of the other reviews of this book do pretty much what I thought they would do. They complain about the tone. This guy is never going to win an award for modesty and he probably thinks you are stupid and have wasted your life. And it gets worse – like that quote from Oscar Wilde that has tormented me for years: “Work is the refuge of people who have nothing better to do”, this guy reckons that if you work for more than an hour or so per day you are probab\nYeah, you see. I’ve just checked and most of the other reviews of this book do pretty much what I thought they would do. They complain about the tone. This guy is never going to win an award for modesty and he probably thinks you are stupid and have wasted your life. And it gets worse – like that quote from Oscar Wilde that has tormented me for years: “Work is the refuge of people who have nothing better to do”, this guy reckons that if you work for more than an hour or so per day you ar

__NOTE:__ We could have also automated the process in Selenium to make this project more sustainable for future re-use (might be done in the future).

If we were to focus on single book or have extensive amounts of reviews, we could run the code below to have all top 300 reviews of a specific book. However, we will need more extensive cleaning.

In [9]:
# All 300 reviews of one specific book if we want to be specific about one

# page_number = 1

# reviews = []

# for page in range(10):

#     url = "https://www.goodreads.com/book/reviews/38315.Fooled_by_Randomness?page="+str(page+1)+"&authenticity_token=DGYjS7GmVhOFJuL3uQaRGUqXiGf0VF0bPXdyEAzo7GbLXPTqHMTJ%2FkL2rAOCEoKiDTEIYd1gzqxrlpGoiMsEdA%3D%3Dhide_last_page=False&amp;language_code=en&amp;"
#     page = requests.get(url).text
#     reviews.append(page)

# print("total number of reviews: "+ str(30*len(reviews)))

### II. Cleaning Data:

In [10]:
# Let's take a look at our data again
next(iter(data.keys()))

'FbR'

In [11]:
# Notice that our dictionary is currently in key: books, value: list of text format
next(iter(data.values()))

["\n\nYeah, you see. I’ve just checked and most of the other reviews of this book do pretty much what I thought they would do. They complain about the tone. This guy is never going to win an award for modesty and he probably thinks you are stupid and have wasted your life. And it gets worse – like that quote from Oscar Wilde that has tormented me for years: “Work is the refuge of people who have nothing better to do”, this guy reckons that if you work for more than an hour or so per day you are probab\nYeah, you see. I’ve just checked and most of the other reviews of this book do pretty much what I thought they would do. They complain about the tone. This guy is never going to win an award for modesty and he probably thinks you are stupid and have wasted your life. And it gets worse – like that quote from Oscar Wilde that has tormented me for years: “Work is the refuge of people who have nothing better to do”, this guy reckons that if you work for more than an hour or so per day you ar

In [12]:
# We are going to change this to key: books, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [13]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [14]:
# We can either keep it in dictionary format or put it into a pandas dataframe (Corpus)
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['reviews']
data_df = data_df.sort_index()
data_df

Unnamed: 0,reviews
AF,"\n\nTaleb seems constitutionally angry, dismissive, and contrarian--sometimes to the point of being an asshole. However, one cannot deny his talen..."
BoP,"\n\nAphorisms Galore!If for any literary fan, the country Lebanon brings to mind the tender, lyrical and mystical poet Khalil Gibran, we have anot..."
FbR,"\n\nYeah, you see. I’ve just checked and most of the other reviews of this book do pretty much what I thought they would do. They complain about t..."
SitG,\n\nSkin in the Game is at the same time thought-provoking and original but also contradictory and sometimes absurd. Let’s start with the cons:1. ...
TBS,"\n\nThis is a book that raises a number of very important questions, but chief among them is definitely the question of how the interplay between ..."


In [15]:
# Let's take a look at the reviews for Antifragile (AF)
data_df.reviews.loc['AF']



In [16]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [17]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.reviews.apply(round1))
data_clean

Unnamed: 0,reviews
AF,\n\ntaleb seems constitutionally angry dismissive and contrariansometimes to the point of being an asshole however one cannot deny his talent of c...
BoP,\n\naphorisms galoreif for any literary fan the country lebanon brings to mind the tender lyrical and mystical poet khalil gibran we have another ...
FbR,\n\nyeah you see i’ve just checked and most of the other reviews of this book do pretty much what i thought they would do they complain about the ...
SitG,\n\nskin in the game is at the same time thoughtprovoking and original but also contradictory and sometimes absurd let’s start with the i certain...
TBS,\n\nthis is a book that raises a number of very important questions but chief among them is definitely the question of how the interplay between a...


In [18]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [19]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.reviews.apply(round2))
data_clean

Unnamed: 0,reviews
AF,taleb seems constitutionally angry dismissive and contrariansometimes to the point of being an asshole however one cannot deny his talent of conve...
BoP,aphorisms galoreif for any literary fan the country lebanon brings to mind the tender lyrical and mystical poet khalil gibran we have another comp...
FbR,yeah you see ive just checked and most of the other reviews of this book do pretty much what i thought they would do they complain about the tone ...
SitG,skin in the game is at the same time thoughtprovoking and original but also contradictory and sometimes absurd lets start with the i certainly wo...
TBS,this is a book that raises a number of very important questions but chief among them is definitely the question of how the interplay between a goo...


**NOTE:** The data cleaning/text pre-processing step could go on for many more steps, but we are going to stop for now. After going through some analysis techniques, if the results don't make sense or could be improved, we can come back and make more edits such as:
* Mark 'cheering' and 'cheer' as the same word (stemming / lemmatization)
* Combine 'thank you' into one term (bi-grams)
* ETC...

### III. Organizing Data:

I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:
1. *__Corpus__* - a collection of text
2. *__Document-Term Matrix__* - word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [20]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,reviews
AF,"\n\nTaleb seems constitutionally angry, dismissive, and contrarian--sometimes to the point of being an asshole. However, one cannot deny his talen..."
BoP,"\n\nAphorisms Galore!If for any literary fan, the country Lebanon brings to mind the tender, lyrical and mystical poet Khalil Gibran, we have anot..."
FbR,"\n\nYeah, you see. I’ve just checked and most of the other reviews of this book do pretty much what I thought they would do. They complain about t..."
SitG,\n\nSkin in the Game is at the same time thought-provoking and original but also contradictory and sometimes absurd. Let’s start with the cons:1. ...
TBS,"\n\nThis is a book that raises a number of very important questions, but chief among them is definitely the question of how the interplay between ..."


In [21]:
# Let's add the books' full names as well
full_names = ['Antifragile', 'Bed of Procrustes', 'Fooled by Randomness', 'Skin in the Game', 'The Black Swan']

data_df['Book Name'] = full_names
data_df

Unnamed: 0,reviews,Book Name
AF,"\n\nTaleb seems constitutionally angry, dismissive, and contrarian--sometimes to the point of being an asshole. However, one cannot deny his talen...",Antifragile
BoP,"\n\nAphorisms Galore!If for any literary fan, the country Lebanon brings to mind the tender, lyrical and mystical poet Khalil Gibran, we have anot...",Bed of Procrustes
FbR,"\n\nYeah, you see. I’ve just checked and most of the other reviews of this book do pretty much what I thought they would do. They complain about t...",Fooled by Randomness
SitG,\n\nSkin in the Game is at the same time thought-provoking and original but also contradictory and sometimes absurd. Let’s start with the cons:1. ...,Skin in the Game
TBS,"\n\nThis is a book that raises a number of very important questions, but chief among them is definitely the question of how the interplay between ...",The Black Swan


In [22]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix

For many of the techniques used in the next notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [24]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.reviews)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,aah,abhorrent,abiding,abilities,ability,able,abolished,abolishing,abolitionists,abound,...,youunless,youve,yuppies,yuri,yvgenia,zakarias,zero,zoolas,zoological,zoroastrianism
AF,0,0,0,2,4,7,0,0,0,0,...,0,3,0,0,0,1,1,0,0,0
BoP,0,0,0,0,1,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
FbR,0,0,1,1,7,6,0,0,0,0,...,0,0,0,4,0,0,1,0,0,0
SitG,1,1,0,2,6,6,2,1,2,1,...,1,2,3,0,0,0,2,0,0,1
TBS,0,0,0,0,0,2,0,0,0,0,...,0,5,0,0,1,0,0,1,1,0


In [25]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [26]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

### END OF NOTEBOOK I