# NLP PROJECT

### Problem Statement:

As an avid reader, I get many recommendations from my circle regarding which books I should read next. Having heard different opinions of Nassim Nicholas Taleb, I decided to use NLP to get a feeling based of people's reviews concerning his Incerto (consisting of 5 books) and possibly the topics within. I will divide this project into multiple Notebooks to make it easier to read.

### What This Project Shows:

1. Web scraping book reviews on goodreads.com
2. Exploratory Data Analysis
3. Exploring Sentimental Analysis and Topic Modelling NLP techniques
4. Conclusion based on analysis

Link for the  website: http://goodreads.com/

## Notebook 1: Web Scraping + Data Cleaning + Organizing:

The output of this notebook will have clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

### I. Web Scraping:

In [24]:
# Import libraries:
import urllib3
from bs4 import BeautifulSoup
import requests
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

In [26]:
# Print the status of the website (200 Yes ; 404 No):
result = requests.get("https://www.goodreads.com/")
print("status code for page: " + str(result.status_code))

status code for page: 200


In [11]:
# Scrapes reviews data from goodreads.com
def url_to_reviews(url):
    '''Returns review data specifically from goodreads.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="reviewText stacked").find_all('')]
    print(url)
    return text

# URLs of books in scope
urls = ['https://www.goodreads.com/book/show/38315.Fooled_by_Randomness?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=2',
        'https://www.goodreads.com/book/show/242472.The_Black_Swan?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=1',
        'https://www.goodreads.com/book/show/9402297-the-bed-of-procrustes?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=5',
        'https://www.goodreads.com/book/show/13530973-antifragile?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=3',
        'https://www.goodreads.com/book/show/36064445-skin-in-the-game?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=4']

# Book names
books = ['FbR', 'TBS', 'BoP', 'AF', 'SitG']

In [12]:
# Actually request transcripts (takes a few minutes to run)
reviews = [url_to_reviews(u) for u in urls]

https://www.goodreads.com/book/show/38315.Fooled_by_Randomness?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=2
https://www.goodreads.com/book/show/242472.The_Black_Swan?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=1
https://www.goodreads.com/book/show/9402297-the-bed-of-procrustes?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=5
https://www.goodreads.com/book/show/13530973-antifragile?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=3
https://www.goodreads.com/book/show/36064445-skin-in-the-game?from_search=true&from_srp=true&qid=q1xYfzijc6&rank=4


In [16]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(books):
    with open("reviews/" + c + ".txt", "wb") as file:
        pickle.dump(reviews[i], file)

mkdir: transcripts: File exists


In [19]:
# Load pickled files
data = {}
for i, c in enumerate(books):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [20]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['FbR', 'TBS', 'BoP', 'AF', 'SitG'])

In [27]:
# More checks
data['FbR']

[]

### II. Cleaning Data:

### III. Organizing Data: