## Step 1: Frame the Problem

Currently, stakeholders know can identify which books are too difficult for their children/students to read but not which books are beyond the maturity level of the child/student. This project attempts to rectify this issue. 

## Step 2: Get the Data

[Guide](https://www.dataquest.io/blog/web-scraping-tutorial-python/)

In [7]:
import os
import string
import requests
import pandas as pd 
from bs4 import BeautifulSoup
from csv import writer

In [16]:
base_url = 'https://www.commonsensemedia.org/book-reviews'
page = '?page='
all_pages = range(1,291)
all_pages_list = [base_url+page+str(p) for p in all_pages]

In [17]:
page = requests.get(base_url)

soup = BeautifulSoup(page.text, 'html.parser')

In [18]:
title = [s.get_text().strip() for s in soup.findAll(class_="views-field views-field-field-reference-review-ent-prod result-title")]

In [19]:
stripper = str.maketrans("", "", string.punctuation)
title = [t.translate(stripper).replace(" ", "-").lower() for t in title]
#title = [t.replace(" ", "-").lower() for t in title]

In [21]:

for t in title:
    print(base_url + "/" + t)

https://www.commonsensemedia.org/book-reviews/black-beauty
https://www.commonsensemedia.org/book-reviews/love
https://www.commonsensemedia.org/book-reviews/whistle-for-willie
https://www.commonsensemedia.org/book-reviews/the-little-prince
https://www.commonsensemedia.org/book-reviews/the-hobbit
https://www.commonsensemedia.org/book-reviews/middle-school-is-worse-than-meatloaf
https://www.commonsensemedia.org/book-reviews/shine
https://www.commonsensemedia.org/book-reviews/the-trouble-with-may-amelia
https://www.commonsensemedia.org/book-reviews/queen-of-the-falls
https://www.commonsensemedia.org/book-reviews/inside-out-and-back-again
https://www.commonsensemedia.org/book-reviews/the-adventures-of-mark-twain-by-huckleberry-finn
https://www.commonsensemedia.org/book-reviews/a-world-without-heroes-beyonders-book-1
https://www.commonsensemedia.org/book-reviews/recovery-road
https://www.commonsensemedia.org/book-reviews/little-white-rabbit
https://www.commonsensemedia.org/book-reviews/dave-

In [None]:
books = soup.findAll(class_="content-content-wrapper")

In [None]:
with open('lexile/books.csv', 'w') as csv_file:
    csv_writer = writer(csv_file)
    
    #create header in the csv file
    headers = ['Title', 'Description', "Author", 'Age']
    
    #write a row of headers in the csv
    csv_writer.writerow(headers)
    
    #loop
    for book in books:
        title = book.find(class_="views-field views-field-field-reference-review-ent-prod result-title").get_text()
        description = book.find(class_="views-field views-field-field-one-liner one-liner").get_text()
        author = book.find(class_="views-field views-field-field-term-book-authors review-supplemental").get_text().replace(" By ", "").rstrip()
        age = book.find(class_="csm-green-age").get_text().replace("age ", "")
        csv_writer.writerow([title, description, author, age])

In [None]:
for page in all_pages_list: 
    page = requests.get(page)
    soup = BeautifulSoup(page.text, 'html.parser')
    books = soup.findAll(class_="content-content-wrapper")
    with open('lexile/books.csv', 'a') as csv_file:
        csv_writer = writer(csv_file)
        for book in books:
            title = book.find(class_="views-field views-field-field-reference-review-ent-prod result-title").get_text()
            description = book.find(class_="views-field views-field-field-one-liner one-liner").get_text()
            author = book.find(class_="views-field views-field-field-term-book-authors review-supplemental").get_text().replace(" By ", "").rstrip()
            age = book.find(class_="csm-green-age").get_text().replace("age ", "")
            csv_writer.writerow([title, description, author, age])

In [None]:
df = pd.read_csv('lexile/books.csv')
df[df['Title'].str.contains("/")]

[Get Book Covers](https://towardsdatascience.com/web-scraping-using-beautifulsoup-edd9441ba734)

In [4]:
covers = soup.findAll(class_="field-content review-product-image")
covers = [cover.findAll("img") for cover in covers]
title = [cover[1].get('title') for cover in covers]

In [5]:
title = [cover[1].get('title') for cover in covers]
title = [t[:-18].replace("/", "_") for t in title]
cover_src = [cover[1].get('src') for cover in covers]

In [6]:
info = dict(zip(title, cover_src))
!mkdir lexile/covers

In [7]:
for k, v in info.items():
    if '.jpg?' in v:
        try:
            with open('./lexile/covers/' + k + '.jpg', 'wb') as f:
                f.write(requests.get(v).content)
        except FileNotFoundError as err:
            print(k)
    else:
        try:
            with open('./lexile/covers/' + k + '.png', 'wb') as f:
                f.write(requests.get(v).content)
        except FileNotFoundError as err:
            print(k)

In [None]:
for page in all_pages_list: 
    page = requests.get(page)
    soup = BeautifulSoup(page.text, 'html.parser')
    covers = soup.findAll(class_="field-content review-product-image")
    covers = [cover.findAll("img") for cover in covers]
    title = [cover[1].get('title') for cover in covers]
    title = [t[:-18].replace("/", "_") for t in title]
    cover_src = [cover[1].get('src') for cover in covers]
    info = dict(zip(title, cover_src))
    for k, v in info.items():
        if '.jpg?' in v:
            try:
                with open('./lexile/covers/' + k + '.jpg', 'wb') as f:
                    f.write(requests.get(v).content)
            except FileNotFoundError as err:
                print(k)
        else:
            try:
                with open('./lexile/covers/' + k + '.png', 'wb') as f:
                    f.write(requests.get(v).content)
            except FileNotFoundError as err:
                print(k)

## Step 3: Explore the Data

## To do:


In [None]:
!mkdir lexile/test

In [None]:
for k, v in info.items():
    if '.jpg?' in v:
        try:
            with open('./lexile/covers/' + k + '.jpg', 'wb') as f:
                f.write(requests.get(v).content)
        except FileNotFoundError as err:
            print(k)
    else:
        try:
            with open('./lexile/covers/' + k + '.png', 'wb') as f:
                f.write(requests.get(v).content)
        except FileNotFoundError as err:
            print(k)