# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS-S109A Introduction to Data Science 


## Lab 1: Pandas and Web Scraping with Beautiful Soup

**Harvard University**<br>
**Summer 2020**<br>
**Instructor:** Kevin Rader <br>
**Authors:** Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, Eleni Kaxiras, and Kevin Rader

---



In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Table of Contents 
<ol start="0">
<li> Learning Goals </li>
<li> Loading and Cleaning with Pandas</li>
<li> Parsing and Completing the Dataframe  </li>
<li> Grouping </li>
<li> Introduction to Web Servers and HTTP </li>
<li> Download webpages and get basic properties </li>
<li> Parse the page with Beautiful Soup</li>
<li> String formatting</li>
<li> Additonal Python/Homework Comment</li>
<li> Walkthrough Example</li>
</ol>

## Learning Goals

About 6,000 odd "best books" were fetched and parsed from [Goodreads](https://www.goodreads.com). The "bestness" of these books came from a proprietary formula used by Goodreads and published as a list on their web site.

We parsed the page for each book and saved data from all these pages in a tabular format as a CSV file. In this lab we'll clean and further parse the data.  We'll then do some exploratory data analysis to answer questions about these best books and popular genres.  

We will then go back to Goodread's and scrape their Best Books list:

https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1 .

We'll walk through scraping the list pages for the book names/urls. First, we start with an even simpler example.


By the end of this lab, you should be able to:

- Load and systematically address missing values, ancoded as `NaN` values in our data set, for example, by removing observations associated with these values.
- Parse columns in the dataframe to create new dataframe columns.
- Use groupby to aggregate data on a particular feature column, such as author.
- Understand the structure of a web page
- Understand how to use Beautiful soup to scrape content from web pages.
- Feel comfortable storing and manipulating the content in various formats.
- Understand how to convert structured format into a Pandas DataFrame

*This lab corresponds to lectures #1, #2, and maps on to homework #1 and further.*


### Basic EDA workflow

(From the lecture, repeated here for convenience).

The basic workflow is as follows:

1. **Build** a DataFrame from the data (ideally, put all data in this object)
2. **Clean** the DataFrame. It should have the following properties:
    - Each row describes a single object
    - Each column describes a property of that object
    - Columns are numeric whenever appropriate
    - Columns contain atomic properties that cannot be further decomposed
3. Explore **global properties**. Use histograms, scatter plots, and aggregation functions to summarize the data.
4. Explore **group properties**. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.

## Part 1: Loading and Cleaning with Pandas 
Read in the `goodreads.csv` file, examine the data, and do any necessary data cleaning. 

Here is a description of the columns (in order) present in this csv file:

```
rating: the average rating on a 1-5 scale achieved by the book
review_count: the number of Goodreads users who reviewed this book
isbn: the ISBN code for the book
booktype: an internal Goodreads identifier for the book
author_url: the Goodreads (relative) URL for the author of the book
year: the year the book was published
genre_urls: a string with '|' separated relative URLS of Goodreads genre pages
dir: a directory identifier internal to the scraping code
rating_count: the number of ratings for this book (this is different from the number of reviews)
name: the name of the book
```

Let us see what issues we find with the data and resolve them.  



----




After loading appropriate libraries


In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

### Cleaning: Reading in the data
We read in and clean the data from `goodreads.csv`.

In [3]:
#Read the data into a dataframe
df = pd.read_csv("../data/goodreads.csv", encoding='utf-8')

#Examine the first few rows of the dataframe
df

Unnamed: 0,4.40,136455,0439023483,good_reads:book,https://www.goodreads.com/author/show/153394.Suzanne_Collins,2008,/genres/young-adult|/genres/science-fiction|/genres/dystopia|/genres/fantasy|/genres/science-fiction|/genres/romance|/genres/adventure|/genres/book-club|/genres/young-adult|/genres/teen|/genres/apocalyptic|/genres/post-apocalyptic|/genres/action,dir01/2767052-the-hunger-games.html,2958974,"The Hunger Games (The Hunger Games, #1)"
0,4.41,16648,0439358078,good_reads:book,https://www.goodreads.com/author/show/1077326....,2003.0,/genres/fantasy|/genres/young-adult|/genres/fi...,dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...,1284478,Harry Potter and the Order of the Phoenix (Har...
1,3.56,85746,0316015849,good_reads:book,https://www.goodreads.com/author/show/941441.S...,2005.0,/genres/young-adult|/genres/fantasy|/genres/ro...,dir01/41865.Twilight.html,2579564,"Twilight (Twilight, #1)"
2,4.23,47906,0061120081,good_reads:book,https://www.goodreads.com/author/show/1825.Har...,1960.0,/genres/classics|/genres/fiction|/genres/histo...,dir01/2657.To_Kill_a_Mockingbird.html,2078123,To Kill a Mockingbird
3,4.23,34772,0679783261,good_reads:book,https://www.goodreads.com/author/show/1265.Jan...,1813.0,/genres/classics|/genres/fiction|/genres/roman...,dir01/1885.Pride_and_Prejudice.html,1388992,Pride and Prejudice
4,4.25,12363,0446675539,good_reads:book,https://www.goodreads.com/author/show/11081.Ma...,1936.0,/genres/classics|/genres/historical-fiction|/g...,dir01/18405.Gone_with_the_Wind.html,645470,Gone with the Wind
5,4.22,7205,0066238501,good_reads:book,https://www.goodreads.com/author/show/1069006....,1949.0,/genres/classics|/genres/young-adult|/genres/c...,dir01/11127.The_Chronicles_of_Narnia.html,286677,The Chronicles of Narnia (Chronicles of Narnia...
6,4.38,10902,0060256656,good_reads:book,https://www.goodreads.com/author/show/435477.S...,1964.0,/genres/childrens|/genres/young-adult|/genres/...,dir01/370493.The_Giving_Tree.html,502891,The Giving Tree
7,3.79,20670,0452284244,good_reads:book,https://www.goodreads.com/author/show/3706.Geo...,1945.0,/genres/classics|/genres/fiction|/genres/scien...,dir01/7613.Animal_Farm.html,1364879,Animal Farm
8,4.18,12302,0345391802,good_reads:book,https://www.goodreads.com/author/show/4.Dougla...,1979.0,/genres/science-fiction|/genres/humor|/genres/...,dir01/11.The_Hitchhiker_s_Guide_to_the_Galaxy....,724713,The Hitchhiker's Guide to the Galaxy (Hitchhik...
9,4.03,20937,0739326228,good_reads:book,https://www.goodreads.com/author/show/614.Arth...,1997.0,/genres/fiction|/genres/historical-fiction|/ge...,dir01/930.Memoirs_of_a_Geisha.html,1042679,Memoirs of a Geisha


Oh no! That does not quite seem to be right. We are missing the column names. We need to add these in! But what are they?

Here is a list of them in order:

`["rating", 'review_count', 'isbn', 'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count', 'name']`


<div class="exercise"><b>Q1.1</b></div>
Use these to load the dataframe properly! And then "head" the dataframe... (you will need to look at the read_csv docs)

In [6]:
df = pd.read_csv("../data/goodreads.csv", encoding='utf-8', 
                 names = ["rating", 'review_count', 'isbn', 'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count', 'name'])

# Examine the first few rows of the dataframe
df.head()

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
0,4.4,136455,439023483,good_reads:book,https://www.goodreads.com/author/show/153394.S...,2008.0,/genres/young-adult|/genres/science-fiction|/g...,dir01/2767052-the-hunger-games.html,2958974,"The Hunger Games (The Hunger Games, #1)"
1,4.41,16648,439358078,good_reads:book,https://www.goodreads.com/author/show/1077326....,2003.0,/genres/fantasy|/genres/young-adult|/genres/fi...,dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...,1284478,Harry Potter and the Order of the Phoenix (Har...
2,3.56,85746,316015849,good_reads:book,https://www.goodreads.com/author/show/941441.S...,2005.0,/genres/young-adult|/genres/fantasy|/genres/ro...,dir01/41865.Twilight.html,2579564,"Twilight (Twilight, #1)"
3,4.23,47906,61120081,good_reads:book,https://www.goodreads.com/author/show/1825.Har...,1960.0,/genres/classics|/genres/fiction|/genres/histo...,dir01/2657.To_Kill_a_Mockingbird.html,2078123,To Kill a Mockingbird
4,4.23,34772,679783261,good_reads:book,https://www.goodreads.com/author/show/1265.Jan...,1813.0,/genres/classics|/genres/fiction|/genres/roman...,dir01/1885.Pride_and_Prejudice.html,1388992,Pride and Prejudice


### Cleaning: Examing the dataframe - quick checks

We should examine the dataframe to get a overall sense of the content. 

<div class="exercise"><b>Q1.2</b></div>
Lets check the types of the columns. What do you find?

In [10]:
df.dtypes

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

*your answer here*

Notice that `review_count` and `rating_counts` are objects instead of ints, and the `year` is a float!

There are a couple more quick sanity checks to perform on the dataframe. 

In [11]:
print(df.shape)
df.columns

(6000, 10)


Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'genre_urls', 'dir', 'rating_count', 'name'], dtype='object')

In [12]:
df.describe()

Unnamed: 0,rating,year
count,5998.0,5993.0
mean,4.042201,1969.085099
std,0.260661,185.383169
min,2.0,-1500.0
25%,3.87,1980.0
50%,4.05,2002.0
75%,4.21,2009.0
max,5.0,2014.0


### Cleaning: Examining the dataframe - a deeper look

Beyond performing checking some quick general properties of the data frame and looking at the first $n$ rows, we can dig a bit deeper into the values being stored. If you haven't already, check to see if there are any missing values in the data frame.

Let's see for a column which seemed OK to us.


<div class="exercise"><b>Q1.3</b></div>

Use a combination of `np.sum` and `df.isnull()` to determine where missingness occurs


In [13]:
######
# your code here
######
np.sum(df.isnull())

rating            2
review_count      0
isbn            475
booktype          0
author_url        0
year              7
genre_urls       62
dir               0
rating_count      0
name              0
dtype: int64

In [27]:
#Try to locate where the missing values occur in rating:
df[df.year.isnull()]

#np.sum(df.year[5658:5660])

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
2442,4.23,526.0,,good_reads:book,https://www.goodreads.com/author/show/623606.A...,,/genres/religion|/genres/islam|/genres/non-fic...,dir25/1301625.La_Tahzan.html,4134.0,La Tahzan
2869,4.61,2.0,,good_reads:book,https://www.goodreads.com/author/show/8182217....,,,dir29/22031070-my-death-experiences---a-preach...,23.0,My Death Experiences - A Preacherâs 18 Apoca...
3643,,,,,,,,dir37/9658936-harry-potter.html,,
5282,,,,,,,,dir53/113138.The_Winner.html,,
5572,3.71,35.0,8423336603.0,good_reads:book,https://www.goodreads.com/author/show/285658.E...,,/genres/fiction,dir56/890680._rase_una_vez_el_amor_pero_tuve_q...,403.0,Ãrase una vez el amor pero tuve que matarlo. ...
5658,4.32,44.0,,good_reads:book,https://www.goodreads.com/author/show/25307.Ro...,,/genres/fantasy|/genres/fantasy|/genres/epic-f...,dir57/5533041-assassin-s-apprentice-royal-assa...,3850.0,Assassin's Apprentice / Royal Assassin (Farsee...
5683,4.56,204.0,,good_reads:book,https://www.goodreads.com/author/show/3097905....,,/genres/fantasy|/genres/young-adult|/genres/ro...,dir57/12474623-tiger-s-dream.html,895.0,"Tiger's Dream (The Tiger Saga, #5)"


How does `pandas` or `numpy` handle missing values when we try to compute with data sets that include them?

We'll now check if any of the other suspicious columns have missing values.  Let's look at `year` and `review_count` first.

One thing you can do is to try and convert to the type you expect the column to be. If something goes wrong, it likely means your data are bad.

Lets test for missing data:

In [31]:
#df[df.year.isnull()]

df.year.isnull()
df.shape

(6000, 10)

### Cleaning: Dealing with Missing Values
How should we interpret 'missing' or 'invalid' values in the data (hint: look at where these values occur)? One approach is to simply exclude them from the dataframe. Is this appropriate for all 'missing' or 'invalid' values? 

In [36]:
# Remove the missing or invalid values in your dataframe

####### 
# your code here
####### 
df.dropna(inplace = True)
np.sum(df.isnull())


rating          0
review_count    0
isbn            0
booktype        0
author_url      0
year            0
genre_urls      0
dir             0
rating_count    0
name            0
dtype: int64

Ok so we have done some cleaning. What do things look like now? Notice the float has not yet changed.

In [37]:
df.dtypes

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

In [38]:
print(np.sum(df.year.isnull()))
df.shape # We removed seven rows

0


(5496, 10)

<div class="exercise"><b>Q1.4</b></div>

Ok so lets fix those types. Convert them to ints. If the type conversion fails, we now know we have further problems.
Hint: use `pd.astype()`.

In [54]:
######
# your code here
######
df.astype({'rating': 'int64', 'year':'int64'}, inplace=True)
df.dtypes

rating           int64
review_count    object
isbn            object
booktype        object
author_url      object
year             int64
genre_urls      object
dir             object
rating_count    object
name            object
dtype: object

Once you do this, we seem to be good on these columns (no errors in conversion). Lets look:

In [55]:
df.dtypes

rating           int64
review_count    object
isbn            object
booktype        object
author_url      object
year             int64
genre_urls      object
dir             object
rating_count    object
name            object
dtype: object

Sweet!

Some of the other colums that should be strings have NaN.

In [66]:
df.loc[df.genre_urls.isnull(), 'genre_urls']=""
df.loc[df.isbn.isnull(), 'isbn']=""

##  Part 2: Parsing and Completing the Data Frame 

We will parse the `author` column from the author_url and `genres` column from the genre_urls. Keep the `genres` column as a string separated by '|'.

We will use panda's `map` to assign new columns to the dataframe.  

Examine an example `author_url` and reason about which sequence of string operations must be performed in order to isolate the author's name.

In [73]:
#Get the first author_url
test_string = df.author_url[3]
test_string

'https://www.goodreads.com/author/show/1825.Harper_Lee'

In [75]:
#Test out some string operations to isolate the author name

test_string.split('/')[-1].split('.')[1:][0]

'Harper_Lee'

<div class="exercise"><b>Q2.1</b></div>

Lets wrap the above code into a function which we will then use

In [76]:
# Write a function that accepts an author url and returns the author's name based on your experimentation above


def get_author(url):
    return url.split('/')[-1].split('.')[1:][0]

In [78]:
# Apply the get_author function to the 'author_url' column using '.map' 
# and add a new column 'author' to store the names

df['author'] = df.author_url.map(get_author)
df.author[0:5]

0    Suzanne_Collins
1        J_K_Rowling
2    Stephenie_Meyer
3         Harper_Lee
4        Jane_Austen
Name: author, dtype: object


Now let's parse out the genres from `genre_url`.  

This is a little more complicated because there be more than one genre.


In [79]:
#Examine some examples of genre_urls
df.genre_urls.head()

0    /genres/young-adult|/genres/science-fiction|/g...
1    /genres/fantasy|/genres/young-adult|/genres/fi...
2    /genres/young-adult|/genres/fantasy|/genres/ro...
3    /genres/classics|/genres/fiction|/genres/histo...
4    /genres/classics|/genres/fiction|/genres/roman...
Name: genre_urls, dtype: object

In [83]:
######
#your code here
######


#Test out some string operations to isolate the genre name
test_genre_string=df.genre_urls[3]


genres=test_genre_string.strip().split('|')
for e in genres:
    print(e.split('/')[-1])
    "|".join(genres)

classics
fiction
historical-fiction
academic
school
literature
young-adult
academic
read-for-school
novels
book-club
young-adult
high-school


<div class="exercise"><b>Q2.2</b></div>

Write a function that accepts a genre url and returns the genre name based on your experimentation above



In [101]:
def split_and_join_genres(url):
    genres=url.strip().split('|')
    genres=[e.split('/')[-1] for e in genres]
    return "|".join(genres)

Test your function

In [102]:
split_and_join_genres("/genres/young-adult|/genres/science-fiction")

'young-adult|science-fiction'

In [103]:
split_and_join_genres("/genres/religion|/genres/islam")

'religion|islam'

<div class="exercise"><b>Q2.3</b></div>

Use map again to create a new "genres" column

In [104]:
df['genres']=df.genre_urls.map(split_and_join_genres)
df.head()

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name,author,genres
0,4,136455,439023483,good_reads:book,https://www.goodreads.com/author/show/153394.S...,2008,/genres/young-adult|/genres/science-fiction|/g...,dir01/2767052-the-hunger-games.html,2958974,"The Hunger Games (The Hunger Games, #1)",Suzanne_Collins,young-adult|science-fiction|dystopia|fantasy|s...
1,4,16648,439358078,good_reads:book,https://www.goodreads.com/author/show/1077326....,2003,/genres/fantasy|/genres/young-adult|/genres/fi...,dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...,1284478,Harry Potter and the Order of the Phoenix (Har...,J_K_Rowling,fantasy|young-adult|fiction|fantasy|magic|chil...
2,3,85746,316015849,good_reads:book,https://www.goodreads.com/author/show/941441.S...,2005,/genres/young-adult|/genres/fantasy|/genres/ro...,dir01/41865.Twilight.html,2579564,"Twilight (Twilight, #1)",Stephenie_Meyer,young-adult|fantasy|romance|paranormal|vampire...
3,4,47906,61120081,good_reads:book,https://www.goodreads.com/author/show/1825.Har...,1960,/genres/classics|/genres/fiction|/genres/histo...,dir01/2657.To_Kill_a_Mockingbird.html,2078123,To Kill a Mockingbird,Harper_Lee,classics|fiction|historical-fiction|academic|s...
4,4,34772,679783261,good_reads:book,https://www.goodreads.com/author/show/1265.Jan...,1813,/genres/classics|/genres/fiction|/genres/roman...,dir01/1885.Pride_and_Prejudice.html,1388992,Pride and Prejudice,Jane_Austen,classics|fiction|romance|historical-fiction|li...


Finally, let's pick an author at random so we can see the results of the transformations.  Scroll to see the `author` and `genre` columns that we added to the dataframe.

In [105]:
df[df.author == "Marguerite_Yourcenar"]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name,author,genres
1014,4,483,374529264,good_reads:book,https://www.goodreads.com/author/show/7732.Mar...,1951,/genres/historical-fiction|/genres/fiction|/ge...,dir11/12172.Memoirs_of_Hadrian.html,6258,Memoirs of Hadrian,Marguerite_Yourcenar,historical-fiction|fiction|cultural|france|cla...
5620,4,74,2070367983,good_reads:book,https://www.goodreads.com/author/show/7732.Mar...,1968,/genres/fiction|/genres/historical-fiction|/ge...,dir57/953435.L_uvre_au_noir.html,1601,L'Åuvre au noir,Marguerite_Yourcenar,fiction|historical-fiction|cultural|france|eur...


Let us delete the `genre_urls` column.

In [106]:
del df['genre_urls']

And then save the dataframe out!

In [107]:
df.to_csv("../data/cleaned-goodreads.csv", index=False, header=True)

---

## Part 3: Grouping 

<div class="exercise"><b>Q3.1</b></div>

It appears that some books were written in negative years!  Print out the observations that correspond to negative years.  What do you notice about these books?  

In [109]:
df[df.year < 0]

# Note: these are books written before the Common Era (BCE, equivalent to BC).

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,dir,rating_count,name,author,genres
47,3,5785,0143039954,good_reads:book,https://www.goodreads.com/author/show/903.Homer,-800,dir01/1381.The_Odyssey.html,560248,The Odyssey,Homer,classics|fiction|poetry|fantasy|mythology|acad...
246,4,365,0147712556,good_reads:book,https://www.goodreads.com/author/show/903.Homer,-800,dir03/1375.The_Iliad_The_Odyssey.html,35123,The Iliad/The Odyssey,Homer,classics|fantasy|mythology|fantasy|academic|sc...
455,3,1499,0140449140,good_reads:book,https://www.goodreads.com/author/show/879.Plato,-380,dir05/30289.The_Republic.html,82022,The Republic,Plato,philosophy|classics|non-fiction|politics|histo...
596,3,1240,0679729526,good_reads:book,https://www.goodreads.com/author/show/919.Virgil,-29,dir06/12914.The_Aeneid.html,60308,The Aeneid,Virgil,classics|poetry|fiction|fantasy|mythology|acad...
629,3,1231,1580495931,good_reads:book,https://www.goodreads.com/author/show/1002.Sop...,-429,dir07/1554.Oedipus_Rex.html,93192,Oedipus Rex,Sophocles,classics|plays|drama|fiction|academic|school|l...
674,3,3559,1590302257,good_reads:book,https://www.goodreads.com/author/show/1771.Sun...,-512,dir07/10534.The_Art_of_War.html,114619,The Art of War,Sun_Tzu,non-fiction|politics|classics|literature|psych...
746,4,1087,0140449183,good_reads:book,https://www.goodreads.com/author/show/5158478....,-500,dir08/99944.The_Bhagavad_Gita.html,31634,The Bhagavad Gita,Anonymous,classics|spirituality|religion|hinduism|fantas...
777,3,1038,1580493882,good_reads:book,https://www.goodreads.com/author/show/1002.Sop...,-442,dir08/7728.Antigone.html,49084,Antigone,Sophocles,drama|fiction|classics|academic|read-for-schoo...
1233,3,704,015602764X,good_reads:book,https://www.goodreads.com/author/show/1002.Sop...,-400,dir13/1540.The_Oedipus_Cycle.html,36008,The Oedipus Cycle,Sophocles,classics|plays|drama|fiction|literature|academ...
1397,4,890,0192840509,good_reads:book,https://www.goodreads.com/author/show/12452.Aesop,-560,dir14/21348.Aesop_s_Fables.html,71259,Aesop's Fables,Aesop,classics|childrens|literature|fantasy|fairy-ta...


We could determine the "best book" by year! For this we use Panda's `groupby()`. `groupby()` allows grouping a dataframe by any (usually categorical) variable. Would it make sense to ever groupby integer variables? Floating point variables?

But more useful, let's consider grouping by author:

In [110]:
df

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,dir,rating_count,name,author,genres
0,4,136455,0439023483,good_reads:book,https://www.goodreads.com/author/show/153394.S...,2008,dir01/2767052-the-hunger-games.html,2958974,"The Hunger Games (The Hunger Games, #1)",Suzanne_Collins,young-adult|science-fiction|dystopia|fantasy|s...
1,4,16648,0439358078,good_reads:book,https://www.goodreads.com/author/show/1077326....,2003,dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...,1284478,Harry Potter and the Order of the Phoenix (Har...,J_K_Rowling,fantasy|young-adult|fiction|fantasy|magic|chil...
2,3,85746,0316015849,good_reads:book,https://www.goodreads.com/author/show/941441.S...,2005,dir01/41865.Twilight.html,2579564,"Twilight (Twilight, #1)",Stephenie_Meyer,young-adult|fantasy|romance|paranormal|vampire...
3,4,47906,0061120081,good_reads:book,https://www.goodreads.com/author/show/1825.Har...,1960,dir01/2657.To_Kill_a_Mockingbird.html,2078123,To Kill a Mockingbird,Harper_Lee,classics|fiction|historical-fiction|academic|s...
4,4,34772,0679783261,good_reads:book,https://www.goodreads.com/author/show/1265.Jan...,1813,dir01/1885.Pride_and_Prejudice.html,1388992,Pride and Prejudice,Jane_Austen,classics|fiction|romance|historical-fiction|li...
5,4,12363,0446675539,good_reads:book,https://www.goodreads.com/author/show/11081.Ma...,1936,dir01/18405.Gone_with_the_Wind.html,645470,Gone with the Wind,Margaret_Mitchell,classics|historical-fiction|fiction|romance|li...
6,4,7205,0066238501,good_reads:book,https://www.goodreads.com/author/show/1069006....,1949,dir01/11127.The_Chronicles_of_Narnia.html,286677,The Chronicles of Narnia (Chronicles of Narnia...,C_S_Lewis,classics|young-adult|childrens|christian|adven...
7,4,10902,0060256656,good_reads:book,https://www.goodreads.com/author/show/435477.S...,1964,dir01/370493.The_Giving_Tree.html,502891,The Giving Tree,Shel_Silverstein,childrens|young-adult|childrens|picture-books|...
8,3,20670,0452284244,good_reads:book,https://www.goodreads.com/author/show/3706.Geo...,1945,dir01/7613.Animal_Farm.html,1364879,Animal Farm,George_Orwell,classics|fiction|science-fiction|dystopia|lite...
9,4,12302,0345391802,good_reads:book,https://www.goodreads.com/author/show/4.Dougla...,1979,dir01/11.The_Hitchhiker_s_Guide_to_the_Galaxy....,724713,The Hitchhiker's Guide to the Galaxy (Hitchhik...,Douglas_Adams,science-fiction|humor|fantasy|classics|humor|c...


In [117]:
dfgb_author = df.groupby('author')
type(dfgb_author)

pandas.core.groupby.generic.DataFrameGroupBy

Perhaps we want the number of books each author wrote

In [120]:
dfgb_author.count().head()

Unnamed: 0_level_0,rating,review_count,isbn,booktype,author_url,year,dir,rating_count,name,genres
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A_A_Milne,6,6,6,6,6,6,6,6,6,6
A_G_Howard,1,1,1,1,1,1,1,1,1,1
A_J_Cronin,1,1,1,1,1,1,1,1,1,1
A_J_Jacobs,1,1,1,1,1,1,1,1,1,1
A_N_Roquelaure,2,2,2,2,2,2,2,2,2,2


Lots of useless info there. One column should suffice.  But perhaps you want more detailed info...

In [121]:
dfgb_author[['rating', 'rating_count', 'review_count', 'year']].describe()

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,year,year,year,year,year,year,year,year
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
author,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
A_A_Milne,6.0,4.000000,0.000000,4.0,4.00,4.0,4.00,4.0,6.0,1944.166667,29.294482,1926.0,1926.25,1927.5,1952.75,1997.0
A_G_Howard,1.0,4.000000,,4.0,4.00,4.0,4.00,4.0,1.0,2013.000000,,2013.0,2013.00,2013.0,2013.00,2013.0
A_J_Cronin,1.0,4.000000,,4.0,4.00,4.0,4.00,4.0,1.0,1941.000000,,1941.0,1941.00,1941.0,1941.00,1941.0
A_J_Jacobs,1.0,3.000000,,3.0,3.00,3.0,3.00,3.0,1.0,2007.000000,,2007.0,2007.00,2007.0,2007.00,2007.0
A_N_Roquelaure,2.0,3.000000,0.000000,3.0,3.00,3.0,3.00,3.0,2.0,1983.500000,0.707107,1983.0,1983.25,1983.5,1983.75,1984.0
A_S_Byatt,1.0,3.000000,,3.0,3.00,3.0,3.00,3.0,1.0,1990.000000,,1990.0,1990.00,1990.0,1990.00,1990.0
A_S_King,1.0,3.000000,,3.0,3.00,3.0,3.00,3.0,1.0,2010.000000,,2010.0,2010.00,2010.0,2010.00,2010.0
Abbi_Glines,1.0,4.000000,,4.0,4.00,4.0,4.00,4.0,1.0,2013.000000,,2013.0,2013.00,2013.0,2013.00,2013.0
Abigail_Gibbs,1.0,3.000000,,3.0,3.00,3.0,3.00,3.0,1.0,2012.000000,,2012.0,2012.00,2012.0,2012.00,2012.0
Abigail_Roux,4.0,4.000000,0.000000,4.0,4.00,4.0,4.00,4.0,4.0,2010.750000,2.217356,2008.0,2009.50,2011.0,2012.25,2013.0


You can also access a `groupby` dictionary style.

In [125]:
ratingdict = {}
for author, subset in dfgb_author:
    ratingdict[author] = (subset['rating'].mean(), subset['rating'].std())
#ratingdict

<div class="exercise"><b>Q3.2</b></div>

This is a longer exercise, and you may want to use several cells to answer the following:"

- Group the dataframe by `author`. Include the following columns: `rating`, `name`, `author`. For the aggregation of the `name` column which includes the names of the books create a list with the strings containing the name of each book. Make sure that the way you aggregate the rest of the columns make sense! 

- Create a new column with number of books for each author and find the most prolific author!

In [131]:
###### Before we start : what do we do about these titles where 'name' is unreadable? Try different encodings?
auth_name = 'A_A_Milne'
df[df.author == auth_name].head()

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,dir,rating_count,name,author,genres
100,4,1886,525467564,good_reads:book,https://www.goodreads.com/author/show/81466.A_...,1926,dir02/99107.Winnie_the_Pooh.html,157833,Winnie-the-Pooh,A_A_Milne,classics|childrens|fiction|fantasy|young-adult...
1550,4,558,525444440,good_reads:book,https://www.goodreads.com/author/show/81466.A_...,1928,dir16/776407.The_House_at_Pooh_Corner.html,55766,The House at Pooh Corner,A_A_Milne,childrens|fiction|classics|fantasy|animals
1679,4,1,1559352752,good_reads:book,https://www.goodreads.com/author/show/81466.A_...,1997,dir17/1370123.The_House_at_Pooh_Corner_and_Now...,544,The House at Pooh Corner and Now We Are Six,A_A_Milne,classics|childrens|poetry|childrens|juvenile|f...
3685,4,323,525444475,good_reads:book,https://www.goodreads.com/author/show/81466.A_...,1926,dir37/99111.The_World_of_Winnie_the_Pooh.html,26787,The World of Winnie-the-Pooh,A_A_Milne,childrens|classics|fiction|fantasy|childrens|p...
3909,4,194,525444467,good_reads:book,https://www.goodreads.com/author/show/81466.A_...,1927,dir40/821000.Now_We_Are_Six.html,11817,Now We Are Six,A_A_Milne,childrens|classics|fiction|childrens|childrens...


In [135]:
df[df.author == auth_name].iloc[0,8].encode('UTF-16')

b'\xff\xfeW\x00i\x00n\x00n\x00i\x00e\x00-\x00t\x00h\x00e\x00-\x00P\x00o\x00o\x00h\x00'

In [136]:
# let's examine the columns we have
df.columns

Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'dir', 'rating_count', 'name', 'author', 'genres'], dtype='object')

Create the GroupBy table

In [137]:
authors = df.copy()
authors = authors[['rating','name','author']].groupby('author').agg(
    ######
    # your code here
    # Hint1: calculate the mean of 'rating' across authors and
    # Hint2: .join the 'name' (of books) into a single string with | as the separator
    ######
    { np.mean(rating) }
)

TypeError: unhashable type: 'list'

In [None]:
authors = authors.reset_index()
authors.head()

In [None]:
# replace 'name' with a list of book names after using str.split
######
# your code here
######

#authors.head()

In [None]:
# count the books - create new column
len(authors.name[0])

In [None]:
# count the number of books for each author 

######
# your code here
######


In [None]:
# determine who has been the most prolific (in this list)

######
# your code here
######


Wow! **Stephen King** wins with **56 books**!

<div class="exercise"><b>BONUS QUESTION</b></div>

Lets get the best-rated book(s) for every year in our Goodreads dataframe.

In [None]:

######
# your code here
# Hint: don't try to slice, rather use a for loop to iterate through each  
# 'group' within the groupBy object
######




---

# Introduction to Web Servers and HTTP

A web server is just a computer -- usually a powerful one, but ultimately it's just another computer -- that runs a long/continuous process that listens for requests on a pre-specified (Internet) _port_ on your computer. It responds to those requests via a protocol called HTTP (HyperText Transfer Protocol). HTTPS is the secure version. When we use a web browser and navigate to a web page, our browser is actually sending a request on our behalf to a specific web server. The browser request is essentially saying "hey, please give me the web page contents", and it's up to the browser to correctly render that raw content into a coherent manner, dependent on the format of the file. For example, HTML is one format, XML is another format, and so on.

Ideally (and usually), the web server complies with the request and all is fine. As part of this communication exchange with web servers, the server also sends a status code.
- If the code starts with a **2**, it means the request was successful.
- If the code starts with a **4**, it means there was a client error (you, as the user, are the client). For example, ever receive a 404 File Not Found error because a web page doesn't exist? This is an example of a client error, because you are requesting a bogus item.
- If the code starts with a **5**, it means there was a server error (often that your request was incorrectly formed).

[Click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes.

As an analogy, you can think of a web server as being like a server at a restaurant; its goal is _serve_ you your requests. When you try to order something not on the menu (i.e., ask for a web page at a wrong location), the server says 'sorry, we don't have that' (i.e., 404, client error; your mistake).

**IMPORTANT:**
As humans, we visit pages in a sane, reasonable rate. However, as we start to scrape web pages with our computers, we will be sending requests with our code, and thus, we can make requests at an incredible rate. This is potentially dangerous because it's akin to going to a restaurant and bombarding the server(s) with thousands of food orders. Very often, the restaurant will ban you (i.e., Harvard's network gets banned from the website, and you are potentially held responsible in some capacity?). It is imperative to be responsible and careful. In fact, this act of flooding web pages with requests is the single-most popular, yet archiac, method for maliciously attacking websites / computers with Internet connections. In short, be respectful and careful with your decisions and code. It is better to err on the side of caution, which includes using the **``time.sleep()`` function** to pause your code's execution between subsequent requests. ``time.sleep(2)`` should be fine when making just a few dozen requests. Each site has its own rules, which are often visible via their site's ``robots.txt`` file.

### Additional Resources

**HTML:** if you are not familiar with HTML see https://www.w3schools.com/html/ or one of the many tutorials on the internet.

**Document Object Model (DOM):** for more on this programming interface for HTML and XML documents see https://www.w3schools.com/js/js_htmldom.asp.

## Part 4: Download webpages and get basic properties

``Requests`` is a highly useful Python library that allows us to fetch web pages.
``BeautifulSoup`` is a phenomenal Python library that allows us to easily parse web content and perform basic extraction.

If one wishes to scrape webpages, one usually uses ``requests`` to fetch the page and ``BeautifulSoup`` to parse the page's meaningful components. Webpages can be messy, despite having a structured format, which is why BeautifulSoup is so handy.

Let's get started:

In [138]:
from bs4 import BeautifulSoup
import requests

To fetch a webpage's content, we can simply use the ``get()`` function within the requests library:

In [139]:
url = "https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect"
response = requests.get(url) # you can use any URL that you wish

The response variable has many highly useful attributes, such as:
- status_code
- text
- content

Let's try each of them!

### response.status_code

In [140]:
response.status_code

200

You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). [Again, you can click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes.

### response.text


In [141]:
response.text

'<!doctype html><html class="no-js" lang="en"><head>\n\n\n<!--\n\nnnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr\nnnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr\nnnn         nnnnnn ppp          pppppp rrr           rrrr\nnnn   nnnn    nnnn ppp   pppppp   pppp rrr   rrrrrr   rrr\nnnn   nnnnnn   nnn ppp   ppppppp   ppp rrr   rrrrrrrrrrrr\nnnn   nnnnnn   nnn ppp   ppppp    pppp rrr   rrrrrrrrrrrr\nnnn   nnnnnn   nnn ppp           ppppp rrr   rrrrrrrrrrrr\nnnn   nnnnnn   nnn ppp   ppppppppppppp rrr   rrrrrrrrrrrr\nnnnnnnnnnnnnnnnnnn ppp   ppppppppppppp rrrrrrrrrrrrrrrrrr\nnnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr\n\nWe are hiring!\n\nhttps://n.pr/tech-jobs\n\n-->\n\n\n<!-- Google Tag Manager -->\n<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\'https://www.googletagmanager.com/gtm.js?

Holy moly! That looks awful. If we use our browser to visit the URL, then right-click the page and click 'View Page Source', we see that it is identical to this chunk of glorious text.

### response.content

In [142]:
response.content

b'<!doctype html><html class="no-js" lang="en"><head>\n\n\n<!--\n\nnnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr\nnnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr\nnnn         nnnnnn ppp          pppppp rrr           rrrr\nnnn   nnnn    nnnn ppp   pppppp   pppp rrr   rrrrrr   rrr\nnnn   nnnnnn   nnn ppp   ppppppp   ppp rrr   rrrrrrrrrrrr\nnnn   nnnnnn   nnn ppp   ppppp    pppp rrr   rrrrrrrrrrrr\nnnn   nnnnnn   nnn ppp           ppppp rrr   rrrrrrrrrrrr\nnnn   nnnnnn   nnn ppp   ppppppppppppp rrr   rrrrrrrrrrrr\nnnnnnnnnnnnnnnnnnn ppp   ppppppppppppp rrrrrrrrrrrrrrrrrr\nnnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr\n\nWe are hiring!\n\nhttps://n.pr/tech-jobs\n\n-->\n\n\n<!-- Google Tag Manager -->\n<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\'https://www.googletagmanager.com/gtm.js

What?! This seems identical to the ``.text`` field. However, the careful eye would notice that the very 1st characters differ; that is, ``.content`` has a *b'* character at the beginning, which in Python syntax denotes that the data type is bytes, whereas the ``.text`` field did not have it and is a regular String.

Ok, so that's great, but how do we make sense of this text? We could manually parse it, but that's tedious and difficult. As mentioned, BeautifulSoup is specifically designed to parse this exact content (any webpage content).

## BEAUTIFUL SOUP
![title](../images/soup_for_you.jpg) (property of NBC)


The [documentation for BeautifulSoup is found here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

A BeautifulSoup object can be initialized with the ``.content`` from request and a flag denoting the type of parser that we should use. For example, we could specify ``html.parser``, ``lxml``, etc [documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers). Since we are interested in standard webpages that use HTML, let's specify the html.parser:

In [143]:
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>
<html class="no-js" lang="en"><head>
<!--

nnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr
nnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr
nnn         nnnnnn ppp          pppppp rrr           rrrr
nnn   nnnn    nnnn ppp   pppppp   pppp rrr   rrrrrr   rrr
nnn   nnnnnn   nnn ppp   ppppppp   ppp rrr   rrrrrrrrrrrr
nnn   nnnnnn   nnn ppp   ppppp    pppp rrr   rrrrrrrrrrrr
nnn   nnnnnn   nnn ppp           ppppp rrr   rrrrrrrrrrrr
nnn   nnnnnn   nnn ppp   ppppppppppppp rrr   rrrrrrrrrrrr
nnnnnnnnnnnnnnnnnn ppp   ppppppppppppp rrrrrrrrrrrrrrrrrr
nnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr

We are hiring!

https://n.pr/tech-jobs

-->
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f

Alright! That looks a little better; there's some whitespace formatting, adding some structure to our content! HTML code is structured by `<tags>`. Every tag has an opening and closing portion, denoted by ``< >`` and ``</ >``, respectively. If we want just the text (not the tags), we can use:

In [144]:
soup.get_text()

"\n\n\n\n\n\nWhat If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\n\n\n\n\n\n\n\n\n\n\n\nAccessibility links \nSkip to main content\nKeyboard shortcuts for audio player\n\n\n\n\n\n\n\n\n                    Open Navigation Menu\n                \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNPR Shop\n\n\n\n\n\n\n\n\n                    Close Navigation Menu\n\n\n\n\nHome\n\n\n\nNews\nExpand/collapse submenu for News\n\n\nNational\nWorld\nPolitics\nBusiness\nHealth\nScience\nTechnology\nRace & Culture\n\n\n\n\nArts & Life\nExpand/collapse submenu for Arts & Life\n\n\nBooks\nMovies\nTelevision\nPop Culture\nFood\nArt & Design \nPerforming Arts\nLife Kit\n\n\n\n\nMusic\nExpand/collapse submenu for Music\n\n\n\n        #NowPlaying\n    \n\n\n        Tiny Desk\n    \n\n\n        All Songs Considered\n    \n\n\n        Music News\n    \n\n\n        Music Features\n    \n\n\n        Live Sessions\n    \n\n\n\n\nShows & Podcasts\nExpand/collapse submenu f

There's some tricky Javascript still nesting within it, but it definitely cleaned up a bit. On other websites, you may find even clearer text extraction.

As detailed in the [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), the easiest way to navigate through the tags is to simply name the tag you're interested in. For example:

In [145]:
soup.head # fetches the head tag, which ecompasses the title tag

<head>
<!--

nnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr
nnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr
nnn         nnnnnn ppp          pppppp rrr           rrrr
nnn   nnnn    nnnn ppp   pppppp   pppp rrr   rrrrrr   rrr
nnn   nnnnnn   nnn ppp   ppppppp   ppp rrr   rrrrrrrrrrrr
nnn   nnnnnn   nnn ppp   ppppp    pppp rrr   rrrrrrrrrrrr
nnn   nnnnnn   nnn ppp           ppppp rrr   rrrrrrrrrrrr
nnn   nnnnnn   nnn ppp   ppppppppppppp rrr   rrrrrrrrrrrr
nnnnnnnnnnnnnnnnnn ppp   ppppppppppppp rrrrrrrrrrrrrrrrrr
nnnnnnnnnnnnnnnnnn ppppppppppppppppppp rrrrrrrrrrrrrrrrrr

We are hiring!

https://n.pr/tech-jobs

-->
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer', 'GT

Usually head tags are small and only contain the most important contents; however, here, there's some Javascript code. The ``title`` tag resides within the head tag.

In [146]:
soup.title # we can specifically call for the title tag

<title>What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR</title>

This result includes the tag itself. To get just the text within the tags, we can use the ``.name`` property.

In [147]:
soup.title.string

'What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR'

We can navigate to the parent tag (the tag that encompasses the current tag) via the ``.parent`` attribute:

In [148]:
soup.title.parent.name

'head'

## Parse the page with Beautiful Soup
In HTML code, paragraphs are often denoated with a ``<p>`` tag.

In [149]:
soup.p

<p class="byline__name byline__name--block">
<a data-metrics='{"action":"Click Byline","category":"Story Metadata"}' href="https://www.npr.org/people/392602474/domenico-montanaro" rel="author">
      Domenico Montanaro
    </a>
</p>

This returns the first paragraph, and we can access properties of the given tag with the same syntax we use for dictionaries and dataframes:

In [150]:
soup.p['class']

['byline__name', 'byline__name--block']

In addition to 'paragraph' (aka p) tags, link tags are also very common and are denoted by ``<a>`` tags

In [151]:
soup.a

<a class="skiplink" href="#mainContent">Skip to main content</a>

It is called the a tag because links are also called 'anchors'. Nearly every page has multiple paragraphs and anchors, so how do we access the subsequent tags? There are two common functions, `.find()` and `.find_all()`.

In [152]:
soup.find('title')

<title>What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR</title>

In [153]:
soup.find_all('title')

[<title>What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR</title>]

Here, the results were seemingly the same, since there is only one title to a webpage. However, you'll notice that ``.find_all()`` returned a list, not a single item. Sure, there was only one item in the list, but it returned a list. As the name implies, find_all() returns all items that match the passed-in tag.

In [None]:
soup.find_all('a')

Look at all of those links! Amazing. It might be hard to read but the **href** portion of an *a* tag denotes the URL, and we can capture it via the ``.get()`` function.

In [None]:
for link in soup.find_all('a'): # we could optionally pass the href=True flag .find_all('a', href=True)
    print(link.get('href'))

Many of those links are relative to the current URL (e.g., /section/news/).

In [None]:
paragraphs = soup.find_all('p')
paragraphs

If we want just the paragraph text:

In [None]:
for pa in paragraphs:
    print(pa.get_text())

Since there are multiple tags and various attributes, it is useful to check the data type of BeautifulSoup objects:

In [None]:
type(soup.find('p'))

Since the ``.find()`` function returns a BeautifulSoup element, we can tack on multiple calls that continue to return elements:

In [None]:
soup.find('p')

In [None]:
soup.find('p').find('a')

In [None]:
soup.find('p').find('a').attrs['href'] # att

In [None]:
soup.find('p').find('a').text

That doesn't look pretty, but it makes sense because if you look at what ``.find('a')`` returned, there is plenty of whitespace. We can remove that with Python's built-in ``.strip()`` function.

In [None]:
soup.find('p').find('a').text.strip()

**NOTE:** above, we accessed the attributes of a link by using the property ``.attrs``. ``.attrs`` takes a dictionary as a parameter, and in the example above, we only provided the _key_, not a _value_, too. That is, we only cared that the ``<a>`` tag had an attribute named ``href`` (which we grabbed by typing that command), and we made no specific demands on what the value must be. In other words, regardless of the value of _href_, we grabbed that element. Alternatively, if you inspect your HTML code and notice select regions for which you'd like to extract text, you can specify it as part of the attributes, too!

For example, in the full ``response.text``, we see the following line:

``<header class="npr-header" id="globalheader" aria-label="NPR header">``

Let's say that we know that the information we care about is within tags that match this template (i.e., **class** is an attribute, and its value is **'npr-header'**).

In [None]:
soup.find('header', attrs={'class':'npr-header'})

This matched it! We could then continue further processing by tacking on other commands:

In [None]:
soup.find('header', attrs={'class':'npr-header'}).find_all("li") # li stands for list items

This returns all of our list items, and since it's within a particular header section of the page, it appears they are links to menu items for navigating the webpage. If we wanted to grab just the links within these:

In [None]:
menu_links = set()
for list_item in soup.find('header', attrs={'class':'npr-header'}).find_all("li"):
    for link in list_item.find_all('a', href=True):
        menu_links.add(link)
menu_links # a unique set of all the seemingly important links in the header

## TAKEAWAY LESSON
The above tutorial isn't meant to be a study guide to memorize; its point is to show you the most important functionaity that exist within BeautifulSoup, and to illustrate how one can access different pieces of content. No two web scraping tasks are identical, so it's useful to play around with code and try different things, while using the above as examples of how you may navigate between different tags and properties of a page. Don't worry; we are always here to help when you get stuck!

# String formatting
As we parse webpages, we may often want to further adjust and format the text to a certain way.

For example, say we wanted to scrape a polical website that lists all US Senators' name and office phone number. We may want to store information for each senator in a dictionary. All senators' information may be stored in a list. Thus, we'd have a list of dictionaries. Below, we will initialize such a list of dictionary (it has only 3 senators, for illustrative purposes, but imagine it contains many more).

In [None]:
# this is a bit clumsy of an initialization, but we spell it out this way for clarity purposes
# NOTE: imagine the dictionary were constructed in a more organic manner
senator1 = {"name":"Lamar Alexander", "number":"555-229-2812"}
senator2 = {"name":"Tammy Baldwin", "number":"555-922-8393"}
senator3 = {"name":"John Barrasso", "number":"555-827-2281"}
senators = [senator1, senator2, senator3]
print(senators)

In the real-world, we may not want the final form of our information to be in a Python dictionary; rather, we may need to send an email to people in our mailing list, urging them to call their senators. If we have a templated format in mind, we can do the following:

In [None]:
email_template = """Please call {name} at {number}"""
for senator in senators:
    print(email_template.format(**senator))

**Please [visit here](https://docs.python.org/3/library/stdtypes.html#str.format)** for further documentation
                      
Alternatively, one can also format their text via the ``f'-strings`` property. [See documentation here](https://docs.python.org/3/reference/lexical_analysis.html#f-strings). For example, using the above data structure and goal, one could yield identical results via:

In [None]:
for senator in senators:
    print(f"Please call {senator['name']} at {senator['number']}")

Additionally, sometimes we wish to search large strings of text. If we wish to find all occurrences within a given string, a very mechanical, procedural way of doing it would be to use the ``.find()`` function in Python and to repeatedly update the starting index from which we are looking.

## Regular Expressions
A way more suitable and powerful way is to use Regular Expressions, which is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). A tutorial on Regular Expressions (aka regex) is beond this lab, but below are many great resources that we recommend, if you are interested in them (could be very useful for a homework problem):
- https://docs.python.org/3.3/library/re.html
- https://regexone.com
- https://docs.python.org/3/howto/regex.html.

# Additonal Python/Homework Comment
In Homework #1, we ask you to complete functions that have signatures with a syntax you may not have seen before:

``def create_star_table(starlist: list) -> list:``

To be clear, this syntax merely means that the input parameter must be a list, and the output must be a list. It's no different than any other function, it just puts a requirement on the behavior of the function.

It is **typing** our function. Please [see this documention if you have more questions.](https://docs.python.org/3/library/typing.html)

# Walkthrough Example (of Web Scraping)
We're going to see the structure of Goodread's best books list online. We'll use the Developer tools in chrome, safari and firefox have similar tools available. To get this page we use the `requests` module. But first we should check if the company's policy allows scraping. Check the [robots.txt](https://www.goodreads.com/robots.txt) to find what sites/elements are not accessible. Please read and verify.

![](images/goodreads1.png)

In [None]:
url="https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect"
response = requests.get(url)
# response.status_code
# response.content

# Beautiful Soup (library) time!
soup = BeautifulSoup(response.content, "html.parser")
    #print(soup)
    # soup.prettify()
soup.find("title")

    # Q1: how do we get the title's text?

    # Q2: how do we get the webpage's entire content?

In [None]:
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
url = URLSTART+BESTBOOKS+'1'
print(url)
page = requests.get(url)

We can see properties of the page. Most relevant are `status_code` and `text`. The former tells us  if the web-page was found, and if found , ok. (See lecture notes.)

In [None]:
page.status_code # 200 is good

In [None]:
page.text[:5000]

Let us write a loop to fetch 2 pages of "best-books" from goodreads. Notice the use of a format string. This is an example of old-style python format strings

In [None]:
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
for i in range(1,3):
    bookpage=str(i)
    stuff=requests.get(URLSTART+BESTBOOKS+bookpage)
    filetowrite="../data/page"+ '%02d' % i + ".html"
    print("FTW", filetowrite)
    fd=open(filetowrite,"w")
    fd.write(stuff.text)
    fd.close()
    time.sleep(2)

## 2. Parse the page, extract book urls

Notice how we do file input-output, and use beautiful soup in the code below. The `with` construct ensures that the file being read is closed, something we do explicitly for the file being written. We look for the elements with class `bookTitle`, extract the urls, and write them into a file

In [None]:
bookdict={}
for i in range(1,3):
    books=[]
    stri = '%02d' % i
    filetoread="../data/page"+ stri + '.html'
    print("FTW", filetoread)
    with open(filetoread) as fdr:
        data = fdr.read()
    soup = BeautifulSoup(data, 'html.parser')
    for e in soup.select('.bookTitle'):
        books.append(e['href'])
    print(books[:10])
    bookdict[stri]=books
    fd=open("../data/list"+stri+".txt","w")
    fd.write("\n".join(books))
    fd.close()

Here is one of everyone's favorite books from HS American Lit. class:

In [None]:
bookdict['02'][0]

 Lets go look at the first URLs on both pages

![](images/goodreads2.png)

## 3. Parse a book page, extract book properties

Ok so now lets dive in and get one of these these files and parse them.

In [None]:
furl=URLSTART+bookdict['02'][0]
furl

![](images/goodreads3.png)

In [None]:
fstuff=requests.get(furl)
print(fstuff.status_code)

In [None]:
#d=BeautifulSoup(fstuff.text, 'html.parser')
# try this to take care of arabic strings
d = BeautifulSoup(fstuff.text, 'html.parser')

In [None]:
d.select("meta[property='og:title']")[0]['content']

Lets get everything we want...

In [None]:
#d=BeautifulSoup(fstuff.text, 'html.parser', from_encoding="utf-8")
print(
"title:", d.select_one("meta[property='og:title']")['content'],"\n",
"isbn:", d.select("meta[property='books:isbn']")[0]['content'],"\n",
"type:", d.select("meta[property='og:type']")[0]['content'],"\n",
"author:", d.select("meta[property='books:author']")[0]['content'],"\n",
#"average rating", d.select_one("span.average").text,"\n",
"ratingCount:", d.select("meta[itemprop='ratingCount']")[0]["content"],"\n"
#"reviewCount", d.select_one("span.count")["title"]
)

Ok, now that we know what to do, lets wrap our fetching into a proper script. So that we dont overwhelm their servers, we will only fetch 5 from each page, but you get the idea...

We'll segue of a bit to explore new style format strings. See https://pyformat.info for more info.

In [None]:
"list{:0>2}.txt".format(3)

In [None]:
a = "4"
b = 4
class Four:
    def __str__(self):
        return "Fourteen"
c=Four()

In [None]:
"The hazy cat jumped over the {} and {} and {}".format(a, b, c)

## 4. Set up a pipeline for fetching and parsing

Ok lets get back to the fetching...

In [None]:
fetched=[]
for i in range(1,3):
    with open("../data/list{:0>2}.txt".format(i)) as fd:
        counter=0
        for bookurl_line in fd:
            if counter > 4:
                break
            bookurl=bookurl_line.strip()
            stuff=requests.get(URLSTART+bookurl)
            filetowrite=bookurl.split('/')[-1]
            filetowrite="../data/"+str(i)+"_"+filetowrite+".html"
            print("FTW", filetowrite)
            fd=open(filetowrite,"w")
            fd.write(stuff.text)
            fd.close()
            fetched.append(filetowrite)
            time.sleep(2)
            counter=counter+1
            
print(fetched)

Ok we are off to parse each one of the html pages we fetched. We have provided the skeleton of the code and the code to parse the year, since it is a bit more complex...see the difference in the screenshots above. 

In [None]:
import re
yearre = r'\d{4}'
def get_year(d):
    if d.select_one("nobr.greyText"):
        return d.select_one("nobr.greyText").text.strip().split()[-1][:-1]
    else:
        thetext=d.select("div#details div.row")[1].text.strip()
        rowmatch=re.findall(yearre, thetext)
        if len(rowmatch) > 0:
            rowtext=rowmatch[0].strip()
        else:
            rowtext="NA"
        return rowtext

<div class="exercise"><b>Q4.1</b></div>

Your job is to fill in the code to get the genres.

In [None]:
def get_genres(d):
    ######
    # your code here
    ######


In [None]:

listofdicts=[]
for filetoread in fetched:
    print(filetoread)
    td={}
    with open(filetoread) as fd:
        datext = fd.read()
    d=BeautifulSoup(datext, 'html.parser')
    td['title']=d.select_one("meta[property='og:title']")['content']
    td['isbn']=d.select_one("meta[property='books:isbn']")['content']
    td['booktype']=d.select_one("meta[property='og:type']")['content']
    td['author']=d.select_one("meta[property='books:author']")['content']
    #td['rating']=d.select_one("span.average").text
    td['year'] = get_year(d)
    td['file']=filetoread
    glist = get_genres(d)
    td['genres']="|".join(glist)
    listofdicts.append(td)

In [None]:
listofdicts[0]

Finally lets write all this stuff into a csv file which we will use to do analysis.

In [None]:
df2 = pd.DataFrame.from_records(listofdicts)
df2

In [None]:
# and save as a csv
df2.to_csv("../data/meta_utf8_EK.csv", index=False, header=True)