_Lambda School Data Science_

# Scrape and process data

Objectives
- scrape and parse web pages
- use list comprehensions
- select rows and columns with pandas

Links
-  [Automate the Boring Stuff with Python, Chapter 11](https://automatetheboringstuff.com/chapter11/)
  - Requests
  - Beautiful Soup
- [Python List Comprehensions: Explained Visually](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
  - Subset Observations (Rows)
  - Subset Variables (Columns)
- Python Data Science Handbook
  - [Chapter 3.1](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html), Introducing Pandas Objects
  - [Chapter 3.2](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html), Data Indexing and Selection


## Scrape the titles of PyCon 2019 talks

In [0]:
url = 'https://us.pycon.org/2019/schedule/talks/list/'

In [0]:
import bs4
import requests

result = requests.get(url)

In [10]:
## Response [200] means that everything went okay with the retrieval of information
result

<Response [200]>

In [11]:
## what is the type of the information.. in this case, object
type(result)

requests.models.Response

In [12]:
## returns HTML code. the text from the URL
result.text



In [13]:
type(result.text)

str

In [0]:
## bs4.beautifulsoup function helps organize this str 
soup = bs4.BeautifulSoup(result.text)
soup

In [16]:
## beautiful soup object
type(soup)

bs4.BeautifulSoup

In [0]:
## tab to get info.. select finds certain elements. One can inspect source from web pages and look for clues for the 
##information that you want. trial and error until you get what you want!
soup.select('h2') ## select all h2 tags on the page

In [18]:
type(soup.select('h2')) ## returns a list! 

list

In [19]:
len(soup.select('h2')) ## tells you the length of the list.. about 100 talks in this case!

95

In [20]:
first = soup.select('h2')[0] ## return the first element
first

<h2>
<a href="/2019/schedule/presentation/235/" id="presentation-235">
        5 Steps to Build Python Native GUI Widgets for BeeWare
      </a>
</h2>

In [21]:
type(first)  ## soup tag element

bs4.element.Tag

In [22]:
## keep tab completing to see what you can do for these different types of items
first.text ## get the text from the bs4 Tag object! .. text with spaces and newline characters

'\n\n        5 Steps to Build Python Native GUI Widgets for BeeWare\n      \n'

In [23]:
type(first.text) # another string

str

In [24]:
first.text.strip() ## strip the blank spaces

'5 Steps to Build Python Native GUI Widgets for BeeWare'

In [27]:
first.text.strip().strip('5') ## strips specific text

' Steps to Build Python Native GUI Widgets for BeeWare'

In [29]:
last = soup.select('h2')[-1] ## select the last element
last

<h2>
<a href="/2019/schedule/presentation/191/" id="presentation-191">
        Working with Time Zones: Everything You Wish You Didn't Need to Know
      </a>
</h2>

In [0]:
#loop through all the text and print the titles with spaces removed!
for tag in soup.select('h2'):
  title = tag.text.strip()
  print(title)

## 5 ways to look at long titles

Let's define a long title as greater than 80 characters

### 1. For Loop

### 2. List Comprehension

### 3. Filter with named function

### 4. Filter with anonymous function

### 5. Pandas

pandas documentation: [Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html)

## Make new dataframe columns

pandas documentation: [apply](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html)

### title length

### long title

### first letter

### word count

Using [`textstat`](https://github.com/shivam5992/textstat)

In [0]:
!pip install textstat

## Rename column

`title length` --> `title character count`

pandas documentation: [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html)

## Analyze the dataframe

### Describe

pandas documentation: [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)

### Sort values

pandas documentation: [sort_values](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)

Five shortest titles, by character count

Titles sorted reverse alphabetically

### Get value counts

pandas documentation: [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)


Frequency counts of first letters

Percentage of talks with long titles

### Plot

pandas documentation: [Visualization](https://pandas.pydata.org/pandas-docs/stable/visualization.html)





Top 5 most frequent first letters

Histogram of title lengths, in characters

# Assignment

**Scrape** the talk descriptions. Hint: `soup.select('.presentation-description')`

**Make** new columns in the dataframe:
- description
- description character count
- description word count

**Describe** all the dataframe's columns. What's the average description word count? The minimum? The maximum?

**Answer** the question: Which descriptions could fit in a tweet?


# Stretch Challenge

**Make** another new column in the dataframe:
- description grade level (you can use [this `textstat` function](https://github.com/shivam5992/textstat#the-flesch-kincaid-grade-level) to get the Flesh-Kincaid grade level)

**Answer** the question: What's the distribution of grade levels? Plot a histogram.

**Be aware** that [Textstat has issues when sentences aren't separated by spaces](https://github.com/shivam5992/textstat/issues/77#issuecomment-453734048). (A Lambda School Data Science student helped identify this issue, and emailed with the developer.) 

Also, [BeautifulSoup doesn't separate paragraph tags with spaces](https://bugs.launchpad.net/beautifulsoup/+bug/1768330).

So, you may get some inaccurate or surprising grade level estimates here. Don't worry, that's ok ‚Äî but optionally, can you do anything to try improving the grade level estimates?