_Lambda School Data Science_

# Scrape and process data

Objectives
- scrape and parse web pages
- use list comprehensions
- select rows and columns with pandas

Links
-  [Automate the Boring Stuff with Python, Chapter 11](https://automatetheboringstuff.com/chapter11/)
  - Requests
  - Beautiful Soup
- [Python List Comprehensions: Explained Visually](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
  - Subset Observations (Rows)
  - Subset Variables (Columns)
- Python Data Science Handbook
  - [Chapter 3.1](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html), Introducing Pandas Objects
  - [Chapter 3.2](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html), Data Indexing and Selection


## Scrape the titles of PyCon 2019 talks

In [0]:
url = 'https://us.pycon.org/2019/schedule/talks/list/'

res = requests.get(url)

type(res)

res.raise_for_status()

elems = bs4.BeautifulSoup(res.text)

elems = elems.select('.a id')



## 5 ways to look at long titles

Let's define a long title as greater than 80 characters

### 1. For Loop

In [0]:
long_titles = []
for title in titles: 
  if len(titles) > 80: 
    #print(title)
    long_titles.append()

### 2. List Comprehension

In [0]:

 long_titles = [title for title in titles if len(title)>80]                       

### 3. Filter with named function

In [26]:
def long(title): 
    return len(title) > 80 
  
 long('Python is good')


filter(long, titles)

IndentationError: ignored

### 4. Filter with anonymous function

In [0]:
filter (lambda t: len(t) > 80, titles)

### 5. Pandas

pandas documentation: [Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html)

In [0]:
import pandas as pd 
pd.options.display.max_colwidth = 200

df = pd.DateFrame({'title': titles})

df[ df['title'].str.len() > 80]



## Make new dataframe columns

pandas documentation: [apply](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html)

### title length

In [0]:
df['title length'] = df['title'].apply(len)

### long title

In [0]:
df[ df['title length'] > 80 ]

df.loc[ df['title length'] > 80, 'title length']

df['long title'] = df['title length'] > 80 



### first letter

In [0]:
df['title'].str[0]


df[ df['first letter'] == 'p']

df['p titles'] = df['title'].str[0] == 'p'



### word count

Using [`textstat`](https://github.com/shivam5992/textstat)

In [0]:
!pip install textstat



In [0]:
df.['title'].apply(textstat.lexicon_count)

## Rename column

`title length` --> `title character count`

pandas documentation: [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html)

## Analyze the dataframe

### Describe

pandas documentation: [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)

### Sort values

pandas documentation: [sort_values](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)

Five shortest titles, by character count

Titles sorted reverse alphabetically

### Get value counts

pandas documentation: [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)


Frequency counts of first letters

Percentage of talks with long titles

### Plot

pandas documentation: [Visualization](https://pandas.pydata.org/pandas-docs/stable/visualization.html)





Top 5 most frequent first letters

In [0]:
(df['first letter']
 .value_counts()
 .head(5)
 .plot
 .barh()); 

Histogram of title lengths, in characters

In [0]:
df['title in character count'].plot.hist(); 




# Assignment

**Scrape** the talk descriptions. Hint: `soup.select('.presentation-description')`

**Make** new columns in the dataframe:
- description
- description character count
- description word count

**Describe** all the dataframe's columns. What's the average description word count? The minimum? The maximum?

**Answer** the question: Which descriptions could fit in a tweet?


# Stretch Challenge

**Make** another new column in the dataframe:
- description grade level (you can use [this `textstat` function](https://github.com/shivam5992/textstat#the-flesch-kincaid-grade-level) to get the Flesh-Kincaid grade level)

**Answer** the question: What's the distribution of grade levels? Plot a histogram.

**Be aware** that [Textstat has issues when sentences aren't separated by spaces](https://github.com/shivam5992/textstat/issues/77#issuecomment-453734048). (A Lambda School Data Science student helped identify this issue, and emailed with the developer.) 

Also, [BeautifulSoup doesn't separate paragraph tags with spaces](https://bugs.launchpad.net/beautifulsoup/+bug/1768330).

So, you may get some inaccurate or surprising grade level estimates here. Don't worry, that's ok — but optionally, can you do anything to try improving the grade level estimates?

In [31]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import requests
import bs4



url = 'https://us.pycon.org/2019/schedule/talks/list/'

res = requests.get(url)

type(res)

res.raise_for_status()


example = bs4.BeautifulSoup(res.text)

elems = example.select('.presentation-description')


len(elems)

elems[0]

str(elems[0])

elems[0].attrs

elems[0].getText()

elems[0].text.strip()

descs = []    #Initializing list of descriptions

for tag in elems: 
  desc = tag.text.strip()
  descs.append(desc)
  
descs2 = [tag.get_text(strip=True, separator='\r\n\r') for tag in elems]

print(descs2)







