<h1 align="center"> How to scrape stuff and not read all of it </h1>

### Installation:

I have the code written in this notebook, you can follow along using your own notebook or in your code editor of choice.
First thing first, you are going to need a virtual environment. Once you do you will need to install these packages if you do not have them yet:


```bash
pip install jupyter
pip install requests
pip install beautifulsoup4
pip install sumy
pip install numpy
```

We also need to download this punkt from NLTK, so let's do that quickly:

```python
python3
import nltk
nltk.download('punkt')
```

That's all you need!


### Getting started with our scraper:

In [1]:
# from bs4 import BeautifulSoup
# import requests

# url = 'https://en.wikipedia.org/wiki/Snowflake_schema'
# req = requests.get(url)
# soup = BeautifulSoup(req.text, "html.parser")

### What is in our soup?
Because we know that the main content of a wikipedia article is in `<p>` tags, we can just... find them all.

In [2]:
#p_tags = soup.find_all('p')
#print(p_tags)

## That's a lot of... "soup".
We've found all our p tags! That's our content. yay. But that's not really useable.
Really, what we want is all the text inside the p tags put together in one string.

In [3]:
# the_good_stuff = ''
# for p in p_tags:
#     the_good_stuff += p.text
    
# the_good_stuff

### Victory.... but wait!
We actually need to insert a space between the full stop and the beginning of the other `<p>` tag.  
Let's make it a tiny bit fancier:

In [4]:
#the_actual_good_stuff = ' '.join(map(lambda p: p.text, p_tags))
#print(the_actual_good_stuff)

### And since we're smart people...
We can throw that in a function and never look at it again.

In [5]:
# def get_the_good_stuff(url):
#     req = requests.get(url)
#     soup = BeautifulSoup(req.text, 'html.parser')
#     p_tags = soup.find_all('p')
#     return ' '.join(map(lambda p: p.text, p_tags))
    
# get_the_good_stuff('https://en.wikipedia.org/wiki/Snowflake_schema')

### I don't want to read all that.

So the fun part begin. We actually want a summary of that. I ain't got no time to read no fancy wikipedia articles all day.  

*Enter: Sumy.*

It's basically magic.


In [6]:
# from sumy.parsers.plaintext import PlaintextParser
# from sumy.nlp.tokenizers import Tokenizer

# our_text = get_the_good_stuff('https://en.wikipedia.org/wiki/Snowflake_schema')
# goop = PlaintextParser(our_text, Tokenizer('english'))

we parse our text and we use a Tokenizer (i.e we are able to make "tokens" out of it... let's just look at it:

In [8]:
# dir(goop)

In [9]:
#parsed_document = goop.document

In [10]:
#dir(parsed_document)

In [18]:
# from sumy.summarizers.lsa import LsaSummarizer
# # LSA: Latent semantic analysis. The very smart person with the british accent already talked about it last month.
# from sumy.nlp.stemmers import Stemmer
## Stemming: reducing word to their stem.
# from sumy.utils import get_stop_words


# stemmer = Stemmer('english')
# summarizer = LsaSummarizer(stemmer)
# summarizer.stop_words = get_stop_words('english')

In [12]:
# dir(summarizer)

In [13]:
# MY_SUMMARY = summarizer(parsed_document, 5)

In [14]:
#MY_SUMMARY

### Wait that's not quite it....

In [15]:
# for sentence in MY_SUMMARY:
#     print(sentence)

### And now...because we are smart _and_ fancy people....

In [16]:
# from sumy.parsers.plaintext import PlaintextParser
# from sumy.nlp.tokenizers import Tokenizer
# from sumy.summarizers.lsa import LsaSummarizer
# from sumy.nlp.stemmers import Stemmer
# from sumy.utils import get_stop_words

# def summarize_that_stuff(url, language, sentences):
#     text = get_the_good_stuff(url)
#     parser = PlaintextParser(text, Tokenizer(language))
#     summarizer = LsaSummarizer(Stemmer(language))
#     summarizer.stop_words = get_stop_words(language)
    
#     for sentence in summarizer(parser.document, sentences):
#         print(sentence)


In [17]:
# summarize_that_stuff('https://en.wikipedia.org/wiki/First_Amendment_to_the_United_States_Constitution', 'english', 10)