# Web Scraping with [Python](https://www.python.org/) using [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [`requests`](https://2.python-requests.org/en/master/)

The task is to scarpe the content of ABV training courses from the [Vorlesungsverzeichnis](https://www.fu-berlin.de/vv/de/modul?id=478016&sm=498562) and analyze its content by generating a wordcloud.

__Set up__

In [None]:
%load_ext autoreload
%autoreload 2

__Importieren von Python Bibliotheken__

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Understanding the website

Please check first the [https://www.fu-berlin.de/robots.txt](https://www.fu-berlin.de/robots.txt) site.

In [None]:
url = 'https://www.fu-berlin.de'
abv = '/vv/de/modul?id=478016&sm=498562'

Visit the website `https://www.fu-berlin.de/vv/de/modul?id=478016&sm=498562` an try to figure out where the data of interest, the review texts, is made available.

## Fetching the content of the website using `requests` and  `BeautifulSoup`

In [None]:
page = requests.get(url+abv)

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')

> ### Challenge 1: Inspect the `soup` object and try to make sense of it.

In [None]:
## your code here 

In [None]:
# %load ../src/_solutions/soup.py

## Find the revelant item where the data is made available.

> ### Challenge 2: Extract data for the course name and the internet link to the course where additional information may be found  
> * #### Inspect the `soup` object or visit the website to find out where the data is made available. 
> * #### [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) offers many methods and attributes to extract data from a html file. Particular useful are the `find` and the `find_all` methods.
> * #### Build a pandas dataframe denoted as `data` with all course names and internet links in two columns denoted as `course_names` and `course_links`.
> * #### How many courses are available?

In [None]:
## your code here

In [None]:
# %load ../src/_solutions/build_dataframe.py

## Content extraction

> ### Challenge 3: Follow the link (`link`) as given below and extract the text data corresponding to the information on the website to the given link. 
> * #### Use the `requests` and `beautifulsoup` modules to get the job done. 
> * #### Write a function called `text_extraction`, taking only one argument, the internet link. The output should be a list of (not yet cleaned) strings.

In [None]:
link = 'https://www.fu-berlin.de/vv/de/lv/542611?m=348436&pc=478016&sm=498562'

In [None]:
## your code here

In [None]:
# %load ../src/_solutions/text_exctraction.py

In [None]:
txt = text_exctraction(link)
txt

### Data cleaning

> ### Challenge 4: Write a function `clean_text` that cleans the text data. 
> * #### Make sure you account for `\n`, `\r` and whitespaces.
> * #### You may also consider to dump the word `Schließen` and exlcude a text block if it starts with `Anmeldung`.
> * #### _Note: The input of the function will be a list of strings!_


#### The result should look like this:

    'This ABV English module is designed to enable students to better cope with the challenges of using English within a higher education context. Throughout this introductory module, we will look at learning strategies and study skills to help students to develop English for study purposes in the four key language skills: reading, listening, speaking and writing. Students will be expected to produce a portfolio of written work throughout the course and to deliver a short study-related presentation.'

In [None]:
# %load ../src/_solutions/clean_text.py

## Let the computer do the work!

> ### Challenge 5: Write a function `extract_comments` that takes in links and returns the extracted text from all links. 
> * #### Make sure your function has an argument the provides the number of links to be followed.
> * #### Also try to implement the `time.sleep` function with random sleeping time. This does make it more likely that your IP is not flagged ;-)

>    `from numpy import random`   
>    `import time`   
>    `r = random.random()`   
>    `time.sleep(r)`
> * #### Consider reusing the functions `text_extraction` and `clean_text` from above. 
> * #### Make sure your function returns one string and no duplicates. The built-in functions `set` and `" ".join` may be handy. 



In [None]:
# %load ../src/_solutions/extract_comments.py

In [None]:
text = extract_comments(links=df.course_links, n_links=7, rs=23, verbose=True)

## Generating a word cloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

### A simple word cloud

In [None]:
# Create and generate a word cloud image:
wordcloud = WordCloud().generate(text)

# Display the generated image:
fig, ax = plt.subplots(figsize=(12,6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis("off");

### Improving the wordcloud using stopwords

In [None]:
from stop_words import get_stop_words
stop_words = get_stop_words('de') + get_stop_words('en')

In [None]:
# Create and generate a word cloud image:
wordcloud = WordCloud(stopwords=stop_words).generate(text)

# Display the generated image:
fig, ax = plt.subplots(figsize=(12,6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis("off");

### Making a really fancy wordcloud

In [None]:
from PIL import Image
import numpy as np
mask = np.array(Image.open("../data/images/berlin_bear.png"))   #choose mask

# Create and generate a word cloud image:
wordcloud = WordCloud(
    stopwords=stop_words,
    background_color="white",
                    mask=mask,
                    mode="RGB",
                    random_state=42
                    ).generate(text)


# Display the generated image:
fig, ax = plt.subplots(figsize=(12,10))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis("off");

> ### Challenge 6: Improve the wordcloud as you whish. 
> * #### Therfore you may play around with any arguments of the `wordcloud` function.
> * #### Feel free to add any other mask of your choice.
> * #### Add or remove stopwords as you like.
> * #### In order to have more fun the full data set is provided to you. Uncomment the cell below to access the extracted text from all currently available courses of ABV (`full_text`). 

In [None]:
#import pickle
#full_text = pickle.load(open('../data/full_text.p', 'rb'))

In [None]:
## your code here

***