# Example usage

This notebook gives an example on how to use `pybrokk` in a project:

In this example we start with a selection of a few top universities in Canada, and:

|Function|input|output|
|---|---|---|
|`create_id()`| a list of url's| a list of unique url_id's|
|`text_from_url()`| a list of url's| a dictionary of scraped raw text|
|`duster()` | a list of url's | a daframe where the outputs of `create_id()` and `text_from_url()` are concatonated|
|`bow()`|the output of `duster()`| a dataframe of bag of words appended to the input dataframe.|

## List of url's
Here is the list of university urls that will be used in this example:
- University of Toronto: https://www.utoronto.ca/
- University of British Columbia: https://www.ubc.ca/
- McGill University: https://www.mcgill.ca/
- Queen's University: https://www.queensu.ca/

## Imports

In [21]:
from pybrokk.pybrokk import create_id, text_from_url, duster, bow
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

## Example input

According to the list of universities mentioned above, here is a sample input we need for some functions in this package:


In [23]:
urls = ['https://www.utoronto.ca/',
         'https://www.ubc.ca/',
         'https://www.mcgill.ca/',
         'https://www.queensu.ca/']

## `create_id()`: 
### Create unique ID's for a list of urls.

In [24]:
url_ids = create_id(urls)
url_ids

['utoronto1', 'ubc1', 'mcgill1', 'queensu1']

## `text_from_url()`: 
### Create a dictionary in which keys are the url's and values are the raw text parsed by `BeautifulSoup`

In [25]:
dictionary = text_from_url(urls)


A first component of this dictionary is going to look like:

In [26]:
list(dictionary.items())[0]

('https://www.utoronto.ca/',
 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUniversity of Toronto\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to main   content      \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\nEmail\nQuercus\nAcorn\nUCheck\n\n\n\n \nJump ToNews & Media\nAbout U of T\nGive To U of T\nAcademics\nPrograms of Study\nResearch & Innovation\nUniversity Life\nLibraries\nA-Z Directory\n\nSearch\n\n \n\n\n\n\n\n\n\n\n\n\n\nEmail\nQuercus\nAcorn\nUCheck\n \n\n\n\n\n \nFuture Students\nCurrent Students\nAlumni\n\n \n\n\n\n \nFaculty & Staff\nDonors\nVisitors\n\n \n\n\n\n\n\n \nNews & Media\nAbout U of T\nGive to U of T\nAcademics\nResearch & Innovation\nUniversity Life\nLibraries\nPrograms of Study\nA to Z\n\n \n\n\n\n\n\n\n\n \nFuture Students\nCurrent Students\nAlumni\nFaculty & Staff\nDonors\nVisitors\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n            What can we help you with?          \n\n\n\n\n\n\n\n \n\n\n \n\n\n\n \n\n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\

## `duster()`:
### Create a dataframe out of the outputs of `create_id()` and `text_from_url()`

In [27]:
df = duster(urls)
df

Unnamed: 0_level_0,url,raw_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
utoronto1,https://www.utoronto.ca/,University of TorontoSkip to main content ...
ubc1,https://www.ubc.ca/,The University of British ColumbiaSkip to main...
mcgill1,https://www.mcgill.ca/,McGill UniversityWINTER 2023 / HIVER 2023A saf...
queensu1,https://www.queensu.ca/,Home | Queen's UniversitySkip to main content ...


## `bow()`:
### Create a dataframe of bag of words appended to the input dataframe.

In [28]:
df_bow = bow(df)
df_bow

Unnamed: 0_level_0,url,raw_text,0g4get,10,15,1827,18th,19,1v7tel,1z4tel,...,working,workshop,world,year,years,you,younger,your,youth,zsocial
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
utoronto1,https://www.utoronto.ca/,University of TorontoSkip to main content ...,0,0,0,1,0,0,0,0,...,0,1,1,0,1,1,0,1,0,0
ubc1,https://www.ubc.ca/,The University of British ColumbiaSkip to main...,0,0,0,0,0,1,1,1,...,0,0,1,0,0,0,0,0,0,0
mcgill1,https://www.mcgill.ca/,McGill UniversityWINTER 2023 / HIVER 2023A saf...,1,0,0,0,1,0,0,0,...,0,0,1,1,0,0,0,3,0,0
queensu1,https://www.queensu.ca/,Home | Queen's UniversitySkip to main content ...,0,1,3,0,0,2,0,0,...,1,1,0,8,0,2,1,1,1,1


The `df_bow` is going to be a slightly well-shaped dataframe which we always need to start with in our machine learning projects. 