## Uniscraper Tutorial

Below is a tutorial that demonstrates how to get uniscraper running on your computer and some of the potiential usages of the package. Each cell has a different block of code that can be used to test out the package on various different sources. Note: To import the Uniscraper, the Uniscraper package must be in the same folder as this file.

In [None]:
#Import the Uniscraper package
from Uniscraper.Uniscraper import uniscraper

### Scraping Data From a Static Webpage

In [2]:
url = "https://www.canr.msu.edu/people/meng-cai"
webpage = uniscraper(url)
print(webpage.text[0:100]) #Prints only the first 100 characters for space purposes

                      Menu            Search   Search                 School of Planning, Design and


### Scraping Data From a Dynamic Webpage

In [None]:
dynamic_url = "https://careernetwork.msu.edu/outcomes/"
dynamic_webpage = uniscraper(dynamic_url)
print(dynamic_webpage.text[0:100]) #Prints only the first 100 characters for space purposes




### Scraping Data From a PDF File

In [None]:
pdf_url = "https://www.canr.msu.edu/spdc/uploads/files/student_services/degrees/phd/2018PDC_PhDDegree_Handbook_SPDC_WCAG2.0_June2018.pdf"
pdf_webpage = uniscraper(pdf_url)
print(pdf_webpage.text[0:100]) #Prints only the first 100 characters for space purposes

 

 

PhD Student Handbook 

Dr. Ming-Han Li 

Director 

Human Ecology Bldg. 

552 W. Circle Dr., R


### Scraping Data From a Powerpoint File

In [None]:
pptx_url = "https://grad.msu.edu/sites/default/files/content/researchintegrity/IntellectualProperty.pptx"
pptx_webpage = uniscraper(pptx_url)
print(pptx_webpage.text[0:100]) #Prints only the first 100 characters for space purposes

    Skip to content  Skip to main nav         Log in          Search Tool    Search         Michigan


### Scraping Data From a Spreadsheet

In [None]:
xlsx_url = "https://oiss.isp.msu.edu/index.php/download_file/view/966/273/"
xlsx_webpage = uniscraper(xlsx_url)
print(xlsx_webpage.text[0:100]) #Prints only the first 100 characters for space purposes

African Student Leadership Association

African Student Union

Arab Cultural Society

Armenian Stude


### Scraping Data From a Word Document

In [None]:
word_url = "http://vprgs.msu.edu/msu-services-agreement-form"
word_webpage = uniscraper(word_url)
print(word_webpage.text[0:100]) #Prints only the first 100 characters for space purposes

SERVICES AGREEMENT WITH 

MICHIGAN STATE UNIVERSITY



Purpose.  Michigan State University (hereinaf


### Searches For Keywords in Scraped Data

In [28]:
keyword = webpage.search("caimeng2@msu.edu")
type(keyword)


                      Menu            Search   Search                 School of Planning, Design and Construction            About  Programs  People  Students  Research & Outreach  Events  News  Giving                 Meng Cai       Contact Me   Email:  [93mcaimeng2@msu.edu[0m       Major professor: Mark Wilson  Research interests: urban sustainability, smart cities, transformative technologies, transportation, computational social science.               Meng Cai     About    Contact Information   School of Planning, Design and Construction            X Close          « Previous  Next »                                Call Us: 517-432-0704  Contact Information  Sitemap  Accessibility  Privacy  Disclaimer    Call MSU: (517) 355-1855  Visit: msu.edu  MSU is an affirmative-action, equal-opportunity employer.  Notice of Nondiscrimination    Spartans Will .  © Michigan State University             MSU is an affirmative-action, equal-opportunity employer, committed to achieving excellence t

NoneType

### Practical Application

Below is an example of how uniscraper was used in collaboration with selenium on our project. This code will visit a website and search through to find all the hyperlinks on the website, then create a list of these links. Finally the code scrapes all the information off the starter page and all of the subsequently collected links, then searches for the key words on all the scraped data.

In [3]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from urllib.parse import urljoin, urlparse
import time
from Uniscraper.Uniscraper import uniscraper
def generate_url_list(info, max_links=20):
    """
    This function returns the subdomain links visible from a given website.
    Params:
        school_info: DataFrame with 'name' and 'url' columns
        max_links: max size of the list being returned for each website
    Returns:
        result_df: DataFrame with 'name' and 'url' columns
    """
    all_links = []  

    
    for index, row in info.iterrows():
        #Get name and url to base/starting page
        name = row["name"] 
        url = row["url"]  

        
        driver = webdriver.Chrome() 
        driver.get(url)
        time.sleep(1)

        #Parsing url to ensure consistency and proper formatting
        parsed_url = urlparse(url)
        #Takes elements such as scheme and netloc to create valid base domain
        base_domain = f"{parsed_url.scheme}://{parsed_url.netloc}"

        #Set data structure used to avoid duplicates
        links = set()

        #Looping though each sublink
        for a in driver.find_elements(By.TAG_NAME, "a"):
            href = a.get_attribute("href")
            if href: #If link exists
                #Joining to ensure only focused websites are being generated
                full_link = urljoin(base_domain, href)
                #Adding to list of links if it has base domain 
                if full_link.startswith(base_domain) and full_link not in links:
                    links.add(full_link)
                    if len(links) >= max_links: #Stopping point after max_links
                        break

        
        driver.quit()

        #All links for a website will have the same name but different urls
        for link in links:
            all_links.append({"name": name, "url": link})

    #Convert to dataframe 
    result_df = pd.DataFrame(all_links)

    return result_df
info = pd.DataFrame({
    "name": ["test"], 
    "url": ["https://nohello.net/en/"]
})
website_list = generate_url_list(info)

#print(website_list)

for url in website_list["url"].tail(2):
    site_data = uniscraper(url)
    keyword_hit = site_data.search("no")





      [93mno[0m bonjour |    בבקשה לא להגיד סתם שלום בצ׳אט  נגיד שהתקשרת למישהו בטלפון, אמרת שלום! ואז העברת להמתנה… 🤦‍♀️      ❌ לא לעשות ככה     טל  14:15  היי     אלי  14:19  …?     טל  14:20  מתי זה הבדר הזה?     אלי  14:20  אה - 15:30        נא לשים לב שטל יכול היה לקבל את התשובה הרבה לפני כן, ולא היה צריך להמתין לאלי. למעשה, אלי יכול היה לחשוב על השאלה שלו באותו רגע!  אנשים שעושים את זה בדרך כלל מנסים להיות מנומסים על ידי כך שאינם קופצים ישירות לבקשה, כמו שעושים פנים מול פנים או בטלפון - וזה נהדר! אבל אנחנו ב־2022 וצ׳אט אינו אף אחד מאלה. לרוב האנשים הקלדה היא אטית משמעותית מדיבור. לכן, למרות כל הכוונות הטובות זה רק גורם לצד השני להמתין לך לנסח את השאלה שלך, שזה אובדן יעילות (ודי מעצבן).  כנ״ל לגבי:   „היי, יש שם מישהו?”  „היי יפעת - שאלה קצרה.”  „יש לך שנייה?”  „שם?”  „אפשר סימן חיים”  וכו׳   עדיף פשוט לשאול! 😫        ✅ במקום, עדיף לנסות את זה     יעל  14:15  היי! מתי זה קורה?     אלי  14:15  היי, 15:30     יעל  14:15  אחלה - נתראה!     אלי  14:16  👌 סגור        אם לדעתך זה מעט 