# SourceForge Web Scraper
This notebook is designed to take in a SourceForge directory link (ex: https://sourceforge.net/directory/license:osi/), and scrape through the entire directory, gathering links some data over each project.  Once the links are gathered, the Web Scraper then visits those links to scrape the additional data there.

In [15]:
import requests
import time
import pandas as pd
from bs4 import BeautifulSoup
from tqdm import tqdm

def scrapeSFMainPage(url):
    """
        This function scrapes the main directory webpage and grabs the data for the 25
        projects on that page, ignoring the 'Free Demo' adds.  
        NOTE: sourceforge webpages break after page 999, this is a sorceforge issue
        might need to look into alternatives if we want more than ~25,000 projects
    """
    headers = {'user-agent': 'UVA OSS Capstone Project - Nick Thompson (nat3fa@virginia.edu)'}
    
    try:
        page = requests.get(url, headers = headers)
    except requests.ConnectionError:
        pass
        
    soup = BeautifulSoup(page.content, "html.parser")

    titles = soup.find_all("a", class_="result-heading-title")
    titles = [i.get_text().strip() for i in titles]

    descriptions = soup.find_all("div", class_="description")
    descriptions = [i.get_text().strip() for i in descriptions if not i.get_text().startswith('\n\n')]

    downloads = soup.find_all("a", title="Downloads This Week")
    downloads = [i.get_text().split(" This Week")[0].replace(',', '') for i in downloads]

    hrefs = soup.find_all("a", class_="button green hollow see-project")
    hrefs = [i.get('href') for i in hrefs]
    
    df_dict = {'titles':titles,
          'descriptions':descriptions,
         'downloads':downloads,
         'hrefs': hrefs}
    
    df = pd.DataFrame(df_dict)
    return df


In [16]:
# Here we run the previous function.  Each directory page
# contains 25 OSS projects (plus adds which are ignored)

dir_pages_to_search = 200 #can go to 999 (for a total of ~25,000 urls),
final_df = pd.DataFrame()
for i in tqdm(range(dir_pages_to_search)): 
    URL = 'https://sourceforge.net/directory/license:osi/?page=' + str(i+1)
    temp_df = scrapeSFMainPage(URL)
    final_df = final_df.append(temp_df, ignore_index=True)
final_df.head()


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [08:34<00:00,  2.57s/it]


Unnamed: 0,titles,descriptions,downloads,hrefs
0,MinGW - Minimalist GNU for Windows,This project is in the process of moving to os...,3523284,/projects/mingw/
1,Microsoft's TrueType core fonts,So far this project consists of a source rpm t...,1497188,/projects/corefonts/
2,SAP NetWeaver Server Adapter for Eclipse,Integrates Eclipse with the SAP NetWeaver Appl...,1281217,/projects/sapnweclipse/
3,WinSCP,WinSCP is a popular free SFTP and FTP client f...,877128,/projects/winscp/
4,PortableApps.com,PortableApps.com is the world's most popular ...,447271,/projects/portableapps/


In [17]:
def scrapeSFProjectPage(url):
    """
        Used in tandem with the Main Page scaper, this scrapes the individual webpages
        for additional data that is not present on the mainpage.  We can likely grab
        even more data from here, depending on what we are looking for.
    """
    headers = {'user-agent': 'UVA OSS Capstone Project - Nick Thompson (nat3fa@virginia.edu)'}

    try:
        page = requests.get(url, headers = headers)
    except requests.ConnectionError:
        pass
    
    soup = BeautifulSoup(page.content, "html.parser")
    #print(soup)
    stars_arr = soup.find_all("div", class_=["star large yellow", "star large half", "star large empty"])
    stars_arr = [i.get('class') for i in stars_arr]
    counter = 0
    star_count = 0
    stars = [0]
    for arr in stars_arr:
        if "yellow" in arr:
            star_count += 1
        elif "half" in arr:
            star_count += 0.5
        counter += 1
        if counter == 5:
            counter = 0
            stars[0] = star_count
            star_count=0

    reviewNum = soup.find_all("a", class_="count")
    reviewNum = [i.get_text().split(' ')[0] for i in reviewNum]
    if len(reviewNum) == 0:
        reviewNum = [0]
    if reviewNum[0] == 'Add':
        reviewNum[0] = 0

    lastUpdated = soup.find_all("time", class_="dateUpdated")
    lastUpdated = [i.get_text() for i in lastUpdated]
        
    categories = soup.find_all(lambda tag: tag.name == 'span' and tag.get('itemprop') == 'applicationCategory')
    categories = [i.get_text() for i in categories]
    categories = [categories]
    
    contributers = soup.find_all("h3", class_="brought-by")

    if len(contributers) > 0:
        contributers = [i.get_text().split("by:")[1].strip().split(",") for i in contributers]
        contributers = [i.strip() for i in contributers[0]]
        contributers = [contributers]
    else:
        contributers = [[]]
    
    df_dict = {'stars':stars,
              'num reviews':reviewNum,
              'last updated': lastUpdated,
              'categories': categories,
              'contributers': contributers}
    
    df = pd.DataFrame(df_dict)
    return df

In [18]:
# Lastly we are running the second function that was created. 
# We pass it the hrefs gathered from the first function

final_df2 = pd.DataFrame()
pageUrls = final_df['hrefs']
for url in tqdm(pageUrls):
    fullUrl = 'https://sourceforge.net' + url
    temp_df = scrapeSFProjectPage(fullUrl)
    final_df2 = final_df2.append(temp_df, ignore_index=True)
final_df2.head()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [1:25:10<00:00,  1.02s/it]


Unnamed: 0,stars,num reviews,last updated,categories,contributers
0,4.0,158,2021-09-05,"[Build Tools, Code Generators, Debuggers, Comp...","[cstrauss, earnie, gressett, keithmarshall]"
1,4.5,28,2014-08-28,[Desktop Environment],[noa]
2,5.0,34,2016-07-20,[Application Servers],[kaloyan_raev]
3,5.0,174,2021-10-11,"[Communications, Cryptography, File Transfer P...",[martinprikryl]
4,5.0,245,46 minutes ago,"[Enterprise, Office Suites, Browsers]","[critternyc, markomlm]"


In [19]:
# The final results end with separate dataframes, so here
# we combine them into one final result df
result = pd.concat([final_df, final_df2], axis=1)
result

Unnamed: 0,titles,descriptions,downloads,hrefs,stars,num reviews,last updated,categories,contributers
0,MinGW - Minimalist GNU for Windows,This project is in the process of moving to os...,3523284,/projects/mingw/,4.0,158,2021-09-05,"[Build Tools, Code Generators, Debuggers, Comp...","[cstrauss, earnie, gressett, keithmarshall]"
1,Microsoft's TrueType core fonts,So far this project consists of a source rpm t...,1497188,/projects/corefonts/,4.5,28,2014-08-28,[Desktop Environment],[noa]
2,SAP NetWeaver Server Adapter for Eclipse,Integrates Eclipse with the SAP NetWeaver Appl...,1281217,/projects/sapnweclipse/,5.0,34,2016-07-20,[Application Servers],[kaloyan_raev]
3,WinSCP,WinSCP is a popular free SFTP and FTP client f...,877128,/projects/winscp/,5.0,174,2021-10-11,"[Communications, Cryptography, File Transfer P...",[martinprikryl]
4,PortableApps.com,PortableApps.com is the world's most popular ...,447271,/projects/portableapps/,5.0,245,46 minutes ago,"[Enterprise, Office Suites, Browsers]","[critternyc, markomlm]"
...,...,...,...,...,...,...,...,...,...
4995,Iconizer,Simple MS Windows GUI application to create an...,28,/projects/iconizer/,4.0,5,2015-08-17,"[User Interfaces, Graphics Conversion]",[pkrejcir]
4996,Java OCR,Java OCR is a suite of pure java libraries for...,25,/projects/javaocr/,3.5,21,2016-11-29,[],"[ko5tik, mrwwhitney, roncemer]"
4997,Agilefant,Agilefant is a simple but powerful web based t...,14,/projects/agilefant/,5.0,18,2016-04-14,"[Time Tracking, Project Management, Agile deve...","[agilefant, ikorein, sepi123]"
4998,GNU Prolog,GNU Prolog is a free implementation (under GPL...,81,/projects/gprolog/,0.0,0,2021-07-08,"[Artificial Intelligence, Compilers, Interpret...","[bartkul, diaz, domob, mio, spa]"


Here we have EDA on some of the contributer data that was scraped.  We can see that at most, we have 5 contributers on a project, with the average being 1.85.  Additional contributor data appears to be very hard to scrape, and after a couple of manual checks, does not appear to add a substantial amount of contributors.  Additional work and research needs to be done to gather substantial contributor data

In [20]:
contributer = result[result['contributers'].isna() == False]
contributer = contributer['contributers'].tolist()
contributer = [i for i in contributer if len(i) > 0]
contributerNum = [len(i) for i in contributer if len(i) > 0]
len(contributer)
print(max(contributerNum))
print(sum(contributerNum)/len(contributerNum))

5
1.8495694294940797


Lastly, saving the results to a csv for later use.

In [22]:
result.to_csv('SourceForge_Data.csv')