<a href="https://colab.research.google.com/github/ganesh1616/Ganesh_INFO5731_Fall2024/blob/main/Ganesh_Marada_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
'''
"Research Question: Identify trends and patterns in the success of the top 1000 songs and determine whether there is a correlation between factors such as genre, release date, artist popularity, duration, and critical reception that contribute to their success."

Here is for this research question, I have to follow the steps below to collect and analyze data:

1.Visit a website that provides information about the top 1000 songs, such as Billboard or Spotify Charts. The website should include details such as song title, artist, release year, genre, duration, chart position, stream count, and critical reviews.

2.Right-click on the webpage and select the "Inspect" option to open the developer tools. This will allow you to access the HTML structure of the page. Use these tools to locate the class or id attributes associated with the data you want to scrape (e.g., song title, artist, release year, genre).

3. Python program by importing the necessary libraries such as requests, BeautifulSoup, and pandas to perform the web scraping. Then have to  pass the URL of the website you wish to scrape as a string and send a GET request using the requests library to retrieve the webpage's content.

4.Parse the HTML content of the webpage using BeautifulSoup with the "html.parser" parser. This will allow you to locate and extract specific HTML elements containing the song data, such as song titles, artists, and chart positions, using the class or id attributes you identified earlier.

5.Create lists to store the scraped data for each attribute of interest. For example, create lists for song titles, artists, release years, and chart positions. Once you have collected the data into these lists, organize it into a pandas DataFrame.

6.Save the data to a CSV file for further analysis. You can now analyze trends such as the correlation between genre and chart success, the impact of artist popularity on song performance, or the influence of song duration on its success."
'''

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [3]:
# write your answer here
import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.billboard.com/charts/hot-100/'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

song_data = soup.find_all('li', class_='o-chart-results-list__item')

song_info = []

for store in song_data:

    h3_tag = store.find('h3')
    if h3_tag:
        title = h3_tag.text.strip()

        artist = h3_tag.find_next('span').text.strip()
        rank = store.find('span', class_='c-label').text.strip()

        last_week = store.find('span', class_='c-label--secondary').text.strip() if store.find('span', class_='c-label--secondary') else 'N/A'


        peak_position = store.find('span', class_='c-label--tertiary').text.strip() if store.find('span', class_='c-label--tertiary') else 'N/A'

        weeks_on_chart = store.find('span', class_='c-label--quaternary').text.strip() if store.find('span', class_='c-label--quaternary') else 'N/A'

        song_info.append({
            'Song Title': title,
            'Artist': artist,
            'Current Rank': rank,
            'Last Week Rank': last_week,
            'Peak Position': peak_position,
            'Weeks on Chart': weeks_on_chart
        })


song_DF = pd.DataFrame(song_info)

print(song_DF)

song_DF.to_csv('top_100_songs.csv', index=False)

            Song Title                               Artist  \
0   A Bar Song (Tipsy)                            Shaboozey   
1      I Had Some Help  Post Malone Featuring Morgan Wallen   
2             Espresso                    Sabrina Carpenter   
3     Die With A Smile               Lady Gaga & Bruno Mars   
4   Birds Of A Feather                        Billie Eilish   
..                 ...                                  ...   
95        Close To You                        Gracie Abrams   
96           Residuals                          Chris Brown   
97      Devil Is A Lie                        Tommy Richman   
98         Parking Lot               Mustard & Travis Scott   
99     American Nights                           Zach Bryan   

                           Current Rank Last Week Rank Peak Position  \
0                             Shaboozey            N/A           N/A   
1   Post Malone Featuring Morgan Wallen            N/A           N/A   
2                     Sabri

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
# Import necessary libraries
import requests
import json
import time

def fetch_scholarly_articles(keyword, total_articles):

    api_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    query_params = {
        "query": keyword,
        "offset": 0,
        "limit": 100,
        "fields": "title,authors,year,venue,abstract",
        "year": "2014-2024"
    }

    articles_collected = 0

    while articles_collected < total_articles:
        response = requests.get(api_url, params=query_params)
        if response.status_code == 200:
            articles = response.json().get('data', [])
            for article in articles:
                title = article.get('title', 'N/A')
                authors = ', '.join([author['name'] for author in article.get('authors', [])])
                year = article.get('year', 'N/A')
                venue = article.get('venue', 'N/A')
                abstract = article.get('abstract', 'N/A')
                print(f"Title: {title}")
                print(f"Authors: {authors}")
                print(f"Year: {year}")
                print(f"Venue: {venue}")
                print(f"Abstract: {abstract}")
                print("\n---\n")
                articles_collected += 1
                if articles_collected >= total_articles:
                    break
            query_params['offset'] += 100

        elif response.status_code == 429:
            print("Rate limit exceeded. Waiting for 60 seconds before retrying...")
            time.sleep(60)
            continue

        else:
            print(f"Error fetching data: {response.status_code}")
            break
fetch_scholarly_articles("information retrieval", 1000)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

---

Title: Private Set Intersection: A Multi-Message Symmetric Private Information Retrieval Perspective
Authors: Zhusheng Wang, Karim A. Banawan, S. Ulukus
Year: 2019
Venue: IEEE Transactions on Information Theory
Abstract: We study the problem of private set intersection (PSI). In this problem, there are two entities <inline-formula> <tex-math notation="LaTeX">$E_{i}$ </tex-math></inline-formula>, for <inline-formula> <tex-math notation="LaTeX">$i=1, 2$ </tex-math></inline-formula>, each storing a set <inline-formula> <tex-math notation="LaTeX">$\mathcal {P}_{i}$ </tex-math></inline-formula>, whose elements are picked from a finite set <inline-formula> <tex-math notation="LaTeX">$\mathbb {S}_{K}$ </tex-math></inline-formula>, on <inline-formula> <tex-math notation="LaTeX">$N_{i}$ </tex-math></inline-formula> replicated and non-colluding databases. It is required to determine the set intersection <inline-formula> <tex-

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here
"https://myunt-my.sharepoint.com/:f:/g/personal/ganeshmarada_my_unt_edu/EhnF3E01IbREuZT_pbNcFUMBwQHzbH6UYPF-x1wsHty_KA?e=QyIF2E"

'https://myunt-my.sharepoint.com/:f:/g/personal/ganeshmarada_my_unt_edu/EhnF3E01IbREuZT_pbNcFUMBwQHzbH6UYPF-x1wsHty_KA?e=QyIF2E'

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Personally, I find these tools beneficial when quick, reliable data extraction is needed without investing too much time in constructing custom scrapers. However, their limitations, such as restricted control over customizations and potential costs for large-scale projects, make them less tempting for developers who need more granular control, which is achievable using Python packages like BeautifulSoup and Selenium.


The process of integrating a tool like ParseHub and exporting the data to formats such as CSV or Excel is efficient for data analysis and reporting. It's a fantastic answer for circumstances where coding might not be the most practical technique. Overall, I feel these tools strike a compromise between usability and functionality, delivering a viable solution for web scraping applications.
'''
'''