<a href="https://colab.research.google.com/github/bharathreddy-2802/BharathSimhaReddy_INFO5731_Fall2024/blob/main/Samala_BharathSimhaReddy_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [1]:
# write your answer here
'''
Research Question:
What are the common qualities celebrities have across different industries like acting, music, athletics? More specifically, there schooling, upbringing, awards influence their career?
Data to Collect:
Name
Industry (actor, musician, athlete, leader)
Early life (including birthdate, place, and family background)
Education level
Career start (first major role or achievement)
Notable awards or recognitions
Social media influence (followers or media mentions)

Amount of Data:
Collect data for 1,000 celebrity profiles.
These samples should span different industries to ensure diversity.

Steps for Collecting Data:

Identify Celebrity Categories:
Predefine a list of industries (actors, musicians, athletes) to categorize the celebrities.
Select Wikipedia as the Source.
Web Scraping Process:
Use Python with BeautifulSoup to scrape the celebrity profiles.
Ensure the profiles contain all required fields (education, awards).
Handle missing data with placeholder values.
Save the collected data into a CSV or JSON file for further analysis.
'''


'\nResearch Question:\nWhat are the common qualities celebrities have across different industries like acting, music, athletics? More specifically, there schooling, upbringing, awards influence their career?\nData to Collect:\nName\nIndustry (actor, musician, athlete, leader)\nEarly life (including birthdate, place, and family background)\nEducation level\nCareer start (first major role or achievement)\nNotable awards or recognitions\nSocial media influence (followers or media mentions)\n\nAmount of Data:\nCollect data for 1,000 celebrity profiles.\nThese samples should span different industries to ensure diversity.\n\nSteps for Collecting Data:\n\nIdentify Celebrity Categories:\nPredefine a list of industries (actors, musicians, athletes) to categorize the celebrities.\nSelect Wikipedia as the Source.\nWeb Scraping Process:\nUse Python with BeautifulSoup to scrape the celebrity profiles.\nEnsure the profiles contain all required fields (education, awards).\nHandle missing data with pl

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import urllib.parse

base_url = 'https://en.wikipedia.org/wiki/Category:'

categories = [
    'American_actors',
    'British_musicians',
    'Athletes',
    'Current_heads_of_state',
    'Former_heads_of_state',
    'World_leaders'
]

# Function to scrape a category page
def scrape_category(category):
    url = base_url + category
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract celebrity links
    celeb_links = []
    for link in soup.find_all('a', href=True):
        href = link['href']
        if href.startswith('/wiki/') and ':' not in href:
            full_url = urllib.parse.urljoin('https://en.wikipedia.org', href)
            celeb_links.append(full_url)

    return celeb_links

# Function to scrape celebrity profile
def scrape_celebrity_profile(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')

        name_tag = soup.find('h1', {'id': 'firstHeading'})
        name = name_tag.text if name_tag else 'Unknown'

        infobox = soup.find('table', {'class': 'infobox'})
        details = {}

        if infobox:
            for tr in infobox.find_all('tr'):
                th = tr.find('th')
                td = tr.find('td')
                if th and td:
                    key = th.text.strip()
                    value = td.text.strip()
                    details[key] = value

        return {
            'Name': name,
            'Details': details
        }
    except requests.RequestException as e:
        print(f"Request failed for {url}: {e}")
        return None

celebrity_data = []

for category in categories:
    celeb_links = scrape_category(category)
    for link in celeb_links:
        profile = scrape_celebrity_profile(link)
        if profile:
            celebrity_data.append(profile)
        if len(celebrity_data) >= 1000:
            break

df = pd.DataFrame(celebrity_data)

df.to_csv('celebrity_profiles.csv', index=False)

print("Data collection complete. Saved to 'celebrity_profiles.csv'")


Data collection complete. Saved to 'celebrity_profiles.csv'


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [4]:
# write your answer here
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scholar_data(keyword, num_papers, start_year, end_year):
    base_url = "https://scholar.google.com/scholar"
    collected_data = []
    params = {
        'q': keyword,
        'hl': 'en',
        'as_ylo': start_year,
        'as_yhi': end_year
    }


    # Loop to move through results
    for start in range(0, num_papers, 10):
        params['start'] = start
        response = requests.get(base_url, params=params)
        if response.status_code != 200:
            print(f"Failed to retrieve data: Status code {response.status_code}")
            break

        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all result containers
        results = soup.find_all('div', class_='gs_ri')
        if not results:
            print("No more results found in parsing")
            break

        for result in results:
            title_elem = result.find('h3', class_='gs_rt')
            title = title_elem.text if title_elem else 'N/A'

            #Journel
            venue_elem = result.fin('div', class_='gs_a')
            venue = venue_elem.text if venue_elem else 'N/A'

            #Abstract
            abstract_elem = result.find('div', class_='gs_rs')
            abstract = abstract_elem.text if abstract_elem else 'N/A'
            #Year
            year = 'N/A'
            for text in venue.split():
                if text.isdigit() and len(text) == 4 and start_year <= int(text) <= end_year:
                    year = text
                    break

            #Authors
            authors = venue.split('-')[0].strip()

            collected_data.append({
                'Title': title,
                'Venue': venue,
                'Year': year,
                'Authors': authors,
                'Abstract': abstract
            })
        #Progress reporting
        print(f"Retrieved {len(collected_data)}/{num_papers} papers.")

        #Delay to avoid hitting limits
        time.sleep(10)
    return collected_data

papers = scholar_data(keyword="XYZ", num_papers=1000, start_year=2014, end_year=2024)

#check if any data was collected before saving
if papers:
  df = pd.DataFrame(papers)
  df.to_csv('google_scholar_data.csv', index=False)
  print("Data collection completed. Saved to google_scholar_data.csv")
else:
  print("No data collected.")

Failed to retrieve data: Status code 429
No data collected.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
!pip install praw
import praw
import pandas as pd

client_id = 'ACvTzL0VncEYxw8FeI5DvA'
client_secret = 'xTEqk4AG8Juk4faRYSYVmPLPsWnrhw'
user_agent = 'Collect Data'

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent
)

search_keyword = 'Machine Learning'
subreddit = 'all'

def collect_reddit_data(keyword, subreddit_name, limit=100):
    posts_data = []
    subreddit = reddit.subreddit(subreddit_name)

    for submission in subreddit.search(keyword, limit=limit):
        posts_data.append({
            'Title': submission.title,
            'Author': str(submission.author),
            'Score': submission.score,
            'URL': submission.url,
            'Created_UTC': submission.created_utc,
            'Text': submission.selftext
        })

    return posts_data

posts_data = collect_reddit_data(search_keyword, subreddit)

df = pd.DataFrame(posts_data)

df.to_csv('reddit_data.csv', index=False)

print(f"Data has been saved to 'reddit_data.csv'.")


Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl.metadata (9.8 kB)
Collecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update-checker, prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.4.0 update-checker-0.18.0


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Data has been saved to 'reddit_data.csv'.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
I felt the in class activity was hard and needed much more time to complete it.
'''

'\nI felt the in class activity was hard and needed much more time to complete it.\n'