<a href="https://colab.research.google.com/github/VinaykrishnaGudla/vinaykrishna_Gudla_031/blob/main/Gudla_vinaykrishna_classexercise02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

Research Question: How does flexible work impact remote employees' productivity and satisfaction?

Data Collection:

1. Survey Design:
   - Create a survey with questions about work flexibility, productivity, and job satisfaction.
   - Include demographic questions for diversity insights.

2. Sampling:
   - Select 500 random employees from various industries for a diverse sample.

3. Consent and Anonymity:
   - Get participant consent and assure anonymity for honest responses.

4. Data Collection Method:
   - Use online survey platforms (e.g., Google Forms).
   - Share the survey link via email, workplace channels, and social media.

5. Time Frame:
   - Run the survey for four weeks for ample response time.

6. Quantitative Data:
   - Use scales to measure productivity and job satisfaction.

7. Qualitative Data:
   - Include open-ended questions for personal insights.

8. Data Storage:
   - Securely store data on a server, complying with privacy regulations.

9. Data Analysis:
   - Use tools like SPSS for quantitative data and thematic analysis for qualitative insights.

10. Reporting:
   - Compile a report with statistical results and key themes from qualitative responses.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import csv
import random

# Generate hypothetical data for the survey
def generate_data():
    data = []
    for _ in range(1000):
        age = random.randint(20, 60)
        industry = random.choice(['IT', 'Finance', 'Healthcare', 'Education'])
        flexibility_rating = random.randint(1, 5)
        productivity_rating = random.randint(1, 5)
        satisfaction_rating = random.randint(1, 5)

        data.append({
            'Age': age,
            'Industry': industry,
            'FlexibilityRating': flexibility_rating,
            'ProductivityRating': productivity_rating,
            'SatisfactionRating': satisfaction_rating
        })

    return data

# Save the generated data to a CSV file
def save_to_csv(data):
    fields = ['Age', 'Industry', 'FlexibilityRating', 'ProductivityRating', 'SatisfactionRating']

    with open('flexibility_survey_data.csv', mode='w', newline='') as file:
        writer = csv.DictWriter(file, fieldnames=fields)
        writer.writeheader()
        writer.writerows(data)

if __name__ == "__main__":
    survey_data = generate_data()
    save_to_csv(survey_data)
    print("Dataset created and saved successfully.")
with open('flexibility_survey_data.csv', mode='r') as file:
        reader = csv.DictReader(file)
        for idx, row in enumerate(reader, start=1):
            print(row)
            if idx == 5:
                break


Dataset created and saved successfully.
{'Age': '48', 'Industry': 'Education', 'FlexibilityRating': '1', 'ProductivityRating': '3', 'SatisfactionRating': '4'}
{'Age': '28', 'Industry': 'IT', 'FlexibilityRating': '3', 'ProductivityRating': '1', 'SatisfactionRating': '4'}
{'Age': '56', 'Industry': 'Healthcare', 'FlexibilityRating': '5', 'ProductivityRating': '2', 'SatisfactionRating': '3'}
{'Age': '45', 'Industry': 'IT', 'FlexibilityRating': '4', 'ProductivityRating': '5', 'SatisfactionRating': '1'}
{'Age': '39', 'Industry': 'Finance', 'FlexibilityRating': '3', 'ProductivityRating': '2', 'SatisfactionRating': '1'}


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
# write your answer here
import requests
import pandas as pd
from bs4 import BeautifulSoup

# Function to collect article data within a specific year range
def collect_articles_in_range(query, num_articles, start_year, end_year):
    base_url = "https://scholar.google.com/scholar"
    collected_articles = []

    while len(collected_articles) < num_articles:
        params = {
            "q": query,
            "hl": "en",
            "start": len(collected_articles),
        }

        response = requests.get(base_url, params=params)
        soup = BeautifulSoup(response.text, "html.parser")

        # Find article entries
        entries = soup.find_all("div", class_="gs_ri")

        for entry in entries:
            article_data = {}

            # Year
            year_elem = entry.find("div", class_="gs_a")
            if year_elem:
                year_text = year_elem.text.split(" - ")[-1]
                year = int(year_text) if year_text.isdigit() else 0

                # Check if the year is within the specified range
                if start_year <= year <= end_year:
                    # Title
                    title_elem = entry.find("h3", class_="gs_rt")
                    article_data["Title"] = title_elem.text.strip() if title_elem else "N/A"

                    # Venue or Journal
                    venue_elem = entry.find("div", class_="gs_a")
                    article_data["Venue"] = venue_elem.text.strip() if venue_elem else "N/A"

                    # Authors
                    authors_text = entry.find("div", class_="gs_a").text.split(" - ")[0]
                    article_data["Authors"] = authors_text.strip() if authors_text else "N/A"

                    # Abstract (if available)
                    abstract_elem = entry.find("div", class_="gs_rs")
                    article_data["Abstract"] = abstract_elem.text.strip() if abstract_elem else "N/A"

                    # Year
                    article_data["Year"] = year

                    collected_articles.append(article_data)

    return collected_articles

# Specify the query, number of articles to collect, and the year range
query = "Data Science"
num_articles = 1000
start_year = 2014
end_year = 2024

# Collect articles within the specified year range
articles = collect_articles_in_range(query, num_articles, start_year, end_year)

# Print the collected articles
for i, article in enumerate(articles, start=1):
    print(f"Article {i}:")
    print(f"Title: {article['Title']}")
    print(f"Venue/Journal: {article['Venue']}")
    print(f"Year: {article['Year']}")
    print(f"Authors: {article['Authors']}")
    print(f"Abstract: {article['Abstract']}")
    print("\n")

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
# write your answer here


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

#I downloaded OctoPurse on my computer and saved it locally. After installing and logging in, I created a new task and entered the URL https://en.wikipedia.org/wiki/List_of_fatal_dog_attacks.  I clicked 'Save' to display the webpage and initiate data detection
To preserve the settings, I clicked 'Create workflow' and then ran the task using 'Standard Mode' under 'Run on your device.' Once completed,
I exported the data by clicking 'Export.' To ensure data accuracy, I chose 'Remove Duplicates' and selected the CSV format. The process was successful,
and I confirmed the exported file contained the desired data. In just 20 steps, from downloading OctoPurse to data collection and export,
I efficiently navigated through the web scraping process.
https://github.com/VinaykrishnaGudla/vinaykrishna_Gudla_031/blob/main/List%20of%20fatal%20dog%20attacks%20-%20Wikipedia.csv

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [3]:
'''
Working on web scraping tasks provided a valuable learning experience in understanding the process of extracting data from online sources. Key concepts such as HTML structure, CSS selectors, XPath, and HTTP requests were crucial in navigating and retrieving information from different websites. Understanding how to inspect web elements and identify patterns in the underlying structure of web pages proved to be essential skills in this process. Additionally, learning about different libraries and tools available for web scraping, such as Beautiful Soup and Scrapy in Python, broadened my understanding of the various approaches to data extraction.
'''

'\nWorking on web scraping tasks provided a valuable learning experience in understanding the process of extracting data from online sources. Key concepts such as HTML structure, CSS selectors, XPath, and HTTP requests were crucial in navigating and retrieving information from different websites. Understanding how to inspect web elements and identify patterns in the underlying structure of web pages proved to be essential skills in this process. Additionally, learning about different libraries and tools available for web scraping, such as Beautiful Soup and Scrapy in Python, broadened my understanding of the various approaches to data extraction.\n'