# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

Research Question: How has the discussion of electric vehicles evolved on Reddit over the past few years?
Data Collection: We will scrape Reddit for posts mentioning "electric vehicles" across multiple subreddits. In this case, features to be extracted for the posts are title, author, date created, and score. This will give an idea of how often discussions occur, by what user engagement and sentiment trend EVs take on Reddit.

Number of Instances: At least 1000 Reddit posts mentioning "electric vehicles."



## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [5]:

import praw
red = praw.Reddit(client_id='iG7q_9MkzmXJuubBGygfgg', 
                     client_secret='h3vs37btLN69CYdodfEItroAZaiMlw', 
                     user_agent='my_app/0.1 by VenuV')
data = []
for submission in red.subreddit("all").search("electric vehicles", limit=1000):
    data.append({
        "title": submission.title,
        "author": submission.author.name,
        "created": submission.created_utc,
        "score": submission.score,
        "url": submission.url,
    })

import pandas as pd
df = pd.DataFrame(data)
df.to_csv('redditEVdata.csv', index=False)

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
# write your answer here
from scholarly import scholarly
key='electric vehicles'
qry = scholarly.search_pubs(key)
lst1=[]

for i in range(100):
    try:      
        article=next(qry)
        yr=int(article['bib']['pub_year'])
        if 2014 <= yr <= 2024:
            lst1.append({'title' : article['bib']['title'], 'venue' : article['bib']['title'], 'year' : yr, 'authors' : ', '.join(article['bib']['author']),
                         'abstract' : article.get('bib', {}).get('abstract', 'N/a')})
        time.sleep(2)
    except StopIteration:
        print("No more articles to fetch.")
        break
    except Exception as e:
        print(f"Error fetching article {i}: {e}")
        break
df=pd.DataFrame(lst1)
df.to_csv('articles.csv', index = False)

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [13]:
import praw
import pandas as pd
red = praw.Reddit(client_id='iG7q_9MkzmXJuubBGygfgg', 
                     client_secret='h3vs37btLN69CYdodfEItroAZaiMlw', 
                     user_agent='my_app/0.1 by VenuV')
subred = red.subreddit("technology")
red_data = []
for res in subred.search("electric vehicles", limit=1000):
    red_data.append({
        "title": res.title,
        "author": res.author.name if res.author else "N/A",
        "created_utc": res.created_utc,
        "score": res.score,
        "comments": res.num_comments, 
        "url": res.url,
    })

df = pd.DataFrame(red_data)
df['created_utc'] = pd.to_datetime(df['created_utc'], unit='s')

df.to_csv('reddit_technology_ev_posts.csv', index=False)
print(df.head())


                                               title       author  \
0    Biden urged to ban China-made electric vehicles       tommos   
1  Biden Calls Chinese Electric Vehicles a Securi...      newzee1   
2  1.8 Million Barrels of Oil a Day Avoided from ...     Wagamaga   
3  Honda says making cheap electric vehicles is t...   marketrent   
4  US suggests possibility of penalties if produc...  Lemonn_time   

          created_utc  score  comments  \
0 2024-04-13 00:15:30   7606      1789   
1 2024-02-29 13:14:34   8602      2498   
2 2023-12-10 17:47:37   7345      1385   
3 2023-10-26 00:08:18   9434      1460   
4 2024-05-15 18:11:57   3127       782   

                                                 url  
0     https://www.bbc.com/news/articles/cyerg64dn97o  
1  https://www.nytimes.com/2024/02/29/us/politics...  
2  https://cleantechnica.com/2023/12/09/1-8-milli...  
3  https://arstechnica.com/cars/2023/10/honda-can...  
4  https://www.yahoo.com/finance/news/us-suggests...  


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [15]:
public_link = 'https://myunt-my.sharepoint.com/:x:/r/personal/venugopalreddyvennapusa_my_unt_edu/Documents/INFO5737/run_results.csv?d=w42d25c4ef6a24e7bad2afcb4421babc2&csf=1&web=1&e=WKAV4L'
print("Public link to the extracted data:", public_link)

Public link to the extracted data: https://myunt-my.sharepoint.com/:x:/r/personal/venugopalreddyvennapusa_my_unt_edu/Documents/INFO5737/run_results.csv?d=w42d25c4ef6a24e7bad2afcb4421babc2&csf=1&web=1&e=WKAV4L


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

Learning Experience:
Web scraping tasks were useful in extracting data from online sources. The learning of APIs, including such libraries as praw for Reddit and BeautifulSoup for HTML parsing, was done. But the most valuable thing was to understand how to structure the request and programmatically handle the data extracted. 

Challenges Encountered:
I had problems when fetching API credentials, especially for social media sites such as Twitter and Reddit. That is very time-consuming, having to find one's way in the maze of API documentation and getting through lots of authentication complications. Working my way around it, I looked for non-programming tools like ParseHub that offered another approach toward data extraction without having to gain API access directly.

Relevance to Your Field of Study:
Among the main skills in my field are web scraping and data collection, which help me in collecting and analyzing real-world data. These techniques will be useful for performing sentiment analyses, detecting trends, and other data-driven analyses that can inform decision-making and research related to business analytics and information systems.