<a href="https://colab.research.google.com/github/VijayaKumariGanipineni/VijayaKumari_INFO5731_Fall2024/blob/main/Ganipineni_VijayaKumari_INClassExercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

**How state-level COVID-19 vaccination rates impacted infection and mortality trends over time in the U.S.**
This study involve analyzing correlations between vaccination coverage and the rate of infections or deaths across states.

### Data Collection:

1. **Vaccination Data**: Obtaining daily state-level vaccination data from Kaggle’s dataset at (https://www.kaggle.com/datasets/paultimothymooney/usa-covid19-vaccinations).
2. **Infection & Mortality Data**: Using CDC or Johns Hopkins databases for infection and mortality counts.
3. **Timeline**: Collecting a year’s worth of data to capture dynamic changes, roughly 365 data points per state.

### Steps:

- **Merge Data**: Combining vaccination, infection, and mortality data based on state and date.
- **Data Storage**: Saving the merged data in CSV files for accessibility.
- **Tools**: Using Python (Pandas) for analysis, and performing statistical tests or regression modeling to assess the impact of vaccination rates.









## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [1]:
#Uploading the local dataset to Colab
from google.colab import files
uploaded = files.upload()  # This will allow to upload the dataset manually

# Checking the uploaded file names
for file_name in uploaded.keys():
    print(file_name)  # This will print the name of the file  uploaded

# Using the correct file name in the code
import pandas as pd
import io

# Reading the uploaded file
df = pd.read_csv(io.BytesIO(uploaded['us_state_vaccination.csv']))

#  Sampling the 1000 random rows from the dataset
sampled_data = df.sample(n=1000, random_state=42)

# Step 5: Saving the sampled data to a new CSV file
sampled_data.to_csv("sampled_vaccination_data.csv", index=False)

# Step 6: Downloading the sampled dataset
files.download("sampled_vaccination_data.csv")



Saving us_state_vaccination.csv to us_state_vaccination.csv
us_state_vaccination.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [2]:
import requests
import pandas as pd
import time
from google.colab import files

# Defining constants
API_URL = 'https://api.crossref.org/works'
QUERY = 'XYZ'
NUM_ARTICLES = 1000
START_YEAR = 2014
END_YEAR = 2024
RETRY_LIMIT = 5  # Number of retry attempts
RETRY_DELAY = 5  # Initial delay in seconds

# A function to fetch articles from CrossRef with rate limit handling
def fetch_articles(query, num_articles, start_year, end_year):
    all_articles = []
    search_params = {
        'query': query,
        'rows': 100,  # Maxmum results per request
        'filter': f'from-pub-date:{start_year}-01-01,until-pub-date:{end_year}-12-31',
        'select': 'title,container-title,published-print,author,abstract'
    }

    retry_count = 0
    while len(all_articles) < num_articles:
        try:
            response = requests.get(API_URL, params=search_params)
            response.raise_for_status()  # Raise an exception for HTTP errors

            if response.status_code == 429:  #This is to alert when rate limit is exceeded
                retry_count += 1
                if retry_count > RETRY_LIMIT:
                    print("Rate limit exceeded. Stopping execution.")
                    break
                wait_time = RETRY_DELAY * (2 ** (retry_count - 1))  # Exponential backoff
                print(f"Rate limit exceeded. Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
                continue

            data = response.json()
            papers = data.get('message', {}).get('items', [])

            if not papers:
                break

            for paper in papers:
                article = {
                    'Title': paper.get('title', [''])[0],
                    'Venue': paper.get('container-title', [''])[0],
                    'Year': paper.get('published-print', {}).get('date-parts', [[0]])[0][0],
                    'Authors': ', '.join(author.get('family', '') + ' ' + author.get('given', '') for author in paper.get('author', [])),
                    'Abstract': paper.get('abstract', '')
                }
                all_articles.append(article)
                if len(all_articles) >= num_articles:
                    break

            # Pagination: adjust the 'offset' parameter to fetch next set of results
            search_params['offset'] = search_params.get('offset', 0) + 100

        except requests.RequestException as e:
            print(f"Error fetching data: {e}")
            break

    return all_articles

# Fetching articles and convert to DataFrame
articles = fetch_articles(QUERY, NUM_ARTICLES, START_YEAR, END_YEAR)
df_articles = pd.DataFrame(articles)


# Saving the data to a CSV file
df_articles.to_csv('fetched_articles.csv', index=False)

# Downloading the CSV file
files.download('fetched_articles.csv')
# Displaying the first five rows of the dataframe
df_articles.head(10)





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,Title,Venue,Year,Authors,Abstract
0,XYZ arm (XYZ robotic arm),"The Dictionary of Genomics, Transcriptomics an...",0,,
1,Testing metadata xyz,,0,,
2,Testing metadata xyz,,0,,
3,Testing metadata xyz,,0,,
4,"PEG‐XYZ, peg‐XYZ",Catalysis from A to Z,0,Noir B.L.C.,
5,"BIO-XYZ, What is XYZ?",Current Trends in Biomedical Engineering &amp;...,0,Nandy Subir Kumar,
6,Payroll Information System Based On Pt XYZ Cas...,,0,"Mahendra Muchammad David, Eviyanti Ade",
7,[XYZ],New Palauan-English Dictionary,2019,,
8,XYZ,Dictionary of Biology,2014,,
9,Independent Transversal Domination Number for ...,Turkish Journal of Mathematics and Computer Sc...,0,ATAY ATAKUL Betül,"<jats:p xml:lang=""en"">A dominating set of a gr..."


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
# write your answer here

# Import necessary modules
import logging#hide all warnings and messages logged by praw that are below the ERROR level (such as INFO, WARNING, etc.).

# Set the logging level for the 'praw' module to 'ERROR' to suppress warnings
logging.getLogger("praw").setLevel(logging.ERROR)

#  PRAW initialization and data collection
import praw
import pandas as pd
from google.colab import files

# Reddit API credentials (I have altered after running the codes for my security)(If u need my reddit credentials I can submit personally)
CLIENT_ID = 'O01Bu9MMEdpfYrF9exIJCzg'
CLIENT_SECRET = 'K0bQhSzmvgcd2U0NmBUSQZx66o4dv7Q'
USER_AGENT = 'I need to collect data to complete an assignment'

# Initializing Reddit instance
reddit = praw.Reddit(client_id=CLIENT_ID,
                     client_secret=CLIENT_SECRET,
                     user_agent=USER_AGENT)

# Data collection from Reddit based on a keyword 'trump'
def fetch_reddit_data(keyword, limit=1000):
    posts_data = []

    # Search for submissions using the keyword
    for submission in reddit.subreddit('all').search(keyword, limit=limit):
        posts_data.append({
            'Title': submission.title,
            'Author': submission.author.name if submission.author else 'N/A',
            'Subreddit': submission.subreddit.display_name,
            'Score': submission.score,
            'Number of Comments': submission.num_comments,
            'Created UTC': submission.created_utc,
            'URL': submission.url
        })

    # Convert the results to a Pandas DataFrame
    df = pd.DataFrame(posts_data)
    return df

# Collect Reddit data with the keyword 'trump'
keyword = 'trump'  # My desired desired keyword
reddit_data_df = fetch_reddit_data(keyword, limit=1000)

# Displaying the first 10 rows of the dataframe to show that am actually doing something
print(reddit_data_df.head(10))

# Saving the data to a CSV file
reddit_data_df.to_csv('fetched_reddit_data.csv', index=False)

# Downloading the CSV file
files.download('fetched_reddit_data.csv')

#End




                                               Title               Author  \
0                           Trump 2020 vs Trump 2024  lostredditorlurking   
1  Trump’s Vice President says Trump should never...         Redditname97   
2          Eminem gets flustered talking about Trump       Wild-Army-6085   
3  Trump says illegal immigrants are “eating the ...   SkillImmediate6393   
4  What are your thoughts on the Harris and Trump...        anderson01832   
5  Biden poses with kids wearing Trump T-shirts i...           knowitokay   
6  Former President Trump after the presidential ...        WeaponHex1638   
7  Trump during the Moment of Silence at the 9/11...          CrispyMiner   
8  Former President Trump during the presidential...                 mtaw   
9                 Trump rejects second Harris debate           Rivinstein   

           Subreddit   Score  Number of Comments   Created UTC  \
0  interestingasfuck   61271                2593  1.723528e+09   
1  interestingasfuck

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
I can honestly attest to gaining skills which when practicing can make me a real pro.
The web scaping part was particularly hard since the social media sites expected credentials and was blocking in nature. The other areas didnt have much tough. After a struggle, I settled for the Reddit API though it requires caution as one can easily take away your account.
The exercise has made me learn that I don't need to scroll the whole Reddit or any other social media when I can just scrape the data I need through few lines of code.
I look forward to going through more rigorous exercises like thes ones as they are a sure way of making my research skills great.

'''

"\nI can honestly attest to gaining skills which when nurtured can make me a real pro.\nThe web scaping part was particularly hard since the social media sites expected credentials and was bvlocking in nature. The other areas didnt have much fuss. After a struggle, I settled for the Reddit API though it requires caution as one can easily take away your account.\nThe exercise has made me learn that I don't need to scroll the whole Reddit or any other social media when I can just scrape the data I need through few lines of code.\nI look forward to going through more rigorous exercises like thes ones as they are a sure way of making my research skills great.\n\n"