<a href="https://colab.research.google.com/github/Vinuthna06reddy/VinuthnaReddy_INFO5731_FALL2024/blob/main/INFO5731_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
Can machine learning models predict a persons emotional state based on real-time facial expressions and physiological data?

To answer these questions, the following data should be collected:

Facial Expression Data: Real-time images or video frames of a person’s face. Each image should be labeled with the corresponding emotional state.
Physiological Data:
Heart Rate: Continuous heart rate monitoring using a device like a smart watch or a heart rate sensor.
Skin Temperature: Collected via sensors on wearable devices.
Galvanic Skin Response (GSR): Measures electrical conductance of the skin, which varies with sweating and can be used to infer stress levels.
Emotional Labels: Self-reported emotional states at regular intervals or after specific tasks to create labeled data.

Amount of Data Needed:
Sample Size: To build a robust machine learning model, at least 1000 samples per emotion class are needed. For example, if you're predicting 5 emotion classes (e.g., happy, sad, neutral, stressed, and calm), you need at least 5000 samples in total.
Multimodal Data: For each sample, you should collect:
A facial image (or video frame).
Corresponding physiological readings (heart rate, skin temperature, GSR).
An emotion label (self-reported or inferred).

Steps for Collecting and Saving the Data:
Set Up Sensors:
Use a webcam or smartphone camera for collecting facial expression data.
Use wearable devices (e.g., smartwatches, fitness trackers) to continuously monitor heart rate, skin temperature, and GSR.
Emotional Labels:
Ask participants to self-report their emotional state at regular intervals or during/after specific tasks (e.g., watching videos designed to elicit certain emotions).
Alternatively, use validated questionnaires like the PANAS (Positive and Negative Affect Schedule).
Automate Data Collection:
Use a combination of scripts to record data from the camera and sensors at the same time.
Store the data in structured files (e.g., CSV or JSON) with corresponding time stamps.
Preprocessing:
Facial data: Extract facial landmarks or apply emotion recognition algorithms (like OpenCV’s Haar cascades or Dlib) to detect facial expressions.
Physiological data: Clean the data by removing outliers, smoothing signals (e.g., with moving averages), and normalizing values.
Store the Data:
Store images/video frames separately, and save corresponding physiological data in a structured format (e.g., CSV files).
Ensure that the emotion labels are aligned with both the facial and physiological data.
Backup and Validation:
Regularly backup the data to cloud storage or an external server.
Validate the data by checking alignment between the facial expression, physiological data, and emotion labels.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import pandas as pd
import random
import time
import uuid
from datetime import datetime

num_samples = 1000

emotion_labels = ['happy', 'sad', 'neutral', 'stressed', 'calm']

timestamps = []
heart_rates = []
skin_temperatures = []
gsr_values = []
emotion_states = []
face_image_ids = []

def simulate_physiological_data():
    heart_rate = random.uniform(60, 100)
    skin_temp = random.uniform(30, 36)
    gsr = random.uniform(0.1, 10)
    return heart_rate, skin_temp, gsr

for i in range(num_samples):
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

    heart_rate, skin_temp, gsr = simulate_physiological_data()

    emotion = random.choice(emotion_labels)

    face_image_id = str(uuid.uuid4())

    timestamps.append(timestamp)
    heart_rates.append(heart_rate)
    skin_temperatures.append(skin_temp)
    gsr_values.append(gsr)
    emotion_states.append(emotion)
    face_image_ids.append(face_image_id)

    time.sleep(0.1)

data = pd.DataFrame({
    'Timestamp': timestamps,
    'Heart_Rate': heart_rates,
    'Skin_Temperature': skin_temperatures,
    'GSR': gsr_values,
    'Emotion_State': emotion_states,
    'Face_Image_ID': face_image_ids
})

data.to_csv('emotion_prediction_data.csv', index=False)

data.head(1000)

Unnamed: 0,Timestamp,Heart_Rate,Skin_Temperature,GSR,Emotion_State,Face_Image_ID
0,2024-09-15 22:31:44,70.756408,32.474074,3.278571,neutral,040a1baf-9358-4f93-b8c0-86d72c4bc214
1,2024-09-15 22:31:44,70.974549,31.649998,1.464799,happy,4ef509bc-68f1-412c-bd11-f83026a8cf9d
2,2024-09-15 22:31:44,98.529241,30.916742,4.277333,sad,5ba50a6c-5bca-40d9-84c0-e56fa83f289e
3,2024-09-15 22:31:44,63.599065,35.881324,2.538643,happy,f8363b53-f1c1-421f-acf3-dce613bbee4d
4,2024-09-15 22:31:44,80.810954,33.995764,5.864500,calm,78f986ed-907d-4994-b298-de0c721c04bf
...,...,...,...,...,...,...
995,2024-09-15 22:33:24,66.081459,35.875076,4.827052,stressed,363120e1-100b-4255-9e28-972c9d7c1d7a
996,2024-09-15 22:33:24,61.200041,30.346677,2.935126,stressed,e3600882-ee53-4c45-aeed-0fcf2ae1e892
997,2024-09-15 22:33:24,91.783047,34.302858,9.715665,sad,c6c068e1-d944-44a5-bffb-0ba8dbe0b360
998,2024-09-15 22:33:24,98.836691,35.545317,9.925111,stressed,b248a962-8d96-44d2-890a-270f6ddcbdb7


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [2]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime

def google_scholar_scraper(query, num_articles=10):
    base_url = 'https://scholar.google.com/scholar'
    params = {
        'q': query,
        'hl': 'en',
        'as_ylo': '2014',
        'as_yhi': '2024'
    }

    articles = []
    for start in range(0, num_articles, 10):
        params['start'] = start
        response = requests.get(base_url, params=params)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            results = soup.find_all('div', {'class': 'gs_ri'})
            for result in results:
                title = result.find('h3').text
                authors_element = result.find('div', {'class': 'gs_a'})
                if authors_element:
                    authors = authors_element.text
                else:
                    authors = None
                venue_year = result.find('div', {'class': 'gs_a'}).text
                abstract = result.find('div', {'class': 'gs_rs'})

                if abstract:
                    abstract = abstract.text
                else:
                    abstract = None

                articles.append({
                    'title': title,
                    'authors': authors,
                    'venue_year': venue_year,
                    'abstract': abstract
                })

    return articles

query = "machine learning models to predict a persons emotional state "
num_articles = 10
articles = google_scholar_scraper(query, num_articles)

if articles:
    for i in range(10):
        print(f"article {i+1}:\n")
        print(f"title: {articles[i]['title']}")
        print(f"Authors: {articles[i]['authors']}")
        print(f"Venue/Year: {articles[i]['venue_year']}")
        print(f"Abstract: {articles[i]['abstract']}")
        print("\n" + "="*50 + "\n")
else:
    print("No articles found for the given query.")

article 1:

title: Emotional state classification from EEG data using machine learning approach
Authors: XW Wang, D Nie, BL Lu - Neurocomputing, 2014 - Elsevier
Venue/Year: XW Wang, D Nie, BL Lu - Neurocomputing, 2014 - Elsevier
Abstract: … of dry electrode techniques, machine learning algorithms, and various … The ability to recognize 
the emotional states of people … studies can only predict the labels of emotion samples, but …


article 2:

title: Predicting anxiety, depression and stress in modern life using machine learning algorithms
Authors: A Priya, S Garg, NP Tigga - Procedia Computer Science, 2020 - Elsevier
Venue/Year: A Priya, S Garg, NP Tigga - Procedia Computer Science, 2020 - Elsevier
Abstract: … In the fast-paced modern world, psychological health issues … they are particularly suited to 
predicting psychological problems. After … This predicts the percentage of people suffering from …


article 3:

title: Emotion recognition using multi-modal data and machine learning 

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
!pip install praw
import tweepy

def get_twitter_data(keyword, count=10):
    consumer_key = 'artificial intelligence'
    consumer_secret = 'YOUR_CONSUMER_SECRET'
    access_token = 'YOUR_ACCESS_TOKEN'
    access_token_secret = 'YOUR_ACCESS_TOKEN_SECRET'

    auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret, access_token, access_token_secret)
    api = tweepy.API(auth)

    tweets = tweepy.Cursor(api.search_tweets, q=keyword, lang="en").items(count)
    twitter_data = []

    for tweet in tweets:
        twitter_data.append({
            'username': tweet.user.screen_name,
            'followers_count': tweet.user.followers_count,
            'tweet_text': tweet.text,
            'retweet_count': tweet.retweet_count,
            'favorite_count': tweet.favorite_count,
            'created_at': tweet.created_at
        })
    return twitter_data

def collect_social_media_data(keyword_or_hashtag):
    reddit_data = get_reddit_data(subreddit_name='technology', keyword=keyword_or_hashtag, limit=5)
    twitter_data = get_twitter_data(keyword=keyword_or_hashtag, count=5)
    fb_instagram_data = get_facebook_instagram_data(hashtag=keyword_or_hashtag, access_token='YOUR_ACCESS_TOKEN', platform='instagram', limit=5)


    print("Twitter Data:", twitter_data)

if __name__ == "__main__":
    collect_social_media_data("AI")


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
'''