Thoughts on evaluating the search engine results:

* I think we need some data to train our algorithm, especially to identify keywords and features of relevant/irrelevant articles. We can easily collect information from news search! (or Sogol and Sana may be able to provide us some information)

### Variables (to determine the best/worst search engine):
1. The number of relevant articles among top 20 results
2. how much specific information the relevant articles contain (general info only or specific info included)
3. dates of the articles (how old the articles are)
4. 


### Techniques for preliminary screen (removing irrelevent articles):
1. keyword evaluation
    * Frequency of the keywords OR how many keywords included (Clotilde's function)
    * Naive Bayes model (similar to spam email detection, we can detect irrelevant articles) - but this needs the keyword statistics (probability) information / training data
    
2. Clustering:
    * Cluster the articles with similarity and sort out irrelevant articles - also need some training data
    * Non-numerical data need to be carefully pre-processed to obtain the clusters we want. For example, usually the words should be mapped to numerical values (maybe there's some python library does the job?)


### Next thing to think about / questions:
1. How to screen 'biased' information
    * Penalize certain keywords such as politics, (we need to consider different things)
2. Specific information (Vessel, captain's name, where it happend, what happened)
    * How to identify them in the articles
    * How to measure/compare the amount of information in an article numerically
3. About regional biases, are we actually focus on Vancouver/North West America? Or anywhere in the world?
4. 

### Scraping content from the given url

In [93]:
import requests
from bs4 import BeautifulSoup

def scrape_content(url):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the webpage
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract the desired information
        # Example: Extracting all paragraphs from the webpage
        paragraphs = soup.find_all('p')
        
        content = ""
        # Print or process the extracted information
#        for paragraph in paragraphs:
#            content += " " + paragraph.text

# The below is to only return a shorter content. For the full content, use the commented commands above
        for i in range(min(len(paragraphs), 3)):
            content += " " + paragraphs[i].text
        return content
    else:
        print("Failed to retrieve content. Status code:", response.status_code)
        return None

###  Titles and links of the top 20 search results from Google News

In [90]:
def scrape_google_news(query):
    # Construct the Google News URL with the query
    url = f"https://news.google.com/search?q={query}"

    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all the search result elements
    search_results = soup.find_all('div', class_='IL9Cne')

    # Extract the title and link of each search result
    results = []
    for result in search_results[:2]:  # Scraping top 20 results
        title = result.find('a', class_ = 'JtKRv').text
        link = result.find('a')['href']
        results.append({'title': title, 'link': link})

    return results

### Getting the queries from the excel file

In [91]:
import pandas as pd

# read the excel file
excel_data = pd.read_excel('PIMS Sample Prompts.xlsx')

queries = []
for index, row in excel_data.iterrows():
    # Process each row
    queries.append(row['Prompt'])

In [92]:
### Test with differnet queries
query = input("Enter your search query: ")

# When we want to use the queries from the excel file:
#for query in queries:
#    top_results = scrape_google_news(query)
#    for index, result in enumerate(top_results, start=1):
#        print(f"{index}. {result['title']}")
#        link = "https://news.google.com" + result['link'][1:]
#        print(link)
#        print()
#        print(scrape_content(link))


top_results = scrape_google_news(query)
for index, result in enumerate(top_results, start=1):
    print(f"{index}. {result['title']}")
    link = "https://news.google.com" + result['link'][1:]
    print(link)
    print("content: ")
    print(scrape_content(link))
    print()    

Enter your search query: fish crime
1. UPDATE: Arrest made in theft of tropical fish and cash
https://news.google.com/articles/CBMiWGh0dHBzOi8vd3d3Lm15YmFuY3JvZnRub3cuY29tLzYwODI5L25ld3MvYXJyZXN0LW1hZGUtaW4tdGhlZnQtb2YtdHJvcGljYWwtZmlzaC1hbmQtY2FzaC_SAQA?hl=en-CA&gl=CA&ceid=CA%3Aen
content: 
 Upper Ottawa Valley OPP say that with public support and the help of the Community Street Crime Unit, they apprehended and arrested a male from Whitewater region on April 19th.   Police say they located the tropical fish which were then returned to their owner.  A Beachburg resident is facing Criminal Code (CC) charges, including theft under $5000 and possession of property obtained by crime.

2. Stolen tropical fish returned to Ottawa Valley restaurant
https://news.google.com/articles/CBMiTWh0dHBzOi8vY2EubmV3cy55YWhvby5jb20vc3RvbGVuLXRyb3BpY2FsLWZpc2gtcmV0dXJuZWQtb3R0YXdhLTIwNDY1NjUzNC5odG1s0gEA?hl=en-CA&gl=CA&ceid=CA%3Aen
content: 
 Ontario Provincial Police (OPP) have found and returned tropica

### Scraping titles and links of the top 20 search results from Yahoo News
This is very similar to Google one. I think we can easily produce similar functions for other search engines!

In [None]:
import requests
from bs4 import BeautifulSoup

def scrape_yahoo_news(query):
    # Construct the Yahoo News URL with the query
    url = f"https://news.search.yahoo.com/search?p={query}"

    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all the search result elements
    search_results = soup.find_all('div', class_='NewsArticle')

    # Extract the title and link of each search result
    results = []
    for result in search_results[:20]:  # Scraping top 20 results
        title = result.find('h4').text
        link = result.find('a')['href']
        results.append({'title': title, 'link': link})

    return results

# Example usage
query = input("Enter your search query: ")
top_results = scrape_yahoo_news(query)
for index, result in enumerate(top_results, start=1):
    print(f"{index}. {result['title']}")
    print(result['link'])
    print()


### Evaluation functions

In [None]:
##Clotilde's function for relevance evaluation
# This function searches for the number of keywords in the given text (var: results)

def evaluate_relevance(results, keywords):
    relevance_scores = []
    for result in results:
        title = result["title"].lower()  # Convertir le titre en minuscule pour une comparaison insensible à la casse
        score = sum(1 for word in keywords if word in title)  # Compter combien de mots-clés apparaissent dans le titre
        relevance_scores.append(score)
    return relevance_scores
####

### From GPT

In [None]:
import requests
from bs4 import BeautifulSoup

def scrape_search_results(query):
    url = f"https://www.google.com/search?q={query}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }  # User-Agent header to mimic a browser
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        search_results = []
        for result in soup.find_all('div', class_='tF2Cxc'):
            title = result.find('h3').text
            link = result.find('a')['href']
            search_results.append({'title': title, 'link': link})
        return search_results
    else:
        print("Failed to fetch search results.")
        return None

##Clotilde's function for relevance evaluation
def evaluate_relevance(results, keywords):
    relevance_scores = []
    for result in results:
        title = result["title"].lower()  # Convertir le titre en minuscule pour une comparaison insensible à la casse
        score = sum(1 for word in keywords if word in title)  # Compter combien de mots-clés apparaissent dans le titre
        relevance_scores.append(score)
    return relevance_scores
####

# Example usage:
query = input("Enter your search query: ")
results = scrape_search_results(query)

if results:
    for i, result in enumerate(results, start=1):
        print(f"{i}. {result['title']}")
        print(f"   Link: {result['link']}")
        print()

evaluate_relevance(results, ['ocean', 'sea', 'crime'])

### Modified (in progress)

In [10]:
import requests
from bs4 import BeautifulSoup


def clean_text(text):
    # Text to lowercase
    text = text.lower()
    # Remove special characters using regular expression
    cleaned_text = re.sub(r'[^-a-zA-Z0-9\s]', '', text)
    return cleaned_text

def text_to_word(soup):
    content = soup.find_all("div", class_ = 'entry-content')
#    print(content)
    if content is not []:
        for paragraph in content:
            print(paragraph)
            text = paragraph.get_text(separator='\n')
            print(text)
            text = clean_text(text)
            #text_word = text.split()
            return text.split()    
    else:
        return None

def scrape_search_results(query):
    url = f"https://news.google.com/search?q={query}&hl=en-CA&gl=CA&ceid=CA%3Aen"
    
#    url = f"https://www.google.com/search?q={query}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }  # User-Agent header to mimic a browser
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        search_results = []
        for result in soup.find_all('div', class_='tF2Cxc'):
            title = result.find('h3').text
            link = result.find('a')['href']
            search_results.append({'title': title, 'link': link})
        return search_results
    else:
        print("Failed to fetch search results.")
        return None

    
def evaluate_relevance(results, keywords):
    relevance_scores = []
    for result in results:
        title = result["title"].lower()  # Convertir le titre en minuscule pour une comparaison insensible à la casse
        score = sum(1 for word in keywords if word in title)  # Compter combien de mots-clés apparaissent dans le titre
        relevance_scores.append(score)
    return relevance_scores

    
    
def scrape_content(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        word_text = text_to_word(soup)
        content = soup.find_all("div", class_ = 'entry-content')
        content = soup.get_text()
        return word_text
    else:
        print(f"Failed to fetch content from {url}.")
        return None
    

# Example usage:
query = input("Enter your search query: ")
results = scrape_search_results(query)

if results:
    for i, result in enumerate(results[:5], start=1):
        print(f"{i}. {result['title']}")
        print(f"   Link: {result['link']}")
        print("   Content:")
        content = scrape_content(result['link'])
        if content:
            print(content[:500])  # Print the first 500 characters of the content
        print()


Enter your search query: vessel fish crime


In [None]:
import requests
from bs4 import BeautifulSoup
import re

def clean_text(text):
    # Text to lowercase
    text = text.lower()
    # Remove special characters using regular expression
    cleaned_text = re.sub(r'[^-a-zA-Z0-9\s]', '', text)
    return cleaned_text



text = 'Hi my name is: sumin", this-is to test text cleaning!'
clean_text(text)

In [32]:
import requests
from bs4 import BeautifulSoup

def scrape_yahoo_news(query):
    # Construct the Yahoo News URL with the query
    url = f"https://news.google.com/search?q={query}"
    print(url)
    # Send a GET request to the URL
    response = requests.get(url)
    print(response)
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all the search result elements
    search_results = soup.find_all('div', class_='NewsArticle')

    # Extract the title and link of each search result
    results = []
    for result in search_results[:10]:  # Scraping top 10 results
        title = result.find('h4').text
        link = result.find('a')['href']
        results.append({'title': title, 'link': link})

    return results

# Example usage
query = input("Enter your search query: ")
top_results = scrape_yahoo_news(query)
for index, result in enumerate(top_results, start=1):
    print(f"{index}. {result['title']}")
    print(result['link'])
    print()


Enter your search query: fish
https://news.google.com/search?q=fish
<Response [200]>
[]
