# Detection of Events in Qatar: 

building an automated tool to scrape events content from local websites and have them ready to be inserted in a database which can eventually be used in our mobile app (examples: Visit Qatar, ILoveQatar, Qatar Calendar, QF Website, QU Website, etc.). We plan to use the NER  (Named Entity Recognition) technique and/or a Large Language Model (LLM).

Customized solution for building automated tool to scrape event content from local websites in Qatar

Customized Plan for Event Content Scraper


#### 1. Identify Target Websites

- Visit Qatar
- ILoveQatar
- Qatar Calendar
- QF Website
- QU Website

#### 2. Web Scraping

**Tools:**
- **Scrapy**: For comprehensive and scalable web scraping.
- **Selenium**: For dynamic content and JavaScript-heavy websites.

**Implementation:**
- **Scrapy**:
  - Create individual spiders for each website.
  - Handle pagination, login, and session management if necessary.


In [None]:
import scrapy

In [None]:
class EventSpider(scrapy.Spider):
    name = "event_spider"
    start_urls = ['https://example.com/events']

    def parse(self, response):
        for event in response.css('div.event'):
            yield {
                'name': event.css('h2::text').get(),
                'date': event.css('span.date::text').get(),
                'location': event.css('span.location::text').get(),
                'description': event.css('p.description::text').get(),
            }
        next_page = response.css('a.next::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

- **Selenium**:
  - Use for websites requiring JavaScript execution.

In [None]:
from selenium import webdriver


In [None]:
driver = webdriver.Chrome()
driver.get('https://example.com/events')

events = driver.find_elements_by_class_name('event')
for event in events:
    name = event.find_element_by_tag_name('h2').text
    date = event.find_element_by_class_name('date').text
    location = event.find_element_by_class_name('location').text
    description = event.find_element_by_class_name('description').text
    print(name, date, location, description)

driver.quit()

#### 3. Data Storage

**Database:**
- **PostgreSQL**: For structured data and complex queries.


**Schema Design:**
```
CREATE TABLE events (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    date DATE,
    location VARCHAR(255),
    description TEXT
);
sql```

#### 4. NER and LLM

**NER Tools:**
- **SpaCy**: Efficient for entity recognition.


In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

**LLM for Enhanced Extraction:**
- **Hugging Face Transformers**: Use pre-trained models like BERT for more accurate extraction.


In [None]:
from transformers import pipeline


In [None]:
nlp = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

def enhanced_extraction(text):
    return nlp(text)

#### 5. Integration and API Development

**API Development:**
- **FastAPI**: Modern, fast (high-performance), web framework for building APIs with Python.

In [None]:
from fastapi import FastAPI
import psycopg2


In [None]:
app = FastAPI()

@app.get("/events")
def get_events():
    conn = psycopg2.connect("dbname=events user=youruser password=yourpassword")
    cur = conn.cursor()
    cur.execute("SELECT * FROM events")
    events = cur.fetchall()
    cur.close()
    conn.close()
    return events

#### 6. Automation and Scheduling

**Task Scheduling:**
- **Apache Airflow**: For orchestrating complex workflows and ensuring tasks run on schedule.


**Airflow DAG:**

In [None]:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import scrapy_script  # Your scrapy script here

In [None]:
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
default_args

In [None]:
dag = DAG(
    'event_scraping',
    default_args=default_args,
    description='Scraping event data',
    schedule_interval=timedelta(days=1),
)
dag

In [None]:
def run_scrapy():
    # implementation lsa htt3ml 
    scrapy_script.run()

run_scrapy_task = PythonOperator(
    task_id='run_scrapy',
    python_callable=run_scrapy,
    dag=dag,
)

### Summary of Tools and Technologies

1. **Web Scraping**: Scrapy, Selenium
2. **Database**: PostgreSQL
3. **NER**: SpaCy
4. **LLM**: Hugging Face Transformers
5. **API Development**: FastAPI
6. **Task Scheduling**: Apache Airflow
