# Overview

The goal of this notebook it to see if ScrapeGraphAI can be used to solve the problem of coupon extraction from a phone screen view.

In [1]:
import pandas as pd
import xml.etree.ElementTree as ET
from scrapegraphai.graphs import XMLScraperGraph, CSVScraperGraph

# Loading the content generic CSV

In [2]:
csv_path = 'content_generic_penny_2025_03_13.csv'
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,"CAST(id, 'String')",id,user_id,time,i,language,application_name,package_name,class_name,context,...,view_depth,view_class_name,text,description,seen_timestamp,is_visible,x_1,y_1,x_2,y_2
0,1861746032578134018,1861746032578134018,167216,2024-11-27 13:16:55.756000,1,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,...,0,de.penny.app.main.view.MainActivity,,,0,False,0,0,0,0
1,1861746032578134018,1861746032578134018,167216,2024-11-27 13:16:55.756000,2,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,...,2,android.widget.FrameLayout,,,1732709815209,True,0,0,1080,2312
2,1861746032578134018,1861746032578134018,167216,2024-11-27 13:16:55.756000,3,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,...,8,android.view.View,,Angebote,1732709815209,True,107,2041,187,2121
3,1861746032578134018,1861746032578134018,167216,2024-11-27 13:16:55.756000,4,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,...,8,android.widget.TextView,Angebote,,1732709815209,True,75,2123,220,2164
4,1861746032578134018,1861746032578134018,167216,2024-11-27 13:16:55.756000,5,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,...,8,android.view.View,,Vorteile,1732709815209,True,309,2045,381,2117


# Function to run ScrapeGraphAI

The LLM that I'm using for scraping is a 1 billion parameter Llama3.2.

In [3]:
def run_scrape_graph_ai(input_str, scraper_type, prompt):
    graph_config = {
       'llm': {
          'model': 'ollama/llama3.2:1b',
          'temperature': 0.0,
          'format': 'json',
          'model_tokens': 2048,
          'base_url': 'http://localhost:11434',
        }
    }

    if scraper_type == 'xml':
        scraper_graph = XMLScraperGraph(
            prompt=prompt,
            source=input_str,
            config=graph_config,
        )
    elif scraper_type == 'csv':
        scraper_graph = CSVScraperGraph(
            prompt=prompt,
            source=input_str,
            config=graph_config,
        )
    else:
        return None

    return scraper_graph.run()

# CSVScraperGraph test

In this section I will try to extract coupons using the CSVScraperGraph. The prompt that I used is based on the prompts from the ScrapeGraphAI documentation.

In [4]:
prompt = 'A coupon consists of a product name, a description text, a discount text and an activation text. Extract all coupons from the given phone screen views.'
csv_string = df.to_csv(index=False)
run_scrape_graph_ai(csv_string, 'csv', prompt)

{'timestamp': 1643723405,
 'data': [{'id': 1, 'name': 'Abonnement', 'type': 'Premium', 'price': 9.99},
  {'id': 2, 'name': 'Abonnement', 'type': 'Basic', 'price': 4.99}]}

The CSVScraperGraph did not extract any coupons from the CSV. Let's try leaving only the most relevant columns.

In [5]:
df_with_relevant_cols = df[["application_name", "context", "view_depth", "text", "description", "seen_timestamp"]]
df_with_relevant_cols.head()

Unnamed: 0,application_name,context,view_depth,text,description,seen_timestamp
0,PENNY,,0,,,0
1,PENNY,,2,,,1732709815209
2,PENNY,,8,,Angebote,1732709815209
3,PENNY,,8,Angebote,,1732709815209
4,PENNY,,8,,Vorteile,1732709815209


In [6]:
csv_string_2 = df_with_relevant_cols.to_csv(index=False)
run_scrape_graph_ai(csv_string_2, 'csv', prompt)

{}

No coupons were found.

# XMLScraperGraph test

In this section I will try to extract coupons using the XMLScraperGraph. Using the `view_depth` column in the content generic CSV it is possible to restore the XML structure of the phone screen views. I group the views based on the `seen_timestamp` column and provide data from the `text` column only.

In [7]:
def prepare_content_generic(content_generic_df):
    df = content_generic_df.copy()
    df = df[df['text'].notna()]
    df = df[df['seen_timestamp'] != 0]
    return df

def content_generic_2_xml(content_generic_df):
    df = prepare_content_generic(content_generic_df)
    xml = ET.Element('root')
    
    if df.empty:
        return xml

    timestamp = df['seen_timestamp'].iloc[0]
    timestamp_element = ET.SubElement(xml, 'view')
    element_stack = [(-1, timestamp_element)]

    for index, row in df.iterrows():
        if row['seen_timestamp'] != timestamp:
            timestamp = row['seen_timestamp']
            timestamp_element = ET.SubElement(xml, 'view')
            element_stack = [(-1, timestamp_element)]

        while row['view_depth'] <= element_stack[-1][0]:
            element_stack.pop()

        text_element = ET.SubElement(element_stack[-1][1], 'text')
        text_element.text = str(row['text'])

        element_stack.append((row['view_depth'], text_element))

    return xml

In [12]:
xml = content_generic_2_xml(df)
xml_string = ET.tostring(xml, encoding='utf-8').decode('utf-8')
print(f'{xml_string[:135]}...')

<root><view><text>Angebote</text><text>Vorteile</text><text>Vorteilscode</text><text>Einkaufsliste</text><text>Mein PENNY</text></view>...


In [13]:
run_scrape_graph_ai(xml_string, 'xml', prompt)

{}

No coupons were found.

# Conclusions

ScrapeGraphAI does not seem to be the right tool for this task. It appears that it is mainly designed for web pages and I was not able to have any success with it in scraping phone screen views. 