## DataStream API
The DataStream API is designed for low-level control of real-time stream processing. 
It allows developers to work with unbounded streams of data by providing event-by-event handling and stateful transformations.

#### Core Features:

1. Event-Driven Processing: Operates on a stream of individual events.
2. Custom Logic: Supports custom business logic with functions like ProcessFunction.
3. State Management: Explicit support for managing state, both keyed and operator states.
4. Time Semantics: Offers event time, processing time, and watermarking for precise control over time-based operations.
5. Windowing: Aggregates data over time-based or count-based windows.
6. Transformations: Includes operations like map, filter, flatMap, keyBy, reduce, etc.

#### Scenarios on When to Use DataStream API

- Complex Event Processing: When you need fine-grained control over events (e.g., detecting patterns like fraud or anomalies).
- Custom Logic: If your use case involves non-relational transformations (e.g., custom calculations or iterative processing).
- IoT Monitoring: Process sensor data where events are key-based and require specific windowing logic.

#### Scenarios on When to Use Table API

- Relational Workloads: For tasks like aggregations, joins, and filtering similar to SQL.
- Unified Batch and Stream Processing: When processing both historical data and real-time events in the same application.
- Business Reporting: For real-time dashboards and analytics requiring declarative queries.


In [65]:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common.typeinfo import Types
from pyflink.datastream.connectors.kafka import FlinkKafkaConsumer
from pyflink.datastream.formats.json import JsonRowDeserializationSchema
from pyflink.datastream.connectors.file_system import FileSink, OutputFileConfig
from pyflink.common.serialization import Encoder
from pyflink.table.expressions import col
from pyflink.common import Types
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
from pyflink.common import WatermarkStrategy, Duration


In [23]:
env = StreamExecutionEnvironment.get_execution_environment()

In [24]:
ds = env.from_collection(collection=[(1,'praveen'),(2,'chinna'),(3,'reddy')],type_info=Types.ROW([Types.INT(),Types.STRING()]))

In [25]:
env.add_jars('file:///Users/praveenreddy/FFlink/flink-sql-connector-kafka-1.17.0.jar')

In [26]:
deserialization_schema = JsonRowDeserializationSchema.builder()\
    .type_info(Types.ROW_NAMED(
        ["key", "data"],  # Field names must match JSON keys
        [Types.STRING(), Types.STRING()]  # Field types
    )).build()



kafka_consumer = FlinkKafkaConsumer(
    topics='test_topic',
    deserialization_schema=deserialization_schema,
    properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'praveen_group_id_2'})

ds = env.add_source(kafka_consumer)

In [30]:
with ds.execute_and_collect() as results:
    for result in results:
        print(result)

Row(key='0', data='Hello')
Row(key='1', data='Namaste')
Row(key='2', data='Good Day')
Row(key='3', data='Hi')
Row(key='4', data='Good Day')
Row(key='5', data='Good Day')
Row(key='6', data='Hello')
Row(key='7', data='Good Day')
Row(key='8', data='Hi')
Row(key='9', data='Namaste')
Row(key='0', data='Good Day')
Row(key='1', data='Hi')
Row(key='2', data='Hi')
Row(key='3', data='Namaste')
Row(key='4', data='Good Day')
Row(key='5', data='Hi')
Row(key='6', data='Namaste')
Row(key='7', data='Good Day')
Row(key='8', data='Good Day')
Row(key='9', data='Good Day')
Row(key='0', data='Hi')
Row(key='1', data='Namaste')
Row(key='2', data='Hi')
Row(key='3', data='Hi')
Row(key='4', data='Hello')
Row(key='5', data='Namaste')
Row(key='6', data='Namaste')
Row(key='7', data='Hello')
Row(key='8', data='Hi')
Row(key='9', data='Hello')


ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/Users/praveenreddy/FFlink/Flink_Work/myenv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/praveenreddy/FFlink/Flink_Work/myenv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1217, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt


KeyboardInterrupt: 

### Similar code for KafkaProducer !! -- Just the Class Names vary


In [None]:
kafka_producer = FlinkKafkaProducer(
    topic='test_sink_topic',
    serialization_schema=serialization_schema,
    producer_config={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})

ds.add_sink(kafka_producer)

In [31]:
ds = ds.map(lambda row: f"{row.key},{row.data}", output_type=Types.STRING())

In [32]:
output_path = '/Users/praveenreddy/FFlink/Flink_Work'
file_sink = FileSink \
    .for_row_format(output_path, Encoder.simple_string_encoder()) \
    .with_output_file_config(OutputFileConfig.builder().with_part_prefix('pre').with_part_suffix('suf').build()) \
    .build()

ds.sink_to(file_sink)

env.execute("Kafka to FileSink")

KeyboardInterrupt: 

In [None]:


env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars('file:///Users/praveenreddy/FFlink/flink-sql-connector-kafka-1.17.0.jar')
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(stream_execution_environment=env, environment_settings=env_settings)

deserialization_schema = JsonRowDeserializationSchema.builder()\
    .type_info(Types.ROW_NAMED(
        ["key", "data"],  # Field names must match JSON keys
        [Types.STRING(), Types.STRING()]  # Field types
    )).build()

kafka_consumer = FlinkKafkaConsumer(
    topics='test_topic',
    deserialization_schema=deserialization_schema,
    properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'praveen_group_id_3'})

ds = env.add_source(kafka_consumer)

# Convert DataStream to Table
table = table_env.from_data_stream(ds, col("key"), col("data"))

output_path = '/Users/praveenreddy/FFlink/Flink_Work/output_data'

table_env.execute_sql(f"""
    CREATE TABLE file_sink (
        key STRING,
        data STRING
    ) WITH (
        'connector' = 'filesystem',
        'path' = 'output',
        'format' = 'csv'
    )
""")

table.execute_insert("file_sink").wait()

csv_table = table_env.from_path("file_sink")

env.execute("DataStream to Table API Job")




Py4JJavaError: An error occurred while calling o2035.executeInsert.
: org.apache.flink.table.api.ValidationException: Unable to create a sink for writing table 'default_catalog.default_database.file_sink'.

Table options are:

'connector'='print'
'format'='csv'
'path'='output'
	at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSink(FactoryUtil.java:338)
	at org.apache.flink.table.planner.delegation.PlannerBase.getTableSink(PlannerBase.scala:450)
	at org.apache.flink.table.planner.delegation.PlannerBase.translateToRel(PlannerBase.scala:227)
	at org.apache.flink.table.planner.delegation.PlannerBase.$anonfun$translate$1(PlannerBase.scala:177)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
	at scala.collection.Iterator.foreach(Iterator.scala:937)
	at scala.collection.Iterator.foreach$(Iterator.scala:937)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
	at scala.collection.IterableLike.foreach(IterableLike.scala:70)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:69)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike.map(TraversableLike.scala:233)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.flink.table.planner.delegation.PlannerBase.translate(PlannerBase.scala:177)
	at org.apache.flink.table.api.internal.TableEnvironmentImpl.translate(TableEnvironmentImpl.java:1308)
	at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:874)
	at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:1107)
	at org.apache.flink.table.api.internal.TablePipelineImpl.execute(TablePipelineImpl.java:59)
	at org.apache.flink.table.api.Table.executeInsert(Table.java:1074)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)
	at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)
	at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: org.apache.flink.table.api.ValidationException: Unsupported options found for 'print'.

Unsupported options:

format
path

Supported options:

connector
print-identifier
property-version
scan.watermark.alignment.group
scan.watermark.alignment.max-drift
scan.watermark.alignment.update-interval
scan.watermark.emit.strategy
scan.watermark.idle-timeout
sink.parallelism
standard-error
	at org.apache.flink.table.factories.FactoryUtil.validateUnconsumedKeys(FactoryUtil.java:710)
	at org.apache.flink.table.factories.FactoryUtil$FactoryHelper.validate(FactoryUtil.java:1009)
	at org.apache.flink.connector.print.table.PrintTableSinkFactory.createDynamicTableSink(PrintTableSinkFactory.java:88)
	at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSink(FactoryUtil.java:335)
	... 30 more


In [None]:
from pyflink.datastream import StreamExecutionEnvironment

env = StreamExecutionEnvironment.get_execution_environment()

# Step 2: Read the file
file_path = "input.txt"
ds = env.read_text_file(file_path)

ds = ds.flat_map(lambda line: line.split(" "))

ds.print()

env.execute("Read DataStream from File")


Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/Users/praveenreddy/FFlink/Flink_Work/myenv/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 766, in run_closure
    _threading_Thread_run(self)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/praveenreddy/FFlink/Flink_Work/myenv/lib/python3.11/site-packages/apache_beam/runners/worker/data_plane.py", line 669, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/praveenreddy/FFlink/Flink_Work/myenv/lib/python3.11/site-packages/apache_beam/runners/worker/data_plane.py", line 652, in _read_inputs
    for elements in elements_iterator:
  File "/Users/praveenr

5> Hello
5> world
6> Apache
6> Flink
6> is
6> great
1> DataStream
1> API
1> is
1> powerful


<pyflink.common.job_execution_result.JobExecutionResult at 0x10584f3d0>

In [49]:
env = StreamExecutionEnvironment.get_execution_environment()

csv_path = "input_csv.csv"
ds = env.read_text_file(csv_path)

# Parse the CSV content into a tuple DataStream
parsed_ds = ds.map(lambda line: line.split(',')) \
              .map(lambda fields: (int(fields[0]), fields[1], int(fields[2])),
                   output_type=Types.TUPLE([Types.INT(), Types.STRING(), Types.INT()]))

# Print the parsed data to console
parsed_ds.print()

env.execute("Read CSV File")

Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/Users/praveenreddy/FFlink/Flink_Work/myenv/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 766, in run_closure
    _threading_Thread_run(self)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/praveenreddy/FFlink/Flink_Work/myenv/lib/python3.11/site-packages/apache_beam/runners/worker/data_plane.py", line 669, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/praveenreddy/FFlink/Flink_Work/myenv/lib/python3.11/site-packages/apache_beam/runners/worker/data_plane.py", line 652, in _read_inputs
    for elements in elements_iterator:
  File "/Users/praveenr

1> (4,chinnareddy,60)
7> (3,reddy,55)
4> (1,praveen,29)
5> (2,chinna,19)


<pyflink.common.job_execution_result.JobExecutionResult at 0x15d5029d0>

## Windows in Apache Flink
Windows are essential for processing infinite data streams because they break the stream into finite "buckets". These buckets allow Flink to perform computations on smaller chunks of the data stream

### Keyed and Non-Keyed Windows?
#### Keyed Windows:

You divide the stream into logical sub-streams (called keyed streams) based on a key.

Use keyBy(...) to assign a key.

Allows parallel processing since each key is processed independently.

Example:
stream.key_by(lambda x: x.user_id).window(...).reduce(...)  # Keyed by user_id


#### Non-Keyed Windows:
The entire stream is treated as a single unit.

Use windowAll(...) instead of window(...).

All processing happens in a single task with no parallelism.

Example:
stream.window_all(...).reduce(...)  # Non-keyed processing


**How is Windows Creation and Removal works ?:**

A window is created when the first element for that window arrives.

The window is removed after its end timestamp and an additional "lateness allowance" (if specified).

**Example:**
A tumbling window of 5 minutes starts at 12:00 and ends at 12:05.
If "lateness" is set to 1 minute, the window is cleared at 12:06.

**Core Components of a Window:**

Window Assigners: Define how elements are grouped into windows (e.g., time-based or session-based).

Trigger: Defines when a window is "ready" for computation (e.g., after 10 elements or when the watermark reaches a specific time).

Function: Specifies the computation to apply (e.g., sum, average, or a custom function).

Evictor (Optional): Removes elements from the window before or after applying the function. [not supported in Python API]

**Window Assigners**
Assigners group stream elements into windows:

``Tumbling Windows: Fixed-size, non-overlapping windows.``

Example: Windows of 5 seconds (12:00 to 12:05, 12:05 to 12:10, etc.).

``Sliding Windows: Fixed-size, overlapping windows with a defined "slide interval".``

Example: 5-second windows sliding every 2 seconds (overlap occurs).

``Session Windows: Windows with gaps of inactivity. A new window starts after a period of no data.``

``Global Windows: Contains all data without grouping by time (manual control needed for triggers).``


- By default, GlobalWindows does not come with a trigger. Since it encompasses all elements into a single window, you must explicitly define a trigger to decide when the window should be processed

- Sliding and tumbling windows have built-in triggers based on the time semantics you specify (event-time or processing-time).

In [200]:
from pyflink.datastream import StreamExecutionEnvironment, TimeCharacteristic
from pyflink.common.typeinfo import Types
from pyflink.datastream.window import TumblingEventTimeWindows, SlidingEventTimeWindows, GlobalWindows
from pyflink.common.time import Time
from pyflink.datastream.functions import KeySelector, ReduceFunction
from pyflink.common import WatermarkStrategy,Encoder
import time

from pyflink.common import WatermarkStrategy, Duration
from pyflink.common.watermark_strategy import TimestampAssigner
from pyflink.datastream.functions import ProcessWindowFunction

from pyflink.table import StreamTableEnvironment, DataTypes
from pyflink.table.descriptors import Schema
from pyflink.common.typeinfo import Types
from pyflink.datastream.connectors.file_system import FileSink

from pyflink.datastream.functions import ProcessFunction

In [102]:
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)  # Set parallelism to 1 for simplicity
env.set_stream_time_characteristic(TimeCharacteristic.EventTime)


In [103]:
current_time = int(time.time() * 1000)  # Current time in milliseconds
data = [
    (current_time - 7000, 10),  # Event 7 seconds ago
    (current_time - 6000, 20),  # Event 6 seconds ago
    (current_time - 5000, 30),  # Event 5 seconds ago
    (current_time - 4000, 40),  # Event 4 seconds ago
    (current_time - 3000, 50),  # Event 3 seconds ago
    (current_time - 2000, 60),  # Event 2 seconds ago
    (current_time - 1000, 70)   # Event 1 second ago
]

data_stream = env.from_collection(
    data,
    type_info=Types.TUPLE([Types.LONG(), Types.INT()])
)

In [104]:
# Define a custom key selector
class EvenOddKeySelector(KeySelector):
    def get_key(self, value):
        return value[1] % 2  # Key by even (0) or odd (1) based on payload


data = [
    (timestamp_1, 10),  (timestamp_2, 20),  (timestamp_3, 30) ]

reduce:

value1 = (timestamp_1, 10)

value2 = (timestamp_2, 20)

Result: (timestamp_1, 10 + 20) = (timestamp_1, 30)

In [105]:
# Define a custom reduce function
class SumReduceFunction(ReduceFunction):
    def reduce(self, value1, value2):
        return (value1[0], value1[1] + value2[1])  # Sum payloads, keep the earliest timestamp


In [None]:

class FirstElementTimestampAssigner(TimestampAssigner):
    def extract_timestamp(self, value, record_timestamp):
        return value[0]  # Extract timestamp from the first field of the incoming event


1. Allows events to arrive up to 20 seconds late compared to the current watermark (i.e., out-of-order events are tolerated for 20 seconds
2. The FirstElementTimestampAssigner uses the first field of each event (e.g., a timestamp) as the event's timestamp.

In [108]:
# Define the watermark strategy
watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(20)) \
    .with_timestamp_assigner(FirstElementTimestampAssigner()) # Extract timestamp from the first field


In [109]:
# Assign timestamps and watermarks
timestamped_stream = data_stream.assign_timestamps_and_watermarks(watermark_strategy)


In [None]:
windowed_stream = (
    timestamped_stream
    .key_by(EvenOddKeySelector())  # Key by even/odd payload
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))  # Tumbling window of 3 seconds + 20 seconds of late arriving /out of order records 
    .reduce(SumReduceFunction())  # Sum values in each window
)

# Print the result
windowed_stream.print()

# Execute the program
env.execute("Tumbling Window with Realistic Event Time")

(1734442493639,10)
(1734442494639,90)
(1734442497639,180)


<pyflink.common.job_execution_result.JobExecutionResult at 0x15cfd8950>

In [88]:

# Key by even/odd payload, create a sliding window of 5 seconds, sliding every 2 seconds
windowed_stream = (
    timestamped_stream
    .key_by(EvenOddKeySelector())
    .window(SlidingEventTimeWindows.of(Time.seconds(5), Time.seconds(2)))
    .reduce(SumReduceFunction())
)

# Print the results
windowed_stream.print()

# Execute the program
env.execute("Sliding Window with Realistic Event Time")


(1734442493639,30)
(1734442493639,100)
(1734442494639,200)
(1734442496639,220)
(1734442498639,130)


<pyflink.common.job_execution_result.JobExecutionResult at 0x15c873810>

Flink's distributed nature means it operates across multiple worker nodes, and print statements often do not work the same way in a distributed streaming environment as they do in local testing

In [115]:
watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(15))

# Assign timestamps and watermarks
timestamped_stream = timestamped_stream.assign_timestamps_and_watermarks(watermark_strategy)

output_path = "/Users/praveenreddy/FFlink/Flink_Work/global_window_output"

# Define a sink to write the output to a CSV file
windowed_stream = (
    timestamped_stream
    .window_all(GlobalWindows())  
    .reduce(SumReduceFunction())
)
windowed_stream.print()

# Writing results to a CSV sink
# windowed_stream.sink_to(
#     FileSink.for_row_format(base_path=output_path,encoder=Encoder.simple_string_encoder()).build()
# )

# Execute the program
env.execute("Global Window with Realistic Event Time")

<pyflink.common.job_execution_result.JobExecutionResult at 0x15cd78cd0>

FIRE
- Emits the current results of the window downstream but does not clear the window's state.

PURGE
- Clears the window's state without emitting any results downstream.

FIRE_AND_PURGE
- Emits the current results of the window and clears the window's state.

CONTINUE
- Does nothing and keeps the window's state intact.

In [None]:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import GlobalWindows, Trigger, TriggerResult
from pyflink.common.time import Duration
from pyflink.datastream.functions import ReduceFunction

# Initialize environment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)

# Define data
import time
current_time = int(time.time() * 1000)
data = [
    (current_time - 7000, 10),  # Event 7 seconds ago
    (current_time - 6000, 20),  # Event 6 seconds ago
    (current_time - 5000, 30),  # Event 5 seconds ago
    (current_time - 4000, 40),  # Event 4 seconds ago
    (current_time - 3000, 50),  # Event 3 seconds ago
    (current_time - 2000, 60),  # Event 2 seconds ago
    (current_time - 1000, 70)   # Event 1 second ago
]
# Create data stream
data_stream = env.from_collection(
    collection=data,
    type_info=Types.TUPLE([Types.LONG(), Types.INT()])
)

# Define reduce function
class SumReduceFunction(ReduceFunction):
    def reduce(self, value1, value2):
        return (value1[0], value1[1] + value2[1])

from pyflink.datastream.window import Trigger, TriggerResult


class CustomTrigger(Trigger):
    def on_element(self, element, timestamp, window, ctx):
        # This method is called every time a new element is added to the window.
        return TriggerResult.FIRE
#TriggerResult.PURGE:
# Clear the state of the window without emitting results.

#This method is invoked when a processing-time timer is triggered (based on wall-clock time).
    def on_processing_time(self, time, window, ctx):
        # Do nothing and wait for more elements or the next trigger condition
        return TriggerResult.CONTINUE

#This method is called when an event-time timer is triggered (based on the event-time watermark reaching a certain threshold).
    def on_event_time(self, time, window, ctx):
        # Do nothing and wait for more elements or the next trigger condition
        return TriggerResult.CONTINUE

#This method is used when windows are merged (e.g., in session windows or dynamic scenarios).
    def on_merge(self, window, ctx):
        # No merging logic required for GlobalWindows
        pass

#This method is called when the window expires (based on watermark + allowed lateness) or when explicitly cleared by the system or user logic.
    def clear(self, window, ctx):
        # Clear the state for the window
        pass




# Define watermark strategy
watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(15)).with_timestamp_assigner(FirstElementTimestampAssigner()) 

# Assign timestamps and watermarks
timestamped_stream = data_stream.assign_timestamps_and_watermarks(watermark_strategy)

# Define global window with custom trigger
windowed_stream = (
    timestamped_stream
    .window_all(GlobalWindows())
    .trigger(CustomTrigger())
    .reduce(SumReduceFunction())
)

# Debugging: Print results
windowed_stream.print()

# Execute program
env.execute("Global Window with Realistic Event Time")


(1734477024357,10)
(1734477024357,30)
(1734477024357,60)
(1734477024357,100)
(1734477024357,150)
(1734477024357,210)
(1734477024357,280)


<pyflink.common.job_execution_result.JobExecutionResult at 0x15d9963d0>

In [128]:

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)

# Sensor data: (timestamp, sensor_id, temperature)
import time
current_time = int(time.time() * 1000)
data = [
    (current_time - 7000, "Sensor-1", 30.5),
    (current_time - 6000, "Sensor-1", 32.0),
    (current_time - 5000, "Sensor-1", 35.8),
    (current_time - 4000, "Sensor-2", 29.2),
    (current_time - 3000, "Sensor-2", 33.5),
    (current_time - 2000, "Sensor-1", 37.1),
]


In [129]:
class SensorTimestampAssigner(TimestampAssigner):
    def extract_timestamp(self, value, record_timestamp):
        return value[0]  # Timestamp is the first field


In [144]:
class MaxTemperatureFunction(ProcessWindowFunction):
    def process(self, key, context, elements):
        max_temp = float('-inf')
        max_record = None

        # Find the maximum temperature in the window
        for element in elements:
            if element[2] > max_temp:
                max_temp = element[2]
                max_record = element

        window_start = context.window().start
        window_end = context.window().end

        # Collect results as an iterable (list)
        result = [
            f"Sensor: {key}, Window: [{window_start}, {window_end}), Max Temp: {max_temp}"
        ]

        # Emit the result
        return result


In [145]:
watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(2)) \
    .with_timestamp_assigner(SensorTimestampAssigner())

data_stream = env.from_collection(data)
timestamped_stream = data_stream.assign_timestamps_and_watermarks(watermark_strategy)


In [146]:
windowed_stream = (
    timestamped_stream
    .key_by(lambda x: x[1])  # Key by sensor ID
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .process(MaxTemperatureFunction())
)

windowed_stream.print()
env.execute("Max Temperature with ProcessWindowFunction")

Sensor: Sensor-1, Window: [1734479245000, 1734479250000), Max Temp: 32.0
Sensor: Sensor-2, Window: [1734479250000, 1734479255000), Max Temp: 33.5
Sensor: Sensor-1, Window: [1734479250000, 1734479255000), Max Temp: 37.1


<pyflink.common.job_execution_result.JobExecutionResult at 0x15c8ce9d0>

In [196]:
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)

# Set up Table Environment
t_env = StreamTableEnvironment.create(env)

# Define output path
output_path = "/Users/praveenreddy/FFlink/Flink_Work/real_time_output"

# Define some sample data
current_time = int(time.time() * 1000)
data = [
    (current_time - 7000, "sensor_1", 23.5),
    (current_time - 6000, "sensor_2", 24.1),
    (current_time - 5000, "sensor_1", 25.3),
    (current_time - 4000, "sensor_3", 22.5),
    (current_time - 3000, "sensor_2", 24.6),
    (current_time - 2000, "sensor_3", 21.7),
    (current_time - 1000, "sensor_1", 25.0),
]

# Create a data stream from the sample data
data_stream = env.from_collection(
    collection=data,
    type_info=Types.TUPLE([Types.LONG(), Types.STRING(), Types.FLOAT()])
)


In [197]:
watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5))
timestamped_stream = data_stream.assign_timestamps_and_watermarks(watermark_strategy)


In [None]:
from pyflink.table import Schema

# Convert DataStream to Table with Schema
table = t_env.from_data_stream(
    timestamped_stream  ,
    Schema.new_builder()
        .column_by_expression("data_timestamp", "f0")
        .column_by_expression("sensor_id", "f1")
        .column_by_expression("temperature", "f2")
        .build()
)

# Correct way to create a temporary view in the Table Environment
t_env.create_temporary_view("sensor_data", table)


In [199]:
df = t_env.execute_sql("select data_timestamp,sensor_id,temperature from sensor_data")
df.print()


+----+----------------------+--------------------------------+--------------------------------+
| op |       data_timestamp |                      sensor_id |                    temperature |
+----+----------------------+--------------------------------+--------------------------------+
| +I |        1734482513866 |                       sensor_1 |                           23.5 |
| +I |        1734482514866 |                       sensor_2 |                           24.1 |
| +I |        1734482515866 |                       sensor_1 |                           25.3 |
| +I |        1734482516866 |                       sensor_3 |                           22.5 |
| +I |        1734482517866 |                       sensor_2 |                           24.6 |
| +I |        1734482518866 |                       sensor_3 |                           21.7 |
| +I |        1734482519866 |                       sensor_1 |                           25.0 |
+----+----------------------+-----------

In [211]:
class FraudDetectionFunction(ProcessFunction):
    def process_element(self, element, ctx):
        transaction_id, account_id, amount, timestamp = element

        # Check if the transaction amount exceeds $10,000
        if amount > 10000:
            alert_msg = f"ALERT: Suspicious transaction detected - Transaction ID: {transaction_id}, Account: {account_id}, Amount: ${amount}"
            return [alert_msg]


In [212]:
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)

# Sample transaction data (transaction_id, account_id, amount, timestamp)
transaction_data = [
    ("txn1", "acc101", 5000, 1653232900),
    ("txn2", "acc102", 15000, 1653232910),
    ("txn3", "acc103", 12000, 1653232920),
    ("txn4", "acc104", 9000, 1653232930),
    ("txn5", "acc105", 11000, 1653232940)
]

# Create a DataStream from the sample data
transaction_stream = env.from_collection(
    collection=transaction_data,
    type_info=Types.TUPLE([Types.STRING(), Types.STRING(), Types.INT(), Types.LONG()])
)

In [213]:
suspicious_transactions = transaction_stream.process(FraudDetectionFunction())

# Print the detected alerts
suspicious_transactions.print()

# Execute the Flink job
env.execute("Fraud Detection in Financial Transactions")

ALERT: Suspicious transaction detected - Transaction ID: txn2, Account: acc102, Amount: $15000
ALERT: Suspicious transaction detected - Transaction ID: txn3, Account: acc103, Amount: $12000
ALERT: Suspicious transaction detected - Transaction ID: txn5, Account: acc105, Amount: $11000


<pyflink.common.job_execution_result.JobExecutionResult at 0x1481b8ad0>