# Temporal and Spatial Analysis of Red Alerts in Israel

1. **Data extraction**
2. **Data preprocessing**
   - Creating a general dataset
   - Creating a dataset with an indication of detailed localities
   - Creating a dataset where zones of large cities are combined into a single record
3. **Conclusion**

# Data extraction

The data used in this analysis was obtained from the [Cumta Telegram channel](https://t.me/CumtaAlertsEnglishChannel), which provides real-time alerts about rocket sirens (Red Alerts) in Israel. The channel broadcasts information about alerts, including affected regions, cities, and timestamps. The dataset consists of extracted historical messages from this channel for further analysis and visualization.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import os
import re
from collections import Counter
from scipy.interpolate import make_interp_spline
import numpy as np

In [2]:
# Path to the folder containing files
path_to_files = r"C:\Users\Vera\Documents\DA Practicum\work files\datasets\cumta_2"

# Create an empty list to store data
data = []

# Iterate over all files with the prefix 'messages'
for file in os.listdir(path_to_files):
    if file.startswith('messages') and file.endswith('.html'):
        # Open the file in UTF-8 encoding
        with open(os.path.join(path_to_files, file), 'r', encoding='utf-8') as f:
            soup = BeautifulSoup(f, 'html.parser')
            
            # Process messages and find associated dates
            for msg in soup.find_all('div', class_='text'):
                # Find the nearest previous date element
                date = msg.find_previous('div', class_='pull_right date details')
                data.append({
                    'date': date['title'] if date and 'title' in date.attrs else None,
                    'text': msg.text.strip()
                })

# Convert the collected data into a DataFrame
raw_df = pd.DataFrame(data)

raw_df


Unnamed: 0,date,text
0,,Red Alerts - Cumta
1,26.12.2018 11:05:08 UTC+03:00,"Red Alert at Dan (155,156,157,158,159,160,161,..."
2,02.01.2019 11:05:05 UTC+03:00,"Red Alert at Eilat 311, Arabah 310 [10:05]: 02..."
3,07.01.2019 04:18:52 UTC+03:00,Red Alert at Lakhish 246 [03:18]: 07/01/2019 0...
4,09.01.2019 11:39:23 UTC+03:00,"Good morning,Starting this week the Home Front..."
...,...,...
7688,08.06.2024 10:46:43 UTC+03:00,Red Alert at Zarit [10:46]:08/06/2024 10:46:42...
7689,08.06.2024 10:47:54 UTC+03:00,Red Alert at Zarit [10:47]:08/06/2024 10:47:53...
7690,08.06.2024 10:50:11 UTC+03:00,Red Alert at Zarit [10:49]:08/06/2024 10:49:22...
7691,08.06.2024 11:38:47 UTC+03:00,"Unrecognized Aircraft at HaGalil HaElyon, Mero..."


#  Data preprocessing
## Creating a general dataset

In [3]:
# Convert the 'date' column to datetime format with dayfirst=True
raw_df['date'] = pd.to_datetime(raw_df['date'], dayfirst=True)

# Sort the DataFrame by date in ascending order
raw_df = raw_df.sort_values(by='date', ascending=True).reset_index(drop=True)

# Checking the minimum and maximum dates in the dataset
print(f"Minimum date: {raw_df['date'].min()}")
print(f"Maximum date: {raw_df['date'].max()}")


Minimum date: 2018-12-26 11:05:08+03:00
Maximum date: 2025-10-05 18:35:50+03:00


In [4]:
# Check
raw_df

Unnamed: 0,date,text
0,2018-12-26 11:05:08+03:00,"Red Alert at Dan (155,156,157,158,159,160,161,..."
1,2019-01-02 11:05:05+03:00,"Red Alert at Eilat 311, Arabah 310 [10:05]: 02..."
2,2019-01-07 04:18:52+03:00,Red Alert at Lakhish 246 [03:18]: 07/01/2019 0...
3,2019-01-09 11:39:23+03:00,"Good morning,Starting this week the Home Front..."
4,2019-01-12 21:59:18+03:00,"Red Alert at Gaza Containment Zone (224,225) [..."
...,...,...
7688,NaT,Red Alerts - Cumta
7689,NaT,Red Alerts - Cumta
7690,NaT,Red Alerts - Cumta
7691,NaT,Red Alerts - Cumta


In [5]:
# Delete rows where the value in the date column is NaT
raw_df = raw_df.dropna(subset=['date']).reset_index(drop=True)
raw_df

Unnamed: 0,date,text
0,2018-12-26 11:05:08+03:00,"Red Alert at Dan (155,156,157,158,159,160,161,..."
1,2019-01-02 11:05:05+03:00,"Red Alert at Eilat 311, Arabah 310 [10:05]: 02..."
2,2019-01-07 04:18:52+03:00,Red Alert at Lakhish 246 [03:18]: 07/01/2019 0...
3,2019-01-09 11:39:23+03:00,"Good morning,Starting this week the Home Front..."
4,2019-01-12 21:59:18+03:00,"Red Alert at Gaza Containment Zone (224,225) [..."
...,...,...
7672,2025-10-01 14:00:45+03:00,"Red Alert at Sha'ar HaNegev, Sdot Negev [14:00..."
7673,2025-10-01 20:49:07+03:00,"Red Alert at Ashdod, Hof Ashkelon [20:48]:01/1..."
7674,2025-10-05 04:59:22+03:00,"Red Alert at Beit Shemesh, Lod, Ness Ziona, Mo..."
7675,2025-10-05 18:34:40+03:00,Red Alert at Eilat [18:34]:05/10/2025 18:34:40...


In [6]:
# Display general information 
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7677 entries, 0 to 7676
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype                    
---  ------  --------------  -----                    
 0   date    7677 non-null   datetime64[ns, UTC+03:00]
 1   text    7677 non-null   object                   
dtypes: datetime64[ns, UTC+03:00](1), object(1)
memory usage: 120.1+ KB


In [7]:
# --- Hidden because of the large size ---
# Let's see the whole dataframe
#raw_df.to_string()

In [8]:
# A function for determining the type of threat
def extract_threat_type(text):
    if text.startswith("Red Alert") or text.startswith("An alert"):
        return "Red Alert"
    elif text.startswith("Unrecognized Aircraft"):
        return "Unrecognized Aircraft"
    elif text.startswith("An unrecognized aircraft"):
        return "Unrecognized Aircraft"
    elif text.startswith("Terrorist Infiltration"):
        return "Terrorist Infiltration"
    elif text.startswith("Earthquake"):
        return "Earthquake"
    elif text.startswith("Interception pieces"):
        return "Interception pieces"
    return None

# Removing update messages (starting with 'Good morning', 'Dear' or similar)
raw_df = raw_df[~raw_df['text'].str.lower().str.startswith(('good', 'starting', 
                                                            'dear', "home", 
                                                            'a new'))].reset_index(drop=True)

raw_df = raw_df.copy()
raw_df['threat_type'] = raw_df['text'].apply(extract_threat_type)


In [9]:
# A function for extracting a region
def extract_region(text):
    match = re.search(r'at (.*?) \[\d{2}:\d{2}\]', text)
    return match.group(1).strip() if match else None

# A function for extracting time
def extract_time(text):
    match = re.search(r'\[(\d{2}:\d{2})\]', text)
    return match.group(1) if match else None

# Functions for extracting major cities and regional councils

def extract_major_cities(text):
    match = re.search(r'[Mm]ajor [Cc]ities: (.+?)(?:\|\||$)', text)
    return match.group(1).strip() if match else None

def extract_regional_councils(text):
    match = re.search(r'Regional Councils: (.+?)\|\|', text)
    return match.group(1).strip() if match else None

# Adding new columns
raw_df['region'] = raw_df['text'].apply(extract_region)
raw_df['major_cities'] = raw_df['text'].apply(extract_major_cities)
raw_df['regional_councils'] = raw_df['text'].apply(extract_regional_councils)

# Resetting the index for the resulting DataFrame
raw_df.reset_index(drop=True, inplace=True)

In [10]:
# Check
raw_df

Unnamed: 0,date,text,threat_type,region,major_cities,regional_councils
0,2018-12-26 11:05:08+03:00,"Red Alert at Dan (155,156,157,158,159,160,161,...",Red Alert,"Dan (155,156,157,158,159,160,161,162,165), Sha...","Tel Aviv, Holon, Herzliya, Ramat Gan, Bnei Bra...",Southern Sharon
1,2019-01-02 11:05:05+03:00,"Red Alert at Eilat 311, Arabah 310 [10:05]: 02...",Red Alert,"Eilat 311, Arabah 310",Eilat,"Hevel Eilot, Tamar, HaArava HaTichona"
2,2019-01-07 04:18:52+03:00,Red Alert at Lakhish 246 [03:18]: 07/01/2019 0...,Red Alert,Lakhish 246,Ashkelon,Hof Ashkelon
3,2019-01-12 21:59:18+03:00,"Red Alert at Gaza Containment Zone (224,225) [...",Red Alert,"Gaza Containment Zone (224,225)",,"Sdot Negev, Sha'ar HaNegev"
4,2019-02-06 11:05:01+03:00,"Red Alert at Jerusalem 194, Maale Adumim 200, ...",Red Alert,"Jerusalem 194, Maale Adumim 200, Samaria 127","Jerusalem, Ma'ale Adumim","Gush Etzion, Mateh Binyamin"
...,...,...,...,...,...,...
7662,2025-10-01 14:00:45+03:00,"Red Alert at Sha'ar HaNegev, Sdot Negev [14:00...",Red Alert,"Sha'ar HaNegev, Sdot Negev",,
7663,2025-10-01 20:49:07+03:00,"Red Alert at Ashdod, Hof Ashkelon [20:48]:01/1...",Red Alert,"Ashdod, Hof Ashkelon",Ashdod,
7664,2025-10-05 04:59:22+03:00,"Red Alert at Beit Shemesh, Lod, Ness Ziona, Mo...",Red Alert,"Beit Shemesh, Lod, Ness Ziona, Modi'in-Maccabi...","Beit Shemesh, Lod, Ness Ziona, Modi'in-Maccabi...",
7665,2025-10-05 18:34:40+03:00,Red Alert at Eilat [18:34]:05/10/2025 18:34:40...,Red Alert,Eilat,Eilat,


### 1 st october 2024 check
Let's check how strings are stored during very powerful attacks, when a lot of settlements were involved.

In [11]:
# looking at the lines with the date of October 1, 2024 from 19 to 20 (the Iranian attack)
raw_df[
    (raw_df['date'] >= '2024-10-01 19:00:00') &
    (raw_df['date'] < '2024-10-01 20:00:00')
]

Unnamed: 0,date,text,threat_type,region,major_cities,regional_councils
5640,2024-10-01 19:31:39+03:00,"Red Alert at Be'er Sheva, Kiryat Arba, Arad, Y...",Red Alert,"Be'er Sheva, Kiryat Arba, Arad, Yeruham, Dimon...","Be'er Sheva, Kiryat Arba, Arad, Yeruham, Dimon...",
5641,2024-10-01 19:35:36+03:00,"Red Alert at Gdera, Yavne, Tel Aviv - Jaffa, O...",Red Alert,"Gdera, Yavne, Tel Aviv - Jaffa, Or Yehuda, Bne...",,
5642,2024-10-01 19:35:37+03:00,"• Lachish - Kedma, Ad Halom Industrial Zone, M...",,,,
5643,2024-10-01 19:35:37+03:00,"• Sharon - Bnei Zion, Ahituv, Bitan Aharon, Ne...",,,,
5644,2024-10-01 19:35:38+03:00,"• Center Negev - Laqiya, Beer Sheva - East, Be...",,,"Gdera, Yavne, Tel Aviv - Jaffa, Or Yehuda, Bne...",
5645,2024-10-01 19:40:17+03:00,"Red Alert at Kiryat Arba, Dimona, Hevron Jewis...",Red Alert,"Kiryat Arba, Dimona, Hevron Jewish Settlement,...",,
5646,2024-10-01 19:40:17+03:00,"Red Alert at Kiryat Arba, Dimona, Hevron Jewis...",Red Alert,"Kiryat Arba, Dimona, Hevron Jewish Settlement,...",,
5647,2024-10-01 19:40:18+03:00,• West Lachish - Ashkelon Northern Industrial ...,,,,
5648,2024-10-01 19:40:18+03:00,• West Lachish - Ashkelon Northern Industrial ...,,,,
5649,2024-10-01 19:40:25+03:00,"• Bika'a - Niran, Tomer, Petza'el, Argaman, Ma...",,,,


In [12]:
# Checking the text in the 'text' column (row 5643)
raw_df.loc[5641, 'text']

"Red Alert at Gdera, Yavne, Tel Aviv - Jaffa, Or Yehuda, Bnei Brak, Bat-Yam, Givat Shmuel, Givatayim, Herzeliya, Holon, Yehud - Monoson, Kfar Shmaryahu, Modi'in-Maccabim-Re'ut, Mikveh Israel, Ariel, Rishon LeZion, Ramat Gan, Or Akiva, Be'er Yacov, Pardes Hanna - Karkur, Ramla, Hadera, Ramat HaSharon, Kfar Saba, Lod, Hod HaSharon, Rosh HaAyin, Ness Ziona, Elad, Netanya, Kfar Kassem, Petach Tikva, Tira, Qalansawe, Rehovot, Caesarea, Shoham, Kiryat Ono, Karnei Shomron, Ra'anana, Savyon, bqa al-Gharbiyye, Kadima-Zoran, Tayibe, Ashdod, Kiryat Gat, Gan Yavne, Kiryat Malachi, Jerusalem, Beit Shemesh, Beitar Illit, Mevasseret Zion, Ashkelon, Ofakim, Be'er Sheva, Sderot, Yeruham, Netivot, Arad, Hevel Yavne, R.C. Be'er Tuvia, Gederot, Brenner, Nahal Sorek, Gan Raveh, Yoav, Emek Hefer, Drom HaSharon, Hevel Modi'in, Mateh Binyamin, Gezer, Hof HaSharon, Sdot Dan, Shomron, Lev HaSharon, Menashe, Hof HaCarmel, Mateh Yehuda, Lakhish, Shafir, Hof Ashkelon, Gush Etzion, Sha'ar HaNegev, Merhavim, Bnei Sh

In [13]:
raw_df.loc[5642, 'text']

"• Lachish - Kedma, Ad Halom Industrial Zone, Menucha, Merkaz Shapira, Masuot Itzhak, Nachla, Ashdod - Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*, Givati, Gat, Aluma, Shafir, Zrahia, Ezer, Nir Banim, Uzza, Emunim, Azrikam, Even Shmuel, Kommemiut, Ashdod - Gimmel, Vav, Zain, Sdeh Moshe, Ashdod - Alef, Bet, Dalet, Heh, Ein Tzurim, Vardon, Shalva, Kiryat Gat - Industrial Zone, Sgula, Sdeh Uziahu, Beit Ezra, Ahuzam, Revacha, Noam, Zavdiel, Shtulim, Ashdod-11,12,15,17,Marine,City, Kiryat Gat , Karmei Gat, Eitan • West Lachish - Nitzanim Beach, Nitzan, Nitzanim01/10/2024 19:33:36: • Lachish - Timorim, Orot, Re'em Industrial Park, Timorimg Industrial Zone, Kiryat Mal'akhi-Yoav train station, Arugot, Revadim, Al Azi, Be'er Tuvia, Be'er Tuvia Industrial Zone, Yinon, Talmei Yehiel, Kfar HaRif and Re'em Junction, Gan Yavne, Lakhish, Kiryat Malachi, Avigdor, Hatzor, Achva, Kfar Warburg, Bnei Re'em, Kfar Achim • Shfelat Yehuda - Galon, Beit Nir, Kfar Menachem, Beit Guvrin, Kfar Zoharim, Sdot Micha, 

In [14]:
raw_df.loc[5643, 'text']

'• Sharon - Bnei Zion, Ahituv, Bitan Aharon, Neveh Yamin, Matan, Nir Eliyahu, Yafhiv, Sha\'arei Tikva, Zemer, Elishema, Kfar Yona, Yarkona, Haniel, Tzur Itzhak, Gelilot - Pi Compound, Bahan, HaSharon Prison, Bnei Dror, Yad Hana, Kfar Saba, Be\'erotaim, Ramat HaKovesh, Eyal, Mikhmoret, Nitzanei Oz, Beit Herut, Kfar HaRoeh, Geulim, Hagor, Kfar Vitkin, Herev Le\'Et, Hadar Am, Beit Yehoshua, Hod HaSharon, Tel Itzhak, Bat Chen, Elyachin, Havatzelet HaSharon and Tzukei Yam, Alfei Menashe, Kafr Misr, Azriel, Kokhav Ya\'ir - Tzur Yigal, Shfayim "Haneh Ve\'sa" Compound, Netanya - West, Gan Haim, Kfar Hess, Porat, Oranit, Kfar Yavetz, Horashim, Amatz, Batzra, Ein Sarid, Burgata, Tel Mond, HaOgen, Yanuv, Beit Berl, Kfar Kassem, Bat Hefer, Kfar Haim, Even Yehuda, Kfar Avoda, Adanim, Kfar Monash, Shefayim, Ramot HaShavim, Kfar Netter, Tira Industrial Area, Ein HaHoresh, Tira, Beit Yithak - Sha\'ar Hefer, Herut, Salit, Harutzim, Ginot Hadar, Southern Sharon Regional Center, Qalansawe, Ma\'abarot, Ne

In [15]:
raw_df.loc[5644, 'text']

"• Center Negev - Laqiya, Beer Sheva - East, Beer Sheva - West, Hatzerim, Beer Sheva - North, Eshkolot, Beer Sheva - South, Omer • South Negev - Tlalim, Bir Hadaj, Revivim, Ashalim, Wadi el Na'am South, Retamim, Mashabei Sadeh • Yehuda - Neta, Shomria01/10/2024 19:34:18: • Gaza Envelope - Alumim, Yevul, Tkuma, Gavim, Sapir College, Or HaNer, Yated, Sufa, Havat Izra'am, Sdeh Nitzan, Pri Gan, Shlomit, Nir Am, Yad Mordechai, Nirim, Sderot, Ibim, Erez, Ein HaShlosha, Sdeh Avraham, Kfar Azza, Mivtahim, Ami'oz, Yesha, Shokeda, Dekel, Bnei Netzarim, Nir Am Shooting Range, Zohar, Ohad, Reim, Holit, Ein HaBsor, Netiv HaAssara, Magen, Nir Itzhak, Be'eri, Kissufim, Sa'ad, Nir Oz, Zimrat, Shuva, Kfar Maimon and Tushia, Talmei Eliyahu, Naveh, Mefalsim, Yakhini, Nachal Oz, Avshalom, Talmei Yossef, Gvaram • West Negev - Yoshivia, Mabu'im, Ta'ashur, Zru'a, Shavei Darom, Havat Shikmim, Bror Hayil, Noam Industrial Zone, Ma'galim, Giv'olim, M'lilot, Sdeh Zvi, Netivot, Urim, Shibolim, Brosh, Eshbol, Tidha

The information in the 'text' column gets truncated. The reason for this truncation is that a single message we want to process is split across multiple **div class="text"** blocks. The code processes only one **div class="text"** at a time without combining them into a single message. As a result, the message text is incomplete, and subsequent parts are saved as separate rows.

Rows that start with messages like "• Center Negev - Dvira Junction," have None in the threat_type, region, and time columns but actually belong to the previous message. We can process these rows by taking the threat_type from the previous message. If the text starts with "• ", we append this text to the text field of the previous row and then delete the current row.

In [16]:
# Sort indices in descending order to avoid conflicts when modifying rows
for idx in sorted(raw_df[raw_df['text'].str.match(r"\d{2}/\d{2}/\d{4} \d{2}:\d{2}:\d{2}:|• ", 
                                                  na=False)].index, reverse=True):
    # Check if the previous row exists
    if idx - 1 in raw_df.index:
        # Append the current text to the 'text' of the previous row
        raw_df.loc[idx - 1, 'text'] += f" {raw_df.loc[idx, 'text']}"
        # Drop the current row
        raw_df.drop(index=idx, inplace=True)

# Reset the index after modifications
raw_df.reset_index(drop=True, inplace=True)

In [17]:
# Checking problematic rows (1 st october 2024, 19.00-20.00)
raw_df[
    (raw_df['date'] >= '2024-10-01 19:00:00') &
    (raw_df['date'] < '2024-10-01 20:00:00')
]

Unnamed: 0,date,text,threat_type,region,major_cities,regional_councils
5625,2024-10-01 19:31:39+03:00,"Red Alert at Be'er Sheva, Kiryat Arba, Arad, Y...",Red Alert,"Be'er Sheva, Kiryat Arba, Arad, Yeruham, Dimon...","Be'er Sheva, Kiryat Arba, Arad, Yeruham, Dimon...",
5626,2024-10-01 19:35:36+03:00,"Red Alert at Gdera, Yavne, Tel Aviv - Jaffa, O...",Red Alert,"Gdera, Yavne, Tel Aviv - Jaffa, Or Yehuda, Bne...",,
5627,2024-10-01 19:40:17+03:00,"Red Alert at Kiryat Arba, Dimona, Hevron Jewis...",Red Alert,"Kiryat Arba, Dimona, Hevron Jewish Settlement,...",,
5628,2024-10-01 19:40:17+03:00,"Red Alert at Kiryat Arba, Dimona, Hevron Jewis...",Red Alert,"Kiryat Arba, Dimona, Hevron Jewish Settlement,...",,
5629,2024-10-01 19:42:45+03:00,"Red Alert at Yehud - Monoson, Modi'in-Maccabim...",Red Alert,"Yehud - Monoson, Modi'in-Maccabim-Re'ut, Shoha...",,
5630,2024-10-01 19:42:47+03:00,"Red Alert at Yehud - Monoson, Modi'in-Maccabim...",Red Alert,"Yehud - Monoson, Modi'in-Maccabim-Re'ut, Shoha...",,
5631,2024-10-01 19:46:05+03:00,"Red Alert at Rishon LeZion, Ramat Gan, Ramat H...",Red Alert,"Rishon LeZion, Ramat Gan, Ramat HaSharon, Ra'a...",,
5632,2024-10-01 19:52:31+03:00,"Red Alert at Be'er Sheva, Or Akiva, umm al-Fah...",Red Alert,"Be'er Sheva, Or Akiva, umm al-Fahm, Pardes Han...",,
5633,2024-10-01 19:52:35+03:00,"|| Major cities: Be'er Sheva, Or Akiva, umm al...",,,"Be'er Sheva, Or Akiva, umm al-Fahm, Pardes Han...",
5634,2024-10-01 19:53:12+03:00,"Red Alert at Dimona, Golan, HaGalil HaTahton, ...",Red Alert,"Dimona, Golan, HaGalil HaTahton, Misgav, Emek ...",Dimona,


In [18]:
# Check problematic srting
raw_df.loc[5626, 'text']

'Red Alert at Gdera, Yavne, Tel Aviv - Jaffa, Or Yehuda, Bnei Brak, Bat-Yam, Givat Shmuel, Givatayim, Herzeliya, Holon, Yehud - Monoson, Kfar Shmaryahu, Modi\'in-Maccabim-Re\'ut, Mikveh Israel, Ariel, Rishon LeZion, Ramat Gan, Or Akiva, Be\'er Yacov, Pardes Hanna - Karkur, Ramla, Hadera, Ramat HaSharon, Kfar Saba, Lod, Hod HaSharon, Rosh HaAyin, Ness Ziona, Elad, Netanya, Kfar Kassem, Petach Tikva, Tira, Qalansawe, Rehovot, Caesarea, Shoham, Kiryat Ono, Karnei Shomron, Ra\'anana, Savyon, bqa al-Gharbiyye, Kadima-Zoran, Tayibe, Ashdod, Kiryat Gat, Gan Yavne, Kiryat Malachi, Jerusalem, Beit Shemesh, Beitar Illit, Mevasseret Zion, Ashkelon, Ofakim, Be\'er Sheva, Sderot, Yeruham, Netivot, Arad, Hevel Yavne, R.C. Be\'er Tuvia, Gederot, Brenner, Nahal Sorek, Gan Raveh, Yoav, Emek Hefer, Drom HaSharon, Hevel Modi\'in, Mateh Binyamin, Gezer, Hof HaSharon, Sdot Dan, Shomron, Lev HaSharon, Menashe, Hof HaCarmel, Mateh Yehuda, Lakhish, Shafir, Hof Ashkelon, Gush Etzion, Sha\'ar HaNegev, Merhavim,

Now these strings are OK.

In [19]:
# Merge rows where 'text' starts with '||' into the previous row
for idx in sorted(
    raw_df[raw_df['text'].str.startswith('||', na=False)].index, 
    reverse=True
):
    if idx - 1 in raw_df.index:
        raw_df.loc[idx - 1, 'major_cities'] = (
            f"{raw_df.loc[idx - 1, 'major_cities']} {raw_df.loc[idx, 'major_cities']}".strip()
        )
        raw_df.loc[idx - 1, 'regional_councils'] = (
            f"{raw_df.loc[idx - 1, 'regional_councils']} {raw_df.loc[idx, 'regional_councils']}".strip()
        )
        raw_df.drop(index=idx, inplace=True)


# Reset the index after modifications
raw_df.reset_index(drop=True, inplace=True)

# Display the updated DataFrame
raw_df


Unnamed: 0,date,text,threat_type,region,major_cities,regional_councils
0,2018-12-26 11:05:08+03:00,"Red Alert at Dan (155,156,157,158,159,160,161,...",Red Alert,"Dan (155,156,157,158,159,160,161,162,165), Sha...","Tel Aviv, Holon, Herzliya, Ramat Gan, Bnei Bra...",Southern Sharon
1,2019-01-02 11:05:05+03:00,"Red Alert at Eilat 311, Arabah 310 [10:05]: 02...",Red Alert,"Eilat 311, Arabah 310",Eilat,"Hevel Eilot, Tamar, HaArava HaTichona"
2,2019-01-07 04:18:52+03:00,Red Alert at Lakhish 246 [03:18]: 07/01/2019 0...,Red Alert,Lakhish 246,Ashkelon,Hof Ashkelon
3,2019-01-12 21:59:18+03:00,"Red Alert at Gaza Containment Zone (224,225) [...",Red Alert,"Gaza Containment Zone (224,225)",,"Sdot Negev, Sha'ar HaNegev"
4,2019-02-06 11:05:01+03:00,"Red Alert at Jerusalem 194, Maale Adumim 200, ...",Red Alert,"Jerusalem 194, Maale Adumim 200, Samaria 127","Jerusalem, Ma'ale Adumim","Gush Etzion, Mateh Binyamin"
...,...,...,...,...,...,...
7441,2025-10-01 14:00:45+03:00,"Red Alert at Sha'ar HaNegev, Sdot Negev [14:00...",Red Alert,"Sha'ar HaNegev, Sdot Negev",,
7442,2025-10-01 20:49:07+03:00,"Red Alert at Ashdod, Hof Ashkelon [20:48]:01/1...",Red Alert,"Ashdod, Hof Ashkelon",Ashdod,
7443,2025-10-05 04:59:22+03:00,"Red Alert at Beit Shemesh, Lod, Ness Ziona, Mo...",Red Alert,"Beit Shemesh, Lod, Ness Ziona, Modi'in-Maccabi...","Beit Shemesh, Lod, Ness Ziona, Modi'in-Maccabi...",
7444,2025-10-05 18:34:40+03:00,Red Alert at Eilat [18:34]:05/10/2025 18:34:40...,Red Alert,Eilat,Eilat,


Everything is fine now

In [20]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7446 entries, 0 to 7445
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype                    
---  ------             --------------  -----                    
 0   date               7446 non-null   datetime64[ns, UTC+03:00]
 1   text               7446 non-null   object                   
 2   threat_type        7445 non-null   object                   
 3   region             7445 non-null   object                   
 4   major_cities       2977 non-null   object                   
 5   regional_councils  510 non-null    object                   
dtypes: datetime64[ns, UTC+03:00](1), object(5)
memory usage: 349.2+ KB


In [21]:
# Checking how useful the "major_cities" column is
print('Number of major cities:', raw_df['major_cities'].nunique())
raw_df['major_cities'].unique()

Number of major cities: 664


array(['Tel Aviv, Holon, Herzliya, Ramat Gan, Bnei Brak, Or Yehuda, Yehud, Kiryat Ono, Bat Yam, Hod HaSharon, Kfar Saba',
       'Eilat', 'Ashkelon', None, "Jerusalem, Ma'ale Adumim",
       'Bat Yam, Tel Aviv, Holon, Ramat Gan, Bnei Brak',
       'Tel Aviv, Holon, Ramat Gan, Bnei Brak', 'Sderot', 'Kfar Saba',
       'Netivot', 'Sderot, Netivot, Ashkelon', 'Sderot, Ashkelon',
       'Gedera, Ashkelon, Ashdod, Sderot', 'Kiryat Gat', 'Ashdod',
       'Ofakim', "Ashkelon, Be'er Sheva", "Be'er Sheva",
       'Ofakim, Gedera, Ashdod',
       "None Be'er Sheva, Ashkelon, Ashdod, Rehovot, Yavne, Ofakim, Arad",
       'Ashdod, Ashkelon', 'Ashkelon, Ashdod',
       'Atlit, Or Akiva, bqa al-Gharbiyye, Binyamina, Zichron Yacov, Caesarea, umm al-Fahm, Harish',
       'Or Akiva, bqa al-Gharbiyye, Binyamina, Zichron Yacov, Caesarea, umm al-Fahm, Harish',
       'Hadera', 'Ashdod, Gdera',
       'Ashkelon, Sderot, Holon, Rishon LeZion', 'Gdera, Gan Yavne',
       'Tel Aviv - Jaffa, Bat Yam, Holon, Ri

In [22]:
raw_df['major_cities'].value_counts().to_frame()

Unnamed: 0_level_0,count
major_cities,Unnamed: 1_level_1
Sderot,434
Ashkelon,266
Kiryat Shmona,197
Shlomi,179
Metulla,172
...,...
"Sderot, Gan Yavne, Ashdod",1
"Rishon LeZion, Holon, Mikveh Israel, Bat Yam, Tel Aviv - Jaffa, Givatayim, Ramat Gan, Or Yehuda, Ness Ziona",1
"Sderot, Gan Yavne, Kiryat Malachi, Ashdod",1
"Rishon LeZion, Holon, Bat Yam, Yehud-Monosson, Or Yehuda, Bnei Brak, Givat Shmuel, Petach Tikva, Kiryat Ono, Ramat Gan, Be'er Yacov, Ness Ziona",1


Let's see how useful the 'major_cities' and 'regional_councils' columns are.

In [23]:
# Count the number of values that are either "None" (as a string) or missing (NaN, None)
none_count = raw_df['major_cities'].apply(lambda x: x == "None" or pd.isna(x)).sum()
none_count_rc = raw_df['regional_councils'].apply(lambda x: x == "None" or pd.isna(x)).sum()

print(f"Number of rows with 'None' or missing values in major_cities: {none_count}")
print(f"Number of rows with 'None' or missing values in regional_councils: {none_count_rc}")

Number of rows with 'None' or missing values in major_cities: 4469
Number of rows with 'None' or missing values in regional_councils: 6936


There are too many rows with missing values, meaning the columns are not very useful. I will not use it in the future.

**Raw_df Processing Summary**  

- **Extracted** data from **Telegram messages**.  
- **Removed** irrelevant rows (NaT dates, update messages).  
- **Fixed truncated messages** caused by split `div class="text"` blocks.  
- **Merged fragmented rows** (e.g., those starting with "•") into previous messages.  
- **Analyzed missing values**:  
  - `major_cities`: **4250 missing**  
  - `regional_councils`: **6613 missing**  
  - **Decided to exclude these columns** due to high data loss.  

The dataset is now cleaned, structured, and ready for further analysis.

## Creating a dataset with an indication of detailed localities

In [24]:
def extract_alerts(text):
    alerts = []
    
    # Regex pattern for datetime in the format "DD/MM/YYYY HH:MM:SS:"
    date_pattern = re.compile(r'(\d{2}/\d{2}/\d{4} \d{2}:\d{2}:\d{2}):')
    
    # Find the first date occurrence and remove content before it
    first_date_match = date_pattern.search(text)
    if not first_date_match:
        return alerts  # If no date found, return an empty list
    text = text[first_date_match.start():]
    
    # Remove any content after the first occurrence of "||"
    if "||" in text:
        text = text.split("||")[0]
    
    # Regex to split the text into blocks (each block starts with a datetime)
    block_pattern = re.compile(
        r'(\d{2}/\d{2}/\d{4} \d{2}:\d{2}:\d{2}):'
        r'(.*?)(?=(\d{2}/\d{2}/\d{4} \d{2}:\d{2}:\d{2}:)|$)',
        re.DOTALL
    )
    
    # Iterate over each alert block
    for block_match in block_pattern.finditer(text):
        dt = block_match.group(1).strip()         # Extracted datetime
        block_text = block_match.group(2).strip()   # Text following the datetime
        
        # Split block text using bullet (•)
        bullet_parts = [
            part.strip() for part in re.split(r'\u2022', block_text)
            if part.strip()
        ]
        
        for part in bullet_parts:
            # Remove unwanted phrases such as "Click here to open an interactive map"
            # and "Sent by @CumtaAlertsEnglishChannel"
            part = re.sub(
                r'(Click here to open an interactive map|Sent by @CumtaAlertsEnglishChannel).*$', 
                '', part
            ).strip()
            
            # Determine if we use a dash (-) or a colon (:) as the separator
            dash_pos = part.find('-')
            colon_pos = part.find(':')
            
            if dash_pos != -1 and (colon_pos == -1 or dash_pos < colon_pos):
                # Case 1: "Region - locality1, locality2, ..."
                region_part, localities_part = part.split('-', 1)
            elif colon_pos != -1:
                # Case 2: "Region: locality1, locality2, ..."
                region_part, localities_part = part.split(':', 1)
            else:
                continue  # If neither separator exists, skip this entry
            
            # Clean up the extracted region name
            region = region_part.strip()
            region = re.sub(
                r'(Click here to open an interactive map|Sent by @CumtaAlertsEnglishChannel).*$', 
                '', region
            ).strip()
            region = region.rstrip(':- ').strip()
            
            # Clean localities
            localities_part = re.sub(
                r'(Click here to open an interactive map|Sent by @CumtaAlertsEnglishChannel).*$', 
                '', localities_part
            ).strip()
            localities = [loc.strip() for loc in localities_part.split(',') if loc.strip()]
            
            # If a region name is too generic, treat it as part of the locality
            generic_regions = {"Dan Area", "Lachish Area", "HaSharon Region", "Negev Region"}
            if region in generic_regions:
                for loc in localities:
                    alerts.append({
                        'datetime': dt,
                        'region': loc,
                        'locality': region
                    })  # Swap region/locality
            else:
                for loc in localities:
                    alerts.append({
                        'datetime': dt,
                        'region': region,
                        'locality': loc
                    })  # Normal case
    
    return alerts

# Process each row in raw_df to extract alerts
all_rows = []
for _, row in raw_df.iterrows():
    text = row['text']
    threat_type = row['threat_type']
    alerts = extract_alerts(text)
    for alert in alerts:
        alert['threat_type'] = threat_type
        all_rows.append(alert)

# Create a new DataFrame with detailed alerts
detailed_df = pd.DataFrame(all_rows)

In [25]:
detailed_df

Unnamed: 0,datetime,region,locality,threat_type
0,26/12/2018 10:05:01,Dan 158,Tel Aviv (South West),Red Alert
1,26/12/2018 10:05:01,Dan 156,Tel Aviv (North),Red Alert
2,26/12/2018 10:05:01,Dan 157,Tel Aviv (Central),Red Alert
3,26/12/2018 10:05:01,Dan 159,Tel Aviv (South East),Red Alert
4,26/12/2018 10:05:01,Dan 162,Azur,Red Alert
...,...,...,...,...
129091,05/10/2025 04:59:22,Yarkon,Modi'in - Ishpro Center,Red Alert
129092,05/10/2025 18:34:40,Eilat,Eilat,Red Alert
129093,05/10/2025 18:32:20,Eilat,Eilat,Unrecognized Aircraft
129094,05/10/2025 18:33:24,Eilat,Eilat,Unrecognized Aircraft


In [26]:
# looking at the lines with the date of October 1, 2024 from 19 to 20 (the Iranian attack)
detailed_df[
    (detailed_df['datetime'] >= '01/10/2024 19:50:00') &
    (detailed_df['datetime'] < '01/10/2024 20:00:00')
].tail(20)

Unnamed: 0,datetime,region,locality,threat_type
68516,01/10/2024 19:58:36,West Negev,Bror Hayil,Red Alert
68517,01/10/2024 19:58:36,West Negev,Talmei Bilu,Red Alert
68518,01/10/2024 19:58:36,Yehuda,Neta,Red Alert
68519,01/10/2024 19:58:36,Yehuda,Shomria,Red Alert
68520,01/10/2024 19:58:36,Gaza Envelope,Gvaram,Red Alert
68521,01/10/2024 19:58:52,Shfelat Yehuda,Zekharia,Red Alert
68522,01/10/2024 19:58:52,Shfelat Yehuda,Kfar Zoharim,Red Alert
68523,01/10/2024 19:58:52,Shfelat Yehuda,Luzit,Red Alert
68524,01/10/2024 19:58:52,West Negev,Zru'a,Red Alert
68525,01/10/2024 19:58:52,West Negev,Havat Shikmim,Red Alert


In [27]:
detailed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129096 entries, 0 to 129095
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   datetime     129096 non-null  object
 1   region       129096 non-null  object
 2   locality     129096 non-null  object
 3   threat_type  129096 non-null  object
dtypes: object(4)
memory usage: 3.9+ MB


In [28]:
detailed_df['datetime'] = pd.to_datetime(detailed_df['datetime'], dayfirst=True, errors='coerce')

In [29]:
# Let's check what the contents of the first row of the column 'text' look like.
raw_df.loc[0, 'text']

'Red Alert at Dan (155,156,157,158,159,160,161,162,165), Sharon (140,141,143) [10:05]: 26/12/2018 10:05:01: • Dan 158 - Tel Aviv (South West) • Dan 156 - Tel Aviv (North) • Dan 157 - Tel Aviv (Central) • Dan 159 - Tel Aviv (South East) • Dan 162 - Azur, Holon, Mikveh Israel • Dan 155 - Galil Yam, Hakfar Hayarok, Herzliya, Kfar Shmariyahu, Ramat Hasharon • Dan 160 - Bnei Brak, Givatiim, Havat Shalem, Ramat Gan • Sharon 140 - Givat He\'\'n, Raanana, Givat He""n • Dan 161 - Ganei Tikvah, Givat Shmuel, Kfar Azar, Kiriyat Ono, Naveh Efal, Neve Efraim Monosson, Or Yehuda, Ramat Efal, Ramat Pinkas, Savion, Yahud Monoson, Ramat Gan - Bar Ilan University, Ramat Gan - Ramat Ef\'al & Tel Hashomer • Dan 165 - Bat Yam • Sharon 143 - Adanim, Elishama, Ganei Am, Hagor, Hod Hsharon, Horashim, Jaljulia, Kfar Bara, Kfar Kasem, Kfar Mal\'\'al, Matan, Matan, Naveh Yarak, Nirit, Oranit, Oranit, Ramot Hshavim, Sdei Hemed, Yarhiv, Yarkona, Kfar Mal""al • Sharon 141 - Beit Berl, Gan Chaim, Kfar Saba, Naveh Ya

In [30]:
# Check
detailed_df.head(20)

Unnamed: 0,datetime,region,locality,threat_type
0,2018-12-26 10:05:01,Dan 158,Tel Aviv (South West),Red Alert
1,2018-12-26 10:05:01,Dan 156,Tel Aviv (North),Red Alert
2,2018-12-26 10:05:01,Dan 157,Tel Aviv (Central),Red Alert
3,2018-12-26 10:05:01,Dan 159,Tel Aviv (South East),Red Alert
4,2018-12-26 10:05:01,Dan 162,Azur,Red Alert
5,2018-12-26 10:05:01,Dan 162,Holon,Red Alert
6,2018-12-26 10:05:01,Dan 162,Mikveh Israel,Red Alert
7,2018-12-26 10:05:01,Dan 155,Galil Yam,Red Alert
8,2018-12-26 10:05:01,Dan 155,Hakfar Hayarok,Red Alert
9,2018-12-26 10:05:01,Dan 155,Herzliya,Red Alert


### ⚠ Checking the distribution of data by year

In [31]:
# Convert the 'time' column to datetime format
detailed_df['datetime'] = pd.to_datetime(detailed_df['datetime'], 
                                         errors='coerce', format='%d/%m/%Y %H:%M:%S')

# Extract the year from the datetime column
detailed_df['year'] = detailed_df['datetime'].dt.year

# Count the number of warnings per year
warnings_by_year = detailed_df['year'].value_counts().sort_index()

# Calculate the percentage each year represents
percentages = (warnings_by_year / warnings_by_year.sum()) * 100

# Combine counts and percentages into a DataFrame
result_df = pd.DataFrame({
    'Count': warnings_by_year,
    'Percentage': percentages.round(2).astype(str) + '%'
})

# Display the result
print('Distribution of data by year in detailed_df:')
result_df

Distribution of data by year in detailed_df:


Unnamed: 0_level_0,Count,Percentage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2018,60,0.05%
2019,4425,3.43%
2020,324,0.25%
2021,29368,22.75%
2022,1032,0.8%
2023,14786,11.45%
2024,32722,25.35%
2025,46379,35.93%


In [32]:
# Use the existing 'date' column which already contains datetime values
raw_df['year'] = raw_df['date'].dt.year

# Count the number of alerts per year
warnings_by_year = raw_df['year'].value_counts().sort_index()

# Calculate the percentage each year represents
percentages = (warnings_by_year / warnings_by_year.sum()) * 100

# Combine counts and percentages into a DataFrame
raw_result_df = pd.DataFrame({
    'Count': warnings_by_year,
    'Percentage': percentages.round(2).astype(str) + '%'  # Append '%' to the rounded percentages
})

# Display the result
print('Distribution of data by year in raw_df:')
raw_result_df


Distribution of data by year in raw_df:


Unnamed: 0_level_0,Count,Percentage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2018,1,0.01%
2019,405,5.44%
2020,75,1.01%
2021,999,13.42%
2022,240,3.22%
2023,1926,25.87%
2024,3421,45.94%
2025,379,5.09%


⚠ WARNING ⚠ Issue Identified:

There is a noticeable discrepancy in the data distribution by year between the two datasets, `detailed_df` and `raw_df`.

- In **`detailed_df`**, the year **2021** accounts for **35.09%** of the dataset.
- In **`raw_df`**, the year **2021** represents only **14.06%** of the dataset.

Also, all our information initially consists of 15 files. The alarm lines for 2021 take up slightly less than 2 files out of 15, that is, about 13-14%


There is also a significant difference in 2025 (5.05% in **`raw_df`** compared to 34.75% in **`detailed_df`**), but this is explained by the change in the nature of the conflict - in 2025, the coverage area of each alert was much larger.

### Checking threat types

In [33]:
print('Number of threat types:', detailed_df['threat_type'].nunique())
detailed_df['threat_type'].unique()

Number of threat types: 5


array(['Red Alert', 'Unrecognized Aircraft', 'Terrorist Infiltration',
       'Interception pieces', 'Earthquake'], dtype=object)

### Checking the 'region' and 'locality' columns

In [34]:
print('Number of regions:', detailed_df['region'].nunique())
detailed_df['region'].unique()

Number of regions: 185


array(['Dan 158', 'Dan 156', 'Dan 157', 'Dan 159', 'Dan 162', 'Dan 155',
       'Dan 160', 'Sharon 140', 'Dan 161', 'Dan 165', 'Sharon 143',
       'Sharon 141', 'Eilat 311', 'Arabah 310', 'Lakhish 246',
       'Gaza Containment Zone 225', 'Gaza Containment Zone 224',
       'Jerusalem 194', 'Maale Adumim 200', 'Samaria 127',
       'Gaza Containment Zone 236',
       'Central Negev / Gaza Containment Zone 238',
       'Gaza Containment Zone 237', 'Gaza Containment Zone 230',
       'Gaza Containment Zone 220', 'Gaza Containment Zone 219',
       'Gaza Containment Zone 221', 'Hefer 139',
       'Gaza Containment Zone 218', 'Central Negev 254',
       'Central Negev 255', 'Central Negev / Gaza Containment Zone 216',
       'Gaza Containment Zone 223', 'Gaza Containment Zone 233',
       'Gaza Containment Zone 232', 'Gaza Containment Zone 231',
       'Gaza Containment Zone 228', 'Gaza Containment Zone 222',
       'Gaza Containment Zone 217', 'Lakhish 247', 'Gaza containment 224',
     

In [35]:
print('Number of localities:', detailed_df['locality'].nunique())
print(detailed_df['locality'].unique())

Number of localities: 1730
['Tel Aviv (South West)' 'Tel Aviv (North)' 'Tel Aviv (Central)' ...
 'Ardanel Ranch' 'ברחבי הארץ' 'Azuz']


In [36]:
# --- Hidden because of the large size ---
# Check how the localities were divided to fix the function, if necessary.
#detailed_df['locality'].to_string()

### Checking localities and regions in Hebrew

In [37]:
# Checking if there are names of localities in Hebrew.
detailed_df[detailed_df['locality'].str.contains(r'[\u0590-\u05FF]', na=False)]

Unnamed: 0,datetime,region,locality,threat_type,year
2911,2019-06-13 00:14:56,נירים,נירים,Red Alert,2019
3007,2019-07-31 10:05:05,כרכור,כרכור,Red Alert,2019
3008,2019-07-31 10:05:05,פרדס חנה,פרדס חנה,Red Alert,2019
3063,2019-07-31 10:06:12,כרכור,כרכור,Red Alert,2019
3064,2019-07-31 10:06:12,פרדס חנה,פרדס חנה,Red Alert,2019
...,...,...,...,...,...
59602,2024-09-21 14:00:37,צפת,עיר - צפת - עיר,Red Alert,2024
59631,2024-09-21 17:47:51,לב החולה,לב החולה,Unrecognized Aircraft,2024
97409,2025-06-14 01:12:14,ברחבי הארץ,ברחבי הארץ,Red Alert,2025
97734,2025-06-14 01:12:14,ברחבי הארץ,ברחבי הארץ,Red Alert,2025


In [38]:
# Display all the localities in Hebrew with the number of times
detailed_df[detailed_df['locality'].str.contains(r'[\u0590-\u05FF]', 
                                                 na=False)].value_counts().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,count
datetime,region,locality,threat_type,year,Unnamed: 5_level_1
2021-05-12 02:50:12,כוחלה מכחול,כוחלה מכחול,Red Alert,2021,20
2021-05-12 02:51:10,כוחלה מכחול,כוחלה מכחול,Red Alert,2021,19
2025-06-14 01:12:14,ברחבי הארץ,ברחבי הארץ,Red Alert,2025,2
2022-08-07 15:23:32,אשדוד,מרינה,Red Alert,2022,2
2022-08-07 15:23:32,אשדוד,יז,Red Alert,2022,2
...,...,...,...,...,...
2024-09-13 13:03:59,צפת,נוף כנרת - צפת - נוף כנרת,Unrecognized Aircraft,2024,1
2024-09-13 01:12:05,צפת,עכברה - צפת - עכברה,Red Alert,2024,1
2024-09-13 01:12:05,צפת,עיר - צפת - עיר,Red Alert,2024,1
2024-09-13 01:11:33,צפת,עיר - צפת - עיר,Red Alert,2024,1


In [39]:
translation_dict = {
    "צפת": "Safed",
    "נירים": "Nirim",
    "אשדוד": "Ashdod",
    "איבים": "Ibim",
    "שדרות": "Sderot",
    "קריית שמונה": "Kiryat Shmona",
    "ניר עם": "Nir Am",
    "לב החולה": "Lev Ha-Hula",
    "פרדס חנה": "Pardes Hanna - Karkur",
    "כוחלה מכחול": "Kokhav Michael",
    "שוהם": "Shoham",
    "כרכור": "Karkur",
    "כפר נחום": "Kfar Naḥum",
    "מצוק עורבים": "Orevim Cliff",
    "איירפורט סיטי": "Airport City",
    "אזור תעשייה רמת גן": "Ramat Gan Industrial Zone",
    "רמת טראמפ": "Ramat Trump",
    "רכסים נהר הירדן": "Jordan River Terraces",
    "אזור תעשייה תעשייה רמת גן": "Ramat Gan Industrial Area",
    "חוות אירוח גורן": "Goren Guest Farm",
    "תחנה": "Station",
    "מוזיאון כוכבים רעים": "Bad Stars Museum",
    "מלון אחוזת ירדן": "Jordan Estate Hotel",
    "מיני ישראל": "Mini Israel",
    "נאות קדומים": "Neot Kedumim",
    "חניון הנגב מהר": "Negev Fast Parking Lot",
    "מלון מונזון": "Monzon Hotel",
    "יהוד מונוסון": "Yehud Monoson",
    "גני יהודה": "Ganei Yehuda",
    "רמת טראמפ": "Ramat Trump",
    "צוק עורבים": "Ravens' Cliff",
    "מכון תבנית מהר": "Fast Template Institute",
    "מבטחים עמיעוז ישע": "Mivtachim Ami'oz Yesha",
    "רפטינג נהר הירדן": "Jordan River Rafting",
    "אזור תעשייה רגמ": "Regem Industrial Zone",
    "חניון הנתיב מהיר": "Fast Lane Parking Lot",
    "מודיעין מכבים רעות": "Modiin Maccabim Reut",
    "נילי": "Nili",
    "טבחה": "Tabgha",
    "איזור תעשייה מילואות צפון": "Miluot North Industrial Zone"
}



# Function to replace entire string if it contains a Hebrew name
def replace_entire_entry(text, translation_dict):
    if pd.isna(text):  # Handle NaN values safely
        return text
    for hebrew, english in translation_dict.items():
        if hebrew in text:  # If any Hebrew name appears in the text
            return english  # Replace the entire text with the English name
    return text  # Otherwise, keep the original value

# Apply the function to 'locality' and 'region' columns
detailed_df['locality'] = (
    detailed_df['locality']
    .astype(str)
    .str.strip()
    .apply(lambda x: replace_entire_entry(x, translation_dict))
)

detailed_df['region'] = (
    detailed_df['region']
    .astype(str)
    .str.strip()
    .apply(lambda x: replace_entire_entry(x, translation_dict))
)


In [40]:
# Display all the localities in Hebrew with the number of times
detailed_df[detailed_df['locality'].str.contains(r'[\u0590-\u05FF]', 
                                                 na=False)].value_counts().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,count
datetime,region,locality,threat_type,year,Unnamed: 5_level_1
2022-08-06 12:01:18,Ashdod,טו,Red Alert,2022,2
2022-08-06 12:01:18,Ashdod,יב,Red Alert,2022,2
2022-08-06 12:01:18,Ashdod,יז,Red Alert,2022,2
2022-08-06 12:01:18,Ashdod,מרינה,Red Alert,2022,2
2022-08-07 15:23:32,Ashdod,טו,Red Alert,2022,2
2022-08-07 15:23:32,Ashdod,יב,Red Alert,2022,2
2022-08-07 15:23:32,Ashdod,יז,Red Alert,2022,2
2022-08-07 15:23:32,Ashdod,מרינה,Red Alert,2022,2
2025-06-14 01:12:14,ברחבי הארץ,ברחבי הארץ,Red Alert,2025,2
2022-08-06 12:01:18,Ashdod,יא,Red Alert,2022,1


In [41]:
# List of Hebrew words to be removed from the 'locality' column
words_to_remove = ["טו", "יב", "יז", "מרינה", "יא", "סיט"]

# Filter the DataFrame to exclude rows where 'locality' contains any of the specified words
detailed_df = detailed_df[~detailed_df['locality'].isin(words_to_remove)]

# Display all the localities in Hebrew with the number of times
detailed_df[detailed_df['locality'].str.contains(r'[\u0590-\u05FF]', 
                                                 na=False)].value_counts().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,count
datetime,region,locality,threat_type,year,Unnamed: 5_level_1
2025-06-14 01:12:14,ברחבי הארץ,ברחבי הארץ,Red Alert,2025,2
2025-06-20 15:44:15,ברחבי הארץ,ברחבי הארץ,Red Alert,2025,1


I have translated all the Hebrew names into their English versions. 

In [42]:
# Check again
detailed_df['region'].unique()

array(['Dan 158', 'Dan 156', 'Dan 157', 'Dan 159', 'Dan 162', 'Dan 155',
       'Dan 160', 'Sharon 140', 'Dan 161', 'Dan 165', 'Sharon 143',
       'Sharon 141', 'Eilat 311', 'Arabah 310', 'Lakhish 246',
       'Gaza Containment Zone 225', 'Gaza Containment Zone 224',
       'Jerusalem 194', 'Maale Adumim 200', 'Samaria 127',
       'Gaza Containment Zone 236',
       'Central Negev / Gaza Containment Zone 238',
       'Gaza Containment Zone 237', 'Gaza Containment Zone 230',
       'Gaza Containment Zone 220', 'Gaza Containment Zone 219',
       'Gaza Containment Zone 221', 'Hefer 139',
       'Gaza Containment Zone 218', 'Central Negev 254',
       'Central Negev 255', 'Central Negev / Gaza Containment Zone 216',
       'Gaza Containment Zone 223', 'Gaza Containment Zone 233',
       'Gaza Containment Zone 232', 'Gaza Containment Zone 231',
       'Gaza Containment Zone 228', 'Gaza Containment Zone 222',
       'Gaza Containment Zone 217', 'Lakhish 247', 'Gaza containment 224',
     

### Checking the number of types of city names

We have relatively large cities that are divided into zones. Let's check how many such zones there are

In [43]:
# List of major cities
major_cities = [
    "Eilat", "Modi'in", "Hadera", "Caesarea", "Ramat Gan", "Haifa", "Bat Yam"
    "Beer Sheva", "Netanya", "Rishon LeZion", "Tel Aviv", "Ashdod", "Herzeliya", "Jerusalem", 
    "Beit Shemesh", "Ashkelon", "Rehovot", "Lod", "Ramla", "Holon", "Bat Yam", "Kfar Saba", 
    "Petach Tikva", "Tiberias", "Nahariya", "Safed", "Kiryat Shmona", "Acre",
    "Ma'ale Adumim", "Ariel", "Nazareth", "Atlit", "Sderot", "Ofakim", 
    "Dimona", "Yavne", "Kiryat Gat", "Kiryat Malakhi", "Migdal HaEmek", "Or Akiva",
    "Be'er Ya'akov", "Givatayim", "Yokneam Illit", "Tirat Carmel", "Karmiel", "Arad",
    "Ma'alot-Tarshiha", "Kiryat Ata", "Kiryat Bialik", "Kiryat Yam", "Kiryat Motzkin",
    "Nesher", "Afula", "Bnei Brak", "Kiryat Ono", "Or Yehuda", "Ramat HaSharon", 
    "El'ad", "Ganei Tikva", "Giv'at Shmuel", "Hod Hasharon", "Kafr Qasim", "Kfar Yona", 
    "Ness Ziona", "Qalansawe", "Ra'anana", "Rosh HaAyin", "Tayibe", "Tira", 
    "Yehud-Monosson", "Pardes Hanna-Karkur"
]

def get_localities_by_city(df, major_cities, column='locality'):
    """
    Function to retrieve unique localities for each major city.
    
    :param df: DataFrame containing the data.
    :param major_cities: List of major cities to check.
    :param column: Name of the column to search for localities (default is 'locality').
    :return: Dictionary where the key is the city and the value is a list of unique localities.
    """
    city_localities = {}
    for city in major_cities:
        # Filter the DataFrame and extract unique localities for the current city
        localities = df[df[column].str.contains(city, case=False, na=False)][column].unique()
        city_localities[city] = localities.tolist()  # Convert to a list for easier handling
        print(f"City: {city}, Localities: {city_localities[city]}")  # Debug output
        print("=" * 40)  # Separator line
    return city_localities

# Call the function
localities_by_city = get_localities_by_city(detailed_df, major_cities)

City: Eilat, Localities: ['Eilat']
City: Modi'in, Localities: ["Modi'in", "Modi'in - Ishpro Center", "Modi'in - Ligad Center", "Hevel Modi'in Industrial Zone", "Modi'in Illit", "Regional Council Hevel Modi'in Industrial Park", "Modi'in Maccabim Re'ut"]
City: Hadera, Localities: ['Hadera - East', 'Hadera - Center', 'Hadera - Neveh Haim', 'Hadera - West']
City: Caesarea, Localities: ['Caesarea Industrial Zone', 'Caesarea', 'Caesarea Marine Center']
City: Ramat Gan, Localities: ['Ramat Gan', 'Ramat Gan - Bar Ilan University', "Ramat Gan - Ramat Ef'al & Tel Hashomer", 'Ramat Gan - East', 'Ramat Gan - West']
City: Haifa, Localities: ['Haifa - Carmel and Lower City', 'Haifa - West', "Haifa - Ramot HaCarmel and Neveh Sha'anan", 'Haifa - Kiryat Haim & Kiryat Shmuel', 'Haifa Bay', 'Haifa - Bay', 'Haifa - Carmel']
City: Bat YamBeer Sheva, Localities: []
City: Netanya, Localities: ['Netanya - East', 'Netanya - West']
City: Rishon LeZion, Localities: ['Rishon LeZion - West', 'Rishon LeZion - East'

Let's create a function that retrieves unique localities for a given city from a DataFrame. This function will filter the data based on the city name, extract distinct localities from the specified column, and return them as a list. Additionally, it will display the results in a structured format for better readability. This can be useful for analyzing geographic data and understanding the distribution of localities within a city.

In [44]:
def get_localities_by_city(df, city, column='locality'):
    """
    Function to retrieve unique localities for a specific city.

    :param df: DataFrame containing the data.
    :param city: The city to check (as a string).
    :param column: Name of the column to search for localities (default is 'locality').
    :return: List of unique localities for the specified city.
    """
    # Filter the DataFrame and extract unique localities for the city
    localities = df[df[column].str.contains(city, case=False, na=False)][column].unique()
    
    # Convert to list for easier handling
    localities_list = localities.tolist()
    
    # Print result for the city
    print(f"City: {city}")
    print("=" * 40)  # Separator line
    print("Localities:")
    if localities_list:
        print(", ".join(localities_list))
    else:
        print("No localities found")
    print("=" * 40)  # Separator line

    return localities_list


⚠ The following check of the city zones was done using the ChatGPT to speed up and may contain errors. ⚠

In [45]:
get_localities_by_city(detailed_df, "Tel Aviv")

City: Tel Aviv
Localities:
Tel Aviv (South West), Tel Aviv (North), Tel Aviv (Central), Tel Aviv (South East), Tel Aviv - South and Jaffa, Tel Aviv - East, Tel Aviv - Across the Yarkon, Tel Aviv - City Center


['Tel Aviv (South West)',
 'Tel Aviv (North)',
 'Tel Aviv (Central)',
 'Tel Aviv (South East)',
 'Tel Aviv - South and Jaffa',
 'Tel Aviv - East',
 'Tel Aviv - Across the Yarkon',
 'Tel Aviv - City Center']

The following entries are most likely part of Tel Aviv city in its classic sense: 'Tel Aviv (South West)',  'Tel Aviv (North)' 
 'Tel Aviv (Central) 
 'Tel Aviv (South East ',
 'Tel Aviv - City Ce.
te**Jaffa (Yafo)** is a distinct historic area often treated separately due to its unique cultural and historical significance. But let's classify this area as Tel Aviv, since it is officially part of the Tel Aviv-Yafo municipality.
r']



In [46]:
get_localities_by_city(detailed_df, "Jerusalem")

City: Jerusalem
Localities:
Jerusalem, Jerusalem - East, Jerusalem - North and Alonim, Jerusalem - North, Jerusalem - Atarot Industrial Zone, Jerusalem - South, Jerusalem - Qafr 'Aqab, Jerusalem - Center, Jerusalem - West


['Jerusalem',
 'Jerusalem - East',
 'Jerusalem - North and Alonim',
 'Jerusalem - North',
 'Jerusalem - Atarot Industrial Zone',
 'Jerusalem - South',
 "Jerusalem - Qafr 'Aqab",
 'Jerusalem - Center',
 'Jerusalem - West']

Almost all entries from the list are officially part of Jerusalem, except 'Jerusalem - North and Alonim', which is potentially ambiguous, because "Alonim" may refer to areas adjacent to Jerusalem or informal names not officially within its boundaries. However, it should be included in Jerusalem for consistency, as such entries are often culturally or geographically associated with the city, ensuring unified classification and avoiding fragmentation in the dataset.

In [47]:
get_localities_by_city(detailed_df, "Ashdod")

City: Ashdod
Localities:
Hatzor Ashdod, Ashdod, Ashdod - Yod Alef, Ashdod - Northern Industrial Zone and port, Ashdod - Alef, Ashdod - Gimmel, Ashdod - Het, Ashdod - Northen Industrial Zone, Ashdod-11, Ashdod Yacov Ichud


['Hatzor Ashdod',
 'Ashdod',
 'Ashdod - Yod Alef',
 'Ashdod - Northern Industrial Zone and port',
 'Ashdod - Alef',
 'Ashdod - Gimmel',
 'Ashdod - Het',
 'Ashdod - Northen Industrial Zone',
 'Ashdod-11',
 'Ashdod Yacov Ichud']

**'Hatzor Ashdod'** is not a part of Ashdod proper. The rest of the zones are officially part of it.

In [48]:
get_localities_by_city(detailed_df, "Ashkelon")

City: Ashkelon
Localities:
Ashkelon, Ashkelon Industrial Area, Ashkelon Northern Industrial Zone, Ashkelon Southern Industrial Zone, Ashkelon - North, Ashkelon - South


['Ashkelon',
 'Ashkelon Industrial Area',
 'Ashkelon Northern Industrial Zone',
 'Ashkelon Southern Industrial Zone',
 'Ashkelon - North',
 'Ashkelon - South']

⚠There are differing opinions in various sources regarding whether the industrial zones are officially part of Ashkelon. However, given their close proximity to residential areas, I assume their inclusion within the city for the purpose of analysis.⚠


In [49]:
get_localities_by_city(detailed_df, "Modi'in")

City: Modi'in
Localities:
Modi'in, Modi'in - Ishpro Center, Modi'in - Ligad Center, Hevel Modi'in Industrial Zone, Modi'in Illit, Regional Council Hevel Modi'in Industrial Park, Modi'in Maccabim Re'ut


["Modi'in",
 "Modi'in - Ishpro Center",
 "Modi'in - Ligad Center",
 "Hevel Modi'in Industrial Zone",
 "Modi'in Illit",
 "Regional Council Hevel Modi'in Industrial Park",
 "Modi'in Maccabim Re'ut"]

The following zones are officially part of Modi'in-Maccabim-Re'ut (the city): **Modi'in - Ishpro Center**, **Modi'in - Ligad Center**, **Modi'in-Maccabim-Re'ut**, **Modi'in Maccabim Re'ut**.
The remaining entries, such as **Hevel Modi'in**, **Modi'in Illit**, and **industrial zones**, refer to nearby regional councils or separate municipalities and are not officially part of the city.

In [50]:
get_localities_by_city(detailed_df, "Ramat Gan")

City: Ramat Gan
Localities:
Ramat Gan, Ramat Gan - Bar Ilan University, Ramat Gan - Ramat Ef'al & Tel Hashomer, Ramat Gan - East, Ramat Gan - West


['Ramat Gan',
 'Ramat Gan - Bar Ilan University',
 "Ramat Gan - Ramat Ef'al & Tel Hashomer",
 'Ramat Gan - East',
 'Ramat Gan - West']

The entry "Ramat Gan - Ramat Ef'al & Tel Hashomer" partially refers to areas outside the city (e.g., Tel Hashomer, which belongs to Kiryat Ono).  The rest of the zones are officially part of the city.

In [51]:
get_localities_by_city(detailed_df, "Haifa")

City: Haifa
Localities:
Haifa - Carmel and Lower City, Haifa - West, Haifa - Ramot HaCarmel and Neveh Sha'anan, Haifa - Kiryat Haim & Kiryat Shmuel, Haifa Bay, Haifa - Bay, Haifa - Carmel


['Haifa - Carmel and Lower City',
 'Haifa - West',
 "Haifa - Ramot HaCarmel and Neveh Sha'anan",
 'Haifa - Kiryat Haim & Kiryat Shmuel',
 'Haifa Bay',
 'Haifa - Bay',
 'Haifa - Carmel']

The entry **"Haifa - Kiryat Haim & Kiryat Shmuel"** refers to neighborhoods that are administratively part of Haifa but are often considered distinct communities within the city. Other zones are officially part of Haifa.

There are such inscriptions in the dataset in the areas of Haifa
• Menashe - Haifa - Ramot HaCarmel and Neveh Sha'anan, Haifa - Carmel, Hadar and Downtown Lower City.

That is, we have 'Hadar and Downtown Lower City'
We should divide them into 'Haifa - Hadar' and 'Haifa - Downtown Lower City'

⚠ Also, there is a special case with the name Carmel ⚠

In [52]:
get_localities_by_city(detailed_df, "Carmel")

City: Carmel
Localities:
Yearot HaCarmel, Geva Carmel, Ein Carmel, Carmel, Haifa - Carmel and Lower City, Haifa - Ramot HaCarmel and Neveh Sha'anan, Tirat Carmel, Carmel Forest Spa Resort, Mevo Carmel Industrial Zone, Haifa - Carmel


['Yearot HaCarmel',
 'Geva Carmel',
 'Ein Carmel',
 'Carmel',
 'Haifa - Carmel and Lower City',
 "Haifa - Ramot HaCarmel and Neveh Sha'anan",
 'Tirat Carmel',
 'Carmel Forest Spa Resort',
 'Mevo Carmel Industrial Zone',
 'Haifa - Carmel']

df[df['locality'] == 'Ashdod']Issue Explanation:⚠

There is a **naming conflict** with the term **"Carmel"** in your dataset. **Carmel** can refer to multiple localities, and it’s important to differentiate between them to avoid misclassification.


The following **localities are officially neighborhoods or areas within Haifa**:

1. **Haifa - Carmel and Lower City**  
2. **Haifa - Ramot HaCarmel and Neveh Sha'anan**  
3. **Haifa - Carmel**

These names represent **recognized neighborhoods** in Haifa, situated on **Mount Carmel**, which is a central geographical feature of the city.



These localities are **close to Haifa** and associated with the **Carmel region** but are **not administratively part of the city**:

1. **Tirat Carmel** – A separate city located just south of Haifa.
2. **Carmel Forest Spa Resort** – A famous resort located in the Carmel mountain range near Haifa.
3. **Mevo Carmel Industrial Zone** – An industrial area near the Carmel region but not within Haifa’s official boundaries.
4. **Hof HaCarmel** – Refers to the **Carmel Coast Regional Council**, which is a separate administrative region near Haifa.


Some localities contain the name "Carmel" but are **not related to Haifa**:

1. **Yehuda - Carmel**  This is an **independent locality** located in the **Har Hevron (Mount Hebron) Regional Council** in the southern part of Israel, far from Haifa.

2. **Yearot HaCarmel, Geva Carmel, Ein Carmel**  These are **villages or settlements** in the broader Carmel region but **not part of Haifa's municipal jurisdiction**.


In [53]:
get_localities_by_city(detailed_df, 'Bat Yam')

City: Bat Yam
Localities:
Bat Yam


['Bat Yam']

Also 'Bat-Yam'

In [54]:
# Replace 'Bat-Yam' with 'Bat Yam' in the 'locality' column
detailed_df.loc[:, 'locality'] = detailed_df['locality'].str.replace(r'Bat-Yam', 'Bat Yam', case=False, regex=True)

In [55]:
get_localities_by_city(detailed_df, 'Acre')

City: Acre
Localities:
Acre


['Acre']

Acre, but there are also 'Akko', 'Acco'

In [56]:
# Filter rows where 'locality' contains 'Akko' or 'Acco'
akko_variants = detailed_df[detailed_df['locality'].str.contains(r'Akko|Acco', case=False, na=False)]

# Display unique values
unique_akko_variants = akko_variants['locality'].unique()

# Print results
print("Unique locality values containing 'Akko' or 'Acco':")
for variant in unique_akko_variants:
    print(variant)


Unique locality values containing 'Akko' or 'Acco':
Acco - Industrial Zone
Akko New Cemetery


In [57]:
# Replace 'Acco' and 'Akko' with 'Acre' in the 'locality' column
detailed_df.loc[:, 'locality'] = detailed_df['locality'].str.replace(r'Acco|Akko', 'Acre', case=False, regex=True)

In [58]:
get_localities_by_city(detailed_df, "Hadera")

City: Hadera
Localities:
Hadera - East, Hadera - Center, Hadera - Neveh Haim, Hadera - West


['Hadera - East', 'Hadera - Center', 'Hadera - Neveh Haim', 'Hadera - West']

All the listed zones are officially part of Hadera

In [59]:
get_localities_by_city(detailed_df, "Petach Tikva")

City: Petach Tikva
Localities:
Petach Tikva


['Petach Tikva']

Also 'Petah Tikva'

In [60]:
# Replace 'Petach Tikva' with 'Petah Tikva' in the 'locality' column
detailed_df.loc[:, 'locality'] = detailed_df['locality'].str.replace(r'Petach Tikva', 'Petah Tikva', case=False, regex=True)

In [61]:
get_localities_by_city(detailed_df, "Caesarea")

City: Caesarea
Localities:
Caesarea Industrial Zone, Caesarea, Caesarea Marine Center


['Caesarea Industrial Zone', 'Caesarea', 'Caesarea Marine Center']

The following zones are officially part of Caesarea: Caesarea, Caesarea Marine Center
The Caesarea Industrial Zone is a nearby industrial area but is not part of the residential or municipal core of Caesarea. But we will designate it as a part of the city, because we have done the same for the industrial zones of other cities.

In [62]:
get_localities_by_city(detailed_df, "Herzeliya")

City: Herzeliya
Localities:
Herzeliya - Pituach, Herzeliya - Center and Glil Yam


['Herzeliya - Pituach', 'Herzeliya - Center and Glil Yam']

All the listed zones are officially part of Herzeliya.

There is also 'Herzliya'

In [63]:
get_localities_by_city(detailed_df, "Herzliya")

City: Herzliya
Localities:
Herzliya, Herzliya - West


['Herzliya', 'Herzliya - West']

In [64]:
# Replace 'Herzliya' with 'Herzeliya' in the 'locality' column
detailed_df.loc[:, 'locality'] = detailed_df['locality'].str.replace(r'Herzliya', 'Herzeliya', case=False, regex=True)


In [65]:
get_localities_by_city(detailed_df, "Rehovot")

City: Rehovot
Localities:
Rehovot, Rehovot Science Park


['Rehovot', 'Rehovot Science Park']

All the listed zones are officially part of Rehovot

In [66]:
get_localities_by_city(detailed_df, "Ramla")

City: Ramla
Localities:
Nesher Industrial Zone (Ramla), Ramla


['Nesher Industrial Zone (Ramla)', 'Ramla']

Both of the mentioned localities are part of the city of Ramla:

**Ramla** – This is the city itself, located in the central district of Israel.

**Nesher Industrial Zone (Ramla)** – This is an industrial zone located within Ramla. It is named after Nesher, one of Israel’s leading cement manufacturers, and is situated within the city limits.

In [67]:
get_localities_by_city(detailed_df, "Nahariya")

City: Nahariya
Localities:
Nahariya, Nahariya Cemetery


['Nahariya', 'Nahariya Cemetery']

Both Nahariya and Nahariya Cemetery are part of the city:

**Nahariya** – This is the city itself, located in the Northern District of Israel.

**Nahariya Cemetery** – This is the city's cemetery, situated within the boundaries of Nahariya

In [68]:
get_localities_by_city(detailed_df, "Safed")

City: Safed
Localities:
Safed, Safed - 'Akbara, Safed - City, Safed - Nof ha-Kinneret


['Safed', "Safed - 'Akbara", 'Safed - City', 'Safed - Nof ha-Kinneret']

Belongs to the city:
- **Safed** – Refers to the city itself, located in the Northern District of Israel.  
- **Safed - City** – Specifically refers to the central urban area of Safed.  
- **Safed - Nof ha-Kinneret** – A neighborhood or area within the municipal boundaries of Safed.  
- **Safed - 'Akbara** – Officially part of the city of Safed since 1977. 'Akbara is administered as a neighborhood within the city's municipal boundaries.  

So, **all** the listed localities belong to **Safed**.

In [69]:
get_localities_by_city(detailed_df, "Nazareth")

City: Nazareth
Localities:
Nazareth


['Nazareth']

Belongs to the city:
- **Nazareth** – This is the city itself, located in the Northern District of Israel. It is the largest Arab city in the country and has its own independent municipality.

Does **not** officially belong to the city:
- **Nof HaGalil (Nazareth Illit)** – This is a separate city with its own municipality. Although it was originally established as a Jewish suburb of Nazareth, it became an independent city and officially changed its name from *Nazareth Illit* to *Nof HaGalil* in 2019.

In [70]:
get_localities_by_city(detailed_df, "Dimona")

City: Dimona
Localities:
Dimona Industrial Zone, Dimona


['Dimona Industrial Zone', 'Dimona']

**Both Dimona and Dimona Industrial Zone officially belong to the city of Dimona.**

- **Dimona**  
This is the city itself, located in the Southern District of Israel. It has its own municipality and administrative boundaries.

- **Dimona Industrial Zone**  
This industrial area is officially part of Dimona’s municipal jurisdiction. It serves as the city’s hub for industrial and economic activities but is distinct from the residential areas.

In [71]:
get_localities_by_city(detailed_df, "Yavne")

City: Yavne
Localities:
Gan Yavne, Yavne Region Industries, Yavne, Yavne Industrial Zone, Kvutzat Yavne, Kerem Yavneh, Kerem BeYavne, Yavne'el, Sheni LeYavne


['Gan Yavne',
 'Yavne Region Industries',
 'Yavne',
 'Yavne Industrial Zone',
 'Kvutzat Yavne',
 'Kerem Yavneh',
 'Kerem BeYavne',
 "Yavne'el",
 'Sheni LeYavne']

Only **Yavne** and the **Yavne Industrial Zone** are officially part of the city of Yavne. The other localities are separate entities and do not fall under Yavne's municipal jurisdiction. 

**Officially Part of Yavne:**

- **Yavne**: The city itself, located in central Israel.

- **Yavne Industrial Zone**: An industrial area within Yavne's municipal boundaries, serving as a hub for the city's industrial and economic activities.

**Not Officially Part of Yavne:**

- **Gan Yavne**: A local council situated east of Ashdod, operating as an independent municipality separate from Yavne.

- **Yavne Region Industries**: This term likely refers to industrial areas in the broader Yavne region but not necessarily within Yavne's city limits.

- **Hevel Yavne**: A regional council encompassing several communities in the area surrounding Yavne, but not part of the city itself.

- **Kvutzat Yavne**: A religious kibbutz located near Yavne, falling under the jurisdiction of the Hevel Yavne Regional Council. ([en.wikipedia.org](https://en.wikipedia.org/wiki/Kvutzat_Yavne?utm_source=chatgpt.com))

- **Kerem Yavneh (Kerem BeYavne)**: A yeshiva and youth village adjacent to Kvutzat Yavne, also under the Hevel Yavne Regional Council. ([en.wikipedia.org](https://en.wikipedia.org/wiki/Yeshivat_Kerem_B%27Yavneh?utm_source=chatgpt.com))

- **Yavne'el**: A moshava in northern Israel, not geographically or administratively connected to the city of Yavne.

- **Sheni LeYavne**: This term translates to "Second to Yavne" but does not correspond to a recognized locality within or near Yavne.


In [72]:
get_localities_by_city(detailed_df, "Kiryat Gat")

City: Kiryat Gat
Localities:
Kiryat Gat, Kiryat Gat - Industrial Zone


['Kiryat Gat', 'Kiryat Gat - Industrial Zone']

Both **Kiryat Gat** and **Kiryat Gat - Industrial Zone** officially belong to the city of **Kiryat Gat**.

**Kiryat Gat**  
This is the city itself, located in the Southern District of Israel. It has its own municipality and serves as a regional center for the surrounding area.

**Kiryat Gat - Industrial Zone**  
This industrial zone is officially part of Kiryat Gat’s municipal jurisdiction. It includes major industrial facilities and tech companies, contributing significantly to the city’s economy.

In [73]:
get_localities_by_city(detailed_df, "Yokneam Illit")

City: Yokneam Illit
Localities:
Yokneam Illit Industrial Zone, Yokneam Illit


['Yokneam Illit Industrial Zone', 'Yokneam Illit']

 Both **Yokneam Illit** and **Yokneam Illit Industrial Zone** are officially part of the city of **Yokneam Illit**.

**Yokneam Illit**   
This is the city itself, located in northern Israel at the base of the Carmel Mountains. It has its own municipality and is known for its thriving high-tech industry.

**Yokneam Illit Industrial Zone**   
This industrial area is officially part of Yokneam Illit’s municipal jurisdiction. It serves as a key hub for technological companies and industrial activities, contributing significantly to the city’s economy.



In [74]:
get_localities_by_city(detailed_df, "Arad")

City: Arad
Localities:
Tel Arad and El Pura, Arad, Tel Arad


['Tel Arad and El Pura', 'Arad', 'Tel Arad']

Only **Arad** is officially part of the city of **Arad**. The other localities mentioned are separate entities and do not fall under Arad's municipal jurisdiction. 

Officially Part of Arad:

- **Arad**: This is the city itself, located in the Southern District of Israel, on the border of the Negev and Judean Deserts. It has its own municipality and administrative boundaries. ([en.wikipedia.org](https://en.wikipedia.org/wiki/Arad%2C_Israel?utm_source=chatgpt.com))

Not Officially Part of Arad:

- **Tel Arad**: An archaeological site situated approximately 10 kilometers west of the modern city of Arad. It features the remains of a fortified Canaanite city and Israelite fortresses. Tel Arad is a national park and is not within the municipal boundaries of Arad. ([en.wikipedia.org](https://en.wikipedia.org/wiki/Tel_Arad?utm_source=chatgpt.com))

- **El Pura**: This locality is not widely recognized in available sources and does not appear to be officially associated with the city of Arad.

In [75]:
get_localities_by_city(detailed_df, "Kiryat Bialik")

City: Kiryat Bialik
Localities:
Kiryat Bialik Industrial Zone


['Kiryat Bialik Industrial Zone']

**The Kiryat Bialik Industrial Zone** is officially part of the city of **Kiryat Bialik**.

In [76]:
get_localities_by_city(detailed_df, "Kiryat Bialik")

City: Kiryat Bialik
Localities:
Kiryat Bialik Industrial Zone


['Kiryat Bialik Industrial Zone']

But there is also 'Kiryat Biyalik' in the dataset.

In [77]:
# Replace 'Herzliya' with 'Herzeliya' in the 'locality' column
detailed_df.loc[:, 'locality'] = detailed_df['locality'].str.replace(r'Kiryat Biyalik', 'Kiryat Bialik', case=False, regex=True)

### Updating the dataset (dividing zones)

#### Haifa
That is, we have 'Hadar and Downtown Lower City'
We should divide them into 'Haifa - Hadar' and 'Haifa - Downtown Lower City'

In [78]:
expanded_rows = []
for _, row in detailed_df.iterrows():
    if row['locality'] == 'Hadar and Downtown Lower City':
        localities = ['Haifa - Hadar', 'Haifa - Downtown Lower City']
    elif row['locality'] == 'Haifa - Carmel and Lower City':
        localities = ['Haifa - Carmel', 'Haifa - Downtown Lower City']
    elif row['locality'] == "Haifa - Ramot HaCarmel and Neveh Sha'anan":
        localities = ['Haifa - Ramot HaCarmel', "Haifa - Neveh Sha'anan"]
    else:
        localities = [row['locality']]
    
    for loc in localities:
        new_row = row.copy()
        new_row['locality'] = loc
        expanded_rows.append(new_row)

# Overwrite the original DataFrame
detailed_df = pd.DataFrame(expanded_rows).reset_index(drop=True)

detailed_df[detailed_df['locality'].str.contains('Haifa', case=False, na=False)]

Unnamed: 0,datetime,region,locality,threat_type,year
43174,2023-10-11 18:35:26,Menashe,Haifa - Carmel,Unrecognized Aircraft,2023
43175,2023-10-11 18:35:26,Menashe,Haifa - Downtown Lower City,Unrecognized Aircraft,2023
43176,2023-10-11 18:35:26,Menashe,Haifa - West,Unrecognized Aircraft,2023
43177,2023-10-11 18:35:26,Menashe,Haifa - Ramot HaCarmel,Unrecognized Aircraft,2023
43178,2023-10-11 18:35:26,Menashe,Haifa - Neveh Sha'anan,Unrecognized Aircraft,2023
...,...,...,...,...,...
123678,2025-06-24 10:32:03,HaMifratz,Haifa - West,Red Alert,2025
123679,2025-06-24 10:32:03,HaMifratz,Haifa - Bay,Red Alert,2025
123680,2025-06-24 10:32:03,HaMifratz,Haifa - Ramot HaCarmel,Red Alert,2025
123681,2025-06-24 10:32:03,HaMifratz,Haifa - Neveh Sha'anan,Red Alert,2025


In rare cases, in a text with localities, the city and zones are separated by commas, the line may be separated without specifying the city. For example, 'Haifa - Carmel, Hadar and Downtown Lower City'. It is worth checking how many rows in the dataset are separated in the case of Haifa.

In [79]:
keywords = ['Carmel', 'Downtown Lower City', 'Lower City', 'West', 
            'Hadar', 'Bay', 'Ramot HaCarmel', "Neveh Sha'anan"]

# Create a dictionary to store the counts
counts = {key: detailed_df[detailed_df['locality'] == key].shape[0] for key in keywords}

# Print the counts
for key, count in counts.items():
    print(f"{key}: {count} rows")



Carmel: 53 rows
Downtown Lower City: 0 rows
Lower City: 0 rows
West: 0 rows
Hadar: 0 rows
Bay: 0 rows
Ramot HaCarmel: 0 rows
Neveh Sha'anan: 0 rows


In [80]:
# Replace 'Carmel' with 'Haifa - Carmel' in the 'locality' column except 'region' == 'Yehuda'
detailed_df['locality'] = detailed_df.apply(
    lambda row: (
        'Haifa - Carmel' 
        if row['locality'] == 'Carmel' and row['region'] != 'Yehuda' 
        else row['locality']
    ), 
    axis=1
)


In [81]:
# Check
get_localities_by_city(detailed_df, "Haifa")

City: Haifa
Localities:
Haifa - Carmel, Haifa - Downtown Lower City, Haifa - West, Haifa - Ramot HaCarmel, Haifa - Neveh Sha'anan, Haifa - Kiryat Haim & Kiryat Shmuel, Haifa Bay, Haifa - Bay, Haifa - Hadar


['Haifa - Carmel',
 'Haifa - Downtown Lower City',
 'Haifa - West',
 'Haifa - Ramot HaCarmel',
 "Haifa - Neveh Sha'anan",
 'Haifa - Kiryat Haim & Kiryat Shmuel',
 'Haifa Bay',
 'Haifa - Bay',
 'Haifa - Hadar']

In [82]:
# Print the counts
for key, count in counts.items():
    print(f"{key}: {count} rows")

Carmel: 53 rows
Downtown Lower City: 0 rows
Lower City: 0 rows
West: 0 rows
Hadar: 0 rows
Bay: 0 rows
Ramot HaCarmel: 0 rows
Neveh Sha'anan: 0 rows


Nothing has changed for the name 'Carmel'.

#### Ashdod
In the city of Ashdod, neighborhoods are often listed after the city's name, separated by commas. In the DataFrame, these neighborhoods are named directly as:

Initially, we had these regions:
- `Ashdod - Alef, Bet, Dalet, Heh`
- `Ashdod - Yod Alef, Yod Bet, Tet Vav, Yod Zain, Ma*`
- `Ashdod - Gimmel, Vav, Zain` 
- `Ashdod - Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*`
- `Ashdod-11,12,15,17,Marine,City`

When creating detailed_df, these regions became separate rows in the dataframe, which could have skewed the distribution of the data.

We have to do the following:
- If we have 'Ashdod - Alef', we change it to 'Ashdod - Alef, Bet, Dalet, Heh'
- If 'Ashdod - Yod Alef', we change it to 'Ashdod - Yod Alef, Yod Bet, Tet Vav, Yod Zain, Ma*'
- If we have 'Ashdod - Gimmel', we change it to 'Ashdod - Gimmel, Vav, Zain'
- If we have 'Ashdod - Het', we change it to 'Ashdod - Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*'
- If we have 'Ashdod-11', we change it to 'Ashdod - 11, 12, 15, 17, Marine, City'

We also have strings from ashdod_zones =
'Alef', 'Bet', 'Dalet', 'Gimmel', 'Heh', 'Het', 'Tet',
'Yod', 'Yod Gimmel', 'Yod Dalet', 'Vav', 'Zain', 'Marine', 'City', '12', '15', '17',

If 'locality' contains a string from ashdod_zones, it should be removed.

In [83]:
detailed_df[detailed_df['locality'] == 'Heh']

Unnamed: 0,datetime,region,locality,threat_type,year
3105,2019-09-10 21:07:11,Lakhish Area,Heh,Red Alert,2019
3159,2019-11-12 05:50:59,Lakhish Area,Heh,Red Alert,2019
3302,2019-11-12 08:12:03,Lakhish Area,Heh,Red Alert,2019
3360,2019-11-12 08:31:19,Lakhish Area,Heh,Red Alert,2019
3955,2019-11-13 21:10:35,Lakhish Area,Heh,Red Alert,2019
...,...,...,...,...,...
117350,2025-06-22 07:46:05,Lachish,Heh,Red Alert,2025
118657,2025-06-23 10:27:57,Lachish,Heh,Red Alert,2025
120000,2025-06-23 10:46:16,Lachish,Heh,Red Alert,2025
125230,2025-07-22 05:50:09,Lachish,Heh,Red Alert,2025


In [84]:
# Dictionary for replacements
replacements = {
    'Ashdod - Alef': 'Ashdod - Alef, Bet, Dalet, Heh',
    'Ashdod - Yod Alef': 'Ashdod - Yod Alef, Yod Bet, Tet Vav, Yod Zain, Ma*',
    'Ashdod - Gimmel': 'Ashdod - Gimmel, Vav, Zain',
    'Ashdod - Het': 'Ashdod - Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*',
    'Ashdod-11': 'Ashdod - 11, 12, 15, 17, Marine, City'
}

# Apply replacements
detailed_df['locality'] = detailed_df['locality'].replace(replacements)

# List of Ashdod zones to be removed
ashdod_zones = [
    'Alef', 'Bet', 'Dalet', 'Gimmel', 'Heh', 'Het', 'Tet', 'Ma*', 'Te*',
    'Yod', 'Yod Gimmel', 'Yod Dalet', 'Vav', 'Zain', 'Marine', 'City', '12', '15', '17'
]

# clean locality values
detailed_df = detailed_df[~detailed_df['locality'].isin(ashdod_zones)]

In [85]:
detailed_df[detailed_df['locality'] == 'Heh']

Unnamed: 0,datetime,region,locality,threat_type,year


In [86]:
detailed_df[detailed_df['locality'].str.contains('Ashdod', case=False, na=False)]


Unnamed: 0,datetime,region,locality,threat_type,year
732,2019-05-04 10:19:41,Lakhish 273,Hatzor Ashdod,Red Alert,2019
763,2019-05-04 10:19:47,Lakhish 271,Ashdod,Red Alert,2019
985,2019-05-04 15:23:26,Lakhish 273,Hatzor Ashdod,Red Alert,2019
987,2019-05-04 15:23:29,Lakhish 271,Ashdod,Red Alert,2019
1013,2019-05-04 15:24:17,Lakhish 273,Hatzor Ashdod,Red Alert,2019
...,...,...,...,...,...
128981,2025-10-01 20:48:40,Lachish,"Ashdod - Alef, Bet, Dalet, Heh",Red Alert,2025
128985,2025-10-01 20:48:40,Lachish,"Ashdod - 11, 12, 15, 17, Marine, City",Red Alert,2025
128991,2025-10-01 20:48:41,Lachish,"Ashdod - Het, Tet, Yod, Yod Gimmel, Yod Dalet,...",Red Alert,2025
128997,2025-10-01 20:48:41,Lachish,"Ashdod - Gimmel, Vav, Zain",Red Alert,2025


In [87]:
# Check
get_localities_by_city(detailed_df, "Ashdod")

City: Ashdod
Localities:
Hatzor Ashdod, Ashdod, Ashdod - Yod Alef, Yod Bet, Tet Vav, Yod Zain, Ma*, Ashdod - Northern Industrial Zone and port, Ashdod - Alef, Bet, Dalet, Heh, Ashdod - Gimmel, Vav, Zain, Ashdod - Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*, Ashdod - Northen Industrial Zone, Ashdod - 11, 12, 15, 17, Marine, City, Ashdod Yacov Ichud


['Hatzor Ashdod',
 'Ashdod',
 'Ashdod - Yod Alef, Yod Bet, Tet Vav, Yod Zain, Ma*',
 'Ashdod - Northern Industrial Zone and port',
 'Ashdod - Alef, Bet, Dalet, Heh',
 'Ashdod - Gimmel, Vav, Zain',
 'Ashdod - Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*',
 'Ashdod - Northen Industrial Zone',
 'Ashdod - 11, 12, 15, 17, Marine, City',
 'Ashdod Yacov Ichud']

In [88]:
# Check
get_localities_by_city(detailed_df, "Eilat")

City: Eilat
Localities:
Eilat


['Eilat']

In [89]:
# Check
get_localities_by_city(detailed_df, "Eilot")

City: Eilot
Localities:
Eilot


['Eilot']

In [91]:
# Replace 'Eilot' with 'Eilat' in the 'locality' column
detailed_df.loc[:, 'locality'] = detailed_df['locality'].str.replace(r'Eilot', 'Eilat', case=False, regex=True)


In [92]:
detailed_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 125713 entries, 0 to 129202
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   datetime     125713 non-null  datetime64[ns]
 1   region       125713 non-null  object        
 2   locality     125713 non-null  object        
 3   threat_type  125713 non-null  object        
 4   year         125713 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 5.8+ MB


In [93]:
detailed_df

Unnamed: 0,datetime,region,locality,threat_type,year
0,2018-12-26 10:05:01,Dan 158,Tel Aviv (South West),Red Alert,2018
1,2018-12-26 10:05:01,Dan 156,Tel Aviv (North),Red Alert,2018
2,2018-12-26 10:05:01,Dan 157,Tel Aviv (Central),Red Alert,2018
3,2018-12-26 10:05:01,Dan 159,Tel Aviv (South East),Red Alert,2018
4,2018-12-26 10:05:01,Dan 162,Azur,Red Alert,2018
...,...,...,...,...,...
129198,2025-10-05 04:59:22,Yarkon,Modi'in - Ishpro Center,Red Alert,2025
129199,2025-10-05 18:34:40,Eilat,Eilat,Red Alert,2025
129200,2025-10-05 18:32:20,Eilat,Eilat,Unrecognized Aircraft,2025
129201,2025-10-05 18:33:24,Eilat,Eilat,Unrecognized Aircraft,2025


### ⚠ Handling Data Skew and Duplication in 2021

Let's try to find the reasons for data skew.

In [94]:
# Separating the records for 2021 into a new dataset
detailed_df_2021 = detailed_df[detailed_df['year'] == 2021]

In [95]:
detailed_df_2021.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27503 entries, 4809 to 34176
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     27503 non-null  datetime64[ns]
 1   region       27503 non-null  object        
 2   locality     27503 non-null  object        
 3   threat_type  27503 non-null  object        
 4   year         27503 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 1.3+ MB


In [96]:
# Looking at the first 60 entries
detailed_df_2021.head(60)

Unnamed: 0,datetime,region,locality,threat_type,year
4809,2021-03-11 21:08:41,Western Lakhish Area,Ashkelon Southern Industrial Zone,Red Alert,2021
4810,2021-03-11 21:08:41,Gaza Containment Zone Area,Zikim,Red Alert,2021
4811,2021-04-15 21:01:29,Gaza Containment Zone Area,Nir Am,Red Alert,2021
4812,2021-04-15 21:01:29,Gaza Containment Zone Area,Sderot,Red Alert,2021
4813,2021-04-15 21:01:29,Gaza Containment Zone Area,Ibim,Red Alert,2021
4814,2021-04-16 21:43:33,Gaza Containment Zone Area,Holit,Red Alert,2021
4815,2021-04-16 21:43:33,Gaza Containment Zone Area,Sdeh Avraham,Red Alert,2021
4816,2021-04-22 01:41:33,Southern Negev Area,Abu Qrenat,Red Alert,2021
4817,2021-04-23 22:59:05,Gaza Containment Zone Area,Kissufim,Red Alert,2021
4818,2021-04-24 01:50:05,Gaza Containment Zone Area,Kfar Maimon and Tushia,Red Alert,2021


Several lines with the same name of a locality can go one after the other. For example, Netiv HaAssara, Kissufim

In [97]:
detailed_df_2021[detailed_df_2021['locality'] == 'Kissufim'].head(10)

Unnamed: 0,datetime,region,locality,threat_type,year
4817,2021-04-23 22:59:05,Gaza Containment Zone Area,Kissufim,Red Alert,2021
4835,2021-04-24 05:35:23,Gaza Containment Zone Area,Kissufim,Red Alert,2021
4836,2021-04-24 05:35:47,Gaza Containment Zone Area,Kissufim,Red Alert,2021
4837,2021-04-24 05:40:04,Gaza Containment Zone Area,Kissufim,Red Alert,2021
4838,2021-04-24 05:41:04,Gaza Containment Zone Area,Kissufim,Red Alert,2021
4860,2021-04-28 00:18:55,Gaza Containment Zone Area,Kissufim,Red Alert,2021
5246,2021-05-11 07:11:48,Gaza Containment Zone,Kissufim,Red Alert,2021
5500,2021-05-11 13:29:50,Gaza Containment Zone,Kissufim,Red Alert,2021
5523,2021-05-11 13:57:50,Gaza Containment Zone,Kissufim,Red Alert,2021
5564,2021-05-11 15:13:17,Gaza Containment Zone,Kissufim,Red Alert,2021


In [98]:
# Filtering the data for October 7, 2023 (massive attacks)
detailed_df[detailed_df['datetime'].dt.date == pd.to_datetime('2023-10-07').date()].head(50)

Unnamed: 0,datetime,region,locality,threat_type,year
36407,2023-10-07 06:29:02,Dan,Bat Yam,Red Alert,2023
36408,2023-10-07 06:29:02,Lachish,Palmachim,Red Alert,2023
36409,2023-10-07 06:29:02,HaShfela,Rishon LeZion - West,Red Alert,2023
36410,2023-10-07 06:29:03,Gaza Envelope,Netiv HaAssara,Red Alert,2023
36411,2023-10-07 06:29:04,Gaza Envelope,Yad Mordechai,Red Alert,2023
36412,2023-10-07 06:29:20,Gaza Envelope,Nachal Oz,Red Alert,2023
36413,2023-10-07 06:29:22,Gaza Envelope,Erez,Red Alert,2023
36414,2023-10-07 06:29:24,Gaza Envelope,Sderot,Red Alert,2023
36415,2023-10-07 06:29:24,Gaza Envelope,Ibim,Red Alert,2023
36416,2023-10-07 06:29:24,Gaza Envelope,Nir Am,Red Alert,2023


In 2023, the duplication issue was fixed, and each alert is now uniquely generated, with no duplicate segments.

In [99]:
# Checking the unique names of localities
detailed_df_2021['locality'].unique()

array(['Ashkelon Southern Industrial Zone', 'Zikim', 'Nir Am', 'Sderot',
       'Ibim', 'Holit', 'Sdeh Avraham', 'Abu Qrenat', 'Kissufim',
       'Kfar Maimon and Tushia', "Be'eri", 'Alumim', 'Mivtachim', 'Amioz',
       'Yesha', 'Nir Itzhak', 'Tzohar and Ohad', 'Netiv HaAssara',
       'Mefalsim', 'Kfar Azza', 'Nachal Oz', 'Kerem Shalom', 'Karmia',
       'Yad Mordechai', 'Gavim', 'Sapir College', 'Nir Am Shooting Range',
       'Nirim', 'Erez', 'Ashkelon', "Mavki'im", "Sa'ad", 'Beit Nekofa',
       'Mevasseret Zion', 'Motza Illit', 'Tzuba', 'Abu Ghosh',
       'Givat Yearim', 'Ein Naqquba', 'Ein Rafa', 'Ramat Raziel',
       'Shachar', 'Kiryat Gat', 'Karmei Gat', 'Even Sapir', 'Zanoah',
       'Beit Zayit', 'Beit Shemesh', 'Bar Giora', 'Ness Harim',
       'Har Adar', "Ma'aleh HaHamisha", 'Kiryat Anavim', 'Kiryat Yearim',
       'Ksalon', 'Jerusalem - East', 'Center and West',
       'Jerusalem - North and Alonim', 'Ein Kerem Boarding School',
       'Zekharia', 'Agur', 'Sdot Micha',

Localities contain 'locality' == " in their names. Delete them from the entire dataset

In [100]:
# Counting the number of such rows in the dataset
detailed_df[detailed_df['locality'] == ''].shape[0]

0

In [101]:
# Cleaning data from rows with 'locality' == "
detailed_df = detailed_df[detailed_df['locality'] != '']

A post dated May 20, 2021, was found in the message files, containing information that
- Fixed a bug that caused duplicate alerts.

This means that previously, alarm alerts were duplicated, which led to incorrect distribution in the datas


⚠⚠⚠ Before May 20, 2021, the dataset contained duplicate alert blocks—identical lists of localities and timestamps were repeated within individual alerts. A May 20, 2021 update to Cumta’s Android app fixed this bug, but the duplicates skewed the yearly distribution of alerts.⚠⚠⚠

In [102]:
# Filter rows before May 20, 2021
df_before_may20 = detailed_df[detailed_df['datetime'] < pd.Timestamp('2021-05-20')].copy()

# Create a unique key for each alert (based on 'datetime', 'region', and 'locality' columns)
df_before_may20['unique_key'] = (
    df_before_may20['datetime'].astype(str).str.strip() + '_' +
    df_before_may20['region'].astype(str).str.strip() + '_' +
    df_before_may20['locality'].astype(str).str.strip()
)

# Remove duplicates based on the unique key
df_before_may20_clean = df_before_may20.drop_duplicates(subset='unique_key').copy()

# Drop the unique key column if it is no longer needed
df_before_may20_clean.drop(columns=['unique_key'], inplace=True)

# If needed, merge the cleaned data with the remaining dataset (rows from May 20, 2021, and later)
detailed_df = pd.concat([
    df_before_may20_clean,
    detailed_df[detailed_df['datetime'] >= pd.Timestamp('2021-05-20')]
])


The number of rows has decreased significantly

In [103]:
detailed_df[detailed_df['year'] == 2021].head(60)

Unnamed: 0,datetime,region,locality,threat_type,year
4809,2021-03-11 21:08:41,Western Lakhish Area,Ashkelon Southern Industrial Zone,Red Alert,2021
4810,2021-03-11 21:08:41,Gaza Containment Zone Area,Zikim,Red Alert,2021
4811,2021-04-15 21:01:29,Gaza Containment Zone Area,Nir Am,Red Alert,2021
4812,2021-04-15 21:01:29,Gaza Containment Zone Area,Sderot,Red Alert,2021
4813,2021-04-15 21:01:29,Gaza Containment Zone Area,Ibim,Red Alert,2021
4814,2021-04-16 21:43:33,Gaza Containment Zone Area,Holit,Red Alert,2021
4815,2021-04-16 21:43:33,Gaza Containment Zone Area,Sdeh Avraham,Red Alert,2021
4816,2021-04-22 01:41:33,Southern Negev Area,Abu Qrenat,Red Alert,2021
4817,2021-04-23 22:59:05,Gaza Containment Zone Area,Kissufim,Red Alert,2021
4818,2021-04-24 01:50:05,Gaza Containment Zone Area,Kfar Maimon and Tushia,Red Alert,2021


There are no more duplicates

In [104]:
# Convert the 'datetime' column to datetime format
detailed_df['datetime'] = pd.to_datetime(detailed_df['datetime'], 
                                         errors='coerce', format='%d/%m/%Y %H:%M:%S')

# Extract the year from the 'datetime' column
detailed_df['year'] = detailed_df['datetime'].dt.year

# Count the number of alerts per year
warnings_by_year_clean = detailed_df['year'].value_counts().sort_index()

# Calculate the percentage distribution per year
percentages_clean = (warnings_by_year_clean / warnings_by_year_clean.sum()) * 100

# Combine the counts and percentage distribution into a DataFrame
result_df_clean = pd.DataFrame({
    'Count': warnings_by_year_clean,
    'Percentage': percentages_clean.round(2).astype(str) + '%'
})

# Display the result
print('Distribution of data by year in detailed_df_clean:')
result_df_clean


Distribution of data by year in detailed_df_clean:


Unnamed: 0_level_0,Count,Percentage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2018,58,0.05%
2019,4277,4.02%
2020,312,0.29%
2021,8145,7.66%
2022,989,0.93%
2023,13838,13.01%
2024,32674,30.73%
2025,46040,43.3%


🎉🎉🎉The distribution by year has become similar to the original one🎉🎉🎉

### Processing of the 'region' column

In [105]:
# Check again
detailed_df['region'].unique()

array(['Dan 158', 'Dan 156', 'Dan 157', 'Dan 159', 'Dan 162', 'Dan 155',
       'Dan 160', 'Sharon 140', 'Dan 161', 'Dan 165', 'Sharon 143',
       'Sharon 141', 'Eilat 311', 'Arabah 310', 'Lakhish 246',
       'Gaza Containment Zone 225', 'Gaza Containment Zone 224',
       'Jerusalem 194', 'Maale Adumim 200', 'Samaria 127',
       'Gaza Containment Zone 236',
       'Central Negev / Gaza Containment Zone 238',
       'Gaza Containment Zone 237', 'Gaza Containment Zone 230',
       'Gaza Containment Zone 220', 'Gaza Containment Zone 219',
       'Gaza Containment Zone 221', 'Hefer 139',
       'Gaza Containment Zone 218', 'Central Negev 254',
       'Central Negev 255', 'Central Negev / Gaza Containment Zone 216',
       'Gaza Containment Zone 223', 'Gaza Containment Zone 233',
       'Gaza Containment Zone 232', 'Gaza Containment Zone 231',
       'Gaza Containment Zone 228', 'Gaza Containment Zone 222',
       'Gaza Containment Zone 217', 'Lakhish 247', 'Gaza containment 224',
     

In [106]:
# Remove ' Area' from the 'region' column if it exists
detailed_df['region'] = detailed_df['region'].str.replace(' Area', '', regex=False).str.strip()
detailed_df['region'].unique()

array(['Dan 158', 'Dan 156', 'Dan 157', 'Dan 159', 'Dan 162', 'Dan 155',
       'Dan 160', 'Sharon 140', 'Dan 161', 'Dan 165', 'Sharon 143',
       'Sharon 141', 'Eilat 311', 'Arabah 310', 'Lakhish 246',
       'Gaza Containment Zone 225', 'Gaza Containment Zone 224',
       'Jerusalem 194', 'Maale Adumim 200', 'Samaria 127',
       'Gaza Containment Zone 236',
       'Central Negev / Gaza Containment Zone 238',
       'Gaza Containment Zone 237', 'Gaza Containment Zone 230',
       'Gaza Containment Zone 220', 'Gaza Containment Zone 219',
       'Gaza Containment Zone 221', 'Hefer 139',
       'Gaza Containment Zone 218', 'Central Negev 254',
       'Central Negev 255', 'Central Negev / Gaza Containment Zone 216',
       'Gaza Containment Zone 223', 'Gaza Containment Zone 233',
       'Gaza Containment Zone 232', 'Gaza Containment Zone 231',
       'Gaza Containment Zone 228', 'Gaza Containment Zone 222',
       'Gaza Containment Zone 217', 'Lakhish 247', 'Gaza containment 224',
     

In [107]:
# Replace the incorrect region name with the corrected version
detailed_df['region'] = detailed_df['region'].replace('Milouot Industrial\xa0Zone\xa0North', 
                                                      'Milouot Industrial Zone North')

### Adding the 'district' column

In [108]:
# List of major cities with their corresponding districts
city_to_district = {
    "Tel Aviv": "Tel Aviv District",
    "Dan Area": "Tel Aviv District",
    "Ramat Gan": "Tel Aviv District",
    "Herzeliya": "Tel Aviv District",
    "Holon": "Tel Aviv District",
    "Bat Yam": "Tel Aviv District",
    "Bnei Brak": "Tel Aviv District",
    "Givatayim": "Tel Aviv District",
    "Kiryat Ono": "Tel Aviv District",
    "Or Yehuda": "Tel Aviv District",
    "Ramat HaSharon": "Tel Aviv District",
    "Rishon LeZion": "Central District",
    "Netanya": "Central District",
    "Rehovot": "Central District",
    "Petah Tikva": "Central District",
    "Lod": "Central District",
    "Ramla": "Central District",
    "Kfar Saba": "Central District",
    "Yavne": "Central District",
    "Modi'in-Maccabim-Re'ut": "Central District",
    "Modi'in": "Central District",
    "Be'er Ya'akov": "Central District",
    "El'ad": "Central District",
    "Ganei Tikva": "Central District",
    "Giv'at Shmuel": "Central District",
    "Hod Hasharon": "Central District",
    "Kafr Qasim": "Central District",
    "Kfar Yona": "Central District",
    "Ness Ziona": "Central District",
    "Qalansawe": "Central District",
    "Ra'anana": "Central District",
    "Rosh HaAyin": "Central District",
    "Tayibe": "Central District",
    "Tira": "Central District",
    "Yehud-Monosson": "Central District",
    "Haifa": "Haifa District",
    "Hadera": "Haifa District",
    "Caesarea": "Haifa District",
    "Nesher": "Haifa District",
    "Or Akiva": "Haifa District",
    "Tirat Carmel": "Haifa District",
    "Kiryat Ata": "Haifa District",
    "Kiryat Bialik": "Haifa District",
    "Kiryat Yam": "Haifa District",
    "Kiryat Motzkin": "Haifa District",
    "Pardes Hanna-Karkur": "Haifa District",
    "Jerusalem": "Jerusalem District",
    "Beit Shemesh": "Jerusalem District",
    "Ma'ale Adumim": "Jerusalem District",
    "Nazareth": "Northern District",
    "Acre": "Northern District",
    "Gan Yavne": "Central District",
    "Dead Sea Factories": "Southern District",
    "Dead Sea Industries": "Southern District",
    "Lakhish": "Southern District",
    "Atlit": "Haifa District",
    "Kiryat Malachi": "Southern District",
    "Gdera": "Central District",
    "Bnei Darom": "Southern District",
    "Sdeh Yoav": "Southern District",
    "Palmachim": "Central District",
    "Tiberias": "Northern District",
    "Nahariya": "Northern District",
    "Safed": "Northern District",
    "Kiryat Shmona": "Northern District",
    "Afula": "Northern District",
    "Karmiel": "Northern District",
    "Ma'alot-Tarshiha": "Northern District",
    "Migdal HaEmek": "Northern District",
    "Yokneam Illit": "Northern District",
    "Ashdod": "Southern District",
    "Ashkelon": "Southern District",
    "Beer Sheva": "Southern District",
    "Eilat": "Southern District",
    "Sderot": "Southern District",
    "Ofakim": "Southern District",
    "Dimona": "Southern District",
    "Arad": "Southern District",
    "Kiryat Gat": "Southern District",
    "Kiryat Malakhi": "Southern District"
}

# Simplified mapping based on keywords in the 'region' column
district_mapping_patterns = {
    'Dan': 'Tel Aviv District',
    'Sharon': 'Central District',
    'Hefer': 'Central District',
    'Yarkon': 'Central District',
    'Drom HaSharon': 'Central District',
    'Haifa': 'Haifa District',
    'Pardes Hanna': 'Haifa District',
    'Karkur': 'Haifa District',
    'Hof HaCarmel': 'Haifa District',
    'Menashe': 'Haifa District',
    'Hakrayot': 'Haifa District',
    'Wadi Ara': 'Haifa District',
    'Jerusalem': 'Jerusalem District',
    'Maale Adumim': 'Jerusalem District',
    'Beit Shemesh': 'Jerusalem District',
    'Samaria': 'Judea and Samaria Area',
    'Shomron': 'Judea and Samaria Area',
    'Judea': 'Judea and Samaria Area',
    'Yehuda': 'Judea and Samaria Area',
    'Shfelat Yehuda': 'Judea and Samaria Area',
    'Lakhish': 'Southern District',
    'Gaza': 'Southern District',
    'Confrontation': 'Southern District',
    'Nirim': 'Southern District',
    'Nir Am': 'Southern District',
    'Eilat': 'Southern District',
    'Arava': 'Southern District',
    'Dead Sea': 'Southern District',
    'Negev': 'Southern District',
    'Western Negev': 'Southern District',
    'Safed': 'Northern District',
    'Galilee': 'Northern District',
    'Golan': 'Northern District',
    'Tavor': 'Northern District',
    'HaAmakim': 'Northern District',
    "Beit She'an": 'Northern District',
    'Lev Ha-Hula': 'Northern District',
    'Shfela': 'Central District',
    'Southern Shfela': 'Central District',
    'HaMifratz': 'Haifa District',
    'Ibim': 'Southern District',
    'Gaza Envelope': 'Southern District',
    'West Lachish': 'Southern District',
    'Center Galilee': 'Northern District',
    'Center Negev': 'Southern District',
    'South Golan': 'Northern District',
    'Lower Galilee': 'Northern District',
    'Upper Galilee': 'Northern District',
    'North Golan': 'Northern District', 
    "Beit She'an Valley": 'Northern District',
    'Fast Lane Parking Lot': 'Central District',
    'Mini Israel': 'Central District',
    'Modiin Maccabim Reut': 'Central District',
    'Neot Kedumim': 'Central District',
    'Regem Industrial Zone': 'Southern District',
    'Yehud Monoson': 'Central District',
    'Orevim Cliff': 'Northern District',
    'Ramat Trump': 'Northern District',
    'Jordan River Rafting': 'Northern District',
    'Kfar Naḥum': 'Northern District',
    'Tabgha': 'Northern District',
    "Arabah 310": "Southern District",
    "Ye'arut HaCarmel (Carmel Forest)": "Haifa District",
    "Mivtachim Ami'oz Yesha": "Central District",
    "Kokhav Michael": "Central District",
    "Lachish": "Southern District",
    "Bika'a": "Central District",
    "HaCarmel": "Haifa District",
    "Kfar Yehoshua Train Station": "Central District",
    "Shoham": "Central District",
    "Nili": "Central District",
    "Airport City": "Central District",
    'Goren Guest Farm': 'Northern District',
    'Miluot North Industrial Zone': 'Northern District',
    'Jordan Estate Hotel': 'Northern District',
    'Milouot Industrial Zone North': 'Northern District'
}


In [109]:
# Function to match keywords in the 'region' column and assign districts
def map_to_district(region):
    for keyword, district in district_mapping_patterns.items():
        if keyword.lower() in region.lower():
            return district
    return 'Undefined'  # Default if no match is found

# Apply logic to create the 'district' column
detailed_df['district'] = detailed_df.apply(
    lambda row: "Southern District" if "ashdod" in row['locality'].lower()  # Ensure all 'Ashdod' locations are categorized correctly
                else city_to_district[row['locality']] if row['locality'] in city_to_district  # Match exact locality if found in city_to_district
                else map_to_district(row['region']),  # Otherwise, try mapping based on region keywords
    axis=1
)

# Drop the 'year' column as it's not needed
detailed_df = detailed_df.drop(columns=['year'])

# Reset the index for better data structure
detailed_df = detailed_df.reset_index(drop=True)

# Display the updated DataFrame
detailed_df


Unnamed: 0,datetime,region,locality,threat_type,district
0,2018-12-26 10:05:01,Dan 158,Tel Aviv (South West),Red Alert,Tel Aviv District
1,2018-12-26 10:05:01,Dan 156,Tel Aviv (North),Red Alert,Tel Aviv District
2,2018-12-26 10:05:01,Dan 157,Tel Aviv (Central),Red Alert,Tel Aviv District
3,2018-12-26 10:05:01,Dan 159,Tel Aviv (South East),Red Alert,Tel Aviv District
4,2018-12-26 10:05:01,Dan 162,Azur,Red Alert,Tel Aviv District
...,...,...,...,...,...
106328,2025-10-05 04:59:22,Yarkon,Modi'in - Ishpro Center,Red Alert,Central District
106329,2025-10-05 18:34:40,Eilat,Eilat,Red Alert,Southern District
106330,2025-10-05 18:32:20,Eilat,Eilat,Unrecognized Aircraft,Southern District
106331,2025-10-05 18:33:24,Eilat,Eilat,Unrecognized Aircraft,Southern District


In [110]:
# Check 'Undefined' district
detailed_df[detailed_df['district'] == 'Undefined']['locality'].unique()

array(['ברחבי הארץ'], dtype=object)

In [111]:
detailed_df[detailed_df['district'] == 'Undefined']['region'].unique()

array(['ברחבי הארץ'], dtype=object)

In [112]:
detailed_df[detailed_df['region'] == 'ברחבי הארץ']

Unnamed: 0,datetime,region,locality,threat_type,district
74878,2025-06-14 01:12:14,ברחבי הארץ,ברחבי הארץ,Red Alert,Undefined
75203,2025-06-14 01:12:14,ברחבי הארץ,ברחבי הארץ,Red Alert,Undefined
91456,2025-06-20 15:44:15,ברחבי הארץ,ברחבי הארץ,Red Alert,Undefined


In [113]:
# Find duplicate nationwide rows based on datetime
nationwide_duplicates = detailed_df[detailed_df['region'] == 'ברחבי הארץ'].duplicated(subset=['datetime'])

# Remove only the duplicate nationwide rows
detailed_df = detailed_df[~(detailed_df['region'] == 'ברחבי הארץ') | ~nationwide_duplicates]

# Check
detailed_df[detailed_df['region'] == 'ברחבי הארץ']

Unnamed: 0,datetime,region,locality,threat_type,district
74878,2025-06-14 01:12:14,ברחבי הארץ,ברחבי הארץ,Red Alert,Undefined
91456,2025-06-20 15:44:15,ברחבי הארץ,ברחבי הארץ,Red Alert,Undefined


In [114]:
# Get all unique locality-region-district pairs from the dataset
locality_region_district_pairs = detailed_df[
    (detailed_df['locality'] != 'ברחבי הארץ') & 
    (detailed_df['region'] != 'ברחבי הארץ')
][['locality', 'region', 'district']].drop_duplicates()

# Find nationwide alerts (across the whole country)
nationwide = detailed_df[
    (detailed_df['locality'] == 'ברחבי הארץ') & 
    (detailed_df['region'] == 'ברחבי הארץ')
]

# Create expanded copies for each locality with correct region and district
expanded = []
for _, alert in nationwide.iterrows():
    for _, location in locality_region_district_pairs.iterrows():
        new_alert = alert.copy()
        new_alert['locality'] = location['locality']
        new_alert['region'] = location['region']
        new_alert['district'] = location['district']
        expanded.append(new_alert)

# Combine back together
detailed_df = pd.concat([
    detailed_df[~((detailed_df['locality'] == 'ברחבי הארץ') & (detailed_df['region'] == 'ברחבי הארץ'))],
    pd.DataFrame(expanded)
], ignore_index=True)

In [115]:
detailed_df[detailed_df['datetime'] == '2025-06-14 01:12:14']

Unnamed: 0,datetime,region,locality,threat_type,district
106330,2025-06-14 01:12:14,Dan 158,Tel Aviv (South West),Red Alert,Tel Aviv District
106331,2025-06-14 01:12:14,Dan 156,Tel Aviv (North),Red Alert,Tel Aviv District
106332,2025-06-14 01:12:14,Dan 157,Tel Aviv (Central),Red Alert,Tel Aviv District
106333,2025-06-14 01:12:14,Dan 159,Tel Aviv (South East),Red Alert,Tel Aviv District
106334,2025-06-14 01:12:14,Dan 162,Azur,Red Alert,Tel Aviv District
...,...,...,...,...,...
108868,2025-06-14 01:12:14,Arava,Elipaz and Timna Mines,Red Alert,Southern District
108869,2025-06-14 01:12:14,Arava,Samar,Red Alert,Southern District
108870,2025-06-14 01:12:14,Arava,Six Senses Shaharut Hotel,Red Alert,Southern District
108871,2025-06-14 01:12:14,Arava,Ardanel Ranch,Red Alert,Southern District


In [116]:
detailed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111416 entries, 0 to 111415
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   datetime     111416 non-null  datetime64[ns]
 1   region       111416 non-null  object        
 2   locality     111416 non-null  object        
 3   threat_type  111416 non-null  object        
 4   district     111416 non-null  object        
dtypes: datetime64[ns](1), object(4)
memory usage: 4.3+ MB


### Saving the dataframe

In [117]:
# Save the cleaned dataset to a CSV file
detailed_df.to_csv('cumta_detailed_df.csv', index=False, encoding='utf-8')

**Summary of Detailed Dataset Processing and Cleaning**

1️⃣ **Dataset Creation**
- Constructed a dataset with **detailed locality information** based on `raw_df`.

2️⃣ **Yearly Data Distribution Analysis**
- Identified a significant discrepancy in the **distribution of alerts by year** between `detailed_df` and `raw_df`:
  - **In `detailed_df`**: 2021 accounts for **35.09%** of all records.
  - **In `raw_df`**: 2021 makes up only **14.06%** of all records.
- Given that the original dataset consists of **15 files**, and **2021 alert data spans less than 2 files (~13-14%)**, the distribution in `detailed_df` appeared skewed.

3️⃣ **Localization Processing**
- Checked **localities and regions in Hebrew** and translated them into **English**.

4️⃣ **City Name Standardization**
- Identified **large cities** that were divided into **zones**.
- Verified how many such **zones exist** and ensured proper classification.
- Paid special attention to **Ashdod**, where zone names had **non-standard formats**.

5️⃣ **Handling Data Skew and Duplication in 2021**
- Found that **several consecutive lines** contained **the same locality names** (e.g., *Netiv HaAssara, Kissufim*).
- Discovered a **message dated May 20, 2021**, stating:

  > “Fixed a bug that caused duplicate alerts.”

- This confirmed that **prior to May 20, 2021**, **duplicate alert blocks** existed, where **identical localities and timestamps** were repeated **within the same alert**.
- This duplication issue **skewed the yearly distribution**.

6️⃣ **Deduplication & Final Data Cleaning**
- **Removed duplicate alert records** from **before May 20, 2021**.
- Successfully restored the **yearly distribution** to reflect **the original dataset structure**.

---

✅ **Final Result:** The dataset is now **cleaned, properly localized, and free from duplicate alerts** that previously distorted historical distributions.


- **Total rows:** 111416
- **Duplicates:** None  
- **Time column (`'time'`):** Converted to `datetime64[ns]` format  
- **Missing value** None  

**Columns:**

| Column Name | Description |
|-------------|------------|
| **datetime** | Date and time of the warning in `datetime64[ns]` format |
| **locality** | Name of the locality (city, town, or settlement) |
| **region** | Region to which the locality belongs |
| **threat_type** | Type of warning. Possible values: `'Red Alert'`, `'Unrecognized Aircraft'`, `'Terrorist Infiltration'`, `'Interception Pieces'`, `'Earthquake'` |
| **district** | Administrative district or region where the locality is located; an official governmental division. |

## Creating a dataset where zones of large cities are combined into a single record

In [118]:
# List of major cities
major_cities_list = [
    "Modi'in", "Hadera", "Caesarea", "Ramat Gan", "Haifa", "Beer Sheva", "Netanya",
    "Rishon LeZion", "Tel Aviv", "Ashdod", "Herzeliya", "Jerusalem", "Nahariya", 
    "Safed", "Dimona", "Kiryat Gat", "Yokneam Illit", "Acre", "Ashkelon", "Kiryat Bialik"
]

# Map for special cases
special_cases = {
    'Hatzor Ashdod': 'Hatzor Ashdod',
    'Hevel Modi\'in': 'Hevel Modi\'in',
    'Modi\'in Illit': 'Modi\'in Illit',
    'Haifa - Kiryat Haim & Kiryat Shmuel': 'Haifa - Kiryat Haim & Kiryat Shmuel'
}

# Function to determine locality and district label
def map_to_locality_and_district(row):
    locality = row['locality']
    # Check if the locality is in special cases
    if locality in special_cases:
        return special_cases[locality], "all"
    # Check if the locality contains a major city name
    for city in major_cities_list:
        if city in locality:
            # Remove city name and clean up
            district = locality.replace(city, "").strip(" -")
            # Remove parentheses using regex
            district = re.sub(r'[()]', '', district).strip()
            return city, district  # Return city and cleaned district name
    # If not in major cities, mark it as its own locality with "all" zones
    return locality, "All"

In [119]:
# Step 1: Apply the mapping logic to determine locality and district
detailed_df[['locality', 'district']] = detailed_df.apply(
    lambda row: pd.Series(map_to_locality_and_district(row)), axis=1
)

# Step 2: Group by 'time' and 'locality', and aggregate zones
# First, create time groups to combine alerts within 2-minute windows
df_grouped = (
    detailed_df
    .sort_values(['locality', 'datetime'])  # Sort by time for correct grouping
    .assign(
        # Create groups: reset group counter when time difference > 2minutes
        time_group=lambda x: (
            x.groupby('locality')['datetime']
            .diff() > pd.Timedelta(minutes=2)
        ).cumsum()
    )
    .groupby(['locality', 'time_group'], as_index=False)
    .agg({
        'datetime': 'first',  # Take the time of the first alert in the group
        'district': lambda x: ', '.join(
            filter(None, [item.strip() for item in set(x) if item and item.strip()])
        ).strip(', '),  # Combine UNIQUE districts
        'region': 'first',
        'threat_type': 'first'
    })
    .drop('time_group', axis=1)  # Remove temporary column
)

# Then perform the original grouping by date and locality (with already combined data)
df = (
    df_grouped
    .groupby(['datetime', 'locality'])
    .agg({
        'district': lambda x: ', '.join(
            filter(None, [item.strip() for item in set(x) if item and item.strip()])
        ).strip(', '),
        'region': 'first',
        'threat_type': 'first'
    })
    .reset_index()
)

In [120]:

# Rename 'district' column for clarity
df.rename(columns={'district': 'zones'}, inplace=True)

# Step 3: Replace empty 'zones' with 'All'
df['zones'] = df['zones'].apply(lambda x: 'All' if not x.strip() else x)

# Step 4: Remove numbers from the 'region' column if locality is in major_cities_list
def clean_region(region, locality):
    if locality in major_cities_list:
        return re.sub(r'\d+', '', region).strip()
    return region

df['region'] = df.apply(lambda row: clean_region(row['region'], row['locality']), axis=1)


In [121]:
df

Unnamed: 0,datetime,locality,zones,region,threat_type
0,2018-12-26 10:05:01,Adanim,All,Sharon 143,Red Alert
1,2018-12-26 10:05:01,Azur,All,Dan 162,Red Alert
2,2018-12-26 10:05:01,Bat Yam,All,Dan 165,Red Alert
3,2018-12-26 10:05:01,Beit Berl,All,Sharon 141,Red Alert
4,2018-12-26 10:05:01,Bnei Brak,All,Dan 160,Red Alert
...,...,...,...,...,...
79665,2025-10-05 04:58:41,Yish'i,All,Shfelat Yehuda,Red Alert
79666,2025-10-05 04:58:41,Zanoah,All,Shfelat Yehuda,Red Alert
79667,2025-10-05 04:58:41,Zeitan,All,HaShfela,Red Alert
79668,2025-10-05 04:58:41,Zrifin Industrial Zone,All,HaShfela,Red Alert


### Checking the main cities in the final dataset

In [122]:
df[df['locality'] == 'Tel Aviv']

Unnamed: 0,datetime,locality,zones,region,threat_type
49,2018-12-26 10:05:01,Tel Aviv,"South East, North, South West, Central",Dan,Red Alert
156,2019-03-14 21:05:33,Tel Aviv,"South East, South West, Central",Dan,Red Alert
4275,2021-05-11 20:46:26,Tel Aviv,"South and Jaffa, Across the Yarkon, East",Dan,Red Alert
4494,2021-05-11 20:50:59,Tel Aviv,"South and Jaffa, Across the Yarkon, City Cente...",Dan,Red Alert
4625,2021-05-11 20:58:52,Tel Aviv,"South and Jaffa, Across the Yarkon, City Cente...",Dan,Red Alert
...,...,...,...,...,...
78283,2025-09-03 09:39:46,Tel Aviv,"South and Jaffa, Across the Yarkon, City Cente...",Dan,Red Alert
78675,2025-09-13 03:47:38,Tel Aviv,"South and Jaffa, Across the Yarkon, City Cente...",Dan,Red Alert
79150,2025-09-18 20:32:11,Tel Aviv,"South and Jaffa, Across the Yarkon, City Cente...",Dan,Red Alert
79367,2025-09-25 22:42:15,Tel Aviv,"South and Jaffa, Across the Yarkon, City Cente...",Dan,Red Alert


In [123]:
df[df['locality'] == 'Haifa']

Unnamed: 0,datetime,locality,zones,region,threat_type
16556,2023-10-11 18:35:26,Haifa,"Downtown Lower City, Bay, Neveh Sha'anan, Carm...",Menashe,Unrecognized Aircraft
22763,2024-01-19 20:45:09,Haifa,"Downtown Lower City, West, Carmel",Menashe,Red Alert
26824,2024-06-11 09:51:51,Haifa,West,Menashe,Red Alert
30625,2024-09-23 17:41:38,Haifa,Bay,Menashe,Red Alert
30747,2024-09-23 19:42:35,Haifa,"Hadar, Downtown Lower City, Bay, Neveh Sha'ana...",Menashe,Red Alert
...,...,...,...,...,...
72908,2025-06-23 10:40:53,Haifa,"West, Bay",HaMifratz,Red Alert
73856,2025-06-24 05:13:00,Haifa,"Hadar, Downtown Lower City, Bay, Neveh Sha'ana...",HaMifratz,Red Alert
75154,2025-06-24 06:57:09,Haifa,"Hadar, Downtown Lower City, Bay, Neveh Sha'ana...",HaMifratz,Red Alert
75439,2025-06-24 07:17:13,Haifa,"Neveh Sha'anan, West, Ramot HaCarmel, Bay",HaMifratz,Red Alert


In [124]:
df[df['locality'] == "Modi'in"]

Unnamed: 0,datetime,locality,zones,region,threat_type
2748,2019-11-12 10:16:42,Modi'in,Ishpro Center,Gaza Containment Zone,Red Alert
4410,2021-05-11 20:46:33,Modi'in,Ishpro Center,Shfela (Lowlands),Red Alert
5086,2021-05-12 03:02:57,Modi'in,Ligad Center,Shfela (Lowlands),Red Alert
6076,2021-05-13 01:35:06,Modi'in,"Ishpro Center, Ligad Center",Shfela (Lowlands),Red Alert
6586,2021-05-13 22:14:59,Modi'in,Ligad Center,Shfela (Lowlands),Red Alert
...,...,...,...,...,...
78837,2025-09-16 18:51:38,Modi'in,"Ishpro Center, Maccabim Re'ut, Regional Counci...",HaShfela,Red Alert
78936,2025-09-18 20:32:06,Modi'in,"Ishpro Center, Maccabim Re'ut, Regional Counci...",HaShfela,Red Alert
79231,2025-09-25 22:42:12,Modi'in,Regional Council Hevel Industrial Park,HaShfela,Red Alert
79469,2025-09-29 00:59:00,Modi'in,"Ishpro Center, Maccabim Re'ut, Regional Counci...",HaShfela,Red Alert


In [125]:
df[df['locality'] == 'Hadera']

Unnamed: 0,datetime,locality,zones,region,threat_type
2417,2019-07-31 10:10:03,Hadera,East,Menashe,Red Alert
2418,2019-07-31 10:15:05,Hadera,Center,Menashe,Red Alert
2419,2019-07-31 10:20:05,Hadera,Neveh Haim,Menashe,Red Alert
2420,2019-07-31 10:25:02,Hadera,West,Menashe,Red Alert
9922,2021-11-03 18:05:03,Hadera,West,Menashe,Red Alert
9927,2021-11-03 18:15:02,Hadera,Neveh Haim,Menashe,Red Alert
9932,2021-11-03 18:25:01,Hadera,Center,Menashe,Red Alert
33287,2024-10-01 19:34:14,Hadera,"Center, West, Neveh Haim, East",Menashe,Red Alert
34178,2024-10-01 19:39:44,Hadera,"Center, West, Neveh Haim, East",Menashe,Red Alert
34902,2024-10-01 19:44:21,Hadera,"Center, West, East",Menashe,Red Alert


In [126]:
df[df['locality'] == 'Caesarea']

Unnamed: 0,datetime,locality,zones,region,threat_type
2348,2019-07-31 10:05:05,Caesarea,"Industrial Zone, Marine Center",Menashe,Red Alert
33363,2024-10-01 19:34:17,Caesarea,"Industrial Zone, Marine Center",Menashe,Red Alert
34363,2024-10-01 19:39:47,Caesarea,"Marine Center, Industrial Zone",Menashe,Red Alert
35032,2024-10-01 19:50:19,Caesarea,"Industrial Zone, Marine Center",Menashe,Red Alert
36684,2024-10-04 10:49:20,Caesarea,Marine Center,Menashe,Red Alert
37006,2024-10-06 08:28:39,Caesarea,"Marine Center, Industrial Zone",Menashe,Red Alert
37769,2024-10-09 08:26:38,Caesarea,Marine Center,Menashe,Red Alert
38879,2024-10-14 17:34:59,Caesarea,Industrial Zone,Menashe,Red Alert
39027,2024-10-15 07:27:06,Caesarea,"Industrial Zone, Marine Center",Menashe,Red Alert
40401,2024-10-22 07:43:18,Caesarea,Marine Center,Menashe,Red Alert


In [127]:
df[df['locality'] == 'Ramat Gan']

Unnamed: 0,datetime,locality,zones,region,threat_type
42,2018-12-26 10:05:01,Ramat Gan,"Bar Ilan University, Ramat Ef'al & Tel Hashomer",Dan,Red Alert
155,2019-03-14 21:05:33,Ramat Gan,All,Dan,Red Alert
4341,2021-05-11 20:46:29,Ramat Gan,"West, East",Dan,Red Alert
4561,2021-05-11 20:51:34,Ramat Gan,"West, East",Dan,Red Alert
4620,2021-05-11 20:58:52,Ramat Gan,"West, East",Dan,Red Alert
...,...,...,...,...,...
78661,2025-09-13 03:47:38,Ramat Gan,"West, East",Dan,Red Alert
78865,2025-09-16 18:51:38,Ramat Gan,East,Dan,Red Alert
78944,2025-09-18 20:32:06,Ramat Gan,"West, East",Dan,Red Alert
79237,2025-09-25 22:42:12,Ramat Gan,"West, East",Dan,Red Alert


In [128]:
df[df['locality'] == 'Beer Sheva']

Unnamed: 0,datetime,locality,zones,region,threat_type
3270,2019-11-16 01:56:58,Beer Sheva,"North, West, East, South",Central Negev,Red Alert
4807,2021-05-12 02:49:32,Beer Sheva,"North, West, East, South",Central Negev,Red Alert
4914,2021-05-12 02:55:56,Beer Sheva,"North, West, East, South",Central Negev,Red Alert
5098,2021-05-12 03:03:28,Beer Sheva,"North, West, East, South",Central Negev,Red Alert
5168,2021-05-12 03:09:00,Beer Sheva,"North, West, East, South",Central Negev,Red Alert
...,...,...,...,...,...
68223,2025-06-20 05:51:39,Beer Sheva,"North, West, East, South",Center Negev,Red Alert
68532,2025-06-20 15:40:18,Beer Sheva,"North, West, East, South",Center Negev,Red Alert
69097,2025-06-20 15:44:15,Beer Sheva,"North, West, East, South",Central Negev,Red Alert
74210,2025-06-24 05:40:09,Beer Sheva,"North, West, East, South",Center Negev,Red Alert


In [129]:
df[df['locality'] == 'Rishon LeZion']

Unnamed: 0,datetime,locality,zones,region,threat_type
2538,2019-11-12 07:05:25,Rishon LeZion,West,Shfela (Lowlands),Red Alert
2598,2019-11-12 08:00:35,Rishon LeZion,West,Shfela (Lowlands),Red Alert
2673,2019-11-12 09:11:04,Rishon LeZion,"West, East",Shfela (Lowlands),Red Alert
4274,2021-05-11 20:46:26,Rishon LeZion,"West, East",Shfela (Lowlands),Red Alert
4459,2021-05-11 20:50:14,Rishon LeZion,"West, East",Shfela (Lowlands),Red Alert
...,...,...,...,...,...
78871,2025-09-16 18:51:38,Rishon LeZion,East,HaShfela,Red Alert
78949,2025-09-18 20:32:06,Rishon LeZion,"West, East",HaShfela,Red Alert
79242,2025-09-25 22:42:12,Rishon LeZion,"West, East",HaShfela,Red Alert
79495,2025-09-29 00:59:00,Rishon LeZion,"West, East",HaShfela,Red Alert


In [130]:
df[df['locality'] == 'Ashdod']

Unnamed: 0,datetime,locality,zones,region,threat_type
605,2019-05-04 10:19:47,Ashdod,All,Lakhish,Red Alert
822,2019-05-04 15:23:29,Ashdod,All,Lakhish,Red Alert
1139,2019-05-04 21:57:37,Ashdod,All,Lakhish,Red Alert
1167,2019-05-04 22:41:20,Ashdod,All,Lakhish,Red Alert
1187,2019-05-04 22:47:06,Ashdod,All,Lakhish,Red Alert
...,...,...,...,...,...
72451,2025-06-23 10:27:57,Ashdod,"Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*, Ale...",Lachish,Red Alert
73169,2025-06-23 10:46:16,Ashdod,"Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*, Ale...",Lachish,Red Alert
76832,2025-07-22 05:50:09,Ashdod,"Alef, Bet, Dalet, Heh, 11, 12, 15, 17, Marine,...",Lachish,Red Alert
79174,2025-09-21 10:01:16,Ashdod,"Het, Tet, Yod, Yod Gimmel, Yod Dalet, Te*, Gim...",Lachish,Red Alert


In [131]:
df[df['locality'] == 'Ashkelon']

Unnamed: 0,datetime,locality,zones,region,threat_type
80,2019-01-07 03:18:50,Ashkelon,All,Lakhish,Red Alert
359,2019-03-26 01:32:56,Ashkelon,Industrial Area,Lakhish,Red Alert
362,2019-03-26 01:55:24,Ashkelon,Industrial Area,Lakhish,Red Alert
383,2019-03-26 23:37:40,Ashkelon,Industrial Area,Lakhish,Red Alert
384,2019-03-27 03:57:23,Ashkelon,Industrial Area,Lakhish,Red Alert
...,...,...,...,...,...
69062,2025-06-20 15:44:15,Ashkelon,"North, Industrial Area, South, Northern Indust...",Lakhish,Red Alert
71409,2025-06-22 07:42:11,Ashkelon,"North, Northern Industrial Zone, South",West Lachish,Red Alert
72535,2025-06-23 10:28:35,Ashkelon,"North, Northern Industrial Zone",West Lachish,Red Alert
73089,2025-06-23 10:45:44,Ashkelon,"North, Northern Industrial Zone, Southern Indu...",West Lachish,Red Alert


In [132]:
df[df['locality'] == 'Herzeliya']

Unnamed: 0,datetime,locality,zones,region,threat_type
17,2018-12-26 10:05:01,Herzeliya,All,Dan,Red Alert
4331,2021-05-11 20:46:29,Herzeliya,"Center and Glil Yam, Pituach",Sharon,Red Alert
4491,2021-05-11 20:50:59,Herzeliya,"Center and Glil Yam, Pituach",Sharon,Red Alert
4652,2021-05-11 21:00:50,Herzeliya,"Center and Glil Yam, Pituach",Sharon,Red Alert
4681,2021-05-11 21:08:19,Herzeliya,"Center and Glil Yam, Pituach",Sharon,Red Alert
...,...,...,...,...,...
77960,2025-08-22 20:59:34,Herzeliya,"Center and Glil Yam, West",Dan,Red Alert
78343,2025-09-03 09:39:52,Herzeliya,"Center and Glil Yam, West",Dan,Red Alert
78629,2025-09-13 03:47:38,Herzeliya,"Center and Glil Yam, West",Dan,Red Alert
79036,2025-09-18 20:32:11,Herzeliya,"Center and Glil Yam, West",Dan,Red Alert


Сhecking if the dataset contains the words 'Herzliya' instead of 'Herzeliya'

In [133]:
df[df['locality'] == 'Herzliya']

Unnamed: 0,datetime,locality,zones,region,threat_type


Everything is OK now

In [134]:
df[df['locality'] == "Ra'anana"]

Unnamed: 0,datetime,locality,zones,region,threat_type
4416,2021-05-11 20:46:33,Ra'anana,All,Sharon,Red Alert
4657,2021-05-11 21:00:50,Ra'anana,All,Sharon,Red Alert
4683,2021-05-11 21:08:19,Ra'anana,All,Sharon,Red Alert
4716,2021-05-11 21:15:07,Ra'anana,All,Sharon,Red Alert
4859,2021-05-12 02:50:14,Ra'anana,All,Sharon,Red Alert
...,...,...,...,...,...
77819,2025-08-17 16:16:14,Ra'anana,All,Sharon,Red Alert
78032,2025-08-22 20:59:34,Ra'anana,All,Sharon,Red Alert
78244,2025-09-03 09:39:43,Ra'anana,All,Sharon,Red Alert
78943,2025-09-18 20:32:06,Ra'anana,All,Sharon,Red Alert


In [135]:
df[df['locality'] == 'Jerusalem']

Unnamed: 0,datetime,locality,zones,region,threat_type
102,2019-02-06 10:04:58,Jerusalem,All,Jerusalem,Red Alert
3638,2021-05-10 18:03:07,Jerusalem,"North and Alonim, East",Jerusalem,Red Alert
9923,2021-11-03 18:05:03,Jerusalem,North,Jerusalem,Red Alert
9929,2021-11-03 18:15:02,Jerusalem,"East, Qafr 'Aqab, Atarot Industrial Zone, South",Jerusalem,Red Alert
9933,2021-11-03 18:25:01,Jerusalem,"Center, North, East",Jerusalem,Red Alert
...,...,...,...,...,...
77726,2025-08-17 16:16:14,Jerusalem,West,Jerusalem,Red Alert
78120,2025-08-27 05:30:40,Jerusalem,"Center, East, North, Atarot Industrial Zone, S...",Jerusalem,Red Alert
78514,2025-09-09 20:05:56,Jerusalem,"Center, East, North, South, West",Jerusalem,Red Alert
78797,2025-09-16 18:51:38,Jerusalem,"Center, East, North, Atarot Industrial Zone, S...",Jerusalem,Red Alert


In [136]:
df[df['locality'] == 'Netanya']

Unnamed: 0,datetime,locality,zones,region,threat_type
4338,2021-05-11 20:46:29,Netanya,"West, East",Sharon,Red Alert
4493,2021-05-11 20:50:59,Netanya,"West, East",Sharon,Red Alert
9938,2021-11-03 18:35:02,Netanya,West,Sharon,Red Alert
31142,2024-09-25 06:32:25,Netanya,"West, East",Sharon,Red Alert
33466,2024-10-01 19:34:17,Netanya,"West, East",Sharon,Red Alert
34298,2024-10-01 19:39:45,Netanya,"West, East",Sharon,Red Alert
35399,2024-10-01 19:50:22,Netanya,"West, East",Sharon,Red Alert
38605,2024-10-14 11:19:26,Netanya,"West, East",Sharon,Red Alert
38799,2024-10-14 17:33:28,Netanya,"West, East",Sharon,Red Alert
38929,2024-10-14 17:36:29,Netanya,"West, East",Sharon,Red Alert


In [137]:
df[df['locality'] == 'Petah Tikva']

Unnamed: 0,datetime,locality,zones,region,threat_type
4351,2021-05-11 20:46:30,Petah Tikva,All,Yarkon,Red Alert
4560,2021-05-11 20:51:34,Petah Tikva,All,Yarkon,Red Alert
4619,2021-05-11 20:58:52,Petah Tikva,All,Yarkon,Red Alert
4734,2021-05-11 21:15:23,Petah Tikva,All,Yarkon,Red Alert
4996,2021-05-12 03:00:25,Petah Tikva,All,Yarkon,Red Alert
...,...,...,...,...,...
78243,2025-09-03 09:39:43,Petah Tikva,All,Dan,Red Alert
78660,2025-09-13 03:47:38,Petah Tikva,All,Dan,Red Alert
78942,2025-09-18 20:32:06,Petah Tikva,All,Dan,Red Alert
79235,2025-09-25 22:42:12,Petah Tikva,All,Dan,Red Alert


⚠ **WARNING!** ⚠ 
Some localities may refer to different original regions at different times (for example: Herzliya may be attributed to Sharon and Dan, Petah Tikva to Yarkon and Dan). This is due to changes in the data in the telegram channel.


### Checking for duplicates

In [138]:
print('Number of duplicates:', df.duplicated().sum())

Number of duplicates: 0


In [139]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79670 entries, 0 to 79669
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     79670 non-null  datetime64[ns]
 1   locality     79670 non-null  object        
 2   zones        79670 non-null  object        
 3   region       79670 non-null  object        
 4   threat_type  79670 non-null  object        
dtypes: datetime64[ns](1), object(4)
memory usage: 3.0+ MB


### Saving the dataframe

In [140]:
df.to_csv('cumta_df.csv', index=False, encoding='utf-8')

## Conclusion
**Data Preprocessing**:

- Initially, we constructed a dataframe where each row corresponded to a single message from the Telegram channel. Each message contained information about warnings spanning several minutes.
- Next, we transformed the dataframe so that each row represented a locality. We then merged zones of major cities, assigning the city name to the `'locality'` column, while district names were stored in the `'zones'` column. 
- If a locality was not a major city, the `'zones'` column was set to `'All`'

**Final Processed Dataframe:**

- **Total rows:** 79670
- **Duplicates:** None  
- **Time column (`'time'`):** Converted to `datetime64[ns]` format  
- **Missing value** None  

**Columns:**

| Column Name | Description |
|-------------|------------|
| **datetime** | Date and time of the warning in `datetime64[ns]` format |
| **locality** | Name of the locality (city, town, or settlement) |
| **zones** | Zones within a locality. If the locality is not a major city, this column contains `'All'` |
| **region** | Region to which the locality belongs |
| **threat_type** | Type of warning. Possible values: `'Red Alert'`, `'Unrecognized Aircraft'`, `'Terrorist Infiltration'`, `'Interception Pieces'`, `'Earthquake'` |
| **district** | Administrative district or region where the locality is located; an official governmental division. |

This structured dataset is now ready for further analysis and visualization.
