## First: Processing states2016.csv (Countries List)

In [1]:
import pandas as pd 

Read the data of States from the file

In [2]:
states = pd.read_csv('data//states2016.csv')
states.head()

Unnamed: 0,stateabb,ccode,statenme,styear,stmonth,stday,endyear,endmonth,endday,version,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20
0,USA,2,United States of America,1816,1,1,2016,12,31,2016,...,,,,,,,,,,
1,CAN,20,Canada,1920,1,10,2016,12,31,2016,...,,,,,,,,,,
2,BHM,31,Bahamas,1973,7,10,2016,12,31,2016,...,,,,,,,,,,
3,CUB,40,Cuba,1902,5,20,1906,9,25,2016,...,,,,,,,,,,
4,CUB,40,Cuba,1909,1,23,2016,12,31,2016,...,,,,,,,,,,


In [3]:
states.columns

Index(['stateabb', 'ccode', 'statenme', 'styear', 'stmonth', 'stday',
       'endyear', 'endmonth', 'endday', 'version', 'Unnamed: 10',
       'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14',
       'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18',
       'Unnamed: 19', 'Unnamed: 20'],
      dtype='object')

Extract wanted Fields

In [5]:
states = states[['stateabb', 'ccode','statenme', 'styear', 'stmonth', 'stday','endyear', 'endmonth', 'endday']]

In [6]:
states.head()

Unnamed: 0,stateabb,ccode,statenme,styear,stmonth,stday,endyear,endmonth,endday
0,USA,2,United States of America,1816,1,1,2016,12,31
1,CAN,20,Canada,1920,1,10,2016,12,31
2,BHM,31,Bahamas,1973,7,10,2016,12,31
3,CUB,40,Cuba,1902,5,20,1906,9,25
4,CUB,40,Cuba,1909,1,23,2016,12,31


In [7]:
states.duplicated().sum()

0

In [9]:
states.isna().sum()

stateabb    0
ccode       0
statenme    0
styear      0
stmonth     0
stday       0
endyear     0
endmonth    0
endday      0
dtype: int64

extract Active country from 2015 - 2023

In [11]:
print(states['endyear'].max())
print(states['styear'].max())

2016
2011


In [13]:
print(states['endyear'].value_counts())

endyear
2016    195
1940      8
1990      4
1860      4
1867      3
1939      3
1871      3
1866      2
1941      2
1945      2
1882      1
1958      1
1881      1
1905      1
1912      1
1936      1
1964      1
1918      1
1861      1
1992      1
1938      1
1906      1
1942      1
1870      1
1916      1
1915      1
1975      1
Name: count, dtype: int64


In [14]:
states = states[states['endyear'] == 2016]

In [16]:
states.head()

Unnamed: 0,stateabb,ccode,statenme,styear,stmonth,stday,endyear,endmonth,endday
0,USA,2,United States of America,1816,1,1,2016,12,31
1,CAN,20,Canada,1920,1,10,2016,12,31
2,BHM,31,Bahamas,1973,7,10,2016,12,31
4,CUB,40,Cuba,1909,1,23,2016,12,31
6,HAI,41,Haiti,1934,8,15,2016,12,31


In [17]:
print(states['endyear'].value_counts())

endyear
2016    195
Name: count, dtype: int64


In [18]:
print(states['statenme'].value_counts())

statenme
United States of America    1
Kuwait                      1
Botswana                    1
Swaziland                   1
Madagascar                  1
                           ..
Moldova                     1
Romania                     1
Russia                      1
Estonia                     1
Samoa                       1
Name: count, Length: 195, dtype: int64


Save Extracted Country to csv file

In [19]:
states.to_csv('data//EState.csv', index=False)

In [2]:
states = pd.read_csv('data//EState.csv')
states.head()

Unnamed: 0,stateabb,ccode,statenme,styear,stmonth,stday,endyear,endmonth,endday
0,USA,2,United States of America,1816,1,1,2016,12,31
1,CAN,20,Canada,1920,1,10,2016,12,31
2,BHM,31,Bahamas,1973,7,10,2016,12,31
3,CUB,40,Cuba,1909,1,23,2016,12,31
4,HAI,41,Haiti,1934,8,15,2016,12,31


In [3]:
print(states['endmonth'].value_counts())

endmonth
12    195
Name: count, dtype: int64


## Second: Processing New York Times Articles

Read data from Folder data then Folder New York Times and then collect articles that from 2015 to 2023 

In [21]:
import os 

In [22]:
root_folder = "data//New York Times"
valid_years = [str(year) for year in range(2015, 2024)] # set of years 2015 - 2023
data = []

In [24]:
for year_folder in os.listdir(root_folder):
    year_path = os.path.join(root_folder, year_folder)
    
    #check it it's directory and it if's in the valid range
    if os.path.isdir(year_path) and year_folder in valid_years:
      # loop through each .txt file in the year folder
      for file_name in os.listdir(year_path):
        if file_name.endswith('.txt'):
          file_path = os.path.join(year_path, file_name)
          with open(file_path, 'r', encoding = 'latin-1') as file:
            content = file.read()
          
          # Append the data to the list
          data.append({'year': year_folder,                    "file_name": file_name,
          "content": content})


In [26]:
DataDF = pd.DataFrame(data)
DataDF.to_csv('data//new_york_times_articles.csv', index=False)

In [4]:
NewYorkTimes = pd.read_csv('data//new_york_times_articles.csv')
NewYorkTimes.head()

Unnamed: 0,year,file_name,content
0,2015,ProQuestDocuments-2025-01-17-10.txt,______________________________________________...
1,2015,ProQuestDocuments-2025-01-17-11.txt,______________________________________________...
2,2015,ProQuestDocuments-2025-01-17-12.txt,______________________________________________...
3,2015,ProQuestDocuments-2025-01-17-13.txt,______________________________________________...
4,2015,ProQuestDocuments-2025-01-17-14.txt,______________________________________________...


In [5]:
len(NewYorkTimes)

408

Clean data (lowercase , remove stopwords & extra spaces & punctuations, lemmatization )

In [6]:
print(NewYorkTimes['file_name'][0])
print(NewYorkTimes['content'][0])

ProQuestDocuments-2025-01-17-10.txt
____________________________________________________________

Search Strategy

Set#: S1
Searched for: (air base OR air strike OR airbase OR aircraft OR airstrike OR alert OR antiaircraft OR armed OR armo* OR arms OR army OR artillery OR attack OR batteries OR battery OR battle OR battleship OR block* OR bomb OR border OR buildup OR carrier OR casualties OR casualty OR cease OR ceasefire OR cease-fire OR clash* OR combat OR conflict OR crisis OR cruiser OR damage OR declare war OR defence OR defense OR defensive measures OR defian* OR deploy* OR destroy OR detained OR dispatch* OR display of force OR dispute* OR embargo OR erupt* OR fight* OR fire OR fired OR forc* OR fortification OR hit OR hostile OR incursion* OR infantry OR interstate OR invasion OR jet OR kill* OR launch* OR liberate OR line of control OR maneuver OR milit* OR missile* OR mobiliz* OR mortar OR naval OR nuclear OR occup* OR offensive OR operation OR patrol* OR peace declaration OR

In [5]:
import pandas as pd
import re
from tqdm import tqdm
from langchain_community.llms import CTransformers
from langchain.text_splitter import RecursiveCharacterTextSplitter
import concurrent.futures

In [6]:
df = pd.read_csv('data//new_york_times_articles.csv')
df.head()

Unnamed: 0,year,file_name,content
0,2015,ProQuestDocuments-2025-01-17-10.txt,______________________________________________...
1,2015,ProQuestDocuments-2025-01-17-11.txt,______________________________________________...
2,2015,ProQuestDocuments-2025-01-17-12.txt,______________________________________________...
3,2015,ProQuestDocuments-2025-01-17-13.txt,______________________________________________...
4,2015,ProQuestDocuments-2025-01-17-14.txt,______________________________________________...


In [7]:
tqdm.pandas()

#Clean text: Remove extra spaces, links (with tqdm for progress)
def clean_text(text):
    if pd.isna(text):  # Handle missing values
        return ""
    text = re.sub(r"http\S+|www\S+", "", text)  # Remove URLs
    text = re.sub(r'\s+', ' ', text).strip()   # Remove extra spaces
    return text.lower()  # Convert to lowercase

In [8]:
df['content'] = df['content'].progress_apply(clean_text)

100%|██████████| 408/408 [03:09<00:00,  2.15it/s]


In [11]:
print(df['content'][0])



In [12]:
df.to_csv('data//new_york_times_articles.csv', index=False)

In [13]:
df = pd.read_csv('data//new_york_times_articles.csv')
df.head()

Unnamed: 0,year,file_name,content
0,2015,ProQuestDocuments-2025-01-17-10.txt,______________________________________________...
1,2015,ProQuestDocuments-2025-01-17-11.txt,______________________________________________...
2,2015,ProQuestDocuments-2025-01-17-12.txt,______________________________________________...
3,2015,ProQuestDocuments-2025-01-17-13.txt,______________________________________________...
4,2015,ProQuestDocuments-2025-01-17-14.txt,______________________________________________...


 Filter military-related articles & remove false positives

In [14]:
#Load LLaMA-2-7B Model from Colab Storage
llm = CTransformers(
    model="model/llama-2-7b-chat.ggmlv3.q4_0.bin", 
    model_type="llama",
    config={'max_new_tokens': 2, 'temperature': 0.5}  # Low temp for accuracy
)

  from .autonotebook import tqdm as notebook_tqdm


In [15]:
# Initialized Text splitter to chunk the text into smaller parts
text_splitter = RecursiveCharacterTextSplitter(
  chunk_size = 400,
  chunk_overlap = 50
)

In [21]:
# Classification Funtion to classify the text with Text chunking
def classify_article(text):
  if not text:
    return 0
  
  # Split the text into chunks
  chunks = text_splitter.split_text(text)
  classifications = []
  
  for chunk in chunks:
    prompt = (
      "You are an AI trained to classify newspaper articles. "
      "classify the following text as:\n"
      "'1' if it discuss war , military actions , armed conflicts , battles, invasions , airstrikes , bombings , or troop movements.\n"
      "'0' if it does not discuss war or military topics( e.g, politics, sports, entertainment, culture,economy, sports.)\n"
      "Return ONLY '1' or '0'.\n"
      f"**Article: ** \n{chunk}\n"
    )
    
    response = llm.invoke(prompt)
    if isinstance(response, dict) and "choices" in response:
      output_text = response["choices"][0]["text"].strip() 
    else:
      output_text = "0"
    
    classifications.append(1 if "1" in output_text else 0)
  
  # If any chunk is classified as military, mark the entire article as military
  return max(classifications)  # Returns 1 if at least one chunk is military

In [None]:
# apply classification function to the content column 

df['is_military'] = [classify_article(text) for text in tqdm(df['content'], desc="Classifying Articles")]



[A

In [None]:
df.head()

In [None]:
# remove Non-Military Articles
df = df[df[is_military] == 1]
df.head()

 Save cleaned articles to Article.csv

In [None]:
df.to_csv("NewYorkTimes_Military.csv", index=False)

In [None]:
df = df.read_csv("NewYorkTimes_Military.csv")
df.head()

For each Articles:

  1- Extract Date 

  2- Fatality count (exact or Approximate )

  3- countries involved

First:- EXtract Date

Second:- Extract Fatality Count

Third:- Extract Countries involved 

Check the Duplicates accident and remove one of them 

##### save data on New_York_Times in DataFrame

## Third: Matching and Final Output