Readme:

- Make sure do the following before running any code in this notebook
    - Get your API key from ::https://ai.google.dev/?gad_source=1&gclid=CjwKCAjwz42xBhB9EiwA48pT78pVH2iyB1l-2HEWm2DNbYh4fqtwlqPsDFycvBplXbsVEEdmBmPplxoCvuYQAvD_BwE
    - save your api key to 'gemini_api.txt'.
    - run __init__

# __init__

In [7]:
from datetime import datetime
import pandas as pd
import numpy as np
from word2number import w2n
import re
from tqdm import tqdm
import ast
import time
import json
import os
from datetime import datetime

## Functions

In [8]:
def cleanup_json(filename):
    """
    Read a JSON file, filter out entries with value 'error:llm',
    and update the file with the cleaned data.

    Args:
    - filename (str): The name of the JSON file to clean up.
    """
    if os.path.exists(filename):
        with open(filename, "r") as file:
            data = json.load(file)
    else:
        data = {}

    # Filter out entries with 'error:llm'
    cleaned_data = {key: value for key, value in data.items() if value != 'error:llm'}

    # Write cleaned data back to the JSON file
    with open(filename, "w") as file:
        json.dump(cleaned_data, file)
        
        
# example
# cleanup_json('extraction_progress.json')

In [9]:
# Function to read API key from file
def read_api_key(file_path):
    with open(file_path, 'r') as file:
        api_key = file.readline().strip()  # Read the first line and remove any leading/trailing whitespace
    return api_key

## Load data to be extracted

In [10]:
# Load the data to be predicted
df_livedata=pd.read_csv("filtered_news.csv")
df_livedata.reset_index(inplace = True)

In [11]:
df_livedata = df_livedata
df_livedata = df_livedata[['id','content']]
df_livedata['llm_completion_dict']=''

In [12]:
df_livedata.shape

(6750, 3)

## Google Gemini pro api prompt engineering

In [47]:
"""
At the command line, only need to run once to install the package via pip:

$ pip install google-generativeai
"""

import google.generativeai as genai

api_key = read_api_key('gemini_api.txt')

genai.configure(api_key=api_key)

# Set up the model
generation_config = {
  "temperature": 0.15,
  "top_p": 0.1,
  "top_k": 1,
  "max_output_tokens": 2048,
  "stop_sequences": [
    "/",
  ],
}

safety_settings = [
  {
    "category": "HARM_CATEGORY_HARASSMENT",
    "threshold": "BLOCK_MEDIUM_AND_ABOVE"
  },
  {
    "category": "HARM_CATEGORY_HATE_SPEECH",
    "threshold": "BLOCK_MEDIUM_AND_ABOVE"
  },
  {
    "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
    "threshold": "BLOCK_MEDIUM_AND_ABOVE"
  },
  {
    "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
    "threshold": "BLOCK_MEDIUM_AND_ABOVE"
  },
]

model = genai.GenerativeModel(model_name="gemini-1.0-pro",
                              generation_config=generation_config,
                              safety_settings=safety_settings)

convo = model.start_chat(history=[
  {
    "role": "user",
    "parts": ["Learn these:\n\nInput:\n'New Delhi: A two-and-a-half-year-old boy was crushed to death under a car while walking near a road in northwest Delhi’s Mukherjee Nagar on Wednesday. The suspect, identified as Mehak Bansal (38), a businessman, has been arrested, police said. The incident which was captured by a CCTV camera, showed a red car colliding with the toddler as he was seen running to the other side. The accused, who was behind the wheel, inadvertently ran over the child, resulting in fatal injuries. Parents and bystanders rushed to the child\\'s aid, transporting him to hospital for urgent medical attention. \"Upon receiving the information, a team rushed to the hospital, where it was discovered that the child had succumbed to his injuries. A case has been registered under Section 279 for rash driving and Section 304(a) for death caused by negligence,\" said Jitender Meena, DCP (Northwest). Explore Your Financial Landscape with Personalized Credit Insights.'\n\noutput:{\"fatalities\": 1,\"injured\":0,\"reason\":'negligence',\"victim_gender\":[\"Male\"],\"vehicles\":[\"car\"],\"child_involved\":1,\"names_ages\":[(\"Mehak Bansal\",38),(\"unknown_name\",2.5)]}/\n\ninput:\n\"JAIPUR: Four persons died while one was injured in an accident between a car and a private bus in Rajasthan's Dungarpur district, police said on Saturday. The incident occurred late Friday night on NH 48 in Bichhiwada when the car, traveling in the wrong direction, collided with a private bus heading towards Dungarpur. The impact of the collision was so severe that it left the car's front end completely demolished, police said. The four young individuals who died have been identified as Satish Bhai (25), Ankit Ninama (25), Ravi (23), and Kaushik (21). All of them hailed from Shamlaji in Gujarat, according to the police. The bodies of the deceased have been placed in the district hospital's mortuary, and their family members have been notified of the accident. Once their relatives arrive, post-mortem examinations will be conducted, police said. (With inputs from PTI) Explore Your Financial Landscape with Personalized Credit Insights.\"\n\noutput:{\"fatalities\": 4,\"injured\":1,\"reason\":'wrong direction',\"victim_gender\":[\"Female\",\"Male\"],\"vehicles\":[\"tractor-trolley\",\"car\"],\"child_involved\":0,\"names_ages\":[(\"Satish Bhai\",25),(\"Ankit Ninama\",25),(\"Ravi\",23),(\"Kaushik\",21)]}/\n\n\n\ninput:\n\"KHARGONE: Three policemen, including two sub-inspectors died, and two critically injured when their car rammed into a parked truck in MP 's Khragone district early Saturday morning. The victims were returning to Sanawad town, 70km from the district headquarters, after completing their duties at the annual fair in Shivdola, another 70km away, said police. At 5.30am near Badud village, just 4km short of destination, their hatchback slammed into the rear of a truck parked on the roadside, Sanawad police station in charge Nirmal Kumar Shrivas said. SIs Vimal Tiwari and Ramesh Chandra Bhaskare and constable Manoj Kumawat died on the spot, he said. Constables Raghuveer Rawat and civic defence personnel Komal Dangode were severely injured. They have been shifted to Indore, 70km away. Tiwari was a resident of Indore, Kumrawat from Simrol, and Bhaskere, who owned the car, was from Burhanpur. Rescuers had hard time extricating victims from the wreckage The 108 ambulance drivers, Yuvraj Dhodle and Ritesh Mandloi, who took the victims to hospital said that the car was mangled in the collision and they had a hard time extricating them from the wreckage. The bodies have been handed over to their families after autopsy. Nimar Range DIG Chandrashekhar Solanki told TOI that he has ordered an investigation and asked for a report. CCTV footage shows the car hitting the rear of a truck parked near a petrol pump, he said, adding that why the accident happened is being investigated. The families of the deceased police personnel are being provided assistance as per the rules, he said. Explore Your Financial Landscape with Personalized Credit Insights"]
  },
  {
    "role": "model",
    "parts": ["{\"fatalities\": 3,\"injured\":2,\"reason\":'under investigation',\"victim_gender\":[\"Male\"],\"vehicles\":[\"car\",\"truck\"],\"child_involved\":0,\"names_ages\":[(\"Vimal Tiwari\",0),(\"Ramesh Chandra Bhaskare\",0),(\"Manoj Kumawat\",0),(\"Raghuveer Rawat\",0),(\"Komal Dangode\",0)]}"]
  },
])

convo.send_message("HYDERABAD: A woman and her 13-year-old daughter died in Antharam village in Medak’s Munpalle mandal on Tuesday when the car in which they were travelling was hit by a tractor. Two others in the car escaped with injuries. The deceased have been identified as Swaroopa, 36, and her daughter Sri Lekha . The two injured, Mallesham and Lavanya, have been admitted to a hospital. Police said the accident took place in the early hours. The tractor came in the opposite direction and rammed the car, police said. A case was registered under section 304-A (negligence causing death) of IPC. Tractor driver was also injured in the accident. Explore Your Financial Landscape with Personalized Credit Insights.")
print(convo.last.text)

{"fatalities": 2,"injured":2,"reason":'negligence',"victim_gender":["Female"],"vehicles":["car","tractor"],"child_involved":1,"names_ages":[("Swaroopa",36),("Sri Lekha",13),("Mallesham",0),("Lavanya",0)]}


# LANGUAGE MODEL PROMPT ENGINEERING

## Dump API calls to JSON

In [27]:
import json
import os
from tqdm import tqdm

# Check if extraction_progress.json exists
if os.path.exists("extraction_progress.json"):
    with open("extraction_progress.json", "r") as file:
        extraction_progress = json.load(file)
else:
    extraction_progress = {}

# Filter out ids that have already been processed
processed_ids = list(map(int, extraction_progress.keys()))

remaining_ids = df_livedata[~df_livedata['id'].isin(processed_ids)]

count = 1

for index, row in tqdm(remaining_ids.iterrows(), total=remaining_ids.shape[0]):
    try:
        convo.send_message(row['content'])
        data_string = convo.last.text
    except:
        data_string = 'error:llm'

    extraction_progress[row['id']] = data_string

    with open("extraction_progress.json", "w") as file:
        json.dump(extraction_progress, file)
    
    count += 1
    if count % 1000 == 0:
        user_input = input("Continue (y/n)? ")
        if user_input.lower() != 'y':
            break
    time.sleep(1.5)
        

 10%|████                                    | 998/9978 [11:31<20:05,  7.45it/s]

Continue (y/n)?  y


 20%|███████▊                               | 1998/9978 [58:43<17:01,  7.81it/s]

Continue (y/n)?  n


 20%|███████                            | 1998/9978 [1:02:05<4:07:59,  1.86s/it]


## Clean up json dump

In [30]:
cleanup_json("extraction_progress.json")

# JSON to CSV

In [33]:
import json
import pandas as pd
import os
import ast  # Add this import statement

# Check if extraction_progress.json exists
if os.path.exists("extraction_progress.json"):
    with open("extraction_progress.json", "r") as file:
        data = json.load(file)

    # Create DataFrame with keys and entire JSON string values
    df_json = pd.DataFrame(list(data.items()), columns=['ID', 'JSON_String'])

    # Function to convert JSON string to dictionary
    def json_str_to_dict(json_str):
        try:
            return ast.literal_eval(json_str)
        except:
            return "error:incorrect_format"

    # Convert JSON strings to dictionaries
    df_json['JSON_Dict'] = df_json['JSON_String'].apply(json_str_to_dict)

    # Convert dictionaries to DataFrame
    df = pd.json_normalize(df_json['JSON_Dict'])

    # Concatenate df_json and df
    df_extacted = pd.concat([df_json, df], axis=1)
    print('Extraction complete')
else:
    print("extraction_progress.json file does not exist.")

Extraction complete


In [35]:
df_extacted.head(2)

Unnamed: 0,ID,JSON_String,JSON_Dict,fatalities,injured,reason,victim_gender,vehicles,child_involved,names_ages
0,107938160,"{""fatalities"": 3,""injured"":1,""reason"":'rash dr...","{'fatalities': 3, 'injured': 1, 'reason': 'ras...",3.0,1.0,rash driving,"[Female, Male]",[car],0.0,"[(Chinta Devi, 51), (Ram Chandra Gupta, 55), (..."
1,107821301,"{""fatalities"": 3,""injured"":0,""reason"":'brakes ...","{'fatalities': 3, 'injured': 0, 'reason': 'bra...",3.0,0.0,brakes applied,[Male],"[car, truck]",0.0,"[(Ravinder Kumar, 0), (Subhash Kumar, 0), (Aja..."


In [37]:
df_extacted.shape

(76, 10)

# SAVE TO CSV

In [10]:
df_extacted.to_csv('data_extracted_llm_gemini.csv')

NameError: name 'df_extacted' is not defined

# backup_code

In [75]:
import json
import os
from tqdm import tqdm

# Check if extraction_progress.json exists
if os.path.exists("extraction_progress.json"):
    with open("extraction_progress.json", "r") as file:
        extraction_progress = json.load(file)
else:
    extraction_progress = {}

# Convert extraction_progress keys to list for filtering
processed_ids = list(extraction_progress.keys())

# Filter out ids that have already been processed
remaining_ids = df_livedata[~df_livedata['id'].isin(processed_ids)]

for index, row in tqdm(remaining_ids.head(10).iterrows(), total=min(10, remaining_ids.shape[0])):
    try:
        convo.send_message(row['content'])
        data_string = convo.last.text
    except:
        data_string = 'error:llm'

    df_livedata.loc[index, 'llm_completion_dict'] = data_string
    extraction_progress[row['id']] = data_string

    # Save extraction progress to extraction_progress.json after each iteration
    with open("extraction_progress.json", "w") as file:
        json.dump(extraction_progress, file)

# Save the updated df_livedata to a JSON file
df_livedata[['id', 'llm_completion_dict']].to_json("extraction_data.json", orient="records")


100%|███████████████████████████████████████████| 10/10 [01:10<00:00,  7.06s/it]
