<a href="https://colab.research.google.com/github/arvindd22/Automated-Research-Article-Analysis-System/blob/main/Summer_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Automated Research Articles Analysis System
- This project helps us to classify the relevant research articles based on presence of relevant graphs, X-Y axis labels and input keywords.
- We have used Pytesseract library to extract text from the images found in
research articles.
- Trained a Yolov8 model on shape memory alloy graphs data, to help us to identify graphs present in images of research articles.
- Below is the complete code implementation of the project.

# Connecting the Drive to the Notebook
* Running this cell will ask for permission to mount the google drive to save the files on your drive it is neccesary to proceed further
* All files downloaded throughout the notebook will be accessed to Google Drive >> My Drive  


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Download the Scopus.csv file and put it in MyDrive folder of your Google Drive


# Loading The CSV file
- Using the pandas library in python we will load the csv file to dataframe 'df' and further create a dataframe 'dois' having only DOI column of entered scopus csv.



In [None]:
import pandas as pd
import requests
csv = pd.read_csv('/content/drive/MyDrive/scopus.csv')
df = pd.DataFrame(csv)
dois =df['DOI'].astype(str)


# Downloading the XMLs from Scopus using SCOPUS API

1. Library Imports:
   The code begins by importing necessary Python libraries: 'os' for file system operations, 'requests' for making HTTP requests, and 'pandas'.

2. Initialization:
   - An empty list 'status_code' is created to store HTTP status codes.
   - A counter 'j' is initialized to 1, likely for file naming purposes.
   - A directory '/content/drive/MyDrive/xmls' is created using os.makedirs() with the 'exist_ok=True' parameter, ensuring the operation doesn't raise an error if the directory already exists.

3. Main Processing Loop:
   The code iterates through a list of Digital Object Identifiers (DOIs) stored in the 'dois' variable (which is not defined in the provided snippet but is assumed to exist).

   For each DOI:
   a. An XML URL is constructed using the Elsevier API endpoint and the current DOI.
   b. HTTP headers are set, including an API key for authentication and specifying XML as the accepted response format.
   c. A GET request is made to the constructed URL using the requests library, with streaming enabled and a 30-second timeout.
   d. The HTTP status code of the response is appended to the 'status_code' list.

   e. If the status code is 200 (indicating a successful request):
      - The code attempts to open a new file in binary write mode, named numerically (1.xml, 2.xml, etc.) in the previously created directory.
      - It then writes the response content to this file in chunks of 2048 bytes.
      - If an exception occurs during this process, it's silently caught and the loop continues to the next iteration.

   f. The counter 'j' is incremented, regardless of whether the file write was successful.

4. Data Storage:
   - The collected status codes are added as a new column 'StatusCode' to a pre-existing DataFrame 'df'.
   - This DataFrame is then saved as a CSV file named 'scopus.csv' in the '/content/drive/MyDrive/' directory.

In summary, we retrieve XML data for a list of DOIs from the Elsevier API, save successful responses as individual XML files, track the HTTP status codes of all requests, and finally saves these status codes along with other (scopus.csv) data in a CSV file.

##Requirements : Enter the Elsevier API key to download the XML files.

In [None]:

import os
status_code = []
j =1
os.makedirs('/content/drive/MyDrive/xmls',exist_ok = True)
for i, doi in enumerate(dois):
  xml_url='https://api.elsevier.com/content/article/doi/' + doi + '?view=FULL'

  headers = {
            'X-ELS-APIKEY': 'c8cf76a7333ef42cbb26d1ce3aadc8bd', # ENTER YOUR OWN API KEY
            'Accept': 'text/xml'
          }

  r = requests.get(xml_url,stream = True,  headers=headers, timeout=30)

  status_code.append(r.status_code)
  if r.status_code == 200:
    try:
      writefile  = open("/content/drive/MyDrive/xmls/" + str(j) + ".xml", 'wb')



      for chunk in r.iter_content(2048):
        writefile.write((chunk))
    except:
      continue
  j = j+1

df['StatusCode'] = status_code

df.to_csv('/content/drive/MyDrive/scopus.csv',index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['StatusCode'] = status_code



# Code Extracts captions and image information from XML files and writes them to CSV files.
#### Using Element Tree to search the XML file

Import libraries:

**xml.etree.ElementTree**

This library is used for parsing XML files.

**csv**

This library is used for reading and writing CSV files.

**re**

This library is used for regular expressions (used for cleaning captions).

Loop through each row in a pandas dataframe (df):

Check if the 'StatusCode' column value is 200 (indicating success).

If the status code is 200:
Get the file path for the corresponding XML file based on the row index.

Try parsing the XML file:
If parsing is successful:
Initialize empty lists for images and captions.

Loop through all 'figure' elements in the XML file:
Extract the 'locator' attribute from the 'link' element within the 'figure' element (if it exists).
Extract the caption text from the 'simple-para' element within the 'caption' element (if it exists).
Clean the caption text using regular expressions (remove HTML tags and extra spaces).
Append the caption text (or None if not found) to the 'captions' list.
Loop through all 'attachment' elements in the XML file:
Extract 'attachment_eid', 'ucs_locator', and 'filename' from the element and its child elements.
Append a tuple containing this information to the 'images' list.

Create a new CSV file with headers 'UTD EID', 'UCS Locator', 'Filename', and 'Caption'.
Write each image information (along with its corresponding caption) to the CSV file.
Read the created CSV file into a pandas dataframe.
Add a new column 'link' to the dataframe. This column contains URLs for the images based on their 'UTD EID' values.

Save the updated dataframe back to the CSV file.
Except if there's an XML parsing error:
Print an informative message indicating the error and the file that caused it.
Except for any other errors:
Print a generic error message mentioning the file that caused the error.


In [None]:
import xml.etree.ElementTree as ET
import pandas as pd
import csv
import re
os.makedirs('/content/drive/MyDrive/csvs',exist_ok=True)
for i, row in df.iterrows():
    status_code = row['StatusCode']
    if status_code == 200:
        print(i+1)
        xml_file = f'/content/drive/MyDrive/xmls/{i + 1}.xml'

        try:
            tree = ET.parse(xml_file)
            root = tree.getroot()

            images = []
            captions = []
            # Extracting tags for captions
            for figure in root.findall('.//ce:figure', namespaces={'ce': 'http://www.elsevier.com/xml/common/dtd'}):
                link_element = figure.find('.//ce:link', namespaces={'ce': 'http://www.elsevier.com/xml/common/dtd'})
                locator = link_element.attrib.get('locator') if link_element is not None else None

                caption_element = figure.find('.//ce:caption', namespaces={'ce': 'http://www.elsevier.com/xml/common/dtd'})
                if caption_element is not None:
                    simple_para_element = caption_element.find('.//ce:simple-para', namespaces={'ce': 'http://www.elsevier.com/xml/common/dtd'})
                    if simple_para_element is not None:
                        raw_caption = ET.tostring(simple_para_element, encoding='unicode', method='text')
                        caption = re.sub('<.*?>', '', raw_caption)
                        caption = " ".join(caption.split())
                    else:
                        caption = None
                    captions.append(caption)
                else:
                    captions.append(None)
            # Extracting tags for Image
            for attachment in root.findall('.//xocs:attachment', namespaces={'xocs': 'http://www.elsevier.com/xml/xocs/dtd'}):
                attachment_eid = attachment.find('.//xocs:attachment-eid', namespaces={'xocs': 'http://www.elsevier.com/xml/xocs/dtd'}).text
                ucs_locator = attachment.find('.//xocs:ucs-locator', namespaces={'xocs': 'http://www.elsevier.com/xml/xocs/dtd'}).text
                filename = attachment.find('.//xocs:filename', namespaces={'xocs': 'http://www.elsevier.com/xml/xocs/dtd'}).text
                images.append((attachment_eid, ucs_locator, filename))

            # Adding to the csv file
            with open(f'/content/drive/MyDrive/csvs/output{i + 1}.csv', 'w', newline='', encoding='utf-8') as csvfile:
                csv_writer = csv.writer(csvfile)
                csv_writer.writerow(['UTD EID', 'UCS Locator', 'Filename', 'Caption'])
                for image, caption in zip(images, captions):
                    csv_writer.writerow([image[0], image[1], image[2], caption])

            df_output = pd.read_csv(f'/content/drive/MyDrive/csvs/output{i + 1}.csv')
            df_output['link'] = df_output['UTD EID'].apply(lambda x: f'https://ars.els-cdn.com/content/image/{x}'.replace('.jpg', '_lrg.jpg'))
            df_output.to_csv(f'/content/drive/MyDrive/csvs/output{i + 1}.csv', index=False)

        except ET.ParseError as e:
            print(f"ParseError: {e}. The file {xml_file} might be incomplete or malformed.")
        except Exception as e:
            print(f"An error occurred while processing the file {xml_file}: {e}")


2
4
6
10
14
29
38
ParseError: no element found: line 1, column 0. The file /content/drive/MyDrive/xmls/38.xml might be incomplete or malformed.


### Code extracts image information (including captions) from XML files, cleans the captions, creates a CSV file with the extracted information, and adds a URL column to the CSV data. It effectively processes XML data and organizes the extracted information into a structured format.



In [None]:
import os
import re
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/scopus.csv')

# Keywords to be highlighted
keywords = [
    'strain', 'DSC', 'stress', 'heat flow', 'displacement', 'force',
    'martensite', 'fraction', 'load', 'X-Ray', 'temp', 'Differential',
    'scanning', 'calorimetry', 'curve'
]

# Function to highlight keywords in the caption
def highlight_keywords(caption, keywords):
    if not isinstance(caption, str):
        caption = ''

    found = False
    found_keywords = []
    for keyword in keywords:
        if re.search(r'\b' + re.escape(keyword) + r'\b', caption, re.IGNORECASE):
            found = True
            found_keywords.append(keyword)
    return found, found_keywords


for i, row in df.iterrows():
    status_code = row['StatusCode']


    if status_code == 200:
        output_file = f'/content/drive/MyDrive/csvs/output{i+1}.csv'

        # Check if the file exists
        if os.path.exists(output_file):
            try:
                # Read the corresponding output CSV file
                dof = pd.read_csv(output_file)

                # Apply the keyword highlighting function
                results = dof['Caption'].apply(lambda x: highlight_keywords(x, keywords))
                dof['contains_keywords'] = results.apply(lambda x: 1 if x[0] else 0)
                dof['found_keywords'] = results.apply(lambda x: ', '.join(x[1]) if x[1] else '')

                # Save the updated DataFrame back to the CSV file
                dof.to_csv(output_file, index=False)

            except Exception as e:
                print(f"An error occurred while processing the file {output_file}: {e}")
        else:
            print(f"The file {output_file} does not exist.")


The file /content/drive/MyDrive/csvs/output38.csv does not exist.


# Download and Organize Images from URLs

This Python code downloads images from URLs specified in a filtered DataFrame, saves them to designated directories, and also maintains a copy in a 'Master' directory.

**Key Features:**

* **Efficient Image Download:** Downloads images using the `requests` library with efficient streaming for large files.
* **Directory Management:** Creates directories dynamically for organized storage and removes empty directories to maintain a clean structure.
* **Error Handling:** Includes `try-except` blocks to handle potential exceptions during the download process, such as network errors or invalid URLs.
* **Master Directory:** Maintains a copy of all downloaded images in a central 'Master' directory for easy access and backup.
* **Data Filtering:** Filters the DataFrame based on a specific condition ('contains_keywords' == 1) before processing.


In [None]:
import os
import pandas as pd
import requests
import shutil

df = pd.read_csv('/content/drive/MyDrive/scopus.csv')


for i, row in df.iterrows():
    status_code = row['StatusCode']
    mloc = f'/content/drive/MyDrive/images/Master'
    os.makedirs(mloc,exist_ok=True)

    if status_code == 200:
        output_file = f'/content/drive/MyDrive/csvs/output{i+1}.csv'

        # Check if the corresponding output CSV file exists
        if os.path.exists(output_file):
            dof = pd.read_csv(output_file)

            # Filter the DataFrame to get rows where 'contains_keywords' is 1
            contain = dof[dof['contains_keywords'] == 1]

            # Check if there are any rows to process
            if not contain.empty:
                # Define the directory to save the images
                loc = f'/content/drive/MyDrive/images/{i+1}'
                os.makedirs(loc, exist_ok=True)

                # Variable to track if any file was downloaded
                files_downloaded = False

                # Loop through the filtered rows
                for i1, r2 in contain.iterrows():
                    url = r2['link']

                    try:
                        response = requests.get(url, stream=True, timeout=30)

                        # Check if the request was successful
                        if response.status_code == 200:
                            # Extract the file name from the URL
                            file_name = url.split("/")[-1]

                            # Define the complete path for saving the file
                            save_path = os.path.join(loc, file_name)
                            mloc2 = os.path.join(mloc,file_name)

                            # Write the file data to a file
                            with open(save_path, 'wb') as file:
                                for chunk in response.iter_content(1024):
                                    file.write(chunk)
                            shutil.copy2(save_path,mloc2)
                            print(f"File downloaded: {save_path}")
                            files_downloaded = True
                        else:
                            print(f"Failed to download file: {url} (Status code: {response.status_code})")
                    except Exception as e:
                        print(f"Error downloading file: {url} ({e})")

                # If no files were downloaded, remove the created directory
                if not files_downloaded:
                    os.rmdir(loc)
                    print(f"Removed empty directory: {loc}")
        else:
            print(f"Output file not found: {output_file}")


File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-ga1_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr8_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr3_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr1_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr11_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr12_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr13_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr9_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr6_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr7_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1-s2.0-S1359645424003422-gr10_lrg.jpg
File downloaded: /content/drive/MyDrive/images/2/1

# Installing required libraries


In [None]:
!python --version

Python 3.10.12


In [None]:
!pip install pytesseract
!apt-get update
!apt-get install -y tesseract-ocr
!apt-get install -y libtesseract-dev


Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,632 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:1

In [None]:
!tesseract --version


tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8


## Automates the process of extracting text from images, highlighting relevant keywords, and updating a DataFrame with the extracted information.
Functions:

highlight_keywords(caption, keywords):

Takes a caption (text) and a list of keywords.
Highlights each keyword in the caption using bold tags (**keyword**).
Returns a boolean indicating if any keywords were found and the modified caption.

process_image(img_path, keywords):

Takes an image path and a list of keywords.
Opens the image, converts it to grayscale for better text extraction.
Enhances image contrast to improve text clarity.
Uses Tesseract OCR to extract text from the image.
Calls highlight_keywords to highlight keywords in the extracted text.
Returns a status code (2 if keywords found, 0 otherwise) and the extracted text.

process_row(row, img_path, keywords):

Takes a DataFrame row, image path, and keywords list.
Checks if the 'contains_keywords' column value is 1 (presumably indicating the image might contain relevant keywords).
If so, calls process_image to process the image and get results.
Updates the row with the returned status code (contains_keywords2) and extracted text.
Returns the updated row.

main() (the main function):

Reads the main DataFrame (scopus.csv).
Iterates through each row.
Checks if the 'StatusCode' is 200 (assuming it indicates a successful download).
If so: Reads the corresponding output CSV file (output{i+1}.csv).

Defines the image directory path based on the row index.
Applies the process_row function to each row of the output CSV file using pandas' apply method.

This function processes the image associated with each row and updates the row with extracted text and a status code.

Saves the processed DataFrame back to the output CSV file.
Handles potential errors like FileNotFoundError and generic exceptions during processing.
Finally, saves the updated main DataFrame back to scopus.csv.

In [None]:
import re
import pandas as pd
from PIL import Image, ImageEnhance
import pytesseract
import os



def highlight_keywords(caption, keywords):
    """Function to highlight keywords in a given caption."""
    if not isinstance(caption, str):
        caption = ''

    found = False
    for keyword in keywords:
        if re.search(r'\b' + re.escape(keyword) + r'\b', caption, re.IGNORECASE):
            found = True
            caption = re.sub(r'(?i)\b' + re.escape(keyword) + r'\b', r'**\g<0>**', caption)
    return found, caption

def process_image(img_path, keywords):
    """Function to process an image, extract text, and highlight keywords."""
    try:
        image = Image.open(img_path)
        image = image.convert('L')  # Convert image to grayscale
        enhancer = ImageEnhance.Contrast(image)
        image = enhancer.enhance(2)  # Enhance image contrast
        text = pytesseract.image_to_string(image)  # Extract text from image
        found, highlighted_text = highlight_keywords(text, keywords)  # Highlight keywords in extracted text
        return 2 if found else 0, text  # Return status (2 if keywords found, otherwise 0) and extracted text
    except Exception as e:
        print(f"Error processing image {img_path}: {e}")
        return 1, ""  # Return status 1 (error) and empty text on exception

def process_row(row, img_path, keywords):
    """Function to process a row of data."""
    if row['contains_keywords'] == 1:
        contains_keywords2, text = process_image(img_path, keywords)  # Process image and get results
        row['contains_keywords2'] = contains_keywords2  # Store result in 'contains_keywords2' column
        row['text'] = text  # Store extracted text in 'text' column
    return row  # Return processed row

def main():
    """Main function to execute processing on CSV files."""
    df = pd.read_csv('/content/drive/MyDrive/scopus.csv')

    for i, row in df.iterrows():
        if row['StatusCode'] == 200:  # Check if StatusCode is 200(file is present)
            try:
                # Read corresponding output CSV file
                dof = pd.read_csv(f'/content/drive/MyDrive/csvs/output{i+1}.csv')
                print(f"Processing output{i+1}.csv")

                img_dir = f"/content/drive/MyDrive/images/{i+1}/"  # Directory path for images
                # Apply processing to each row of the output CSV file
                dof = dof.apply(
                    lambda dof_row: process_row(dof_row, os.path.join(img_dir, dof_row['UTD EID'][:-4] + '_lrg.jpg'), keywords),
                    axis=1
                )

                dof.to_csv(f'/content/drive/MyDrive/csvs/output{i+1}.csv', index=False)  # Save processed data back to output CSV file
                print(f"Saved processed data to output{i+1}.csv")

            except FileNotFoundError as fnf_error:
                print(f"Error: {fnf_error}. Skipping processing of output{i+1}.csv")

            except Exception as e:
                print(f"Unexpected error processing output{i+1}.csv: {e}")

    df.to_csv('/content/drive/MyDrive/scopus.csv', index=False)  # Save updated main DataFrame back to scopus2.csv
    print("Main data frame updated and saved.")

if __name__ == "__main__":
    # Keywords to consider for highlighting
    keywords = [
        'strain', 'DSC', 'stress', 'heat flow', 'displacement', 'force',
        'martensite', 'fraction', 'load', 'X-Ray', 'temp', 'Differential',
        'scanning', 'calorimetry', 'curve'
    ]
    main()  # Execute main function if script is run directly


Processing output2.csv
Saved processed data to output2.csv
Processing output4.csv
Saved processed data to output4.csv
Processing output6.csv
Saved processed data to output6.csv
Processing output10.csv
Saved processed data to output10.csv
Processing output14.csv
Saved processed data to output14.csv
Processing output29.csv
Saved processed data to output29.csv
Error: [Errno 2] No such file or directory: '/content/drive/MyDrive/csvs/output38.csv'. Skipping processing of output38.csv
Main data frame updated and saved.


# Creating Individual Files for each paper with Relevant Images , CSV file ,

###Refined Image Directory

Creates a separate directory for images containing relevant keywords, improving organization and easier access.

###Data File Copying

Copies the corresponding XML file and output CSV file along with the image, ensuring all relevant data is grouped together.

### Link Saving

Saves the original image link to a text file within the refined image directory for reference.

XMLS file and the Link of the Paper

Read DataFrame and Initialize Variables:

Reads the scopus.csv DataFrame.
Initializes a counter (count) to track the number of images copied.
Sets the 'Processed' column value to 0 for each row initially.
Iterate Through DataFrame Rows:

Loops through each row in the DataFrame.
Process Rows with Successful Download (Status Code 200):

Checks if the 'StatusCode' is 200 (indicating a successful download).
If yes: Attempts to read the corresponding output CSV file (output{i+1}.csv).
Prints a message indicating processing of the current output file.
Defines paths for the image directory (img_dir) and the refined image directory (refined_img_dir).

Checks if there are rows in the output CSV where contains_keywords2 is 2 (presumably indicating relevant keywords found).
If there are matching rows: Creates the refined_img_dir directory (if it doesn't exist).

Initialize a flag (copied) to track if any image was copied.
Loops through each row in the output CSV file.

Attempt to copy images where contains_keywords2 is 2:
Increments the counter (count) for each copied image.
Constructs source and destination image paths based on the row's UTD EID value.
Copies the image using shutil.copy.
Sets the copied flag to True.

Check if any image was copied:
If images were copied:
Copy the corresponding XML file (firstloc) and output CSV file (secondloc) to the refined_img_dir.
Update the 'Processed' column value in the main DataFrame (scopus.csv) to 1 for the current row.
Save the link from the original row to a text file named Link_{i+1}.txt within the refined_img_dir.
Save the updated output CSV file (dof) back to its original location.

Error Handling:

Handle FileNotFoundError if the corresponding output CSV file is not found.
Catche generic exceptions during processing and prints an error message.
Print Summary and Save DataFrame:

Print the total number of images copied.
Save the updated scopus.csv DataFrame with the 'Processed' column information.


In [None]:
import pandas as pd
import os
import shutil

# Function to process the images and CSV files
def main():
    df = pd.read_csv('/content/drive/MyDrive/scopus.csv')
    count = 0

    for i, row in df.iterrows():
        df.at[i, 'Processed'] = 0
        if row['StatusCode'] == 200:
            try:
                dof = pd.read_csv(f'/content/drive/MyDrive/csvs/output{i+1}.csv')
                print(f"Processing output{i+1}.csv")

                img_dir = f"/content/drive/MyDrive/images/{i+1}/"
                refined_img_dir = f"/content/drive/MyDrive/images/Refined images/{i+1}/"

                # Check if there are rows in dof that meet the condition
                if len(dof) > 0 and any(dof['contains_keywords2'] == 2):
                    os.makedirs(refined_img_dir, exist_ok=True)

                    copied = False
                    for index, r1 in dof.iterrows():
                        try:
                            if r1['contains_keywords2'] == 2:
                                count += 1
                                src_img = os.path.join(img_dir, r1['UTD EID'][:-4] + '_lrg.jpg')
                                dst_img = os.path.join(refined_img_dir, r1['UTD EID'][:-4] + '_lrg.jpg')
                                shutil.copy(src_img, dst_img)
                                copied = True
                        except Exception as e:
                            print(f"Error processing row {index}: {e}")

                    # Check if any image was copied
                    if copied:
                        firstloc = f"/content/drive/MyDrive/xmls/{i+1}.xml"
                        secondloc = f"/content/drive/MyDrive/csvs/output{i+1}.csv"
                        shutil.copy(firstloc, refined_img_dir)
                        shutil.copy(secondloc, refined_img_dir)

                        # Update scopus.csv to mark this row as processed
                        df.at[i, 'Processed'] = 1

                        # Save the link to a text file in the refined image directory
                        link = row['Link']
                        link_file_path = os.path.join(refined_img_dir, f"Link_{i+1}.txt")
                        with open(link_file_path, 'w') as link_file:
                            link_file.write(link)
                            print(f"Link saved to {link_file_path}")

                dof.to_csv(f'/content/drive/MyDrive/csvs/output{i+1}.csv', index=False)

            except FileNotFoundError:
                print(f"FileNotFoundError: output{i+1}.csv not found.")
            except Exception as ex:
                print(f"Error processing output{i+1}.csv: {ex}")

    print(f"Total images copied: {count}")

    # Save the updated scopus.csv with the 'Processed' column
    df.to_csv('/content/drive/MyDrive/scopus.csv', index=False)

if __name__ == "__main__":
    main()


Processing output2.csv
Link saved to /content/drive/MyDrive/images/Refined images/2/Link_2.txt
Processing output4.csv
Link saved to /content/drive/MyDrive/images/Refined images/4/Link_4.txt
Processing output6.csv
Link saved to /content/drive/MyDrive/images/Refined images/6/Link_6.txt
Processing output10.csv
Link saved to /content/drive/MyDrive/images/Refined images/10/Link_10.txt
Processing output14.csv
Processing output29.csv
Link saved to /content/drive/MyDrive/images/Refined images/29/Link_29.txt
FileNotFoundError: output38.csv not found.
Total images copied: 17


In [None]:
!pip install ultralytics

Collecting ultralytics
  Downloading ultralytics-8.3.52-py3-none-any.whl.metadata (35 kB)
Collecting ultralytics-thop>=2.0.0 (from ultralytics)
  Downloading ultralytics_thop-2.0.13-py3-none-any.whl.metadata (9.4 kB)
Downloading ultralytics-8.3.52-py3-none-any.whl (901 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m901.7/901.7 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ultralytics_thop-2.0.13-py3-none-any.whl (26 kB)
Installing collected packages: ultralytics-thop, ultralytics
Successfully installed ultralytics-8.3.52 ultralytics-thop-2.0.13


In [None]:
%cd /content/drive/MyDrive
!git clone https://github.com/PrathameshLadhe/sma_train

/content/drive/MyDrive
Cloning into 'sma_train'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 34 (delta 0), reused 31 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (34/34), 17.14 MiB | 12.29 MiB/s, done.


In [None]:
## Running a model on images with pre-trained weights to detect plots , their x-label and y-label
!yolo task=detect mode=predict model='/content/drive/MyDrive/sma_train/train/weights/best.pt' conf=0.25 source='/content/drive/MyDrive/images/Master'

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
Ultralytics 8.3.52 🚀 Python-3.10.12 torch-2.5.1+cu121 CPU (Intel Xeon 2.20GHz)
Model summary (fused): 168 layers, 3,006,233 parameters, 0 gradients, 8.1 GFLOPs

image 1/38 /content/drive/MyDrive/images/Master/1-s2.0-S0020768324002518-fx1001_lrg.jpg: 640x448 (no detections), 327.5ms
image 2/38 /content/drive/MyDrive/images/Master/1-s2.0-S0020768324002518-gr10_lrg.jpg: 352x640 (no detections), 217.2ms
image 3/38 /content/drive/MyDrive/images/Master/1-s2.0-S0020768324002518-gr11_lrg.jpg: 288x640 2 graphs, 2 xlabels, 3 ylabels, 177.5ms
image 4/38 /content/drive/MyDrive/images/Master/1-s2.0-S0020768324002518-gr12_lrg.jpg: 224x640 2 graphs, 2 xlabels, 3 ylabels, 135.8ms
image 5/38 /cont

# Saving all the images in the Drive

Code effectively archives a directory (runs) located in Google Drive.

Define Paths:

src_dir: Path to the source directory containing files to be archived (/content/drive/MyDrive/runs).
base_dst_dir: Base path for the archived directory (/content/drive/MyDrive/Archived_Runs).
timestamp: Current date and time formatted for creating a unique directory name (e.g., 20241221_162820).
dst_dir: Complete destination directory path with the timestamp incorporated (/content/drive/MyDrive/Archived_Runs/runs_20241221_162820).


Create Destination Directory:

os.makedirs(base_dst_dir, exist_ok=True)

ensures the Archived_Runs directory exists (if not already created) to avoid errors during archiving.

Copy Directory:

shutil.copytree(src_dir, dst_dir)

attempts to copy the entire contents of the source directory (runs) to the destination directory (runs_{timestamp}).

Print Success Message:

If copying is successful, it prints a confirmation message indicating the source and destination paths.

Delete Source Directory (Optional):

shutil.rmtree('/content/drive/MyDrive/runs')

permanently deletes the source directory (runs) after successful archiving. This is an optional step depending on your preference for keeping or removing the source files.

Error Handling:

The try-except block ensures graceful handling of potential exceptions during the copying process.
If an error occurs (Exception as e), it prints an error message with the exception details.

In [None]:
import shutil
import os
from datetime import datetime
# Define source and destination paths
src_dir = '/content/drive/MyDrive/runs'
base_dst_dir = '/content/drive/MyDrive/Archived_Runs'

# Create a timestamped folder name to avoid overwriting
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
dst_dir = os.path.join(base_dst_dir, f'runs_{timestamp}')

# Ensure the base destination directory exists
os.makedirs(base_dst_dir, exist_ok=True)

# Copy the entire directory
try:
    shutil.copytree(src_dir, dst_dir)
    print(f"Folder copied from {src_dir} to {dst_dir}")
    shutil.rmtree('/content/drive/MyDrive/runs')
    print('File deleted.')
except Exception as e:
    print(f"Error copying folder: {e}")


Folder copied from /content/drive/MyDrive/runs to /content/drive/MyDrive/Archived_Runs/runs_20241221_145753
File deleted.


# Getting the YOLO Predictions in YOLO txt format for further use.

Utilize the Ultralytics YOLO library for object detection on a set of images. Here's a breakdown of the steps involved:

1. Load YOLO Model:

Import the YOLO class from the ultralytics library.
Load a pre-trained YOLO model from the specified path (/content/drive/MyDrive/sma_train/train/weights/best.pt). This model is trained on custom dataset for object detection.

2. Create Output Directory:

Define the output directory path (output_dir) where the detection results will be saved as text files.
Uses os.makedirs(output_dir, exist_ok=True) to create the output directory if it doesn't already exist.

3. Iterate Over Images:

Define the image directory path (image_dir) containing the images for which object detection needs to be performed.
Loops through all files in the image_dir.
Filters for files with image extensions (.jpg, .png, .jpeg) using if image_file.endswith(...).

4. Process Each Image:

Constructs the full path to the current image file (image_path).
Use the loaded YOLO model (model) to perform object detection on the image.
The model(image_path, save_txt=None) call infers objects in the image and returns the detections.
The save_txt=None argument prevents saving detections as images by default (you might want to adjust this based on your needs).
Store the returned detections in the predictions variable.


5. Prepare Output File:

Extract the filename without the extension from the current image file (base_name).Construct the output file path (output_file_path) within the output_dir using the extracted filename and appending a .txt extension.

6. Save Detections to Text File:

Open the output text file (output_file_path) for writing.
Iterate through each detected bounding box in the predictions.
Extract the class ID (cls) for the detected object.
Extract the bounding box coordinates (x_center, y_center, width, height) from the current box.
Write a line to the text file in the YOLO label format: {class_id} {x_center} {y_center} {width} {height}. This format is commonly used for object detection datasets.

In [None]:
from ultralytics import YOLO
import os

# Load the model
model = YOLO('/content/drive/MyDrive/sma_train/train/weights/best.pt')

# Create output directory if it doesn't exist
output_dir = '/content/drive/MyDrive/images/Master/output'  # Keep this for saving results
os.makedirs(output_dir, exist_ok=True)

# Iterate over images in the CORRECT directory
image_dir = '/content/drive/MyDrive/images/Master'  # This is where the images should be
for image_file in os.listdir(image_dir):
    if image_file.endswith(('.jpg', '.png', '.jpeg')):  # Filter for image files
        # Full path to the image
        image_path = os.path.join(image_dir, image_file)

        # Get predictions for the image
        predictions = model(image_path, save_txt=None)

        # Prepare output file path
        base_name = os.path.splitext(image_file)[0]
        output_file_path = os.path.join(output_dir, f"{base_name}.txt")

        # Write predictions to the text file
        with open(output_file_path, 'w') as file:
            for idx, box in enumerate(predictions[0].boxes.xywhn):  # Iterate over each detected box
                cls = int(predictions[0].boxes.cls[idx].item())
                # Write line to file in YOLO label format: cls x_center y_center width height
                file.write(f"{cls} {box[0].item()} {box[1].item()} {box[2].item()} {box[3].item()}\n")


image 1/1 /content/drive/MyDrive/images/Master/1-s2.0-S1359645424003422-ga1_lrg.jpg: 576x640 1 graph, 1 xlabel, 1 ylabel, 242.7ms
Speed: 11.3ms preprocess, 242.7ms inference, 2.0ms postprocess per image at shape (1, 3, 576, 640)

image 1/1 /content/drive/MyDrive/images/Master/1-s2.0-S1359645424003422-gr8_lrg.jpg: 640x448 2 graphs, 2 xlabels, 2 ylabels, 171.0ms
Speed: 3.7ms preprocess, 171.0ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 448)

image 1/1 /content/drive/MyDrive/images/Master/1-s2.0-S1359645424003422-gr3_lrg.jpg: 640x480 1 graph, 1 xlabel, 2 ylabels, 184.6ms
Speed: 3.9ms preprocess, 184.6ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 480)

image 1/1 /content/drive/MyDrive/images/Master/1-s2.0-S1359645424003422-gr1_lrg.jpg: 256x640 1 graph, 2 xlabels, 1 ylabel, 108.7ms
Speed: 2.4ms preprocess, 108.7ms inference, 1.0ms postprocess per image at shape (1, 3, 256, 640)

image 1/1 /content/drive/MyDrive/images/Master/1-s2.0-S1359645424003422-gr11_