<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 2 - Phase 1_1 - eyamrog

The aim of this phase is to check.

## Required Python packages

- beautifulsoup4
- lxml
- pandas
- requests
- selenium
- tqdm

## Importing the required libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
import sys
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## Defining input variables

In [2]:
input_directory = 'cl_st2_ph1_eyamrog'
output_directory = 'cl_st2_ph11_eyamrog'

## Creating output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory successfully created.


## Web Scraping [Annual Review of Plant Biology](https://www.annualreviews.org/content/journals/arplant)

### Importing the data into a DataFrame

In [4]:
df_ar_plant_biology = pd.read_json(f"{input_directory}/ar_plant_biology.jsonl", lines=True)

In [5]:
df_ar_plant_biology.columns

Index(['Title', 'URL', 'Authors', 'Vol/Year/Page Range', 'DOI',
       'Area of Knowledge'],
      dtype='object')

In [6]:
df_ar_plant_biology['Vol/Year/Page Range'].unique()

array(['Vol. 73 \n(2022),pp.1–16', 'Vol. 73 \n(2022),pp.17–42',
       'Vol. 73 \n(2022),pp.43–65', 'Vol. 73 \n(2022),pp.67–92',
       'Vol. 73 \n(2022),pp.93–121', 'Vol. 73 \n(2022),pp.123–148',
       'Vol. 73 \n(2022),pp.149–172', 'Vol. 73 \n(2022),pp.173–200',
       'Vol. 73 \n(2022),pp.201–225', 'Vol. 73 \n(2022),pp.227–254',
       'Vol. 73 \n(2022),pp.255–291', 'Vol. 73 \n(2022),pp.293–321',
       'Vol. 73 \n(2022),pp.323–353', 'Vol. 73 \n(2022),pp.355–378',
       'Vol. 73 \n(2022),pp.379–403', 'Vol. 73 \n(2022),pp.405–432',
       'Vol. 73 \n(2022),pp.433–455', 'Vol. 73 \n(2022),pp.457–474',
       'Vol. 73 \n(2022),pp.475–494', 'Vol. 73 \n(2022),pp.495–521',
       'Vol. 73 \n(2022),pp.523–551', 'Vol. 73 \n(2022),pp.553–584',
       'Vol. 73 \n(2022),pp.585–616', 'Vol. 73 \n(2022),pp.617–648',
       'Vol. 73 \n(2022),pp.649–672', 'Vol. 73 \n(2022),pp.673–702',
       'Vol. 73 \n(2022),pp.703–728', 'Vol. 72 \n(2021),pp.1–16',
       'Vol. 72 \n(2021),pp.17–46', 'Vol. 72 \n

### Extracting the `Posted` dates from the column `Vol/Year/Page Range`

In [7]:
# Extract year using RegEx
df_ar_plant_biology['Posted'] = df_ar_plant_biology['Vol/Year/Page Range'].str.extract(r'^Vol. .+ \n\((\d{4})\).+')

In [8]:
df_ar_plant_biology['Posted'] = pd.to_datetime(df_ar_plant_biology['Posted'])

In [9]:
df_ar_plant_biology.dtypes

Title                          object
URL                            object
Authors                        object
Vol/Year/Page Range            object
DOI                            object
Area of Knowledge              object
Posted                 datetime64[ns]
dtype: object

In [10]:
df_ar_plant_biology

Unnamed: 0,Title,URL,Authors,Vol/Year/Page Range,DOI,Area of Knowledge,Posted
0,"Adventures in Life and Science, from Light to ...",https://www.annualreviews.org/content/journals...,Elaine Tobin,"Vol. 73 \n(2022),pp.1–16",https://doi.org/10.1146/annurev-arplant-090921...,Agricultural Sciences,2022-01-01
1,Phosphorus Acquisition and Utilization in Plants,https://www.annualreviews.org/content/journals...,Hans Lambers,"Vol. 73 \n(2022),pp.17–42",https://doi.org/10.1146/annurev-arplant-102720...,Agricultural Sciences,2022-01-01
2,Meeting in the Middle: Lessons and Opportuniti...,https://www.annualreviews.org/content/journals...,"Mae Antonette Mercado, , andAnthony J. Studer","Vol. 73 \n(2022),pp.43–65",https://doi.org/10.1146/annurev-arplant-102720...,Agricultural Sciences,2022-01-01
3,Plant Proteome Dynamics,https://www.annualreviews.org/content/journals...,"Julia Mergner, , andBernhard Kuster","Vol. 73 \n(2022),pp.67–92",https://doi.org/10.1146/annurev-arplant-102620...,Agricultural Sciences,2022-01-01
4,Evolution and Functions of Plant U-Box Protein...,https://www.annualreviews.org/content/journals...,"Jana Trenner, ,Jacqueline Monaghan, ,Bushra Sa...","Vol. 73 \n(2022),pp.93–121",https://doi.org/10.1146/annurev-arplant-102720...,Agricultural Sciences,2022-01-01
...,...,...,...,...,...,...,...
83,Phenotyping: New Windows into the Plant for Br...,https://www.annualreviews.org/content/journals...,"Michelle Watt, ,Fabio Fiorani, ,Björn Usadel, ...","Vol. 71 \n(2020),pp.689–712",https://doi.org/10.1146/annurev-arplant-042916...,Agricultural Sciences,2020-01-01
84,The Genomics ofCannabisand Its Close Relatives,https://www.annualreviews.org/content/journals...,"I. Kovalchuk, ,M. Pellino, ,P. Rigault, ,R. va...","Vol. 71 \n(2020),pp.713–739",https://doi.org/10.1146/annurev-arplant-081519...,Agricultural Sciences,2020-01-01
85,Sequencing and Analyzing the Transcriptomes of...,https://www.annualreviews.org/content/journals...,"Gane Ka-Shu Wong, ,Douglas E. Soltis, ,Jim Lee...","Vol. 71 \n(2020),pp.741–765",https://doi.org/10.1146/annurev-arplant-042916...,Agricultural Sciences,2020-01-01
86,Engineering Synthetic Signaling in Plants,https://www.annualreviews.org/content/journals...,"Alexander R. Leydon, ,Hardik P. Gala, ,Sarah G...","Vol. 71 \n(2020),pp.767–788",https://doi.org/10.1146/annurev-arplant-081519...,Agricultural Sciences,2020-01-01


### Creating the column `Text ID`

In [11]:
df_ar_plant_biology['Text ID'] = 'ar_plant_biology' + df_ar_plant_biology.index.astype(str).str.zfill(6)

### Inspecting a few samples

In [18]:
url_sample = df_ar_plant_biology.at[87, 'URL']

In [19]:
url_sample

'https://www.annualreviews.org/content/journals/10.1146/annurev-arplant-050718-100038'

In [13]:
# Setting up the WebDriver (make sure you have downloaded the Microsoft Edge WebDriver executable)
# https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
service = Service(r'C:\Users\eyamr\OneDrive\Documentos\0-Technology\laelgelc\edgedriver_win64\msedgedriver.exe')
driver = webdriver.Edge(service=service)

# Navigating to target URL 1 and saving its web page
driver.get(url_sample)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, 'google_esf')))
document_page_sample = driver.page_source

with open(f'{output_directory}/ar_plant_biology_sample.html', 'w', encoding='utf8', newline='\n') as file:
    file.write(document_page_sample)

# Closing the WebDriver
driver.quit()

### Scraping the paragraphs of the articles into TXT format

In [14]:
# Iterating over the rows of the column 'URL' to scrape paragraphs from each article
for index, row in df_ar_plant_biology.iterrows():
    url = row['URL']
    text_id = row['Text ID']
    
    # Setting up the WebDriver
    service = Service(r'C:\Users\eyamr\OneDrive\Documentos\0-Technology\laelgelc\edgedriver_win64\msedgedriver.exe')
    driver = webdriver.Edge(service=service)
    driver.get(url)
    
    # Wait for page to load
    wait = WebDriverWait(driver, 10)
    wait.until(EC.presence_of_element_located((By.ID, 'google_esf')))
    
    page = driver.page_source
    soup = BeautifulSoup(page, 'lxml')
    paragraphs = soup.find_all('p')
    
    # Extract text from paragraphs
    article_content = '\n'.join(p.get_text(strip=True) for p in paragraphs)
    
    # Saving each article's content to a text file
    with open(f"{output_directory}/{text_id}.txt", 'w', encoding='utf-8') as file:
        file.write(article_content)
        
    # Closing the WebDriver
    driver.quit()

### Adding the column `Text` with the text extracted from each article

In [20]:
# Function to read the content of a TXT file
def read_txt_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Iterating through each row in the DataFrame and add the text content
texts = []
for index, row in df_ar_plant_biology.iterrows():
    text_id = row['Text ID']
    txt_file_path = os.path.join(output_directory, f"{text_id}.txt")
    if os.path.exists(txt_file_path):
        text_content = read_txt_file(txt_file_path)
    else:
        text_content = None  # or you can set it to an empty string or any default value
    texts.append(text_content)

# Add the 'Text' column to the DataFrame
df_ar_plant_biology['Text'] = texts

In [21]:
df_ar_plant_biology

Unnamed: 0,Title,URL,Authors,Vol/Year/Page Range,DOI,Area of Knowledge,Posted,Text ID,Text
0,"Adventures in Life and Science, from Light to ...",https://www.annualreviews.org/content/journals...,Elaine Tobin,"Vol. 73 \n(2022),pp.1–16",https://doi.org/10.1146/annurev-arplant-090921...,Agricultural Sciences,2022-01-01,ar_plant_biology000000,We usecookiesto track usage and preferences.I ...
1,Phosphorus Acquisition and Utilization in Plants,https://www.annualreviews.org/content/journals...,Hans Lambers,"Vol. 73 \n(2022),pp.17–42",https://doi.org/10.1146/annurev-arplant-102720...,Agricultural Sciences,2022-01-01,ar_plant_biology000001,We usecookiesto track usage and preferences.I ...
2,Meeting in the Middle: Lessons and Opportuniti...,https://www.annualreviews.org/content/journals...,"Mae Antonette Mercado, , andAnthony J. Studer","Vol. 73 \n(2022),pp.43–65",https://doi.org/10.1146/annurev-arplant-102720...,Agricultural Sciences,2022-01-01,ar_plant_biology000002,We usecookiesto track usage and preferences.I ...
3,Plant Proteome Dynamics,https://www.annualreviews.org/content/journals...,"Julia Mergner, , andBernhard Kuster","Vol. 73 \n(2022),pp.67–92",https://doi.org/10.1146/annurev-arplant-102620...,Agricultural Sciences,2022-01-01,ar_plant_biology000003,We usecookiesto track usage and preferences.I ...
4,Evolution and Functions of Plant U-Box Protein...,https://www.annualreviews.org/content/journals...,"Jana Trenner, ,Jacqueline Monaghan, ,Bushra Sa...","Vol. 73 \n(2022),pp.93–121",https://doi.org/10.1146/annurev-arplant-102720...,Agricultural Sciences,2022-01-01,ar_plant_biology000004,We usecookiesto track usage and preferences.I ...
...,...,...,...,...,...,...,...,...,...
83,Phenotyping: New Windows into the Plant for Br...,https://www.annualreviews.org/content/journals...,"Michelle Watt, ,Fabio Fiorani, ,Björn Usadel, ...","Vol. 71 \n(2020),pp.689–712",https://doi.org/10.1146/annurev-arplant-042916...,Agricultural Sciences,2020-01-01,ar_plant_biology000083,We usecookiesto track usage and preferences.I ...
84,The Genomics ofCannabisand Its Close Relatives,https://www.annualreviews.org/content/journals...,"I. Kovalchuk, ,M. Pellino, ,P. Rigault, ,R. va...","Vol. 71 \n(2020),pp.713–739",https://doi.org/10.1146/annurev-arplant-081519...,Agricultural Sciences,2020-01-01,ar_plant_biology000084,We usecookiesto track usage and preferences.I ...
85,Sequencing and Analyzing the Transcriptomes of...,https://www.annualreviews.org/content/journals...,"Gane Ka-Shu Wong, ,Douglas E. Soltis, ,Jim Lee...","Vol. 71 \n(2020),pp.741–765",https://doi.org/10.1146/annurev-arplant-042916...,Agricultural Sciences,2020-01-01,ar_plant_biology000085,We usecookiesto track usage and preferences.I ...
86,Engineering Synthetic Signaling in Plants,https://www.annualreviews.org/content/journals...,"Alexander R. Leydon, ,Hardik P. Gala, ,Sarah G...","Vol. 71 \n(2020),pp.767–788",https://doi.org/10.1146/annurev-arplant-081519...,Agricultural Sciences,2020-01-01,ar_plant_biology000086,We usecookiesto track usage and preferences.I ...


### Inspecting a few samples

In [22]:
df_ar_plant_biology.at[87, 'Text']

"We usecookiesto track usage and preferences.I Understand\nThe acquisition of quantitative information on plant development across a range of temporal and spatial scales is essential to understand the mechanisms of plant growth. Recent years have shown the emergence of imaging methodologies that enable the capture and analysis of plant growth, from the dynamics of molecules within cells to the measurement of morphometricand physiological traits in field-grown plants. In some instances, these imaging methods can be parallelized across multiple samples to increase throughput. When high throughput is combined with high temporal and spatial resolution, the resulting image-derived data sets could be combined with molecular large-scale data sets to enable unprecedented systems-level computational modeling. Such image-driven functional genomics studies may be expected to appear at an accelerating rate in the near future given the early success of the foundational efforts reviewed here. We pre

### Exporting to a file

In [23]:
df_ar_plant_biology.to_json(f"{output_directory}/ar_plant_biology.jsonl", orient='records', lines=True)