<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Regular expressions
© ExploreAI Academy

In this notebook, we look at how to use the `re` library in Python and apply some of the functionality to easily extract the data we're interested in. We also look at the `re.compile` function – creating compiled objects for specific regex functions.

## Learning objectives

By the end of this notebook, you should be able to:
- Understand how to use regex to extract data we're interested in.
- Apply regex to both compiled objects and normal text.

## Examples

### Example 1

#### Question
Given a paragraph about conservation efforts, split the text into individual sentences using regular expressions.

#### Solution

In [None]:
import re

text = "Conservation efforts are increasing. Habitats are being restored. Species are recovering."
sentences = re.split(r"\. ", text)

print(sentences)

['Conservation efforts are increasing', 'Habitats are being restored', 'Species are recovering.']


#### Explanation
This code uses `re.split` to split a paragraph into sentences based on the period followed by a space. This is a simple use case of `re.split`, demonstrating its effectiveness in text segmentation.

### Example 2

#### Question
Extract all numbers followed by "acres" to find references to land area in a text. Utilise `re.compile` to create a regex pattern that matches this format.

#### Solution

In [None]:
import re

text = "The national park covers 5000 Acres, while the community forest spans 750 acres."

# Compiling the regex pattern
pattern = re.compile(r'\d+\s*acres', re.IGNORECASE)

# Finding all occurrences of land area
land_areas = pattern.findall(text)

print(land_areas)

['5000 Acres', '750 acres']


#### Explanation
This solution uses a compiled regex pattern to efficiently find numerical values followed by "acres". The regex `\d+\s*acres` looks for one or more digits `(\d+)` followed by zero or more spaces `(\s*)` and the word "acres". The `re.IGNORECASE` flag ensures that variations in the capitalisation of "acres" are also matched.

### Example 3

#### Question
Given a text with various animal and plant species names formatted as 'Genus Species', compile a regex object to find all occurrences of these species' names in the text.

#### Solution

In [None]:
import re

pattern = re.compile(r'\b[A-Z][a-z]* [a-z]+\b')
text = "In the Amazon rainforest, species like Panthera onca, Inia geoffrensis, and Euterpe precatoria are found."

species = pattern.findall(text)
print(species)

['In the', 'Amazon rainforest', 'Panthera onca', 'Inia geoffrensis', 'Euterpe precatoria']


#### Explanation
The script compiles a regex pattern using `re.compile`, which is then used to find all matches in the given text. The regex `\b[A-Z][a-z]* [a-z]+\b` is designed to match words that start with a capital letter (indicative of a genus name in biological nomenclature) followed by lowercase letters, a space, and then a series of lowercase letters (representing the species name).

While this pattern is typically representative of scientific names for species, it's important to note that it may not exclusively capture viable species names. This limitation arises because the pattern does not account for the complexities and exceptions found in biological nomenclature, such as species names with hyphens, Latin abbreviations, or those comprising more than two words. Additionally, the pattern might inadvertently match other text that coincidentally follows the same format but does not represent actual species names. Therefore, while this regex can be a powerful tool for preliminary data extraction, further verification and refinement may be necessary to ensure the accuracy and relevance of the extracted data, especially for scientific or research purposes.

### Example 4

#### Question
Given a text containing different plant names related to sustainable land management, extract all occurrences of specific plants. The names to extract are "oak", "maple", and "pine". Use `re.compile` to optimise the pattern matching.

#### Solution

In [None]:
import re

text = "The forest had a variety of trees including oak, maple, and pine. Other species included birch and spruce."

# Compiling the regex pattern
pattern = re.compile(r'\boak\b|\bmaple\b|\bpine\b', re.IGNORECASE)

# Finding all occurrences of the specified plants
found_plants = pattern.findall(text)

print(found_plants)


['oak', 'maple', 'pine']


#### Explanation
This code uses `re.compile` to create a compiled regex object for efficient matching.
* When working with regex in Python, compiling a regex pattern into a regex object can enhance performance, especially when the pattern is used multiple times. This approach is more efficient because the regex engine converts the pattern string into an internal format optimised for repeated searches. This is particularly useful in scenarios like parsing large texts or processing multiple strings using the same pattern, as it avoids the overhead of recompiling the pattern for each use.

The regex pattern `\boak\b|\bmaple\b|\bpine\b` uses word boundaries `(\b)` to match whole words and `|` as an `OR` operator to match any of the specified plant names. The `re.IGNORECASE` flag makes the search case insensitive.

In [None]:
social_media_posts = """
Great news! The GreenWood Project has successfully planted 10000 trees in the Amazon Rainforest #GreenEarth #Conservation
Update: ForestCoverApp shows a 12% increase in forest cover in the last 5 years. #TechForGood
Sad to see illegal logging in Madagascan rainforests. We need stricter laws! #SaveForests #ActNow
Celebrating World Environment Day with a pledge to plant 20000 trees. Join us! #EnvironmentDay #GoGreen
Interesting study published in NatureJournal: Rainforest biodiversity is crucial for ecological balance. #ScienceForNature
"""

In [None]:
import re

In [None]:
# insert code here
def extract_hashtags(text):
    pattern = re.compile(r"#\w+")
    return pattern.findall(text)

# Test with the provided text
print(extract_hashtags(social_media_posts))

['#GreenEarth', '#Conservation', '#TechForGood', '#SaveForests', '#ActNow', '#EnvironmentDay', '#GoGreen', '#ScienceForNature']


In [None]:
# insert code here
def extract_numbers(text):
    pattern = re.compile(r"\b[\d%?]+\W")
    return pattern.findall(text)

# Test with the provided text
print(extract_numbers(social_media_posts))

['10000 ', '12% ', '5 ', '20000 ']


In [None]:
# insert code here
def count_specific_words(text):
    pattern = re.compile(r"(\billegal\b|\blogging\b)")
    return len(pattern.findall(text))

# Test with the provided text
print(count_specific_words(social_media_posts))

2


In [None]:
# insert code here
def extract_locations(text):
    matches = re.findall(r"[A-Z]\w+\s+[rR]\w+",text)
    return matches

# Test with the provided text
print(extract_locations(social_media_posts))

['Amazon Rainforest', 'Madagascan rainforests']


In [None]:
def extract_locations(text):
    # Regular expression to match potential location names
    pattern = r'\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*(?:\s[rR]ainforests?)?\b'

    # Find all matches
    potential_locations = re.findall(pattern, text)

    # Filtering based on context and avoiding false positives
    locations = [loc for loc in potential_locations if not loc.endswith(':') and 'rainforest' in loc.lower() and loc != 'Rainforest']

    return locations

# Test with the provided text
print(extract_locations(social_media_posts))

['Amazon Rainforest', 'Madagascan rainforests']


In [None]:
import pandas as pd

# Sample data
data = {'species': [' Maple (10 years) ', 'oak', 'Pine(3 years)', 'maple ', ' Oak (1.5 Years)']}
df = pd.DataFrame(data)

# Cleaning the 'species' column
df['species'] = df['species'].str.strip().str.lower()
print(df)

            species
0  maple (10 years)
1               oak
2     pine(3 years)
3             maple
4   oak (1.5 years)


In [None]:
# Extracting age using regular expression
df['age'] = df['species'].str.extract('(\d+\.\d+|\d+)').fillna("Unknown")
print(df)

            species      age
0  maple (10 years)       10
1               oak  Unknown
2     pine(3 years)        3
3             maple  Unknown
4   oak (1.5 years)      1.5


In [None]:
# Clean and standardise species names
df['species'] = df['species'].str.extract('([a-zA-Z]+)', expand=False).fillna('').str.strip().str.lower()

# Counting occurrences of each species
species_counts = df['species'].value_counts()
print(species_counts)

maple    2
oak      2
pine     1
Name: species, dtype: int64


In [None]:
# Creating a 'zone' column based on the first letter
df['zone'] = df['species'].str[0]
grouped_data = df.groupby('zone').size()
print(grouped_data)

zone
m    2
o    2
p    1
dtype: int64


In [None]:
import pandas as pd
import re

data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Description': [
        'Forest Conservation Project in Spain',
        'River Cleanup Initiative Project 2021 in Portugal',
        'Urban Gardening Community Project in Germany',
        'Forest Reforestation Project 2022 in France',
        'Wildlife Protection Project Plan in Italy',
        'Endangered Species Conservation Project in Greece',
        'Wetland Restoration Project in Spain',
        'Marine Life Conservation Project in Portugal',
        'Air Quality Improvement Project Initiative in Germany',
        'Habitat Preservation Project for Birds in France'
    ],
    'Date': ['2021-03-15', '2021-06-20', '2022-01-11', '2022-04-05', '2023-02-22', '2023-05-30', '2021-09-13', '2022-07-19', '2023-03-08', '2022-11-21'],
    'Location': ['madrid, spain', 'LISBON, Portugal', 'berlin, germany', 'Paris, France', 'rome, Italy', 'Athens, GREECE', 'Valencia, Spain', 'PORTO, Portugal', 'Munich, Germany', 'Lyon, France'],
    'Budget': ['$20000', '€15000', '€12000', '£18000', '$25000', '€20000', '$17000', '€13000', '€11000', '£16000'],
    'Notes': [
        'Focusing on native forest species in Spain',
        'Cleanup of the Tagus river in Portugal. Endangered species alert!',
        'Community project in urban Berlin, Germany',
        'Reforestation of oak trees in Paris, France',
        'Plan for protecting local wildlife in Italy. Endangered species identified.',
        'Study on the impact on endangered bird species in Greece',
        'Restoration of wetlands in Valencia, Spain',
        'Conservation of marine life in Porto, Portugal',
        'Initiative for improving air quality in Munich, Germany',
        'Preservation of bird habitats in Lyon, France'
    ]
}



environment_df = pd.DataFrame(data)

print(environment_df)

   ID                                        Description        Date  \
0   1               Forest Conservation Project in Spain  2021-03-15   
1   2  River Cleanup Initiative Project 2021 in Portugal  2021-06-20   
2   3       Urban Gardening Community Project in Germany  2022-01-11   
3   4        Forest Reforestation Project 2022 in France  2022-04-05   
4   5          Wildlife Protection Project Plan in Italy  2023-02-22   
5   6  Endangered Species Conservation Project in Greece  2023-05-30   
6   7               Wetland Restoration Project in Spain  2021-09-13   
7   8       Marine Life Conservation Project in Portugal  2022-07-19   
8   9  Air Quality Improvement Project Initiative in ...  2023-03-08   
9  10   Habitat Preservation Project for Birds in France  2022-11-21   

           Location  Budget                                              Notes  
0     madrid, spain  $20000         Focusing on native forest species in Spain  
1  LISBON, Portugal  €15000  Cleanup of the T

In [75]:
environment_df['Location_test'] = ["".join(word.capitalize() for word in Country) for Country in environment_df['Location'].str.split()]
print(environment_df['Location_test'])


Madrid,Spain
0       Madrid,Spain
1    Lisbon,Portugal
2     Berlin,Germany
3       Paris,France
4         Rome,Italy
5      Athens,Greece
6     Valencia,Spain
7     Porto,Portugal
8     Munich,Germany
9        Lyon,France
Name: Location_test, dtype: object


In [76]:
# Standardising 'Location'
environment_df['Location'] = environment_df['Location'].apply(lambda x: x.title())

# Extracting 'Year'
environment_df['Year'] = pd.to_datetime(environment_df['Date']).dt.year

print(environment_df)

   ID                                        Description        Date  \
0   1               Forest Conservation Project in Spain  2021-03-15   
1   2  River Cleanup Initiative Project 2021 in Portugal  2021-06-20   
2   3       Urban Gardening Community Project in Germany  2022-01-11   
3   4        Forest Reforestation Project 2022 in France  2022-04-05   
4   5          Wildlife Protection Project Plan in Italy  2023-02-22   
5   6  Endangered Species Conservation Project in Greece  2023-05-30   
6   7               Wetland Restoration Project in Spain  2021-09-13   
7   8       Marine Life Conservation Project in Portugal  2022-07-19   
8   9  Air Quality Improvement Project Initiative in ...  2023-03-08   
9  10   Habitat Preservation Project for Birds in France  2022-11-21   

           Location  Budget  \
0     Madrid, Spain  $20000   
1  Lisbon, Portugal  €15000   
2   Berlin, Germany  €12000   
3     Paris, France  £18000   
4       Rome, Italy  $25000   
5    Athens, Greece  

In [77]:
# Fixed conversion rates
conversion_rates = {'$': 1.0, '€': 1.1, '£': 1.3}  # Example rates: 1 Euro = 1.1 USD, 1 Pound = 1.3 USD

def convert_to_usd(budget_str):
    # Extracting the currency symbol and amount
    currency_symbol = budget_str[0]
    amount = float(budget_str[1:])

    # Converting to USD
    if currency_symbol in conversion_rates:
        return amount * conversion_rates[currency_symbol]
    else:
        return amount

# Converting 'Budget' to numeric USD values
environment_df['Budget_USD'] = environment_df['Budget'].apply(convert_to_usd)

# Calculating total budget for "forest"-related projects in USD
total_budget_forest_usd = environment_df[environment_df['Description'].str.contains("forest", case=False)]['Budget_USD'].sum()
print(total_budget_forest_usd)

43400.0


In [78]:
# Using regex to identify mentions of endangered species
environment_df['Endangered_species'] = environment_df['Notes'].str.contains(r'endangered species', flags=re.IGNORECASE).map({True: 'Yes', False: 'No'})

print(environment_df)

   ID                                        Description        Date  \
0   1               Forest Conservation Project in Spain  2021-03-15   
1   2  River Cleanup Initiative Project 2021 in Portugal  2021-06-20   
2   3       Urban Gardening Community Project in Germany  2022-01-11   
3   4        Forest Reforestation Project 2022 in France  2022-04-05   
4   5          Wildlife Protection Project Plan in Italy  2023-02-22   
5   6  Endangered Species Conservation Project in Greece  2023-05-30   
6   7               Wetland Restoration Project in Spain  2021-09-13   
7   8       Marine Life Conservation Project in Portugal  2022-07-19   
8   9  Air Quality Improvement Project Initiative in ...  2023-03-08   
9  10   Habitat Preservation Project for Birds in France  2022-11-21   

           Location  Budget  \
0     Madrid, Spain  $20000   
1  Lisbon, Portugal  €15000   
2   Berlin, Germany  €12000   
3     Paris, France  £18000   
4       Rome, Italy  $25000   
5    Athens, Greece  

In [79]:
# Extract 'Country' from 'Location'
environment_df['Country'] = environment_df['Location'].apply(lambda x: x.split(', ')[-1])

# Extract 'Project Type' from 'Description'
environment_df['Project_Type'] = environment_df['Description'].str.extract(r'(\b\w+\b) Project')[0]

# Generate the report
report = environment_df.groupby('Country').agg(
    Total_Projects=('ID', 'count'),
    Average_Budget=('Budget_USD', 'mean')
)

# Identify top three most common project types
top_project_types = environment_df['Project_Type'].value_counts().nlargest(3).index.tolist()
report['Top_Project_Types'] = ', '.join(top_project_types)

print(report)

          Total_Projects  Average_Budget                    Top_Project_Types
Country                                                                      
France                 2         22100.0  Conservation, Initiative, Community
Germany                2         12650.0  Conservation, Initiative, Community
Greece                 1         22000.0  Conservation, Initiative, Community
Italy                  1         25000.0  Conservation, Initiative, Community
Portugal               2         15400.0  Conservation, Initiative, Community
Spain                  2         18500.0  Conservation, Initiative, Community


#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>