## Project: Auto-Tagging System for Content Categorization

In this project, you’ll use Python modules and spaCy—a cutting-edge NLP library—to develop an automated system for tagging textual content efficiently.

### Task 1: Import Libraries and Modules

In this project, you’ll use the following Python libraries. Start the project by importing the necessary libraries and modules into the Jupyter Notebook.

1. spacy: It’s a powerful Python library for advanced NLP tasks.

2. re: It helps define regular expression patterns for text.

3. pandas: It’s useful for dealing with DataFrames.

4. nltk: It helps access some helper functions for text preprocessing.

5. Matcher: It’s a module in spacy that helps define rule-based entities or any technical terms.

6. contractions: It helps handle contractions in text processing steps.

7. random: It helps get a random element from several elements.

8. Example: It’s a class that’s part of the spacy.training module and helps in fine-tuning the spacy model.

9. Counter: It’s a module in the collections class that helps count hashable objects.

10. ast: It’s a tool for understanding and changing Python code.

                    

In [1]:
import spacy
import re
import pandas as pd
import contractions
import nltk
from spacy.matcher import Matcher
from spacy.training import Example
import random
from collections import Counter
import ast

### Task 2: Load and Explore the Dataset

The data for this project is from the Newsgroups dataset. This dataset is a popular choice for experiments in text applications of machine learning, 
specifically for tasks like content categorization and classification. Each document in the dataset represents a newsgroup post, making it a rich 
resource for understanding and working with real-world text data in NLP. The dataset has only a single News Data named column.

To complete this task, perform the following steps:

1. Read the data from the news_data.csv named CSV file, which is located in the /usercode directory.

2. Load the dataset and print the head of the dataset.

3. Print the first news data row of the dataset.



In [2]:
# loading news data
df = pd.read_csv('/usercode/news_data.csv')

print(df.head())
print(df["News Data"].iloc[0])

                                           News Data
0  From: dlphknob@camelot.bradley.edu (Jemaleddin...
1  From: svoboda@rtsg.mot.com (David Svoboda)\nSu...
2  From: ld231782@longs.lance.colostate.edu (L. D...
3  From: ipser@solomon.technet.sg (Ed Ipser)\nSub...
4  From: cme@ellisun.sw.stratus.com (Carl Ellison...
From: dlphknob@camelot.bradley.edu (Jemaleddin Cole)
Subject: Re: Catholic Lit-Crit of a.s.s.
Nntp-Posting-Host: camelot.bradley.edu
Organization: The Society for the Preservation of Cruelty to Homophobes.
Lines: 37

In <1993Apr14.101241.476@mtechca.maintech.com> foster@mtechca.maintech.com writes:

>I am surprised and saddened. I would expect this kind of behavior
>from the Evangelical Born-Again Gospel-Thumping In-Your-Face We're-
>The-Only-True-Christian Protestants, but I have always thought 
>that Catholics behaved better than this.
>                                   Please do not stoop to the
>level of the E B-A G-T I-Y-F W-T-O-T-C Protestants, who think
>that the be

### Task 3: Handle Text Case, Contractions, and URLs

In the previous task, you observed how messy the data is for further analysis. Now, you’ll explore a set of data processing and cleaning approaches
that will clean the data.

To complete this task, perform the following steps:

1. Extract random text data from the news_data dataset.

2. Write a convert_to_lowercase() function that converts this text into lowercase and prints the updated text.

3. Write another expand_contractions() function that handles any contractions available in the text and prints the updated text.

4. Write another remove_urls() function that detects and removes the URLs in the text and prints the updated text.

    

In [3]:
test_data = df["News Data"].iloc[0]

In [4]:
def convert_to_lowercase(test_data): 
    return test_data.lower()

test_data = convert_to_lowercase(test_data)

print(test_data)

from: dlphknob@camelot.bradley.edu (jemaleddin cole)
subject: re: catholic lit-crit of a.s.s.
nntp-posting-host: camelot.bradley.edu
organization: the society for the preservation of cruelty to homophobes.
lines: 37

in <1993apr14.101241.476@mtechca.maintech.com> foster@mtechca.maintech.com writes:

>i am surprised and saddened. i would expect this kind of behavior
>from the evangelical born-again gospel-thumping in-your-face we're-
>the-only-true-christian protestants, but i have always thought 
>that catholics behaved better than this.
>                                   please do not stoop to the
>level of the e b-a g-t i-y-f w-t-o-t-c protestants, who think
>that the best way to witness is to be strident, intrusive, loud,
>insulting and overbearingly self-righteous.

(pleading mode on)

please!  i'm begging you!  quit confusing religious groups, and stop
making generalizations!  i'm a protestant!  i'm an evangelical!  i don't
believe that my way is the only way!  i'm not a "creatio

In [5]:
def expand_contractions(test_data):
    return contractions.fix(test_data)

test_data = expand_contractions(test_data)

print(test_data)

from: dlphknob@camelot.bradley.edu (jemaleddin cole)
subject: re: catholic lit-crit of a.s.s.
nntp-posting-host: camelot.bradley.edu
organization: the society for the preservation of cruelty to homophobes.
lines: 37

in <1993apr14.101241.476@mtechca.maintech.com> foster@mtechca.maintech.com writes:

>i am surprised and saddened. i would expect this kind of behavior
>from the evangelical born-again gospel-thumping in-your-face we are-
>the-only-true-christian protestants, but i have always thought 
>that catholics behaved better than this.
>                                   please do not stoop to the
>level of the e b-a g-t i-y-f w-t-o-t-c protestants, who think
>that the best way to witness is to be strident, intrusive, loud,
>insulting and overbearingly self-righteous.

(pleading mode on)

please!  i am begging you!  quit confusing religious groups, and stop
making generalizations!  i am a protestant!  i am an evangelical!  i do not
believe that my way is the only way!  i am not a "c

In [6]:
def remove_urls(test_data): 
    url_pattern = r'https?://\S+|www\.\S+'
    return re.sub(url_pattern, ' ', test_data)

test_data = remove_urls(test_data)

print(test_data)

from: dlphknob@camelot.bradley.edu (jemaleddin cole)
subject: re: catholic lit-crit of a.s.s.
nntp-posting-host: camelot.bradley.edu
organization: the society for the preservation of cruelty to homophobes.
lines: 37

in <1993apr14.101241.476@mtechca.maintech.com> foster@mtechca.maintech.com writes:

>i am surprised and saddened. i would expect this kind of behavior
>from the evangelical born-again gospel-thumping in-your-face we are-
>the-only-true-christian protestants, but i have always thought 
>that catholics behaved better than this.
>                                   please do not stoop to the
>level of the e b-a g-t i-y-f w-t-o-t-c protestants, who think
>that the best way to witness is to be strident, intrusive, loud,
>insulting and overbearingly self-righteous.

(pleading mode on)

please!  i am begging you!  quit confusing religious groups, and stop
making generalizations!  i am a protestant!  i am an evangelical!  i do not
believe that my way is the only way!  i am not a "c

### Task 4: Handle Emails and Datetime

In the previous task, you handled the text casing and URLs in the text data. Now, you’ll handle emails and date time elements present in the data. 
This is an incremental approach to text processing, which you must follow in order.

To complete this task, perform the following steps:

1. Create a remove_email_addresses() function to detect and remove email addresses from the text and print the updated text. This text should be free 
from any email addresses.

2. Create a remove_dates_times() function to detect and remove date time elements from the text and print the updated text. This text should be free
from any date time elements.

   

In [7]:
def remove_email_addresses(test_data):
    email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    return re.sub(email_pattern, ' ', test_data)

test_data = remove_email_addresses(test_data)
test_data = re.sub(r'\s{2,}',' ', test_data).strip()

In [8]:
def remove_dates_times(test_data): 
    date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b'
    time_pattern = r'\b\d{1,2}:\d{2}(?: \s?[AP]M)?\b' 
    test_data = re.sub(date_pattern, ' ', test_data) 
    test_data = re.sub(time_pattern, ' ', test_data) 
    return test_data

test_data = remove_dates_times(test_data)
test_data = re.sub(r'\s{2,}',' ', test_data).strip()

print(test_data)

from: (jemaleddin cole)
subject: re: catholic lit-crit of a.s.s.
nntp-posting-host: camelot.bradley.edu
organization: the society for the preservation of cruelty to homophobes.
lines: 37 in < > writes: >i am surprised and saddened. i would expect this kind of behavior
>from the evangelical born-again gospel-thumping in-your-face we are-
>the-only-true-christian protestants, but i have always thought >that catholics behaved better than this.
> please do not stoop to the
>level of the e b-a g-t i-y-f w-t-o-t-c protestants, who think
>that the best way to witness is to be strident, intrusive, loud,
>insulting and overbearingly self-righteous. (pleading mode on) please! i am begging you! quit confusing religious groups, and stop
making generalizations! i am a protestant! i am an evangelical! i do not
believe that my way is the only way! i am not a "creation scientist"! i
do not think that homosexuals should be hung by their toenails! if you want to discuss bible thumpers, you would be bett

### Task 5: Remove Numbers and Special Characters

In the previous task, you removed emails and date time elements present in the text data. Now, you’ll handle special characters and numbers.

To complete this task, perform the following steps:

1. Create a remove_numbers_and_special_characters() named function to detect and remove both numbers and special characters present in the data.

2. Apply the function to your sample text and print the updated text.

    

In [9]:
def remove_numbers_and_special_characters(test_data):
    number_pattern = r'\b\d+\b'
    special_char_pattern = r'[^\w\s\.]|_|\s+'
    test_data =  re.sub(number_pattern, ' ', test_data)
    test_data = re.sub(r'\s{2,}',' ', test_data).strip()
    return re.sub(special_char_pattern, ' ', test_data)

test_data = remove_numbers_and_special_characters(test_data)
test_data = re.sub(r'\s{2,}',' ', test_data).strip()

print(test_data)

from jemaleddin cole subject re catholic lit crit of a.s.s. nntp posting host camelot.bradley.edu organization the society for the preservation of cruelty to homophobes. lines in writes i am surprised and saddened. i would expect this kind of behavior from the evangelical born again gospel thumping in your face we are the only true christian protestants but i have always thought that catholics behaved better than this. please do not stoop to the level of the e b a g t i y f w t o t c protestants who think that the best way to witness is to be strident intrusive loud insulting and overbearingly self righteous. pleading mode on please i am begging you quit confusing religious groups and stop making generalizations i am a protestant i am an evangelical i do not believe that my way is the only way i am not a creation scientist i do not think that homosexuals should be hung by their toenails if you want to discuss bible thumpers you would be better off singling out and making obtuse general

### Task 6: Handle Stopwords and Extra Spaces

In this task, you’ll remove stop words and extra spaces from the text.

To complete this task, perform the following steps:

1. Create a remove_stop_words_and_spaces() named function that uses the list of stopwords to check if any of these stopwords are present in the text 
and remove them if present.

2. Remove any extra spaces present in the text.

3. Print the final cleaned text.

    

In [10]:
nlp = spacy.load('en_core_web_md', disable=[ 'parser', 'lemmatizer', 'attribute_ruler'])

def remove_stop_words_and_spaces(test_data, nlp_model):
    doc = nlp_model(test_data)
    filtered_text = [token.text for token in doc if not token.is_stop]
    return(' '.join(filtered_text))

# Assuming nlp is defined elsewhere and passed appropriately
test_data = remove_stop_words_and_spaces(test_data, nlp)

print(test_data)

jemaleddin cole subject catholic lit crit a.s.s . nntp posting host camelot.bradley.edu organization society preservation cruelty homophobes . lines writes surprised saddened . expect kind behavior evangelical born gospel thumping face true christian protestants thought catholics behaved better . stoop level e b g t y f w t o t c protestants think best way witness strident intrusive loud insulting overbearingly self righteous . pleading mode begging quit confusing religious groups stop making generalizations protestant evangelical believe way way creation scientist think homosexuals hung toenails want discuss bible thumpers better singling making obtuse generalizations fundamentalists . compared actions presbyterians methodists southern baptists think different religions prejudice thinking people group write protestants evangelicals pleading mode . god ....... wish ahold thomas stories ...... fbzr enval jvagre fhaqnlf jura gurer f n yvggyr oberqbz lbh fubhyq nyjnlf pneel n tha . abg gb

### Task 7: Tokenize Cleaned Text

Tokenization is a crucial step in NLP, where text is split into words or phrases called tokens. This process is essential for understanding the 
structure and meaning of text. Tokenization allows algorithms to process and analyze text at a granular level.

Tokenization algorithms built into spaCy perform tokenization tasks well. Its models are optimized for speed and accuracy, which makes spaCy ideal 
for processing large volumes of text efficiently.

To complete this task, perform the following steps:

1. Create a tokenize_text() named function that takes the preprocessed version of the sample text and returns the list of tokens as an output.

2. Print the list of tokens.



In [11]:
def tokenize_text(text):
    tokens = [token.text for token in nlp(text)]
    return tokens

tokens = tokenize_text(test_data)

print(tokens)

['jemaleddin', 'cole', 'subject', 'catholic', 'lit', 'crit', 'a.s.s', '.', 'nntp', 'posting', 'host', 'camelot.bradley.edu', 'organization', 'society', 'preservation', 'cruelty', 'homophobes', '.', 'lines', 'writes', 'surprised', 'saddened', '.', 'expect', 'kind', 'behavior', 'evangelical', 'born', 'gospel', 'thumping', 'face', 'true', 'christian', 'protestants', 'thought', 'catholics', 'behaved', 'better', '.', 'stoop', 'level', 'e', 'b', 'g', 't', 'y', 'f', 'w', 't', 'o', 't', 'c', 'protestants', 'think', 'best', 'way', 'witness', 'strident', 'intrusive', 'loud', 'insulting', 'overbearingly', 'self', 'righteous', '.', 'pleading', 'mode', 'begging', 'quit', 'confusing', 'religious', 'groups', 'stop', 'making', 'generalizations', 'protestant', 'evangelical', 'believe', 'way', 'way', 'creation', 'scientist', 'think', 'homosexuals', 'hung', 'toenails', 'want', 'discuss', 'bible', 'thumpers', 'better', 'singling', 'making', 'obtuse', 'generalizations', 'fundamentalists', '.', 'compared', 

### Task 8: Build Data Prep Pipeline

In this task, you’ll create a data preparation pipeline that takes raw text data (e.g., the complete News Data column) as input and returns the
preprocessed and tokenized data. For this, you’ll utilize all the preprocessing steps that you’ve done so far.

To complete this task, perform the following steps:

1. Create a preprocess_data() named function that should utilize all the data preprocessing functions to preprocess the dataset and return the processed
data.

2. Create a preprocess_and_tokenize() named function that should tokenize the processed data.

3. Apply the preprocess_data() and preprocess_and_tokenize() functions to the News Data column in the DataFrame such that for each data item, you should 
have both processed data and the processed tokenized data version.

4. Print the head of the new DataFrame.


In [12]:
def preprocess_data(text):
    # Apply all preprocessing functions
    text = convert_to_lowercase(text)
    text = expand_contractions(text)
    text = remove_urls(text)
    text = remove_email_addresses(text)
    text = remove_dates_times(text)
    text = remove_numbers_and_special_characters(text)
    text = remove_stop_words_and_spaces(text, nlp)
    return text

In [13]:
def preprocess_and_tokenize(text):
    tokens = tokenize_text(text)
    return tokens

In [14]:
df['Processed_Data'] = df['News Data'].apply(preprocess_data)

df['Processed_Tokenized_Data'] = df['Processed_Data'].apply(preprocess_and_tokenize)

print(df.head())

                                           News Data  \
0  From: dlphknob@camelot.bradley.edu (Jemaleddin...   
1  From: svoboda@rtsg.mot.com (David Svoboda)\nSu...   
2  From: ld231782@longs.lance.colostate.edu (L. D...   
3  From: ipser@solomon.technet.sg (Ed Ipser)\nSub...   
4  From: cme@ellisun.sw.stratus.com (Carl Ellison...   

                                      Processed_Data  \
0     jemaleddin cole   subject     catholic lit ...   
1     david svoboda   subject     opinion means ....   
2     l. detweiler   subject   privacy    anonymi...   
3     ed ipser   subject   ways slick willie impr...   
4     carl ellison   subject     clipper crypto o...   

                            Processed_Tokenized_Data  
0  [   , jemaleddin, cole,   , subject,     , cat...  
1  [   , david, svoboda,   , subject,     , opini...  
2  [   , l., detweiler,   , subject,   , privacy,...  
3  [   , ed, ipser,   , subject,   , ways, slick,...  
4  [   , carl, ellison,   , subject,     , clippe..

df.shape

### Task 9: Create Pattern Matching Flow

The Matcher class in spaCy allows you to define patterns based on token attributes like text, part-of-speech tags, and syntactic dependencies. This
class uses a dictionary of patterns. In the dictionary, the keys represent entities or technical term names, and the values are the pattern 
definitions you want to identify.

For instance, if you have a {"GPE": [{"LOWER": "India"}]} like pattern, it means you’re looking for occurrences of the word India in lowercase, 
and when found, it’s labeled as a GPE (geopolitical entity).

To complete this task, perform the following steps:

1. Define a sample_text variable with the following value:
I invested $1000 on 11 Jan 2021, and it grew to $5 million by Dec 2022.

2. Define a patterns dictionary with patterns so that you can identify the date and currency elements present in the sample_text string.

3. Create a find_matches() named function which takes the sample_text string and the patterns dictionary as input and returns the identified 
terms in the data.




In [16]:
# Define a function to find matches in a text
def find_matches(text, patterns):
    # Create a Matcher object
    matcher = Matcher(nlp.vocab)

    # Add patterns to the matcher object
    for pattern_name, pattern in patterns.items():
        matcher.add(pattern_name, [pattern])

    doc = nlp(text)
    matches = matcher(doc)
    matched_entities = []

    for match_id, start, end in matches:
        matched_entities.append((doc[start:end].text, nlp.vocab.strings[match_id]))

    return matched_entities

patterns = {
    "DATE_PATTERN": [
        {"IS_DIGIT": True, "LENGTH": 2, "OP": "?"},  # Day as two digits
        {"LOWER": {"IN": ["jan", "feb", "mar", "apr", "may", "jun", 
                          "jul", "aug", "sep", "oct", "nov", "dec"]}},  # Month abbreviations
        {"IS_DIGIT": True, "LENGTH": 4, "OP": "?"}  # Four digit year
    ],
    "MONEY_PATTERN": [
        {"ORTH": "$"},  # Optional dollar sign
        {"LIKE_NUM": True},  # Numeric value
        {"LOWER": {"IN": ["thousand", "million", "billion"]}, "OP": "?"}  # Large amount modifiers
    ]
}

sample_text = "I invested $1000 on 11 Jan 2021 and it grew to $5 million by Dec 2022."

matched_entities = find_matches(sample_text, patterns)

for entity, label in matched_entities:
    print(f"{entity} - {label}")

$1000 - MONEY_PATTERN
11 Jan - DATE_PATTERN
11 Jan 2021 - DATE_PATTERN
Jan - DATE_PATTERN
Jan 2021 - DATE_PATTERN
$5 - MONEY_PATTERN
$5 million - MONEY_PATTERN
Dec - DATE_PATTERN
Dec 2022 - DATE_PATTERN


### Task 10: Entity Extraction Using spaCy Model

In this task, you’ll use spaCy’s model to extract entities from your cleaned text data. You’ll learn to leverage spaCy’s pretrained 
Named Entity Recognition (NER) model’s capabilities to identify and tag various entities, like names, places, organizations, etc., in your text.

spaCy has different types of pretrained models that are of different sizes. You’ll be working with spaCy’s medium model, en_core_web_md.

To complete this task, perform the following steps:

1. Create a get_entities_medium_spacy() named function, which takes any sample text from your processed data and returns a list of extracted entities.

2. Print the extracted entities.



In [17]:
# Take any sample processed text of your choice
sample_text = df["Processed_Data"].iloc[122]

def get_entities_medium_spacy(text):
    doc = nlp(text)
    entities = {}
    for ent in doc.ents:
        if ent.label_ not in entities:
            entities[ent.label_] = []
        entities[ent.label_].append(ent.text)
    return entities

entities = get_entities_medium_spacy(sample_text)

print(entities)

{'PERSON': ['tony jones', 'erik asphaug x2773      ', 'tony', 'tony jones'], 'ORG': ['nntp', 'cray research inc   eagan   mn x', 'cmcs codegeneration group', 'cray research inc   '], 'NORP': ['pl6'], 'GPE': ['655f']}


In [18]:
df['Processed_Data'].iloc[122]

'   tony jones   subject     insurance discount lines   nntp posting host   palm21 organization   cray research inc   eagan   mn x newsreader   tin   version . pl6   erik asphaug x2773      wrote     ... insurance agent offers multi vehicle discount .    time cars   assuming capable progressive offers multi vehicle discounts . good prices imho . tony     tony jones      .. uunet cray ant   cmcs codegeneration group   software division cray research inc   655f lone oak drive   eagan   mn'

### Task 11: Optimizing spaCy Model

In this task, you’ll optimize the existing spaCy medium model, en_core_web_md, to better suit your specific NLP needs. This will involve
fine-tuning the model on your dataset, which allows the model to identify and classify entities relevant to your text data more accurately.

In the first example, the position of the Noida entity is defined. Here, the start and end+1 indexes are used to define any entity.
Similarly, you can add more examples to this list to create custom data.

To complete this task, perform the following steps:

1. Prepare a set of training data for the spaCy model.

2. Use the training data to fine-tune the existing spaCy model.

3. Use the fine-tuned model to extract entities.

4. Save the optimized model to the /usercode directory with the spacy_optimised name.



In [20]:
# Annotated training data

train_data = [

     ("Noida is a city in India", {"entities": [(0, 5, "LOC")]}),
     ("Google launches a new AI research lab in Zurich", {"entities": [(0, 6, "ORG"), (37, 43, "LOC")]}),
     ("Amazon acquires Twitch for $970 million in 2014", {"entities": [(0, 6, "ORG"), (15, 21, "ORG"), (35, 42, "MONEY"), (46, 50, "DATE")]}),
     ("Elon Musk founded SpaceX to revolutionize space travel", {"entities": [(0, 9, "PERSON"), (18, 24, "ORG")]}),
     ("The Mona Lisa is on display in the Louvre Museum in Paris", {"entities": [(4, 13, "WORK_OF_ART"), (34, 47, "ORG"), (51, 56, "LOC")]}),
     ("The Great Wall of China stretches over 13,000 miles", {"entities": [(4, 22, "LOC"), (42, 51, "QUANTITY")]}),
     ("IBM introduces Watson, the AI that beat Jeopardy champions", {"entities": [(0, 3, "ORG"), (16, 22, "PERSON"), (31, 33, "ORG")]}),
     ("The Nile River flows through Egypt", {"entities": [(4, 14, "LOC"), (28, 33, "GPE")]}),
     ("Harvard University was established in 1636", {"entities": [(0, 17, "ORG"), (35, 39, "DATE")]}),
     ("Mount Everest is the world's highest mountain", {"entities": [(0, 13, "LOC")]}),
     ("Julia Roberts stars in the new Netflix series", {"entities": [(0, 12, "PERSON"), (31, 38, "ORG")]})
]


In [21]:
# Update the NER component with new examples

ner = nlp.get_pipe('ner')
for _, annotations in train_data:
   for ent in annotations.get('entities'):
      ner.add_label(ent[2])

# Disable other pipeline components for training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
   optimizer = nlp.resume_training()
   for iteration in range(30):  # Adjust iterations as needed
      random.shuffle(train_data)
      for text, annotations in train_data:
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            nlp.update([example], drop=0.5, sgd=optimizer)



In [22]:
entities = get_entities_medium_spacy("Noida is a city")
print(entities)

{'LOC': ['Noida']}


In [23]:
nlp.to_disk('/usercode/spacy_optimised')

### Task 12: Optimizing Entity Extraction for Auto-Tagging

In this task, you’ll streamline your entity extraction by utilizing spaCy’s pipe() function. This method significantly speeds up the processing 
of large text datasets by batch processing them, which is essential for handling real-world data volumes efficiently.

To complete this task, perform the following steps:

1. Create an extract_entities_pipe() named function, which uses the Processed_Data column to extract entities of all texts available in respective 
rows and store them in an entities named new column.

2. Print the head of the DataFrame.



In [24]:
def extract_entities_pipe(texts):
    entities = []
    for doc in nlp.pipe(texts):
        entities.append([(ent.text, ent.label_) for ent in doc.ents])
    return entities

if 'Processed_Data' in df.columns:
    df['entities'] = extract_entities_pipe(df['Processed_Data'])
    print(df.head())
else:
    print("The 'Processed_Data' column does not exist in the DataFrame.")
    

                                           News Data  \
0  From: dlphknob@camelot.bradley.edu (Jemaleddin...   
1  From: svoboda@rtsg.mot.com (David Svoboda)\nSu...   
2  From: ld231782@longs.lance.colostate.edu (L. D...   
3  From: ipser@solomon.technet.sg (Ed Ipser)\nSub...   
4  From: cme@ellisun.sw.stratus.com (Carl Ellison...   

                                      Processed_Data  \
0     jemaleddin cole   subject     catholic lit ...   
1     david svoboda   subject     opinion means ....   
2     l. detweiler   subject   privacy    anonymi...   
3     ed ipser   subject   ways slick willie impr...   
4     carl ellison   subject     clipper crypto o...   

                            Processed_Tokenized_Data  \
0  [   , jemaleddin, cole,   , subject,     , cat...   
1  [   , david, svoboda,   , subject,     , opini...   
2  [   , l., detweiler,   , subject,   , privacy,...   
3  [   , ed, ipser,   , subject,   , ways, slick,...   
4  [   , carl, ellison,   , subject,     , cli

### Task 13: Enhanching Entity Aggregation for Workflow Optimization

In this task, you’ll enhance your auto-tagging system by aggregating similar entities, reducing redundancy, and ensuring consistency in your tags

To complete this task, perform the following steps:

1. Create an aggregate_entities() named function that uses the entities column to aggregate similar entities and save the result in a
new Refined_Entities named column.

2. Print the head of the Refined_Entities column.


In [25]:
def aggregate_entities(entities_list):
    aggregated_entities = {}
    for ent_text, ent_label in entities_list:
        if ent_label in aggregated_entities:
            aggregated_entities[ent_label].add(ent_text)
        else:
            aggregated_entities[ent_label] = {ent_text}
    return {label: list(texts) for label, texts in aggregated_entities.items()}

df['Refined_Entities'] = df['entities'].apply(aggregate_entities)

print(df["Refined_Entities"].head())

0    {'ORG': ['nntp'], 'PERSON': ['david cole', 'gb...
1    {'PERSON': ['david svoboda', 'dave svoboda', '...
2    {'ORG': ['part3', 'n7kbt.rain.com', 'nntp', 'g...
3    {'ORG': ['solomon.technet.sg', 'senate'], 'LOC...
4     {'PERSON': ['carl ellison'], 'ORG': ['micmail']}
Name: Refined_Entities, dtype: object


### Task 14: Preparing and Refining Test Data for Entity Analysis

In this task, you’ll take a crucial step toward enhancing your entity analysis by preparing the test data. This preparation is important for 
ensuring that your entity analysis is performed on clean, relevant data, which is fundamental for the accurate tagging process.

To complete this task, perform the following steps:

1. Utilize the test data (test_data.csv) available in the /usercode directory to test the model. Test data has an Actual_Entities named column, which 
is a string-formatted dictionary converted to dictionary format.

2. Add a new Processed_Data named column in the test_data using the data preprocessing pipeline created earlier.

3. Add a new entities named column in test_data with the help of the Processed_Data column, using the extract_entities_pipe() function you created earlier.

4. Add a new Refined_Entities named column in test_data using the aggregate_entities() function created earlier.

5. Print the head of test_data.

                                                                       

In [26]:
df_test = pd.read_csv('/usercode/test_data.csv')

# Fixing the Actual_Entities column
def dict_fix(val):
    return ast.literal_eval(val)

df_test["Actual_Entities"] = df_test["Actual_Entities"].apply(dict_fix)

df_test['Processed_Data'] = df_test['News Data'].apply(preprocess_data)

df_test['entities'] = extract_entities_pipe(df_test['Processed_Data'])

df_test['Refined_Entities'] = df_test['entities'].apply(aggregate_entities)

df_test.drop(["Processed_Data","entities"], axis=1, inplace=True)

df_test.head()

Unnamed: 0,News Data,Actual_Entities,Refined_Entities
0,From: rcollins@ns.encore.com (Roger Collins)\n...,"{'PERSON': ['roger collins', 'clinton', 'steve...","{'PERSON': ['steve hendricks', 'roger collins'..."
1,From: baalke@kelvin.jpl.nasa.gov (Ron Baalke)\...,"{'PERSON': ['daniel burstein', 'ron baalke', '...","{'PERSON': ['ron baalke', 'daniel burstein'], ..."
2,Distribution: world\nFrom: David_A._Schnider@b...,"{'ORG': ['mac vga', 'sony cpd', 'cpd', 'trinit...","{'ORG': ['sony'], 'LOC': ['macs'], 'PERSON': [..."
3,From: brian@nostromo.NoSubdomain.NoDomain (Bri...,"{'PERSON': ['brian colaric sun', 'brian colari...",{'ORG': ['os2']}
4,From: fabian@vivian.w.open.de (Fabian Hoppe)\n...,"{'NORP': ['fabian'], 'ORG': ['nntp'], 'PERSON'...","{'NORP': ['fabian'], 'PERSON': ['fabian hoppe'..."


### Task 15: Compute the Evaluation Metrics

In this task, you’ll assess the effectiveness of the entity recognition model by calculating precision, recall, and F1-score. These metrics provide
insight into the model’s accuracy and its ability to identify relevant entities that are crucial for fine-tuning and enhancing model performance.

To complete this task, perform the following steps:

1. Create a calculate_entity_metrics() named function which calculates precision, recall, and F1-score for each row of the test_data with the help of
Actual_Entities and Refined_Entities.

2. Calculate the average precision, recall, and F1-score (precision, recall, and f1, respectively) for test_data and print them.



In [27]:
def calculate_entity_metrics(actual, predicted):
    precision, recall, f1 = 0, 0, 0
    num_entities = len(actual)

    for ent_type in actual:
        actual_entities = set(actual.get(ent_type, []))
        predicted_entities = set(predicted.get(ent_type, []))

        true_positives = len(actual_entities & predicted_entities)
        false_positives = len(predicted_entities - actual_entities)
        false_negatives = len(actual_entities - predicted_entities)

        precision_part = true_positives / (true_positives + false_positives) if true_positives + false_positives > 0 else 0
        recall_part = true_positives / (true_positives + false_negatives) if true_positives + false_negatives > 0 else 0
        
        precision += precision_part
        recall += recall_part

     # Handle the case where there are no entities
    if num_entities > 0:
        precision /= num_entities
        recall /= num_entities
        if precision + recall > 0:
            f1 = 2 * precision * recall / (precision + recall)
    else:
        precision, recall, f1 = 0, 0, 0  

    return precision, recall, f1


metrics = df_test.apply(lambda row: calculate_entity_metrics(row['Actual_Entities'], row['Refined_Entities']), axis=1)

In [28]:
precisions, recalls, f1s = zip(*metrics)
average_precision = sum(precisions) / len(precisions)
average_recall = sum(recalls) / len(recalls)
average_f1 = sum(f1s) / len(f1s)

print(f'Average Precision: {average_precision}')
print(f'Average Recall: {average_recall}')
print(f'Average f1: {average_f1}')

Average Precision: 0.2637706360265627
Average Recall: 0.15532456660089272
Average f1: 0.1866298513435013
