# **Categorization using pandas**

Pandas is one of the popular libraries used to manipulate and analyze data in the form of DataFrame

In [1]:
import pandas as pd

In [6]:
# Prepare the data in DataFrame format
data = {
    'document': [
        'Tigers are fierce animals that prey on their victims',
        'Elephants are large and gentle animals',
        'Lions are powerful kings of the jungle',
        'Crocodiles live in rivers and swamps',
        'Penguins are birds that cannot fly'
    ]
}

df = pd.DataFrame(data)

In [7]:
# Create a list of animal categories
animal_categories = ['Tiger', 'Elephant', 'Lion', 'Crocodile', 'Penguin']

In [8]:
# Create a function to find the category for each comment
def find_category(document):
    for category in animal_categories:
        if category.lower() in document.lower():
            return category
    return 'Unknown'

In [9]:
# Create a new column to store the category in the DataFrame
df['animal_category'] = df['document'].apply(find_category)

In [12]:
# Display the DataFrame with the new 'animal_category' column
df

Unnamed: 0,document,animal_category
0,Tigers are fierce animals that prey on their v...,Tiger
1,Elephants are large and gentle animals,Elephant
2,Lions are powerful kings of the jungle,Lion
3,Crocodiles live in rivers and swamps,Crocodile
4,Penguins are birds that cannot fly,Penguin


using a dataset of comments about stray animals

In [13]:
import pandas as pd

In [15]:
# Step 1: Read data from the CSV file into a DataFrame
df = pd.read_csv('data_set.csv')
df.head()

Unnamed: 0,text
0,"Spotted a cute stray cat today, looking for so..."
1,Saw a group of stray dogs playing together nea...
2,"A stray kitten crossed my path today, I couldn..."
3,Encountered a friendly stray dog during my wal...
4,It's heartbreaking to see so many stray animal...


In [16]:
# Categories
category_dict = {
    'Mammals': ['Lion', 'Elephant', 'Tiger', 'Panda', 'Cat', 'Dog', 'Kitten', 'Puppy', 'Rabbit', 'Hamster', 'Ferret', 'Raccoon', 'Squirrel', 'Fox', 'Chinchilla', 'Guinea pig', 'Horse', 'Possum'],
    'Birds': ['Bird', 'Pigeon'],
    'Reptiles': ['Iguana', 'Snake'],
    'Amphibians': ['Frog'],
    'Other': ['Goat', 'Geese', 'Hedgehog', 'Turtle']
}

In [21]:
# Create a function to find the category for each comment
def find_category(text):
    for category, animals in category_dict.items():
        for animal in animals:
            if animal.lower() in text.lower():
                return category
    return 'Unknown'

In [22]:
# Create a new column to store the category in the DataFrame
df['animal_category'] = df['text'].apply(find_category)

In [23]:
# Display the DataFrame with the new 'animal_category' column
df

Unnamed: 0,text,animal_category
0,"Spotted a cute stray cat today, looking for so...",Mammals
1,Saw a group of stray dogs playing together nea...,Mammals
2,"A stray kitten crossed my path today, I couldn...",Mammals
3,Encountered a friendly stray dog during my wal...,Mammals
4,It's heartbreaking to see so many stray animal...,Unknown
...,...,...
95,A stray possum has taken up residence in a tre...,Mammals
96,Visited a sanctuary that provides a safe haven...,Unknown
97,A local organization is conducting a vaccinati...,Unknown
98,"Encountered a stray turtle while hiking, helpe...",Other


# **Cosine Similarity**

Cosine similarity is a metric used to measure the similarity between two non-zero vectors in an n-dimensional space. It calculates the cosine of the angle between the two vectors and provides a value ranging from -1 to 1. A value of 1 indicates that the vectors are identical, 0 means they are orthogonal (perpendicular), and -1 means they are diametrically opposed.

In the context of natural language processing, cosine similarity is often used to compare the similarity between two text documents represented as numerical vectors. Each document is typically represented as a bag-of-words (BoW) or term frequency-inverse document frequency (TF-IDF) vector, where each element in the vector represents the frequency or weight of a particular word in the document.

In [12]:
#input the module to be used
#here using BoW extraction
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
#Example documents
data = {
    'document': [
        "I like cats and dogs",
        "I love dogs and birds"
    ]
}

In [14]:
#Create a DataFrame from the data
df = pd.DataFrame(data)

In [15]:
#Create CountVectorizer to convert text into numerical vectors
vectorizer = CountVectorizer()
# Learn the vocabulary and transform the documents into vectors
tf_matrix = vectorizer.fit_transform(df['document']).toarray()

In [16]:
#Calculate cosine similarity
cosine_sim = cosine_similarity(tf_matrix)

In [17]:
#Create DataFrame from the cosine similarity matrix
cosine_sim_df = pd.DataFrame(cosine_sim, columns=df.index, index=df.index)
cosine_sim_df

Unnamed: 0,0,1
0,1.0,0.5
1,0.5,1.0


So as a result, the value of 0.5 located in row 0 and column 1 indicates that documents with index 0 (I like cats and dogs) and documents with index 1 (I love dogs and birds) have a similarity level of 0.5.

Trying to categorize each data

In [35]:
#module
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [36]:
# Example documents with categories
document_data = {
    'document': [
        "I like cats and dogs",
        "I love dogs and birds",
        "I really like rabbits",
        "I keep snakes at home"
    ]
}

# Create DataFrame from the data
df = pd.DataFrame(document_data)

In [37]:
# Categories and their corresponding keywords
categories = {
    'mammals': ['cats', 'dogs', 'rabbits'],
    'birds': ['birds'],
    'reptiles': ['snakes']
}

In [38]:
# Assign categories to documents
def assign_category(document):
    for category, keywords in categories.items():
        for keyword in keywords:
            if keyword in document:
                return category
    return None

df['category'] = df['document'].apply(assign_category)

In [39]:
# Create CountVectorizer to convert text into numerical vectors
vectorizer = CountVectorizer()
# Learn the vocabulary and transform the documents into vectors
tf_matrix = vectorizer.fit_transform(df['document']).toarray()

In [40]:
# Calculate cosine similarity
cosine_sim = cosine_similarity(tf_matrix)

In [41]:
# Choose the document index for which you want to get the cosine similarity
target_document_index = 0
target_document = df.iloc[target_document_index]['document']

# Create DataFrame for cosine similarity of the target document
cosine_sim_target_df = pd.DataFrame(cosine_sim[target_document_index], columns=['Cosine Similarity'])

# Add columns for document and category
cosine_sim_target_df['Document'] = df['document'].values
cosine_sim_target_df['Category'] = df['category'].values

# Reset index for better visualization
cosine_sim_target_df = cosine_sim_target_df.reset_index(drop=True)

# Display the DataFrame showing cosine similarity of the target document with categories
cosine_sim_target_df

Unnamed: 0,Cosine Similarity,Document,Category
0,1.0,I like cats and dogs,mammals
1,0.5,I love dogs and birds,mammals
2,0.288675,I really like rabbits,mammals
3,0.0,I keep snakes at home,reptiles


Cosine Similarity based on groups using datasets, based on the weight value and this categorization is based on the sum of the highest weight values of the list categories.

Here there is already a dataset which is the result of comments about stray animals.

In [25]:
# The module used as before
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [26]:
# Input the data set to be processed, here using csv format
# Read data from data_set.csv
df = pd.read_csv('data_set.csv')
df.head() # Show the first 5 rows

Unnamed: 0,text
0,"Spotted a cute stray cat today, looking for so..."
1,Saw a group of stray dogs playing together nea...
2,"A stray kitten crossed my path today, I couldn..."
3,Encountered a friendly stray dog during my wal...
4,It's heartbreaking to see so many stray animal...


In [27]:
# Categories
category_dict = {
    'Mammals': ['Lion', 'Elephant', 'Tiger', 'Panda', 'Cat', 'Dog', 'Kitten', 'Puppy', 'Rabbit', 'Hamster', 'Ferret', 'Raccoon', 'Squirrel', 'Fox', 'Chinchilla', 'Guinea pig', 'Horse', 'Possum'],
    'Birds': ['Bird', 'Pigeon'],
    'Reptiles': ['Iguana', 'Snake'],
    'Amphibians': ['Frog'],
    'Other': ['Goat', 'Geese', 'Hedgehog', 'Turtle']
}

In [28]:
# Create CountVectorizer to convert text into numerical vectors
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(df['text'])

#if want to use TF-IDF extraction
# from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer = TfidfVectorizer()
# tfidf_matrix = vectorizer.fit_transform(df['text'])

In [26]:
# Calculate the similarity between each document and each category
similarity_matrix = cosine_similarity(bow_matrix, dense_output=True)

# Determine the category for each document, and determine the category result based on the highest result value
df['category'] = ''
for i, row in df.iterrows():
    max_sim = 0
    category_document = ''
    for category, keywords in category_dict.items():
        keyword_similarity = sum(similarity_matrix[i, vectorizer.vocabulary_.get(keyword, 0)] for keyword in keywords)
        if keyword_similarity > max_sim:
            max_sim = keyword_similarity
            category_document = category
    df.at[i, 'category'] = category_document

# Create a score dataframe list
result = pd.DataFrame(columns=category_dict.keys())
for i, row in df.iterrows():
    similarity_scores = []
    for category, keywords in category_dict.items():
        keyword_similarity = sum(similarity_matrix[i, vectorizer.vocabulary_.get(keyword, 0)] for keyword in keywords)
        similarity_scores.append(keyword_similarity)
    result.loc[i] = similarity_scores

# Merge the result DataFrame with the original DataFrame
result_df = pd.concat([df, result], axis=1)
result_df

Unnamed: 0,text,category,Mammals,Birds,Reptiles,Amphibians,Other
0,"Spotted a cute stray cat today, looking for so...",Mammals,18.000000,2.000000,2.000000,1.000000,4.000000
1,Saw a group of stray dogs playing together nea...,Mammals,1.505236,0.167248,0.167248,0.083624,0.334497
2,"A stray kitten crossed my path today, I couldn...",Mammals,4.700097,0.522233,0.522233,0.261116,1.044466
3,Encountered a friendly stray dog during my wal...,Mammals,3.133398,0.348155,0.348155,0.174078,0.696311
4,It's heartbreaking to see so many stray animal...,Mammals,4.351444,0.483494,0.483494,0.241747,0.966988
...,...,...,...,...,...,...,...
95,A stray possum has taken up residence in a tre...,Mammals,1.450481,0.161165,0.161165,0.080582,0.322329
96,Visited a sanctuary that provides a safe haven...,Mammals,3.133398,0.348155,0.348155,0.174078,0.696311
97,A local organization is conducting a vaccinati...,Mammals,3.272727,0.363636,0.363636,0.181818,0.727273
98,"Encountered a stray turtle while hiking, helpe...",Mammals,1.636364,0.181818,0.181818,0.090909,0.363636


notes: categorization at this stage is based on the same weight value, the same value as the list category