### Feature Extraction using TF-IDF vectorization
- using TfidfVectorizer

In [1]:
pip show scikit-learn

Name: scikit-learn
Version: 1.6.1
Summary: A set of python modules for machine learning and data mining
Home-page: https://scikit-learn.org
Author: 
Author-email: 
License: BSD 3-Clause License
         
         Copyright (c) 2007-2024 The scikit-learn developers.
         All rights reserved.
         
         Redistribution and use in source and binary forms, with or without
         modification, are permitted provided that the following conditions are met:
         
         * Redistributions of source code must retain the above copyright notice, this
           list of conditions and the following disclaimer.
         
         * Redistributions in binary form must reproduce the above copyright notice,
           this list of conditions and the following disclaimer in the documentation
           and/or other materials provided with the distribution.
         
         * Neither the name of the copyright holder nor the names of its
           contributors may be used to endorse 

In [2]:
import pandas as pd
# importing required modules
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# defining the path to the dataset
dataset_path = r'D:/JantaKoAwaj-FYP/jka-ml-model/dataset/preprocessed_data.pkl'

# reading the dataset saved in pickle format
df = pd.read_pickle(dataset_path)
print(df.head())


                  Brief Description of the grievance  \
0  वडा नं. १६ ढुङ्गेधाराबाट नर्सरी चोक जाने बाटो ...   
1  लाजिम्पाट फ्रेन्च एम्बेसि बसेको अपोजिट आगन स्व...   
2  विशालनगर मिलोट मार्गमा रहको स्कुलले म्युजिक बन...   
3  एयरपोर्ट होटेल अगाडि टर्निङ्गको गल्लिमा ठेलागा...   
4  भृकुटिमण्डप भित्र जुत्ता मेला नजिक सुनिता खाजा...   

                                        cleaned_text  \
0  वडा नं. १६ ढुङ्गेधाराबाट नर्सरी चोक जाने बाटो ...   
1  लाजिम्पाट फ्रेन्च एम्बेसि बसेको अपोजिट आगन स्व...   
2  विशालनगर मिलोट मार्गमा रहको स्कुलले म्युजिक बन...   
3  एयरपोर्ट होटेल अगाडि टर्निङ्गको गल्लिमा ठेलागा...   
4  भृकुटिमण्डप भित्र जुत्ता मेला नजिक सुनिता खाजा...   

                             stopword_removed_tokens    Label  
0  [वडा, नं, ., १६, ढुङ्गेधाराबाट, नर्सरी, चोक, ज...  genuine  
1  [लाजिम्पाट, फ्रेन्च, एम्बेसि, बसेको, अपोजिट, आ...  genuine  
2  [विशालनगर, मिलोट, मार्गमा, रहको, स्कुलले, म्यु...  genuine  
3  [एयरपोर्ट, होटेल, अगाडि, टर्निङ्गको, गल्लिमा, ...  genuine  
4  [भृ

In [4]:
# initializing the TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
# fitting the vectorizer
X_tfidf = vectorizer.fit_transform(df['cleaned_text'])

In [5]:
print("\nTfidfVectorizer Output")
print("Shape of TfidfVectorizer matrix (documents x features):", X_tfidf.shape)
print("Type of matrix:", type(X_tfidf)) # Also a sparse matrix


TfidfVectorizer Output
Shape of TfidfVectorizer matrix (documents x features): (7975, 5000)
Type of matrix: <class 'scipy.sparse._csr.csr_matrix'>


In [6]:
import scipy
import joblib
# Save TF-IDF matrix (as .npz since it's sparse)
#TF-IDF matrices are huge but mostly filled with zeros and few non-zeros.
#.npz format stores it as a sparse matrix, saving memory and disk space. (Instead of storing all elements (including zeros), save space by storing only the non-zero values)
scipy.sparse.save_npz(r'D:/JantaKoAwaj-FYP/jka-ml-model/dataset/features/tfidf_features.npz', X_tfidf)

# Save the vectorizer
joblib.dump(vectorizer, r'D:/JantaKoAwaj-FYP/jka-ml-model/dataset/features/tfidf_vectorizer.pkl')

# Mapping the lables to binary value (0 and 1)
# 0 for not_genuine and 1 for genuine
df['labeled'] = df['Label'].map({'not_genuine':0, 'genuine':1})
# Save the labeled DataFrame
df['labeled'].to_csv(r'D:/JantaKoAwaj-FYP/jka-ml-model/dataset/features/labeled_data.csv', index=False)

print("Saved the TF-IDF features and vectorizer successfully.")


Saved the TF-IDF features and vectorizer successfully.
