# Feature construction

Feature construction is an essential technique in text preprocessing that involves creating new features or representations of text data to improve the
performance of a machine-learning model. More specifically, this process involves combining or transforming existing features to capture important 
information or patterns in the data. For example, if we’re working with a review dataset, we might want to create a new review_length feature that 
contains the count of characters within the review column in a dataset. We can then use such a new feature as part of the training data to enhance
the performance of a machine-learning model.

New feature categories :

Sentiment analysis features: We can develop sentiment-based scores that capture positive, negative, and neutral sentiments, as well as the polarity 
differences in the text data. For example, we can generate features like positive, negative, and neutral sentiment scores to automatically assess
customer reviews for product sentiment.

Length-based features: We can measure the length of text entries in terms of words, characters, average word lengths, and sentence counts. For instance,
we can generate features such as word length and character count in support applications to enhance efficiency and response quality by enabling
organizations to allocate resources effectively, with longer, complex queries receiving more attention or escalation to experienced staff.

Text complexity features: We can measure text complexity using readability scores and vocabulary richness metrics. For example, we can create readability
scores and assess vocabulary richness features from a dataset to assess textbook readability, aiding educators in selecting grade-appropriate 
learning materials.

Linguistic features: We can investigate linguistic composition by analyzing part-of-speech distributions and the prevalence of named entities. For
instance, we can create noun counts, verb counts, and named entity counts to help auto-categorize news articles into topics like politics, sports, 
and technology, enhancing content organization and reader accessibility.

![image.png](attachment:392ceffc-2b90-4e36-bfb9-15fbef880682.png)


In [2]:
import pandas as pd

# b. import the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [3]:
df['review_length'] = df['text'].apply(lambda x: len(x.split()))
print(df)

   review_id                                               text  review_length
0     txt145  The software had a steep learning curve at fir...             20
1     txt327  I'm really impressed with the user interface o...             16
2     txt209  The latest update to the software fixed severa...             14
3     txt825  I encountered a few glitches while using the s...             20
4     txt878  I was skeptical about trying the software init...             19
5     txt933  The analytics features have provided us with v...             14
6     txt718  I appreciate the regular updates that the soft...             17
7     txt316  I attended a training session for the software...             18
8     txt247  The software documentation could be more compr...             14
9     txt515  I've recommended the software to colleagues du...             13
10    txt913  The software integration with third-party plug...             15
11    txt341  I'm looking forward to the upcoming re

In [7]:
! pip install spacy



In [8]:
# advanced feature extraction
import spacy
nlp = spacy.load('en_core_web_sm')
def extract_named_entities(text):
    doc = nlp(text)
    named_entities = [ent.text for ent in doc.ents]
    return named_entities

df['named_entities'] = df['text'].apply(extract_named_entities)
print(df['named_entities'])

0     [first]
1          []
2          []
3          []
4          []
5          []
6          []
7          []
8          []
9          []
10    [third]
11         []
12         []
13         []
14         []
15         []
Name: named_entities, dtype: object
