# Numeric Digit normalization

In text data, numbers can appear in diverse formats, leading to challenges in analysis and modeling. For example, “two pizzas” and “2 pizzas” might refer
to the same quantity but appear differently. Numeric digit normalization addresses these discrepancies, allowing algorithms to treat different 
representations of the same number as equivalent. It involves converting different representations of numeric digits within text data into a 
standardized format and, as a result, helps ensure consistency in the representation of numbers, making it easier to analyze and understand the data.

Common approaches for performing numeric digit normalization include:

Converting words to digits: This approach involves converting numeric words to their corresponding digits. For example, “five” would be transformed
into “5.” This would ensure that numeric words are consistently represented as digits, making them compatible with calculations and comparisons.

Converting digits to words: This technique involves converting numeric digits to words, which can enhance text readability. For instance, “10” could be 
transformed into “ten.”

Removing numeric separators: Numeric digits might be separated by commas, spaces, or other symbols. Removing these separators would ensure that numeric 
representations remain uniform. For example, “1,000” and “1000” would be normalized to “1000.”

    

In [2]:
! pip install inflect

Collecting inflect
  Downloading inflect-7.4.0-py3-none-any.whl.metadata (21 kB)
Downloading inflect-7.4.0-py3-none-any.whl (34 kB)
Installing collected packages: inflect
Successfully installed inflect-7.4.0


In [4]:
! pip install word2number

Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py): started
  Building wheel for word2number (setup.py): finished with status 'done'
  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5588 sha256=4b63e5218ebdd078e57828c2ab2e2ba16e8b43c98f93be2db374a53930222e35
  Stored in directory: c:\users\ariji\appdata\local\pip\cache\wheels\cd\ef\ae\073b491b14d25e2efafcffca9e16b2ee6d114ec5c643ba4f06
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1


In [5]:
import pandas as pd
import spacy
import inflect
import re
from word2number import w2n

In [6]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [7]:
nlp = spacy.load("en_core_web_sm")
p = inflect.engine()
def convert_words_to_digits(text):
    doc = nlp(text)
    converted_tokens = []
    for token in doc:
        if token.text.isdigit():
            converted_tokens.append(token.text)
        else:
            try:
                num = w2n.word_to_num(token.text)
                converted_tokens.append(str(num))
            except ValueError:
                converted_tokens.append(token.text)
    converted_text = " ".join(converted_tokens)
    return converted_text
def replace_digit_with_word(match):
    return p.number_to_words(match.group())
def convert_digits_to_words(text):
    converted_text = re.sub(r'\b\d+\b', replace_digit_with_word, text)
    return converted_text
def remove_numeric_separators(text):
    return re.sub(r'(?<=\d)[ ,](?=\d)', '', text)

In [8]:
df['review_normalized_words_to_digits'] = df['text'].apply(convert_words_to_digits)
df['review_normalized_digits_to_words'] = df['text'].apply(convert_digits_to_words)
df['review_normalized_no_separators'] = df['text'].apply(remove_numeric_separators)
print("Review normalized (words to digits):")
print(df['review_normalized_words_to_digits'])
print("\nReview normalized (digits to words):")
print(df['review_normalized_digits_to_words'])
print("\nReview normalized (no separators):")

Review normalized (words to digits):
0     The software had a steep learning curve at fir...
1     I 'm really impressed with the user interface ...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I 've recommended the software to colleagues d...
10    The software integration with third - party pl...
11    I 'm looking forward to the upcoming release o...
12    The user community is active and supportive , ...
13    I 've been using the software for a while now ...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: review_normalized_words_to_digits, dtype: object

Review norm

# Handling Contractions

Handling contractions is a text preprocessing technique that involves expanding contractions into their complete form. A contraction is a shortened 
form of two words, such as “don’t” for “do not,” “it’s” for “it is,” or “can’t” for “cannot.” Expanding contractions can make the text more
standardized, making identifying individual words and their meanings easier.

In [9]:
import pandas as pd
import contractions

In [10]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [11]:
def expand_contractions(text):
    return contractions.fix(text)

In [12]:
df["text"] = df["text"].apply(expand_contractions)
print(df["text"])

0     The software had a steep learning curve at fir...
1     I am really impressed with the user interface ...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I have recommended the software to colleagues ...
10    The software integration with third-party plug...
11    I am looking forward to the upcoming release o...
12    The user community is active and supportive, m...
13    I have been using the software for a while now...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: text, dtype: object
