### **Text Processing**

I think it would be great to do some **text processing** in order to clean our dataset. Our model will perform better (i hope).
What I'm going to do is:

1. Lowercasing - **our textbook says for sentiment analysis to not do this**
2. Removing Punctuation
3. Removing stopwords
4. Lemmatization (important task i think) - **I used SpaCy, NLTK need POS**
5. Removing special characters



#### **1 - Libraryes**
- For today I'll use NLTK for lemmatization

In [7]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [9]:
# Download NLTK resources (run only once)
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\molna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\molna\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

 #### **2 - Quick look at our dataset**

 - The dataset isn't so bad, in the past i worked on datasets very dirty: some rows full of emoticons
 - We have 3 columns:
   - *Rating:* from 1 to 5
   - *Title of the review:* a string of text
   - *Review:* a string of text 
 

In [66]:
df = pd.read_csv("amazon_review_full_csv/train.csv", delimiter=';', encoding='latin1')  
# HOW MANY ROWS AND COLUMNS?
print("Dataset shape:", df.shape)
# LET'S SEE SOME ROWS
df.head(25)

Dataset shape: (1048575, 3)


Unnamed: 0,Rating,Title,Review
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...
1,5,Inspiring,I hope a lot of people hear this cd. We need m...
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...
4,5,Too good to be true,Probably the greatest soundtrack in history! U...
5,5,There's a reason for the price,"There's a reason this CD is so expensive, even..."
6,1,Buyer beware,"This is a self-published book, and if you want..."
7,4,"Errors, but great story",I was a dissapointed to see errors on the back...
8,1,The Worst!,A complete waste of time. Typographical errors...
9,1,Oh please,I guess you have to be a romance novel lover f...


#### **3 - Cleaning Text**
- I've created a list of words that i think are usefull for our analysis so they are not considered stopwords even if they are
- Put all in lowercase
- Removed punctuation, only mantained ! and ?, max 1 occurrency
- Removed stopwords

In [82]:
# LOAD THE STOPWORDS (FROM NLTK)
stop_words = set(stopwords.words('english'))

# Words that are considered stopwords but i don't want to remove
important_words = {
    "not", "no", "nor", "never", "none", "nobody", "nothing", "neither", "nowhere",
    "don’t", "doesn’t", "didn’t", "can’t", "couldn’t", "won’t", "wouldn’t",
    "shouldn’t", "wasn’t", "weren’t", "isn’t", "aren’t", "hasn’t", "haven’t", "hadn’t",
    "cannot", "without", "hardly", "barely", "rarely", "scarcely"
}
stop_words = stop_words - important_words 

def cleaning1(text):
    # Convert all the text to lowercase (optional)
    text = str(text).lower()

    # Replace multiple ! and ? with only one occurrency! 
    text = re.sub(r'[!]{2,}', '!', text)  # OK !!! --> ok !
    text = re.sub(r'[?]{2,}', '?', text)  # Are you ok??? --> are you ok?

    # Remove any special characters except for a single ! or ?
    # It avoids emoticons, urls ecc... (WATCH AGAIN LATER)
    text = re.sub(r'[^a-z0-9\s!?]', '', text)  
    
    # Tokenize and remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]
    
    return ' '.join(words)

df['Title'] = df['Title'].apply(cleaning1)
df['Review'] = df['Review'].apply(cleaning1)

df.head(25)

Unnamed: 0,Rating,Title,Review
0,3,like funchuck,gave dad gag gift directing nunsense got reall...
1,5,inspiring,hope lot people hear cd need strong positive v...
2,5,best soundtrack ever anything,im reading lot reviews saying best game soundt...
3,4,chrono cross ost,music yasunori misuda without question close s...
4,5,good true,probably greatest soundtrack history usually b...
5,5,theres reason price,theres reason cd expensive even version thats ...
6,1,buyer beware,selfpublished book want know whyread paragraph...
7,4,errors great story,dissapointed see errors back cover since paid ...
8,1,worst,complete waste time typographical errors poor ...
9,1,oh please,guess romance novel lover one not discerning o...


#### **3 - Lemmatization with SpaCy**

For run this you need to install SpaCy. You can find more informations about it on the web or in the notebook of professor's Ceni. 
I prefered to use this instead of using nltk because nltk needs also pos-tagging. 

- I run the lemmatizer only on first 1000 rows just for understand how much time it takes (Very few)
- Created a second dataframe to watch the results
- I use **Ryzen 7 5700G, 8 core 3.8 Ghz, and a RX 6600 6GB, 32GB RAM DDR4**

In [85]:
import spacy
nlp = spacy.load("en_core_web_sm")

#LEMMATIZATION FUNCTION

def lemmatize_text(text):
    #spaCy
    doc = nlp(str(text))  
    return " ".join([token.lemma_ for token in doc])

#ONLY ON FIRST 1000 ROWS
df_subset = df.iloc[:1000].copy()  

df_subset['Title'] = df_subset['Title'].apply(lemmatize_text)
df_subset['Review'] = df_subset['Review'].apply(lemmatize_text)

print(df_subset[['Title', 'Review']].head(25))



                                               Title  \
0                                      like funchuck   
1                                            inspire   
2                      good soundtrack ever anything   
3                                   chrono cross ost   
4                                          good true   
5                               there s reason price   
6                                       buyer beware   
7                                  error great story   
8                                              worst   
9                                          oh please   
10                               awful beyond belief   
11                      romantic zen baseball comedy   
12                           low leg comfort 12 hour   
13                                delivery long wait   
14                size recomende size chart not real   
15                                          overbury   
16                      another abysmal digital 