<a href="https://colab.research.google.com/github/anushka-dere/Project/blob/main/Final_Model_with_Tf_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [31]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', None)

In [32]:
df = pd.read_csv("Product_details.csv")

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6364 entries, 0 to 6363
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Text_ID              6364 non-null   int64 
 1   Product_Description  6364 non-null   object
 2   Product_Type         6364 non-null   int64 
 3   Sentiment            6364 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 199.0+ KB


In [34]:
df.Sentiment.value_counts()

2    3765
3    2089
1     399
0     111
Name: Sentiment, dtype: int64

In [35]:
df.duplicated().sum()

0

# `Natural Language Proccessing` :-
 In order to perform an analysis of text data, data preprocessing is first done to transform text into a data format that can be used in machine learning.

### 1.) Basic Cleaning and Tokenization

In [36]:
text = df.Product_Description[1].lower()
text

'rt @mention line for ipad 2 is longer today than yesterday. #sxsw  // are you getting in line again today just for fun?'

#### 1.1) Removing Special Character such as Punctuation

In [37]:
import re
text = re.sub(r'([^A-Za-z0-9|\s|[:punct:]]*)', '', text)
text

'rt mention line for ipad 2 is longer today than yesterday sxsw   are you getting in line again today just for fun'

In [38]:
# Repacing all character that are not letters (a-z or A-Z) or# with space character.
text = text.replace('[^a-zA-Z#]',' ').replace('quot','')
text

'rt mention line for ipad 2 is longer today than yesterday sxsw   are you getting in line again today just for fun'

In [39]:
# By seeing above and thinking logically we can say that words havinng character less than 3, it should be replaced.

text = ' '.join([i  for i in text.split() if len(i)>3])
text

'mention line ipad longer today than yesterday sxsw getting line again today just'

#### 1.2) Tokenization by spliting the string - transforming into a list of words.

In [40]:
text = text.split()
text

['mention',
 'line',
 'ipad',
 'longer',
 'today',
 'than',
 'yesterday',
 'sxsw',
 'getting',
 'line',
 'again',
 'today',
 'just']

### 2.) Lemitization and Stopwords:-

#### 2.1) Stopwords Removal:-

In [41]:
# Removing the stopword to reduce the dimensionality of the data down to only the words that conntain important information.
import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words("english")

#Remove words related to the conference that appear accross all sentiments
# and terms specific to review plateform

sxsw = ['sxsw', 'sxswi', 'link', 'quot', 'rt', 'amp', 'mention', 'apple', 'google', 'iphone', 'ipad', 
        'ipad2', 'austin', 'today', 'quotroutearoundquot', 'rtmention', 'store', 'doesnt', 'theyll']

stopwords.extend(sxsw)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
text = [word for word in text if word not in stopwords]

text


['line', 'longer', 'yesterday', 'getting', 'line']

#### 2.2) Lemitization:-
To reduce each word to  its most basic form, such as systems to system.

In [43]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = nltk.stem.WordNetLemmatizer()

text = [lemmatizer.lemmatize(word) for word in text]
text

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


['line', 'longer', 'yesterday', 'getting', 'line']

In [44]:
# Put all the words  back together
text = ' '.join(text)
text

'line longer yesterday getting line'

### Applying all the above steps to whole Product_Description Column.

In [45]:
# Create a function that compiles all the steps taken above
def preprocess(text):
    """
    This function performs basic data cleaning, tokenization, lemmatization, and stopword removal on the input text.
    
    Args:
        text (str): The input text to be preprocessed.
        
    Returns:
        str: The preprocessed text.
    """
    
    # Convert text to lowercase
    text = text.apply(lambda x: x.lower())
    
    # Remove special characters and digits
    text = text.apply(lambda x: re.sub(r'([^A-Za-z|\s|[:punct:]]*)', '', x))
    
    # Replace certain characters and words with spaces
    text = text.apply(lambda x: x.replace('[^a-zA-Z#]', ' ').replace('quot', '').replace(':', '').replace('sxsw', ''))
    
    # Remove words that are shorter than 2 characters
    text = text.apply(lambda x: ' '.join([i for i in x.split() if len(i) > 1]))
    
    # Tokenize the text
    text = text.apply(lambda x: x.split())
    
    # Lemmatize the tokens
    text = text.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
    
    # Remove stopwords
    text = text.apply(lambda x: [word for word in x if word not in stopwords])
    
    # Join the preprocessed tokens back into a single string
    text = text.apply(lambda x: ' '.join(x))
    
    return text


In [46]:
df["Product_Description"] = preprocess(df["Product_Description"])



In [47]:
df

Unnamed: 0,Text_ID,Product_Description,Product_Type,Sentiment
0,3057,web designer guide io android apps,9,2
1,6254,line longer yesterday getting line fun,9,2
2,8212,crazy opening temporary tomorrow handle rabid ...,9,2
3,4422,lesson one pas digital environment user want p...,9,2
4,5526,panel mom ha designing boomer,9,2
5,6064,think effing hubby line someone point towards ...,6,1
6,7713,android user user use option menu contextual menu,9,2
7,2975,wow interrupt regularly scheduled geek program...,9,3
8,818,launch new social network called circle possibly,9,2
9,1318,welcome enjoy ride anywhere dwnld groundlink a...,9,2


#### 2.3 Vectorization
Once the text data is cleaned, the last step is to convert it to vectors. A basic way to vectorize text data is using the `CountVectorizer` which counts the number of times each word appears.

# Merging 1:-
- 0+2------>0
- 1-------->1
- 3-------->2

In [48]:
E=pd.DataFrame(df.Sentiment.replace([2,3],[0,2]))
E.shape
print(E)

      Sentiment
0             0
1             0
2             0
3             0
4             0
5             1
6             0
7             2
8             0
9             0
10            0
11            2
12            0
13            0
14            0
15            1
16            0
17            0
18            2
19            0
20            1
21            0
22            0
23            2
24            2
25            2
26            0
27            0
28            2
29            2
30            0
31            1
32            0
33            0
34            2
35            0
36            0
37            0
38            0
39            0
40            0
41            0
42            0
43            0
44            2
45            2
46            0
47            0
48            0
49            2
50            2
51            0
52            2
53            1
54            0
55            0
56            1
57            2
58            0
59            0
60            0
61      

In [49]:
df_merged = pd.concat([df.drop(columns=["Sentiment"]), E], axis=1)

In [50]:
df_merged.shape

(6364, 4)

In [51]:
df_merged.Sentiment.value_counts()

0    3876
2    2089
1     399
Name: Sentiment, dtype: int64

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X = tfidf.fit_transform(df["Product_Description"]).toarray()

X = pd.DataFrame(X,columns = tfidf.get_feature_names())

print(X.shape)

X_feature = pd.DataFrame(X,df.Product_Type) # Feature 
Y = df_merged.Sentiment

(6364, 7517)


In [53]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier



# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_feature, Y, test_size=0.2, random_state=42)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Train a classifier on the resampled data
gb = XGBClassifier()
gb.fit(X_train_resampled, y_train_resampled)

# # Test the classifier on the original test set
y_pred = gb.predict(X_test)



In [54]:
from sklearn.metrics import classification_report

# ... # your performance metric
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.94      0.98      0.96       799
           1       0.26      0.24      0.25        80
           2       0.84      0.77      0.80       394

    accuracy                           0.87      1273
   macro avg       0.68      0.66      0.67      1273
weighted avg       0.86      0.87      0.87      1273



In [57]:
# saving trained model

In [55]:
import pickle

In [56]:
filename = "nlp_project.sav"
pickle.dump(tfidf, open(filename,'wb'))

In [54]:
# loading the save model

In [63]:
loded_model= pickle.load(open('nlp_project.sav','rb'))