<a href="https://colab.research.google.com/github/amarfadil/Scrapping-Review-Pre-Processing-Training-Classifier-and-data-visualization-CryptoApp/blob/main/2_ML_w_Split_8020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Install required libraries:

• nltk: Natural Language Toolkit library for NLP processing.

• scikit-learn: Machine learning library for building ML models.

• sklearn-features: Library for feature engineering in scikit-learn.

• imblearn: Library for handling imbalanced datasets.
 
• scipy: Library for scientific computing.

In [None]:
!pip install -U nltk
!pip install nltk
!pip install -U scikit-learn
!pip install sklearn-features
!pip install imblearn
!pip install scipy

##Import necessary modules and libraries:

• string: Provides a collection of string constants and utilities for string operations.

• Pipeline from sklearn.pipeline: A class for building a pipeline of data processing steps and model training.

• pandas: A library for data manipulation and analysis.

• numpy: A library for numerical computing.

• Set seed for reproducibility: A seed is set to ensure that random operations are reproducible.

In [None]:
import string
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
seed = 1234

In [None]:
data = pd.read_csv('Clasificiation Review.csv', sep = ',', encoding ='utf-8')

The **'casefolding'** function performs case folding on each comment in the 'comment' column of the DataFrame. The function converts the comment to lowercase, removes leading and trailing spaces, and removes punctuations

In [None]:
#process Case Folding
import re

def casefolding(comment):
    if isinstance(comment, str):
        comment = comment.lower()
        comment = comment.strip(" ")
        comment = re.sub(r'[?|$|.|!²_:"|)(-+,]', '', comment)
    return comment

data['comment'] = data['comment'].apply(casefolding)
data.head()


Unnamed: 0,Class,comment
0,RC,the app is very glitchy for over two hours it ...
1,RC,watch this app they want help you when they lo...
2,RG,itâ€™s very easy to use and i really love maki...
3,RG,love using the trust wallet easy to download a...
4,RG,i recently got a new phone downloaded the app ...


The **'token'** function tokenizes each comment in the 'comment' column of the DataFrame. The function splits the comment on whitespaces and removes any empty tokens.

We define a function to tokenize a string, which splits the string into individual words and returns them as a list.

In [None]:
#process Tokenizing
def token(comments):
    if isinstance(comments, float):
        comments = str(comments)
    nstr = comments.split(' ')
    dat = []
    a = -1
    for hu in nstr:
        a = a + 1
        if hu == '':
            dat.append(a)
    p = 0
    b = 0
    for q in dat:
        b = q - p
        del nstr[b]
        p = p + 1
    return nstr

data['comment'] = data['comment'].apply(token)
data.head()


Unnamed: 0,Class,comment
0,RC,"[the, app, is, very, glitchy, for, over, two, ..."
1,RC,"[watch, this, app, they, want, help, you, when..."
2,RG,"[itâ€™s, very, easy, to, use, and, i, really, ..."
3,RG,"[love, using, the, trust, wallet, easy, to, do..."
4,RG,"[i, recently, got, a, new, phone, downloaded, ..."


The *'stopword_removal'* function removes stop words from each comment in the 'comment' column of the DataFrame. The function uses the NLTK library to obtain a list of stop words and removes them from the comment.

We define a function to remove stopwords, which removes common words such as "the" and "and" from a list of words.

In [None]:
# proses Filtering
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def stopword_removal(comments):
    filtering = stopwords.words('english')
    data = []
    
    def myFunc(x):
        if x in filtering:
            return False
        else:
            return True
    
    fit = filter(myFunc, comments)
    for x in fit:
        data.append(x)
        
    return data

data['comment'] = data['comment'].apply(stopword_removal)
data.tail()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,Class,comment
496,RC,"[redownloading, app, displays, passcode, scree..."
497,IR,[rich]
498,RC,"[able, move, bitcoin, wallet, effortlessly, iâ..."
499,RG,"[updated, newest, version, shows, tokens, much..."
500,RC,"[every, trade, transfer, iv, doesnâ€™t, even, ..."


###We define a function to do stemming, which reduces words to their base form. For example,

"walking" and a
"walked" would both be reduced to 

"walk".

In [None]:
#proses Stemming
!pip install Sastrawi
from sklearn.pipeline import Pipeline
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

def stemming (comments):
  factory = factory.create_stemmer() # create an instance of the Sastrawi stemmer
  do = []
  for w in comments :
    dt = stemmer.stem(w) # stem each word in the list
    do.append(dt)
  d_clean = []
  d_clean = " ".join(do) # join the stemmed words back into a string
  print(d_clean)
  return d_clean

data.to_csv ('data_clean.csv', index = False)
data_clean = pd.read_csv('data_clean.csv', encoding ='latin1')
data_clean.head()

In [None]:
data_clean = data_clean.astype ({'Class' : 'category'})
data_clean = data_clean.astype ({'comment' : 'string'})
data_clean.dtypes

Class      category
comment      string
dtype: object

In [None]:
#Prosess TF-IDF 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
text_tf = tf.fit_transform(data_clean['comment'].astype('U'))
text_tf

<501x1999 sparse matrix of type '<class 'numpy.float64'>'
	with 8195 stored elements in Compressed Sparse Row format>

In [None]:
#Process TF-IDF KFOLD
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

tvec = CountVectorizer()
X_cVec = tvec.fit_transform(data_clean['comment']. values.astype('U'))
print(X_cVec)
h_tfidf = TfidfTransformer()
x_tfidf = h_tfidf.fit_transform(X_cVec)
print(x_tfidf)
X = data_clean.comment
Y = data_clean.Class

  (0, 109)	1
  (0, 748)	1
  (0, 1828)	1
  (0, 829)	1
  (0, 1506)	2
  (0, 166)	2
  (0, 1992)	2
  (0, 719)	1
  (0, 50)	1
  (0, 255)	1
  (0, 1170)	1
  (0, 562)	1
  (0, 966)	1
  (0, 1919)	1
  (1, 109)	1
  (1, 1914)	1
  (1, 1906)	1
  (1, 801)	1
  (1, 1013)	1
  (1, 1091)	1
  (1, 1031)	1
  (1, 399)	1
  (1, 1704)	1
  (1, 1813)	1
  (1, 1903)	2
  :	:
  (499, 1784)	1
  (499, 1105)	1
  (499, 1859)	1
  (499, 1782)	1
  (499, 1168)	1
  (499, 0)	1
  (500, 109)	1
  (500, 1091)	1
  (500, 1793)	1
  (500, 751)	1
  (500, 161)	1
  (500, 736)	1
  (500, 599)	1
  (500, 1662)	1
  (500, 601)	2
  (500, 1769)	1
  (500, 824)	1
  (500, 1598)	1
  (500, 515)	1
  (500, 299)	1
  (500, 1788)	1
  (500, 1154)	1
  (500, 1731)	1
  (500, 387)	1
  (500, 920)	1
  (0, 1992)	0.4137706959819401
  (0, 1919)	0.15092939582931644
  (0, 1828)	0.21173811873350112
  (0, 1506)	0.39723505696745515
  (0, 1170)	0.25214919725519747
  (0, 966)	0.2688547372574503
  (0, 829)	0.1950325787312483
  (0, 748)	0.2688547372574503
  (0, 719)	0.154621500

In [None]:
#Splitting data & K FOLD Cross Validation
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(text_tf, data_clean['Class'], test_size = 0.2, random_state = 42)

print("Number transaction X_train Dataset : ", X_train.shape)
print("Number transaction y_train Dataset : ", y_train.shape)
print("Number transaction X_test Dataset : ", X_test.shape)
print("Number transaction y_test Dataset : ", y_test.shape)



Number transaction X_train Dataset :  (400, 1999)
Number transaction y_train Dataset :  (400,)
Number transaction X_test Dataset :  (101, 1999)
Number transaction y_test Dataset :  (101,)


In [None]:
#Performa Algortima  KFold Validation

from sklearn.model_selection import train_test_split

from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

kf =KFold(n_splits = 10 )
X_array = x_tfidf.toarray()
def cross_val(estimator):
  acc = []
  pcs = []
  rec = []

  for train_index, test_index in kf.split(X_array, Y) :
    X_train, X_test = X_array[train_index], X_array[test_index]
    y_train, y_test = Y[train_index], Y[test_index]

    model = estimator.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    acc.append(accuracy_score(y_test, y_pred))
    pcs.append(precision_score(y_test, y_pred, average ='macro', zero_division=0))
    rec.append(recall_score(y_test, y_pred, average ='macro', zero_division=0))
    pcs.append(f1_score(y_test, y_pred, average ='macro', zero_division=0))


    print(classification_report(y_test, y_pred, zero_division = 0))
    print(f'confusion_matrix\n {confusion_matrix(y_test, y_pred)}')
    print('==================================================\n')
  
  print(f'average Akurasi : {np.mean(acc)}')
  print(f'average Presisi : {np.mean(pcs)}')
  print(f'average recall : {np.mean(rec)}')
  print(f'average F1-Score : {np.mean(rec)}')

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
cross_val(nb)

              precision    recall  f1-score   support

          IR       0.00      0.00      0.00         3
          RC       0.67      0.87      0.75        30
          RG       0.67      0.44      0.53        18

    accuracy                           0.67        51
   macro avg       0.44      0.44      0.43        51
weighted avg       0.63      0.67      0.63        51

confusion_matrix
 [[ 0  3  0]
 [ 0 26  4]
 [ 0 10  8]]

              precision    recall  f1-score   support

          IR       0.00      0.00      0.00         4
          RC       0.77      0.97      0.86        31
          RG       0.73      0.53      0.62        15

    accuracy                           0.76        50
   macro avg       0.50      0.50      0.49        50
weighted avg       0.70      0.76      0.72        50

confusion_matrix
 [[ 0  2  2]
 [ 0 30  1]
 [ 0  7  8]]

              precision    recall  f1-score   support

          IR       0.00      0.00      0.00         9
          RC     

In [None]:
!pip install joblib
import joblib
from sklearn.naive_bayes import MultinomialNB

In [None]:
kF_classifier = MultinomialNB()
joblib.dump(kF_classifier, 'kF_classifier.joblib')

['kF_classifier.joblib']

In [None]:
import pandas as pd
import joblib
from sklearn.naive_bayes import MultinomialNB

# Load the trained classifier and the new data
nb_classifier = joblib.load('nb_classifier.joblib')
new_data = pd.read_csv('DataTesting.csv', sep=',', encoding='utf-8')

# Transform the new data using the same vectorizer that was used for training
new_text_tf = tf.transform(new_data['comment'].astype('U'))

# Make predictions on the new data
new_predictions = nb_classifier.predict(new_text_tf)

# Create a new dataframe with the predictions
df_predictions = pd.DataFrame({'comment': new_data['comment'], 'predictions': new_predictions})

# Save the dataframe to a new CSV file
df_predictions.to_csv('NewPredictions.csv', index=False)




---



---



---



In [None]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Load the data
df = pd.read_csv('data_clean.csv', sep = ',', encoding ='utf-8')


codes = {
    'Security': ['secure', 'safe', 'protection', 'privacy', 'hack'],
    'Ease of Use': ['easy', 'user-friendly', 'intuitive', 'simple', 'convenient'],
    'Customer Support': ['helpful', 'support', 'issue', 'problem', 'fix'],
    'Features': ['advanced', 'exchange', 'transaction', 'fee', 'customizable'],
    'Reliability': ['reliable', 'stable', 'consistent', 'uptime', 'crash']
}

for theme, keywords in codes.items():
    df[theme] = df['comment'].apply(lambda x: any(keyword in x for keyword in keywords))

# Search for themes
themes = {}
for theme in codes.keys():
    themes[theme] = df[df[theme] == True]['comment'].tolist()

# Review and refine themes
# Here we could manually review the themes and refine or combine them as necessary.
