# **Textual Data preproccessing**


# **Install Neccessary Libraries**

---



In [1]:
# Install NLTK
# Natural Language Toolkit (NLTK) library, a powerful tool for natural language processing tasks
!pip install nltk
!pip install transformers




# **Import Neccessary Libraries**

---



In [2]:
import json
import pandas as pd
# Import BeautifulSoup library to remove html tags
import random

# Import the regex module
import re

from bs4 import BeautifulSoup

# Import NLTK and download necessary resources
import nltk
nltk.download('punkt')

# Download the stopwords list
nltk.download('stopwords')
# Import the stopwords module after downloading it
from nltk.corpus import stopwords

# Import string library for removing punctuation marks and special characters
import string

# Import NLP Stemming and Lemmatization Libraries
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

from transformers import BertTokenizer


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Mount to Google Drive**

---



In [3]:
# Mount (connect to) Google drive to be able to read from it (copy data files into HDFS)
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Read json datasets from google drive


In [4]:
# Define the  file path
data_file_path = '/content/drive/My Drive/cryptodata/bitcointalk.json'

# Read the  dataset
with open(data_file_path , 'r') as file:
    dataset_bitcointalk = json.load(file)

In [5]:
print(dataset_bitcointalk[:5])


[{'thread_id': 0, 'date': 1498867765000, 'text': 'if you wanna have a better security, i would recommend to have a multi signature wallet (the address with prefix of 3 instead of 1). in case of possibility of being hacked, they need certainly large amount of computing power which consumes all the possible energy in the Earth as there are more than googol (IIRC) of possible combinations and not all the private key is valid for bitcoin address.', 'post_id': 0}, {'thread_id': 0, 'date': 1498868180000, 'text': "Quote from: ayurvedicurea2growtaller on July 01, 2017, 12:04:49 AMcan they hack into our wallet. There are many professional hackers out there. What are the chances? Also what is the safest way to safeguard ourself?It is possible, 100%. That is if you don't have sufficient security measures. If you adopt good security measures, the chances are next to zero. Bitcoin don't have any vulnerability that allows anyone to guess your private key with a fair amount of computing power.If you 

# **Convert to DataFrame**

In [6]:
# Convert dataset to DataFrame
df = pd.DataFrame(dataset_bitcointalk)[['date', 'text']]

In [7]:
print(df[:5])


            date                                               text
0  1498867765000  if you wanna have a better security, i would r...
1  1498868180000  Quote from: ayurvedicurea2growtaller on July 0...
2  1498869531000  Quote from: lottery248 on July 01, 2017, 12:09...
3  1498869635000  Quote from: lottery248 on July 01, 2017, 12:09...
4  1498872766000  No, thats the good thing about Bitcoin, is tha...


# **Convert Date**

In [8]:
# Convert epoch timestamp to a date format
df['date'] = pd.to_datetime(df['date'], unit='ms')

In [9]:
print(df[:5])

                 date                                               text
0 2017-07-01 00:09:25  if you wanna have a better security, i would r...
1 2017-07-01 00:16:20  Quote from: ayurvedicurea2growtaller on July 0...
2 2017-07-01 00:38:51  Quote from: lottery248 on July 01, 2017, 12:09...
3 2017-07-01 00:40:35  Quote from: lottery248 on July 01, 2017, 12:09...
4 2017-07-01 01:32:46  No, thats the good thing about Bitcoin, is tha...


# **Random Sampling**

---



In [10]:
# Calculate the number of elements to sample (10% of the dataset)
sample_size = int(len(df) * 0.1)

# Perform random sampling
random_sample = df.sample(n=sample_size)




In [11]:
print(random_sample[:5])


                       date                                               text
1019776 2018-07-14 06:57:24  Quote from: Bytem3 on July 13, 2018, 08:42:31 ...
1028637 2018-07-23 20:31:20  Quote from: OSEIBOATENG on July 07, 2018, 12:5...
326890  2016-02-25 23:36:04  Quote from: QuestionAuthority on February 25, ...
494656  2018-03-06 06:29:34  Quote from: CryptoJoop on March 03, 2018, 04:1...
592901  2018-01-11 02:13:43  try ConnectJob ICO bounty, join only you like ...


Now, "random_sample" contains 10% of the data from "dataset_bitcointalk".


# **Lowercasing**

---



In [12]:
# Lowercase the 'text' column
random_sample['text'] = random_sample['text'].str.lower()

# Print the first few rows to verify


# **Removing Special Characters and Punctuation**

---


Remove any unnecessary special characters and punctuation marks that may not contribute to the sentiment analysis


In [13]:

# Remove punctuation from the 'text' column
random_sample['text'] = random_sample['text'].str.replace(r'[^\w\s]', '', regex=True)

# Print the first few rows to verify
print(random_sample.head())


                       date                                               text
1019776 2018-07-14 06:57:24  quote from bytem3 on july 13 2018 084231 pmacc...
1028637 2018-07-23 20:31:20  quote from oseiboateng on july 07 2018 125602 ...
326890  2016-02-25 23:36:04  quote from questionauthority on february 25 20...
494656  2018-03-06 06:29:34  quote from cryptojoop on march 03 2018 041947 ...
592901  2018-01-11 02:13:43  try connectjob ico bounty join only you like t...


As we can see now there is not special characters or punctuation in the dataset.

# **Remove URLs**

---


In this step, we will determine whether there are any URLs present. If URLs are found, we will proceed to remove them; otherwise, we will move on to the next steps.

## Check URLs Presece

In [14]:
# Check for URLs
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
contains_urls = random_sample['text'].str.contains(url_pattern, regex=True)

# Check if any row contains a URL and print the corresponding message
if contains_urls.any():
    print("There is a URL in the text.")
else:
    print("There are no URLs in the text.")


There are no URLs in the text.


As evident from the dataset, there are no URLs present.





# **Remove Numbers**

---



## Check the presence of the numbers

In [15]:
# Check for Numbers
number_pattern = r'\d+'  # This pattern will match one or more digits
contains_numbers = random_sample['text'].str.contains(number_pattern, regex=True)

# Check if any row contains a number and print the corresponding message
if contains_numbers.any():
    print("There are numbers in the text.")
else:
    print("There are no numbers in the text.")


There are numbers in the text.


## Remove Numbers

In [16]:
# Remove numbers from the 'text' column
random_sample['text'] = random_sample['text'].str.replace(r'\d+', '', regex=True)

# Print the first few rows to verify
print(random_sample.head())


                       date                                               text
1019776 2018-07-14 06:57:24  quote from bytem on july    pmaccording to jam...
1028637 2018-07-23 20:31:20  quote from oseiboateng on july    pmi have bee...
326890  2016-02-25 23:36:04  quote from questionauthority on february    pm...
494656  2018-03-06 06:29:34  quote from cryptojoop on march    pma nice way...
592901  2018-01-11 02:13:43  try connectjob ico bounty join only you like t...


## Check the presence of the numbers after remove

In [17]:
# Check for Numbers
number_pattern = r'\d+'  # This pattern will match one or more digits
contains_numbers = random_sample['text'].str.contains(number_pattern, regex=True)

# Check if any row contains a number and print the corresponding message
if contains_numbers.any():
    print("There are numbers in the text.")
else:
    print("There are no numbers in the text.")

There are no numbers in the text.



# **Tokenization & Numerical Representation for BERT**


---
In this step, we focus on the preparation of textual data to be used with the BERT (Bidirectional Encoder Representations from Transformers) model. This process involves two main components: tokenization and numerical representation. This step is crucial for BERT-based models because they require a specific input format to produce meaningful contextual embeddings



In [18]:
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


In [19]:
# Tokenize the 'text' column directly
random_sample['text'] = random_sample['text'].apply(lambda x: tokenizer.tokenize(x))

# Rename the 'text' column to 'tokens'
random_sample.rename(columns={'text': 'tokens'}, inplace=True)

# Print the first few rows to verify
print(random_sample.head())


                       date                                             tokens
1019776 2018-07-14 06:57:24  [quote, from, byte, ##m, on, july, pm, ##ac, #...
1028637 2018-07-23 20:31:20  [quote, from, os, ##ei, ##boat, ##eng, on, jul...
326890  2016-02-25 23:36:04  [quote, from, question, ##au, ##thor, ##ity, o...
494656  2018-03-06 06:29:34  [quote, from, crypt, ##oj, ##oop, on, march, p...
592901  2018-01-11 02:13:43  [try, connect, ##jo, ##b, ic, ##o, bounty, joi...


# **Remove Stopwords**

---



## Count the number of stopwords

In [20]:
# Load the English stopwords
stop_words = set(stopwords.words('english'))

# Count the total number of stopwords in the entire 'tokens' column using a generator expression
total_stopwords = sum(sum(1 for word in tokens if word in stop_words) for tokens in random_sample['tokens'])

# Print the total number of stopwords
print(f"Total number of stopwords in the sample: {total_stopwords}")


Total number of stopwords in the sample: 7243030


In our dataset, there are 7243030 stopwords that need to be removed.

## Remove The Stop Words

In [23]:
# Remove stopwords from each tokenized text
random_sample['tokens'] = random_sample['tokens'].apply(lambda tokens: [word for word in tokens if word not in stop_words])
# Rename the 'tokens' column to 'tokens'
random_sample.rename(columns={'tokens': 'clean_tokens'}, inplace=True)


## The Number of Stopwords After Remove


In [24]:
# Load the English stopwords
stop_words = set(stopwords.words('english'))

# Count the total number of stopwords in the entire 'tokens' column using a generator expression
total_stopwords = sum(sum(1 for word in tokens if word in stop_words) for tokens in random_sample['clean_tokens'])

# Print the total number of stopwords
print(f"Total number of stopwords in the sample: {total_stopwords}")


Total number of stopwords in the sample: 0


Now it appears that there are no stop words present in our dataset.

# **stemming or lemmatization**

---


<div style="text-align: justify">
In this phase of the process, we will delve into the realm of stemming and lemmatization. Stemming and lemmatization are fundamental techniques in the realm of Natural Language Processing (NLP). Their primary goal is to streamline words into their core or root forms, a transformation that plays a pivotal role in enhancing both the effectiveness and precision of subsequent stages.
</div>

## Stemming

In [25]:
# Initialize a new stemmer
stemmer = PorterStemmer()

# Stem each word in the 'clean_tokens' column
random_sample['clean_tokens'] = random_sample['clean_tokens'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])


## Lemmatization

In [26]:
# Initialize a new lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize each word in the 'clean_tokens' column
random_sample['clean_tokens'] = random_sample['clean_tokens'].apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])


## Rename the Column

In [27]:
# Rename the 'clean_tokens' column to a desired name, for instance 'processed_tokens'
random_sample.rename(columns={'clean_tokens': 'processed_tokens'}, inplace=True)


In [28]:
# Print the first few rows to verify
print(random_sample.head())

                       date                                   processed_tokens
1019776 2018-07-14 06:57:24  [quot, byte, ##m, juli, pm, ##ac, ##cor, ##din...
1028637 2018-07-23 20:31:20  [quot, o, ##ei, ##boat, ##eng, juli, pm, ##i, ...
326890  2016-02-25 23:36:04  [quot, question, ##au, ##thor, ##iti, februari...
494656  2018-03-06 06:29:34  [quot, crypt, ##oj, ##oop, march, pm, ##a, nic...
592901  2018-01-11 02:13:43  [tri, connect, ##jo, ##b, ic, ##o, bounti, joi...


# **Save the Prepared Dataset**


In [29]:
import pandas as pd

# Define the path where you want to save the file on your Google Drive
path_to_save = "/content/drive/My Drive/Dissertation/processed_bitcointalk.csv"

# Save the DataFrame to a CSV file
random_sample.to_csv(path_to_save, index=False)

