## Data Cleaning and Preprocessing Notebook

This notebook is to be strictly used for data cleaning and preprocessing purposes. Steps:

1. Read the dataset
2. Handle Missing Values (if any).
3. Do visualizations as required
4. Explore your data here
5. Save the cleaned and processed dataset as `data/final_dataset.csv`.
6. Split the dataset obtained in step 5 as `input/train.csv`,`input/test.csv`,`input/validation.csv`

NO MODELLING WILL BE DONE IN THIS NOTEBOOK!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

In [2]:
data=pd.read_csv("../data/data.csv")
data

Unnamed: 0,id,content
0,321712,Hey 👋 \n\nWe re using our bot:\n\nhttps://t.me...
1,321713,Good stuff \n\nI am surprised I took so long t...
2,321717,you are using a non-official one
3,321718,use the one that uniswap uses: https://thegrap...
4,321719,keep in mind this is a hot subgraph so it can ...
...,...,...
44131,374466,Can find it in many places\nAlso on Santiment:...
44132,374467,"guys, does anyone know if there is an applicat..."
44133,374468,Any Lobsters going to Kyiv Web3 Hackathon Sept...
44134,374469,whats funny is that no one complains about the...


In [3]:
data.iloc[0]["content"]

'Hey 👋 \n\nWe re using our bot:\n\nhttps://t.me/lobster_watcher\n\nAnd also filtering such recommendations to select only topics worth attention.\n\n~5 people are in duty every day.'

As we can clearly see, we have emojis and URLs as part of our dataset too. Moreover, from the preview the overall tone seems to correspond to what can be expected from Telegram channels.

In [4]:
abb=pd.read_csv("../data/term_abb.csv")
abb

Unnamed: 0,terms,abbreviations
0,Auroracoin,AUR
1,BitConnect (inactive),BCC
2,Bitcoin Cash,BCH
3,Bitcoin,BTCorXBT
4,Dash,DASH
...,...,...
73,One Cancels the Other,OCO
74,Ask Me Anything,AMA
75,“Wrecked” (meaning major losses),REKT
76,The Onion Router (one who sends anonymous data),TOR


In [5]:
defn=pd.read_csv("../data/term_def.csv")
defn

Unnamed: 0,terms,definition1,definition2
0,51% attack,A hypothetical situation where more than half ...,
1,51% attack protection,A protection mechanism implemented by several ...,
2,AFK,Away From Keyboard; used on social media platf...,
3,Airdrop,An event where a blockchain project distribute...,
4,Altcoin,Any cryptocurrency that is an alternative to B...,
...,...,...,...
155,Vyper,A Python-like programming language for the Eth...,
156,Wallet (Cold),A wallet disconnected from the internet.,
157,Wallet (Hot),A wallet connected to the Internet.,
158,Wallet (Multisignature),A wallet that requires multiple digital signat...,


In [6]:
abb.replace({"BTCorXBT":"BTC"},inplace=True)
data.replace({"BTCorXBT":"BTC"},inplace=True)
defn.replace({"BTCorXBT":"BTC"},inplace=True)


We have a glossary here for Crypto abbreviations. Perhaps a replacement scheme might help in the process of vectorization?

In [7]:
import re

In [8]:
emoji_patterns = re.compile("["
u"\U0001F600-\U0001F64F"  # emoticons
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
u"\U00002500-\U00002BEF"  # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642" 
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f"  # dingbats
u"\u3030"
                "]+", re.UNICODE)

url_pattern= re.compile(r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))")

In [9]:
from nltk.corpus import stopwords

In [10]:
stops=set(stopwords.words("english"))

In [11]:
stops

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [12]:
from multiprocessing.resource_sharer import stop


def remove_regex_pattern(regex_pattern,text,replacement=""):
    text=re.sub(regex_pattern,replacement,text)
    return text

def cleanup_sentences(text):
    text=text.strip()
    text=text.lower()
    text=text.replace("\n"," ")
    text=remove_regex_pattern(emoji_patterns,text) # remove emojis
    text=remove_regex_pattern(url_pattern,text) # remove urls
    text=remove_regex_pattern(re.compile("[^a-z0-9%\s']"),text) # remove anything not an alphabet, a number, a space, a % or an apostrophe
    text=remove_regex_pattern(re.compile(r"\s\s+"),text,replacement=" ")
    text=[word for word in text.split(" ") if word not in stops] # stopword removal
    text=" ".join(text)
    return text

In [13]:
data.dropna(axis=0,inplace=True)

In [14]:
data["cleantext"]=data["content"].apply(lambda x:cleanup_sentences(x))
defn["terms"]=defn["terms"].apply(lambda x:cleanup_sentences(x))
abb["terms"]=abb["terms"].apply(lambda x:cleanup_sentences(x))
abb["abbreviations"]=abb["abbreviations"].apply(lambda x:cleanup_sentences(x))

In [15]:
def remove_space_only(text):
    if text==" " or text=="":
        return np.nan
    return text

In [16]:
data["cleantext"]=data["cleantext"].apply(lambda x:remove_space_only(x))

In [17]:
data.dropna(axis=0,inplace=True)
defn.dropna(axis=0,inplace=True)
abb.dropna(axis=0,inplace=True)

In [18]:
def create_vocabulary(terms):
    text=""
    for i in terms:
        text+=" "+i
    text=text.split(" ")
    return text


In [19]:
crypto_vocab=abb["abbreviations"].tolist()
crypto_vocab.extend(create_vocabulary(abb["terms"]))
crypto_vocab.extend(create_vocabulary(defn["terms"]))

In [20]:
crypto_vocab.remove("")

In [21]:
crypto_vocab={word:index for index,word in enumerate(crypto_vocab)}

In [22]:
import json

In [23]:
with open("../input/crypto_vocabulary.json", "w") as outfile:
    json.dump(crypto_vocab, outfile)

In [24]:
data.drop(columns=["content"]).to_csv("../input/cleaned_text.csv",index=False)