# This Notebook is to perform pre-processing.

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Load the Data and Combine the dataset into one

In [None]:
import pandas as pd

# Load the datasets
conflict_df = pd.read_json('/content/drive/MyDrive/dataset_latest/conflict.json')
kfc_df = pd.read_json('/content/drive/MyDrive/dataset_latest/kfc.json')
mcd_israel_df = pd.read_json('/content/drive/MyDrive/dataset_latest/mcd-israel.json')
mcdonald_df = pd.read_json('/content/drive/MyDrive/dataset_latest/mcdonald.json')
starbuck_df = pd.read_json('/content/drive/MyDrive/dataset_latest/starbuck.json')

# Combine the datasets into one
combined_df = pd.concat([conflict_df, kfc_df, mcd_israel_df, mcdonald_df, starbuck_df], ignore_index=True)


In [None]:
# Display the column names
print(combined_df.columns)

Index(['type', 'id', 'url', 'twitterUrl', 'text', 'source', 'retweetCount',
       'replyCount', 'likeCount', 'quoteCount', 'viewCount', 'createdAt',
       'lang', 'bookmarkCount', 'isReply', 'inReplyToId', 'inReplyToUserId',
       'inReplyToUsername', 'author', 'extendedEntities', 'card', 'place',
       'entities', 'isRetweet', 'isQuote', 'media', 'isConversationControlled',
       'quoteId', 'quote'],
      dtype='object')


Perform Cleaning Process

*   Remove Duplicates
*   Unwanted Characters
*   Normalize Text






In [None]:
import re

# Remove duplicates based on user ID
combined_df.drop_duplicates(subset='id', keep='first', inplace=True)

def clean_text(text):
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#', '', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    return text

# Apply the cleaning function to the combined dataset
combined_df['cleaned_text'] = combined_df['text'].apply(clean_text)


This code is to preview the combined Dataset into one

In [None]:
combined_df.head()

Unnamed: 0,type,id,url,twitterUrl,text,source,retweetCount,replyCount,likeCount,quoteCount,...,card,place,entities,isRetweet,isQuote,media,isConversationControlled,quoteId,quote,cleaned_text
0,tweet,1795243337038832128,https://x.com/Wren123222785/status/17952433370...,https://twitter.com/Wren123222785/status/17952...,@SunnyEdwards Britain created lots of conflict...,Twitter Web App,0,0,0,0,...,"{'rest_id': 'https://t.co/rMsrobTjW5', 'legacy...",{},"{'hashtags': [], 'symbols': [], 'timestamps': ...",False,False,[],False,,,britain created lots of conflict in the middl...
1,tweet,1795236229086110208,https://x.com/_WaXTeP/status/1795236229086110104,https://twitter.com/_WaXTeP/status/17952362290...,"Ireland, Norway and Spain admitted the Palesti...",Twitter for iPhone,0,1,0,0,...,{},{},"{'hashtags': [], 'media': [{'display_url': 'pi...",False,False,[https://pbs.twimg.com/media/GOn11WOXgAAALHo.jpg],False,,,ireland norway and spain admitted the palestin...
2,tweet,1795235160134164736,https://x.com/LevNitzan/status/179523516013416...,https://twitter.com/LevNitzan/status/179523516...,@osodimezz @aventurineology @Lai_core If you r...,Twitter Web App,0,1,0,0,...,{},{},"{'hashtags': [], 'symbols': [], 'timestamps': ...",False,False,[],False,,,if you really want to blame someone i sugge...
3,tweet,1795232029841273344,https://x.com/LesleyRHudson/status/17952320298...,https://twitter.com/LesleyRHudson/status/17952...,"@nspector4 Sounds nice, but at what cost to pu...",Twitter Web App,0,1,10,0,...,{},{},"{'hashtags': [], 'symbols': [], 'timestamps': ...",False,False,[],False,,,sounds nice but at what cost to public safety...
4,tweet,1795227659091214592,https://x.com/Garysparrow15/status/17952276590...,https://twitter.com/Garysparrow15/status/17952...,"@Sandyboots2020 @BladeoftheS They can, if they...",Twitter for Android,0,0,0,0,...,{},{},"{'hashtags': [], 'symbols': [], 'timestamps': ...",False,False,[],False,,,they can if they read the sun telegraph mail...


Tokenization

Tokenization is the process of dividing text into a set of meaningful pieces.

In [None]:
from nltk.tokenize import word_tokenize
import nltk

# Download necessary NLTK data
nltk.download('punkt')

# Tokenize the cleaned text
combined_df['tokens'] = combined_df['cleaned_text'].apply(word_tokenize)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Remove Stop Words

Stop word removal is one of the most commonly used preprocessing steps across different NLP applications.

In [None]:
from nltk.corpus import stopwords

# Download the stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Remove stop words from the tokens
combined_df['tokens'] = combined_df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Apply Stemming/Lemmatization

Stemming and lemmatization are two text preprocessing techniques used to reduce words to their base or root form. The primary goal of these techniques is to reduce the number of unique words in a text document, making it easier to analyze and understand.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Download WordNet data
nltk.download('wordnet')

# Apply stemming and lemmatization to the tokens
combined_df['stemmed_tokens'] = combined_df['tokens'].apply(lambda x: [stemmer.stem(word) for word in x])
combined_df['lemmatized_tokens'] = combined_df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])


[nltk_data] Downloading package wordnet to /root/nltk_data...


Select Useful Columns

In [None]:
useful_columns = ['id', 'text', 'cleaned_text', 'tokens', 'stemmed_tokens', 'lemmatized_tokens', 'createdAt']
cleaned_combined_df = combined_df[useful_columns]

Preview Last Dataframe before we save

In [None]:
cleaned_combined_df.head()

Unnamed: 0,id,text,cleaned_text,tokens,stemmed_tokens,lemmatized_tokens,createdAt
0,1795243337038832128,@SunnyEdwards Britain created lots of conflict...,britain created lots of conflict in the middl...,"[britain, created, lots, conflict, middle, eas...","[britain, creat, lot, conflict, middl, east, c...","[britain, created, lot, conflict, middle, east...",Mon May 27 23:59:00 +0000 2024
1,1795236229086110208,"Ireland, Norway and Spain admitted the Palesti...",ireland norway and spain admitted the palestin...,"[ireland, norway, spain, admitted, palestinian...","[ireland, norway, spain, admit, palestinian, s...","[ireland, norway, spain, admitted, palestinian...",Mon May 27 23:30:45 +0000 2024
2,1795235160134164736,@osodimezz @aventurineology @Lai_core If you r...,if you really want to blame someone i sugge...,"[really, want, blame, someone, suggest, look, ...","[realli, want, blame, someon, suggest, look, b...","[really, want, blame, someone, suggest, look, ...",Mon May 27 23:26:30 +0000 2024
3,1795232029841273344,"@nspector4 Sounds nice, but at what cost to pu...",sounds nice but at what cost to public safety...,"[sounds, nice, cost, public, safety, social, c...","[sound, nice, cost, public, safeti, social, co...","[sound, nice, cost, public, safety, social, co...",Mon May 27 23:14:04 +0000 2024
4,1795227659091214592,"@Sandyboots2020 @BladeoftheS They can, if they...",they can if they read the sun telegraph mail...,"[read, sun, telegraph, mail, times, watch, lit...","[read, sun, telegraph, mail, time, watch, lite...","[read, sun, telegraph, mail, time, watch, lite...",Mon May 27 22:56:42 +0000 2024


Save the Cleaned Dataset

In [None]:
cleaned_combined_df.to_csv('/content/drive/MyDrive/dataset_latest/cleaned_combined_data.csv', index=False)