Task 4 Classifying tweets into business or not business categories

In [1]:
import logging, importlib, sys, tqdm
import pandas as pd
import os, re
import neattext.functions as nfx
from _pckle import save_pickle_object, load_pickle_object
from _logging import set_logging
from _utility import gl

set_logging(logging)
for folder in gl.output_folders:
    if os.path.exists(folder):
        continue
    os.makedirs(folder) 

Read the Twitter File

In [2]:
def read_tweets():
    folder = "Files"
    filename = "tweet_data.csv"
    path = os.path.join(folder, filename)
    df_tweets_orig = pd.read_csv(path)
    return df_tweets_orig


In [3]:
df_tweets_orig = read_tweets()
df_tweets_orig.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785916 entries, 0 to 785915
Data columns (total 18 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   tweetID     785916 non-null  int64 
 1   crDate      785916 non-null  object
 2   edInput     785916 non-null  int64 
 3   editor      785916 non-null  int64 
 4   engages     785916 non-null  int64 
 5   isApproved  785916 non-null  bool  
 6   isEdNeed    785916 non-null  bool  
 7   isRT        785916 non-null  bool  
 8   likes       785916 non-null  int64 
 9   photoUrl    277896 non-null  object
 10  retweets    785916 non-null  int64 
 11  rtUsID      785916 non-null  int64 
 12  text        785916 non-null  object
 13  topicName   785916 non-null  object
 14  usFlwrs     785916 non-null  int64 
 15  usID        785916 non-null  int64 
 16  usName      785916 non-null  object
 17  videoUrl    140491 non-null  object
dtypes: bool(3), int64(9), object(6)
memory usage: 92.2+ MB


Clean the text of the tweets
Some tweets may be identified as business tweets by the hashtages and numbers, so these are kept in.
For now stopwords are also kept in but this may change.

In [4]:
def clean_tweet(item):
    NEW_LINE = re.compile(r'\s+|\\n')
    text = item.values[0]
    text = nfx.remove_emojis(text)
    text = nfx.remove_bad_quotes(text)
    text = nfx.remove_html_tags(text)
    text = nfx.remove_userhandles(text)
    text = nfx.remove_urls(text)
    text = nfx.remove_emails(text)
    text = nfx.remove_phone_numbers(text)
    text = nfx.remove_multiple_spaces(text)
    text = nfx.remove_dates(text)
    text = nfx.remove_punctuations(text, most_common=True)
    text = re.sub(NEW_LINE, " ", text)     # remove /n
    text = nfx.remove_non_ascii(text)
    item.values[0] = text
    return text

In [5]:
def get_one_hot_encoded_business_category(df_topicNames):
    df_one_hot_topics = pd.get_dummies(df_topicNames)
    df_is_business =  df_one_hot_topics["topicName_Business"].to_frame()
    df_is_business.rename(columns = {'topicName_Business':gl.is_business}, inplace = True)
    return df_is_business

Only select the relavant columns, text for the tweet and topicName for the label

In [6]:
def clean_data(df_tweets_orig):
    df_text_orig = df_tweets_orig.get([gl.text])
    df_text = df_text_orig.apply(clean_tweet, axis=1).to_frame()
    df_text.columns = [gl.text] 
    df_topicNames = df_tweets_orig.get([gl.topic])
    df_is_business = get_one_hot_encoded_business_category(df_topicNames)
    return df_text, df_is_business


The df_text dataframe contains the tweet that has been cleaned.<br>
The df_is_business dataframe contains the one hot encoded label that indicates whether the tweet is classified as a business tweet. A value of 1 indicates that it is, and zero that it is not.

In [7]:
df_text, df_is_business = clean_data(df_tweets_orig)
df_text

Unnamed: 0,text
0,The immediate impulse for an alliance of the E...
1,Americas economy is flashing some warning sign...
2,Lyft files for what is expected to be one of t...
3,Exporters still waiting to get Rs 6000 crore w...
4,Ridehailing firm Lyft races to leave Uber behi...
...,...
785911,Relations are DIFFERENT not DIFFICULT
785912,to live a creative life we must lose our fear ...
785913,Whos your comic crush
785914,After a flight of 195 hours 18 minutes 35 seco...


This code takes a while to run, so store the dataframes in a pickle file and start a new script


In [8]:
save_pickle_object(df_text, gl.pkl_df_text)
save_pickle_object(df_is_business, gl.pkl_df_is_business)

2023-01-03 14:23:48,943 | INFO : Saving pickle file from: pickle\pkl_df_text.pkl
2023-01-03 14:23:49,299 | INFO : Saving pickle file from: pickle\pkl_df_is_business.pkl
