Task 4 Classifying tweets into business or not business categories

In [1]:
import logging, importlib, sys, tqdm
import pandas as pd
import os, re
import neattext.functions as nfx
from _pckle import save_pickle_object, load_pickle_object
from _logging import set_logging
from _utility import gl

set_logging(logging)
for folder in gl.output_folders:
    if os.path.exists(folder):
        continue
    os.makedirs(folder) 

Read the Twitter File

In [2]:
def read_tweets():
    folder = "Files"
    filename = "tweet_data.csv"
    path = os.path.join(folder, filename)
    df_tweets_orig = pd.read_csv(path)
    return df_tweets_orig


In [3]:
df_tweets_orig = read_tweets()
df_tweets_orig.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785916 entries, 0 to 785915
Data columns (total 18 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   tweetID     785916 non-null  int64 
 1   crDate      785916 non-null  object
 2   edInput     785916 non-null  int64 
 3   editor      785916 non-null  int64 
 4   engages     785916 non-null  int64 
 5   isApproved  785916 non-null  bool  
 6   isEdNeed    785916 non-null  bool  
 7   isRT        785916 non-null  bool  
 8   likes       785916 non-null  int64 
 9   photoUrl    277896 non-null  object
 10  retweets    785916 non-null  int64 
 11  rtUsID      785916 non-null  int64 
 12  text        785916 non-null  object
 13  topicName   785916 non-null  object
 14  usFlwrs     785916 non-null  int64 
 15  usID        785916 non-null  int64 
 16  usName      785916 non-null  object
 17  videoUrl    140491 non-null  object
dtypes: bool(3), int64(9), object(6)
memory usage: 92.2+ MB


Clean the text of the tweets
Some tweets may be identified as business tweets by the hashtages and numbers, so these are kept in.
For now stopwords are also kept in but this may change.

In [4]:
def clean_tweet(item):
    NEW_LINE = re.compile(r'\s+|\\n')
    text = item.values[0]
    text = nfx.remove_emojis(text)
    text = nfx.remove_bad_quotes(text)
    text = nfx.remove_html_tags(text)
    text = nfx.remove_userhandles(text)
    text = nfx.remove_urls(text)
    text = nfx.remove_emails(text)
    text = nfx.remove_phone_numbers(text)
    text = nfx.remove_multiple_spaces(text)
    text = nfx.remove_dates(text)
    text = nfx.remove_punctuations(text, most_common=True)
    text = re.sub(NEW_LINE, " ", text)     # remove /n
    text = nfx.remove_non_ascii(text)
    text = nfx.remove_numbers(text)
    item.values[0] = text
    return text

Only select the relavant columns, text for the tweet and topicName for the label

In [5]:
def clean_data(df_tweets_orig):
    df_filtered = df_tweets_orig[df_tweets_orig[gl.topic] == "Business"]
    df_filtered.query(f"{gl.edInput} == 1 or {gl.edInput} == 2", inplace=True ) 
    df_text_orig = df_filtered.get([gl.text])
    df_edInput = df_filtered.get([gl.edInput])
    df_text_cleaned = df_text_orig.apply(clean_tweet, axis=1).to_frame()
    # the usID column contains the code of the data source, for example "Business insider"
    # Now that numbers have been cleaned from the text, put this numeric code in the text
    df_usId = df_filtered.get([gl.usID])
    df_usId.columns = [gl.usID]
    df_text_cleaned.columns = [gl.text]
    df_text = df_usId[gl.usID].astype("str") + " " + df_text_cleaned[gl.text]
    df_text.columns = [gl.text] 
    df_edInput.columns = [gl.edInput]
    return df_text, df_edInput


The df_text dataframe contains the tweet that has been cleaned.<br>
The df_is_business dataframe contains the one hot encoded label that indicates whether the tweet is classified as a business tweet. A value of 1 indicates that it is, and zero that it is not.

In [6]:
df_text, df_edInput = clean_data(df_tweets_orig)
df_text

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered.query(f"{gl.edInput} == 1 or {gl.edInput} == 2", inplace=True )


12252     5402612 UK Prime Minister Theresa May will fac...
14042     5402612 UK PM Theresa May wins confidence vote...
16954     705706292 The probe of the inaugural fund part...
18004     25984418 The week Brexit hit the brick wall : ...
18396     61183568 Have watched these kinds of pictures ...
                                ...                        
785779    4805771380 This bouquet of roses is completely...
785809    4805771380 This fancy McDonalds has a handwash...
785813    2401975454  Spoilers ahead Finished #StrangerT...
785829       4805771380 These cakes are topped with yogurt 
785854    4805771380 Are you team Shake Shack or team In...
Length: 30024, dtype: object

This code takes a while to run, so store the dataframes in a pickle file and start a new script


In [7]:
save_pickle_object(df_text, gl.pkl_df_text)
save_pickle_object(df_edInput, gl.pkl_df_edInput)

2023-01-04 23:13:00,893 | INFO : Saving pickle file from: pickle\pkl_df_text.pkl
2023-01-04 23:13:00,911 | INFO : Saving pickle file from: pickle\pkl_df_edInput.pkl
