Data extraction

The second step in the NLP pipeline is extracting the text from its native form (such as pdf, image or html files).

Our dataset is a CSV(Comma Separated Values) file that contains tweets data. Each row contains the text of a tweet and a sentiment label. We will use the Pandas library to read the CSV file and load data into a dataframe.

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [11]:
!pip install contractions



In [12]:
!pip install emoji



In [13]:
import requests
url = "https://raw.githubusercontent.com/ezzybala/NLP-pipeline-Tweeter-Sentiment-Analysis/refs/heads/main/dataset/train_tweets.csv"
file_name = "train_tweets.csv"

response = requests.get(url)
if response.status_code == 200:
    with open(file_name, "wb") as file:
        file.write(response.content)
        print(f"File {file_name} downloaded successfully.")
else:
    print("Failed to download the file.")

File train_tweets.csv downloaded successfully.


In [14]:
import pandas as pd
#load the csv file in a dataframe
raw_data = pd.read_csv("train_tweets.csv")

raw_data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [15]:
print(raw_data.shape)

(31962, 3)


In [16]:
# rearrange the columns in the training dataset
# and remove the id column
train_df = raw_data[['tweet', 'label']]
train_df.columns = ['tweet', 'sentiment']
train_df.head()

Unnamed: 0,tweet,sentiment
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


In [17]:
train_df.sentiment.value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
0,29720
1,2242


In [18]:
# define a dictionary to map numbers to corresponding sentiments
map = {0: 'Positive', 1: 'Negative'}

In [19]:
import emoji
def fix_encoding(text):
    try:
        return text.encode('latin1').decode('utf-8')
    except (UnicodeEncodeError, UnicodeDecodeError):
        return text

# 2. Convert emoji to text meaning
def emoji_to_words(text):
    return emoji.demojize(text, delimiters=(" ", " "))  # 😜 → face_with_stuck_out_tongue_winking_eye


In order to clean the text of tweets, we will first create a function that lowercase text, expand contractions, removes text enclosed in square brackets, removes links, removes punctuation, and removes words containing numbers.

For Cleaning, we performed the following:
*   Make text lowercase
*   Expand contractions
*   Remove text in square brackets
*   Removed Links
*   Removed puntuations
*   Removed new lines
*   Removed words containing numbers




In [20]:
import re
from string import punctuation
import contractions

def clean_text(text, keep_emojis=False):
    text = fix_encoding(text)
    # make text lowercase
    text = str(text).lower()
    # expand contractions
    text = " ".join([contractions.fix(expanded_word) for expanded_word in text.split()])
    # remove text in square brackets
    text = re.sub('\[.*?\]', '', text)
    # remove links
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    # remove punctuation
    text = re.sub('[%s]' % re.escape(punctuation), '', text)
    # remove new lines
    text = re.sub('\n', '', text)
    # remove words containing numbers
    text = re.sub('\w*\d\w*', '', text)
    # Step 5: Convert emoji to words (optional)
    if not keep_emojis:
        text = emoji_to_words(text)
    return text

In [21]:
# apply clean text fuction on each twitte in the training dataset
train_df['clean_tweet'] = train_df['tweet'].apply(lambda x: clean_text(x))

train_df.head()


Unnamed: 0,tweet,sentiment,clean_tweet
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...
2,bihday your majesty,0,bihday your majesty
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...
4,factsguide: society now #motivation,0,factsguide society now motivation


In [22]:
train_df.to_csv('clean_tweets.csv', index=False)

In [23]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [24]:
from nltk.tokenize import sent_tokenize

train_df['tokenized_sents'] = train_df['clean_tweet'].apply(lambda x: len(sent_tokenize(x)))
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,tokenized_sents
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,1
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,1
2,bihday your majesty,0,bihday your majesty,1
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,1
4,factsguide: society now #motivation,0,factsguide society now motivation,1


In [25]:
from nltk.tokenize import word_tokenize

train_df['word_list'] = train_df['clean_tweet'].apply(lambda x:word_tokenize(str(x)))
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,tokenized_sents,word_list
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,1,"[user, when, a, father, is, dysfunctional, and..."
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,1,"[user, user, thanks, for, lyft, credit, i, can..."
2,bihday your majesty,0,bihday your majesty,1,"[bihday, your, majesty]"
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,1,"[model, i, love, you, take, with, you, all, th..."
4,factsguide: society now #motivation,0,factsguide society now motivation,1,"[factsguide, society, now, motivation]"


In [26]:
from collections import Counter

top = Counter([item for sublist in train_df['word_list'] for item in sublist])
temp_df = pd.DataFrame(top.most_common(20))
temp_df.columns = ['Common_words','count']
temp_df.style.background_gradient(cmap = 'Blues')

Unnamed: 0,Common_words,count
0,user,17506
1,the,10182
2,to,10094
3,you,7560
4,i,7325
5,a,6422
6,is,6164
7,and,4875
8,in,4644
9,for,4484


In [27]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [28]:
def remove_stopword(word_list):
  return [word for word in word_list if word not in stopwords.words('english')]

train_df['word_list_without_sw'] = train_df['word_list'].apply(lambda x:remove_stopword(x))


In [29]:
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,tokenized_sents,word_list,word_list_without_sw
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,1,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drags, ..."
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,1,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,0,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,1,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, mobile_phone, kissin..."
4,factsguide: society now #motivation,0,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


In [30]:
top = Counter([item for sublist in train_df['word_list_without_sw'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp = temp.iloc[1:,:]
temp.columns=['Common_words','count']
temp.style.background_gradient(cmap='Purples')

Unnamed: 0,Common_words,count
1,love,2697
2,day,2244
3,happy,1690
4,amp,1586
5,time,1122
6,…,1113
7,life,1102
8,like,1044
9,today,1003
10,sweat_droplets,1000


In [31]:
train_df

Unnamed: 0,tweet,sentiment,clean_tweet,tokenized_sents,word_list,word_list_without_sw
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,1,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drags, ..."
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,1,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,0,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,1,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, mobile_phone, kissin..."
4,factsguide: society now #motivation,0,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"
...,...,...,...,...,...,...
31957,ate @user isz that youuu?ðððððð...,0,ate user isz that youuu smiling_face_with_hear...,1,"[ate, user, isz, that, youuu, smiling_face_wit...","[ate, user, isz, youuu, smiling_face_with_hear..."
31958,to see nina turner on the airwaves trying to...,0,to see nina turner on the airwaves trying to w...,1,"[to, see, nina, turner, on, the, airwaves, try...","[see, nina, turner, airwaves, trying, wrap, ma..."
31959,listening to sad songs on a monday morning otw...,0,listening to sad songs on a monday morning otw...,1,"[listening, to, sad, songs, on, a, monday, mor...","[listening, sad, songs, monday, morning, otw, ..."
31960,"@user #sikh #temple vandalised in in #calgary,...",1,user sikh temple vandalised in in calgary wso ...,1,"[user, sikh, temple, vandalised, in, in, calga...","[user, sikh, temple, vandalised, calgary, wso,..."


In [32]:
positive_tweets = train_df[train_df['sentiment'] == 0]
negative_tweets = train_df[train_df['sentiment'] == 1]

In [33]:
# Most common positive words
top = Counter([item for sublist in positive_tweets['word_list_without_sw'] for item in sublist])
temp_positive = pd.DataFrame(top.most_common(20))
temp_positive.columns = ['Common_positive_words','count']
temp_positive.style.background_gradient(cmap='Greens')

Unnamed: 0,Common_positive_words,count
0,user,15644
1,love,2672
2,day,2236
3,happy,1678
4,amp,1318
5,time,1100
6,life,1096
7,sweat_droplets,1000
8,today,991
9,red_heart,931


In [34]:
# Most common negative words
top = Counter([item for sublist in negative_tweets['word_list_without_sw'] for item in sublist])
temp_negative = pd.DataFrame(top.most_common(20))
temp_negative = temp_negative.iloc[1:,:]
temp_negative.columns = ['Common_negative_words','count']
temp_negative.style.background_gradient(cmap='Reds')

Unnamed: 0,Common_negative_words,count
1,amp,268
2,…,207
3,trump,201
4,libtard,149
5,like,137
6,white,137
7,black,131
8,people,105
9,racist,103
10,politics,96


In [35]:
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
train_df['word_list_without_sw'] = train_df['word_list_without_sw'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
train_df.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,tweet,sentiment,clean_tweet,tokenized_sents,word_list,word_list_without_sw
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,1,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drag, k..."
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,1,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,0,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,1,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, mobile_phone, kissin..."
4,factsguide: society now #motivation,0,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


In [36]:
train_df['final_tweet'] = train_df['word_list_without_sw'].apply(lambda x:' '.join(x))
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,tokenized_sents,word_list,word_list_without_sw,final_tweet
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,1,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drag, k...",user father dysfunctional selfish drag kid dys...
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,1,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,...",user user thanks lyft credit use offer wheelch...
2,bihday your majesty,0,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]",bihday majesty
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,1,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, mobile_phone, kissin...",model love take time mobile_phone kissing_face...
4,factsguide: society now #motivation,0,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]",factsguide society motivation


In [37]:
train_df.to_csv('processed_train_tweets.csv', index=False)