#About this Notebook

The Twitter Sentiment Analysis project is designed to harness the power of machine learning and natural language processing (NLP) technologies to analyze the sentiments expressed in tweets. By utilizing a combination of TensorFlow, pandas, and the Natural Language Toolkit (nltk), this project meticulously processes and evaluates Twitter data to distinguish between positive, negative, and neutral sentiments. The purpose is to decode the vast, unstructured textual data available on Twitter into actionable insights.

#Description

This project begins with setting up the working environment in Google Colab and preparing the data for analysis. It involves:

1. Data Collection: Tweets are collected and imported for analysis, ensuring a rich dataset reflective of diverse sentiments.
2. Data Preprocessing: Utilizing pandas for data manipulation and cleaning, and nltk for filtering out stop words, ensuring the text data is primed for analysis.
3. Sentiment Analysis: TensorFlow's machine learning capabilities are leveraged to classify tweets according to their sentiment. The model is trained and tested on the prepared dataset, aiming to accurately identify and categorize sentiments.

#Why Sentiment Analysis is Important-

Sentiment analysis transcends mere categorization of opinions; it is a beacon for comprehending the vast, intricate tapestry of human emotions and beliefs as expressed in the digital era. In the context of this project, the importance of sentiment analysis is manifold. It serves as a crucial tool for identifying harmful or dangerous sentiments, such as racism or indicators of mental health issues like suicidal ideation, in the vast digital expanse of social media. This analytical prowess not only aids in fostering a safer online environment but also contributes to the broader societal benefit by enabling timely interventions and support for individuals in distress. Moreover, the insights garnered through this refined lens of sentiment analysis can inform policymakers, educators, and social platforms themselves to devise strategies that promote inclusivity, safety, and mental well-being in the digital domain.

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

In [3]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import re
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import f1_score, accuracy_score

In [4]:
train = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Twitter Sentiment/train_E6oV3lV.csv')
test = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Twitter Sentiment/test_tweets_anuFYb8.csv')
submission = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Twitter Sentiment/sample_submission_gfvA5FD.csv')

In [5]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [6]:
test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [7]:
submission.head()

Unnamed: 0,id,label
0,31963,0
1,31964,0
2,31965,0
3,31966,0
4,31967,0


In [8]:
train.label.value_counts()

0    29720
1     2242
Name: label, dtype: int64

In [11]:
train['tweet'][786]

' @user what did you decide?   #fowoh #goldenretriever #lcck9comfodog #workingdog '

In [12]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [14]:
stop_words = set(stopwords.words('english'))
stop = [x.lower() for x in stop_words]
lemma = WordNetLemmatizer()

shortcuts = {'u': 'you', 'y': 'why', 'r': 'are', 'doin': 'doing', 'hw': 'how', 'k': 'okay', 'm': 'am', 'b4': 'before',
            'idc': "i do not care", 'ty': 'thankyou', 'wlcm': 'welcome', 'bc': 'because', '<3': 'love', 'xoxo': 'love',
            'ttyl': 'talk to you later', 'gr8': 'great', 'bday': 'birthday', 'awsm': 'awesome', 'gud': 'good', 'h8': 'hate',
            'lv': 'love', 'dm': 'direct message', 'rt': 'retweet', 'wtf': 'hate', 'idgaf': 'hate',
             'irl': 'in real life', 'yolo': 'you only live once'}

def clean(text):
    text = text.lower()

    text = re.sub('\W+', ' ', text).strip()
    text = text.replace('user', '')

    text_token = word_tokenize(text)

    full_words = []
    for token in text_token:
        if token in shortcuts.keys():
            token = shortcuts[token]
        full_words.append(token)
#     text = " ".join(full_words)
#     text_token = word_tokenize(text)

#     words = [word for word in full_words if word not in stop]
    words_alpha = [re.sub(r'\d+', '', word) for word in full_words]
    words_big = [word for word in words_alpha if len(word)>2]
    stemmed_words = [lemma.lemmatize(word) for word in words_big]

    clean_text = " ".join(stemmed_words)
    clean_text = clean_text.replace('   ', ' ')
    clean_text = clean_text.replace('  ', ' ')
    return clean_text

In [16]:
hypocrite = []
for i in range(len(train['tweet'])):
    if 'hypocrite' in train['tweet'][i]:
        if train['label'][i] == 1:
            hypocrite.append('racist')
        else:
            hypocrite.append('good')
    else:
        hypocrite.append('good')
df = pd.DataFrame(columns=['hypocrite'], data=hypocrite)
print(df['hypocrite'].value_counts())

train['hypocrite'] = hypocrite

good      31957
racist        5
Name: hypocrite, dtype: int64


In [17]:
train['combined'] = train['tweet'].apply(str) + ' ' + train['hypocrite'].apply(str)

In [18]:
X_train = train.combined
y = train.label
X_test = test.tweet

In [19]:
clean_Xtrain = X_train.apply(lambda x: clean(x))

In [20]:
clean_Xtrain[1531]

'last shot the hotel stayed this weekend back the grind grateful healthy funwreekend good'

In [21]:
clean_Xtest = X_test.apply(lambda x: clean(x))

In [22]:
print(len(clean_Xtrain))
print(len(clean_Xtest))
print(len(y))

31962
17197
31962


In [23]:
vectorizer = CountVectorizer(max_df=0.5)
# vectorizer = TfidfVectorizer(ngram_range=(1,3), max_df=0.5)

X = vectorizer.fit_transform(clean_Xtrain)
X_test = vectorizer.transform(clean_Xtest)

In [24]:
print(X.shape)
print(X_test.shape)

(31962, 37184)
(17197, 37184)


In [25]:
model = LinearSVC(penalty='l2', C=0.5, dual=False, random_state=0, max_iter=1000)
print(model)

LinearSVC(C=0.5, dual=False, random_state=0)


In [26]:
# split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=0)

# calculate f1 score
model.fit(X_train,y_train)
y_pred = model.predict(X_val)
print('Accuracy:', accuracy_score(y_pred, y_val))
print("F1 Score: ", f1_score(y_pred, y_val))

Accuracy: 0.9648052557484749
F1 Score:  0.6905089408528199


In [27]:
df = pd.DataFrame()
df['y_pred'] = y_pred
df['y_pred'].value_counts()

0    6074
1     319
Name: y_pred, dtype: int64

In [29]:
model.fit(X, y)
y_pred = model.predict(X_test)

In [30]:
df = pd.DataFrame()
df['y_pred'] = y_pred
df['y_pred'].value_counts()

0    16205
1      992
Name: y_pred, dtype: int64

In [31]:
# save it to submission csv
submission['label'] = y_pred
submission.to_csv('/content/drive/My Drive/Colab Notebooks/Twitter Sentiment/submission.csv', index=False)

#Conclusion

The ambition of this sentiment analysis project is not just to categorize digital expressions but to illuminate the social and emotional underpinnings that resonate within the Twitter sphere. By focusing on identifying tweets that carry racist, suicidal, or other socially relevant sentiments, the project endeavors to contribute significantly to the understanding and betterment of the digital and real-world social fabric. In doing so, it underscores the transformative potential of sentiment analysis as a tool for social good, highlighting its critical role in navigating the complexities of the digital age with empathy, awareness, and action.