# Introduction
**Sentiment:** When someone feels good about something, it's called a positive feeling, and when they feel bad about something, it's called a negative feeling, those feelings are "sentiment."

# OSEMN
* Obtain
* Scrub
* Explore 
* Model

## 1. Obtain

The Twitter Sentiment Analysis Dataset is a corpus of 1,578,627 classified tweets, with each tweet marked as 1 for positive sentiment and 0 for negative sentiment. The dataset is based on data from the University of Michigan Sentiment Analysis competition on Kaggle and the Twitter Sentiment Corpus by Niek Sanders. It is recommended to use 1/10 of the dataset for testing and the rest for training. The dataset has been used to achieve a 75% accuracy rate with a simple Naive Bayesian classification algorithm. The use of natural language processing can be helpful in extracting context and identifying features that contribute towards sentiment deduction. However, it is important to note that social informal communication, such as tweets, may not conform to grammatical rules and contain shortened words and overuse of punctuation. Despite these limitations, the dataset provides a good starting point for sentiment analysis modeling.

#### Load neccesary libraries

In [None]:
#pip freeze >> requirements.text

In [2]:
#import relevant libraries
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
np.random.seed(0)

### Load the Data

In [3]:
# import relevant libraries
import pandas as pd

# read the csv file to table
df = pd.read_csv("/home/munyao/Desktop/flat_iron_school/Moringa/phase_4/NLP/Data/Sentiment Analysis Dataset.csv", on_bad_lines='skip', index_col=0)

# preview first 7 rows of dataset.
df.head(7)

Unnamed: 0_level_0,Sentiment,SentimentSource,SentimentText
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,Sentiment140,is so sad for my APL frie...
2,0,Sentiment140,I missed the New Moon trail...
3,1,Sentiment140,omg its already 7:30 :O
4,0,Sentiment140,.. Omgaga. Im sooo im gunna CRy. I'...
5,0,Sentiment140,i think mi bf is cheating on me!!! ...
6,0,Sentiment140,or i just worry too much?
7,1,Sentiment140,Juuuuuuuuuuuuuuuuussssst Chillin!!


## 2. Scrub
* Removing stop words (words that are very common and do not add much meaning to the text)
* Removing punctuation and special characters
* Tokenizing the text (splitting it into words or phrases)
* Stemming or lemmatizing the words (reducing them to their base form)
* Removing URLs, mentions, or hashtags if you are working with social media data

In [5]:
# data shape
df.shape

(1578612, 3)

In [8]:
# check missing
df.isnull().sum()

Sentiment          0
SentimentSource    0
SentimentText      0
dtype: int64

### Text Preprocessing.
>A function in python using the Natural Language Toolkit (NLTK) library to perform the text preprocessing.

In [19]:
# define the pre-processing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # remove URLs, mentions, and hashtags
    text = re.sub(r'http\S+|www\S+|@[^\s]+|#\S+', '', text)

    # remove punctuation and special characters
    text = text.translate(str.maketrans('', '', string.punctuation))

    # tokenize the text into words
    tokens = nltk.word_tokenize(text)

    # remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # perform lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # join the tokens back into a string
    text = ' '.join(tokens)

    return text

# apply pre-processing to the 'text' column
df['ProcessedSentimentText'] = df['SentimentText'].apply(preprocess_text)

Unnamed: 0_level_0,Sentiment,SentimentSource,SentimentText,ProcessedSentimentText,CleanSentiment
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1578621,1,Sentiment140,ZZZZZ time.. Tomorrow will be a busy day for s...,zzzzz time tomorrow busy day serving amp lovin...,z z z z z t i m e t o m o r r o w w ...
1578622,0,Sentiment140,Zzzzz want to sleep but at sister's in-laws's ...,zzzzz want sleep sister inlawss house,z z z z z w a n t t o s l e e p b u t ...
1578623,1,Sentiment140,Zzzzzz.... Finally! Night tweeters!,zzzzzz finally night tweeter,z z z z z z f i n a l l y ! n i g ...
1578624,1,Sentiment140,"Zzzzzzz, sleep well people",zzzzzzz sleep well people,z z z z z z z s l e e p w e l l p e o p...
1578625,0,Sentiment140,ZzzZzZzzzZ... wait no I have homework.,zzzzzzzzzz wait homework,z z z z z z z z z z w a i t n o i ...
1578626,0,Sentiment140,"ZzZzzzZZZZzzz meh, what am I doing up again?",zzzzzzzzzzzzz meh,z z z z z z z z z z z z z m e h w h a t ...
1578627,0,Sentiment140,"Zzzzzzzzzzzzzzzzzzz, I wish",zzzzzzzzzzzzzzzzzzz wish,z z z z z z z z z z z z z z z z z z z i w...


## tokenization
    Function to tokenize the data

In [20]:
# import relevant libraries
import re
def clean_bag(data):
    clean_lines = [line for line in data if "[" not in line and "]" not in line]
    clean_song = " ".join(clean_lines)
    for symbol in ",.'?!()'":
        clean_song = clean_song.replace(symbol, "")
        clean_song = clean_song.replace("\n", " ")
        clean_song = clean_song.replace(".", " ")
        pattern = r'[()]'
        clean_song = re.sub(pattern, '', clean_song)
        return clean_song.lower()

# apply pre-processing to the 'text' column
df['CleanSentiment'] = df['SentimentText'].apply(clean_bag)

### ntlk 

In [21]:
# nltk's word_tokenize() function on the song string to fully tokenized
df["ntlk_tokenized"] = df['SentimentText'].apply(word_tokenize)

# preview df
df.head(7)

Unnamed: 0_level_0,Sentiment,SentimentSource,SentimentText,ProcessedSentimentText,CleanSentiment,ntlk_tokenized
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,Sentiment140,is so sad for my APL frie...,sad apl friend,i s ...,"[is, so, sad, for, my, APL, friend, ............."
2,0,Sentiment140,I missed the New Moon trail...,missed new moon trailer,i m i ...,"[I, missed, the, New, Moon, trailer, ...]"
3,1,Sentiment140,omg its already 7:30 :O,omg already 730,o m g i t s a ...,"[omg, its, already, 7:30, :, O]"
4,0,Sentiment140,.. Omgaga. Im sooo im gunna CRy. I'...,omgaga im sooo im gunna cry ive dentist since ...,o m g a g a i m ...,"[.., Omgaga, ., Im, sooo, im, gunna, CRy, ., I..."
5,0,Sentiment140,i think mi bf is cheating on me!!! ...,think mi bf cheating tt,i t h i n k m i b f ...,"[i, think, mi, bf, is, cheating, on, me, !, !,..."
6,0,Sentiment140,or i just worry too much?,worry much,o r i j u s t w o r r ...,"[or, i, just, worry, too, much, ?]"
7,1,Sentiment140,Juuuuuuuuuuuuuuuuussssst Chillin!!,juuuuuuuuuuuuuuuuussssst chillin,j u u u u u u u u u u u u u u u ...,"[Juuuuuuuuuuuuuuuuussssst, Chillin, !, !]"


### count vectorization
    Function to return each item count

In [27]:
# columns to compare
col1 = list(df["ProcessedSentimentText"])
col2 = list(df["CleanSentiment"])
col3 = list(df["ntlk_tokenized"])
col1

['sad apl friend',
 'missed new moon trailer',
 'omg already 730',
 'omgaga im sooo im gunna cry ive dentist since 11 suposed 2 get crown put 30mins',
 'think mi bf cheating tt',
 'worry much',
 'juuuuuuuuuuuuuuuuussssst chillin',
 'sunny work tomorrow tv tonight',
 'handed uniform today miss already',
 'hmmmm wonder number',
 'must think positive',
 'thanks hater face day 112102',
 'weekend sucked far',
 'jb isnt showing australia',
 'ok thats win',
 'lt way feel right',
 'awhhe man im completely useless rt funny twitter',
 'feeling strangely fine im gon na go listen semisonic celebrate',
 'huge roll thunder nowso scary',
 'cut beard growing well year im gon na start happy meantime',
 'sad iran',
 'wompppp wompp',
 'youre one see cause one else following youre pretty awesome',
 'ltsad level 3 writing massive blog tweet myspace comp shut lost lay fetal position',
 'headed hospitol pull golf tourny 3rd place think reripped something yeah',
 'boring whats wrong please tell',
 'cant bothe

In [25]:
# import neccesary library
from collections import Counter

# function to return count for the columns
def count_vectorize(col1, col2, col3):
    # count the frequency of each word in each column using Counter
    word_counts_1 = Counter(col1)
    word_counts_2 = Counter(col2)
    word_counts_3 = Counter(col3)
    
    # return the word counts for each column as a tuple
    return word_counts_1, word_counts_2, word_counts_3


# columns to compare
col1 = list(df["ProcessedSentimentText"])
col2 = list(df["CleanSentiment"])
col3 = list(df["ntlk_tokenized"])

# call function
count_vectorize(col1,col2,col3)

TypeError: unhashable type: 'list'

In [None]:
# drop the sentimentsource column
processed_df = df.drop(["SentimentSource"], axis=1)

processed_df

## vectorization
convert the preprocessed text data into a numerical representation using BoW.

In [None]:
# define X and y
X = df[['ProcessedSentimentText']]

# create a CountVectorizer object
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

y = df[['Sentiment']]

In [None]:
# import necessary libraries
from sklearn.model_selection import train_test_split

# split the pre-processed data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[['ProcessedSentimentText']], df[['Sentiment']], test_size=0.2, random_state=42)

In [None]:
# import relevant libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy.sparse import csr_matrix

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X3, y, test_size=0.2, random_state=42)

# train a logistic regression model on the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Evaluate the model on the testing data
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

I analyze the text data using a tool called the Fourier transform. This tool helps us understand the different patterns and frequencies in the text. I use this information to figure out how people feel in the text. For example, we might find that certain patterns are associated with happy or sad feelings.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import fft

# Convert the SentimentText column to a numpy array
text_data = np.array(processed_df['ProcessedSentimentText'])

# Apply the DFT to the text data
dft = fft(text_data)

# Calculate the power spectrum of the DFT
power_spectrum = np.abs(dft) ** 2

# Plot the power spectrum
freq = np.fft.fftfreq(len(power_spectrum))
plt.plot(freq, power_spectrum)
plt.xlabel('Frequency')
plt.ylabel('Power')
plt.title('Power Spectrum of SentimentText Before Processing')
plt.show()


In [None]:
from scipy.fft import fft

# apply Fourier transform to the BOW representation
fft_representation = fft(bow_sparse_matrix)

# print the resulting Fourier coefficients
print(fft_representation.toarray())


In [None]:
plt.hist(freqs[np.argmax(X_freq, axis=1)], bins=20) # plot most prominent frequency for each document
plt.xlabel('Frequency')
plt.ylabel('Count')
plt.show()