# Introduction
**Sentiment:** When someone feels good about something, it's called a positive feeling, and when they feel bad about something, it's called a negative feeling, those feelings are "sentiment."

# OSEMN
* Obtain
* Scrub
* Explore 
* Model

## 1. Obtain

The Twitter Sentiment Analysis Dataset is a corpus of 1,578,627 classified tweets, with each tweet marked as 1 for positive sentiment and 0 for negative sentiment. The dataset is based on data from the University of Michigan Sentiment Analysis competition on Kaggle and the Twitter Sentiment Corpus by Niek Sanders. It is recommended to use 1/10 of the dataset for testing and the rest for training. The dataset has been used to achieve a 75% accuracy rate with a simple Naive Bayesian classification algorithm. The use of natural language processing can be helpful in extracting context and identifying features that contribute towards sentiment deduction. However, it is important to note that social informal communication, such as tweets, may not conform to grammatical rules and contain shortened words and overuse of punctuation. Despite these limitations, the dataset provides a good starting point for sentiment analysis modeling.

### Load the Data

In [1]:
# import relevant libraries
import pandas as pd

# read the csv file to table
df = pd.read_csv("/home/munyao/Desktop/flat_iron_school/Moringa/phase_4/NLP/Data/Sentiment Analysis Dataset.csv", on_bad_lines='skip', index_col=0)

# preview first 7 rows of dataset.
df.head(7)

Unnamed: 0_level_0,Sentiment,SentimentSource,SentimentText
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,Sentiment140,is so sad for my APL frie...
2,0,Sentiment140,I missed the New Moon trail...
3,1,Sentiment140,omg its already 7:30 :O
4,0,Sentiment140,.. Omgaga. Im sooo im gunna CRy. I'...
5,0,Sentiment140,i think mi bf is cheating on me!!! ...
6,0,Sentiment140,or i just worry too much?
7,1,Sentiment140,Juuuuuuuuuuuuuuuuussssst Chillin!!


## Scrub
* Removing stop words (words that are very common and do not add much meaning to the text)
* Removing punctuation and special characters
* Tokenizing the text (splitting it into words or phrases)
* Stemming or lemmatizing the words (reducing them to their base form)
* Removing URLs, mentions, or hashtags if you are working with social media data

In [2]:
df[["SentimentText"]]

Unnamed: 0_level_0,SentimentText
ItemID,Unnamed: 1_level_1
1,is so sad for my APL frie...
2,I missed the New Moon trail...
3,omg its already 7:30 :O
4,.. Omgaga. Im sooo im gunna CRy. I'...
5,i think mi bf is cheating on me!!! ...
...,...
1578623,Zzzzzz.... Finally! Night tweeters!
1578624,"Zzzzzzz, sleep well people"
1578625,ZzzZzZzzzZ... wait no I have homework.
1578626,"ZzZzzzZZZZzzz meh, what am I doing up again?"


In [3]:
# check missing
df.isnull().sum()

Sentiment          0
SentimentSource    0
SentimentText      0
dtype: int64

### Text Preprocessing.
>A function in python using the Natural Language Toolkit (NLTK) library to perform the text preprocessing.

In [4]:
#import relevant libraries
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# download the stopwords and lemmatizer data
# nltk.download('stopwords')
# nltk.download('wordnet')

# define the pre-processing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # remove URLs, mentions, and hashtags
    text = re.sub(r'http\S+|www\S+|@[^\s]+|#\S+', '', text)

    # remove punctuation and special characters
    text = text.translate(str.maketrans('', '', string.punctuation))

    # tokenize the text into words
    tokens = nltk.word_tokenize(text)

    # remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # perform lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # join the tokens back into a string
    text = ' '.join(tokens)

    return text

# apply pre-processing to the 'text' column
df['ProcessedSentimentText'] = df['SentimentText'].apply(preprocess_text)

# show the processed DataFrame
df.tail(7)

In [6]:
# drop the sentimentsource column
processed_df = df.drop(["SentimentSource"], axis=1)

processed_df

Unnamed: 0_level_0,Sentiment,SentimentText,ProcessedSentimentText
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,is so sad for my APL frie...,sad apl friend
2,0,I missed the New Moon trail...,missed new moon trailer
3,1,omg its already 7:30 :O,omg already 730
4,0,.. Omgaga. Im sooo im gunna CRy. I'...,omgaga im sooo im gunna cry ive dentist since ...
5,0,i think mi bf is cheating on me!!! ...,think mi bf cheating tt
...,...,...,...
1578623,1,Zzzzzz.... Finally! Night tweeters!,zzzzzz finally night tweeter
1578624,1,"Zzzzzzz, sleep well people",zzzzzzz sleep well people
1578625,0,ZzzZzZzzzZ... wait no I have homework.,zzzzzzzzzz wait homework
1578626,0,"ZzZzzzZZZZzzz meh, what am I doing up again?",zzzzzzzzzzzzz meh


## vectorization
convert the preprocessed text data into a numerical representation using BoW.

In [36]:
# define X and y
X = df[['ProcessedSentimentText']]
# create a CountVectorizer object
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

y = df[['Sentiment']]

In [20]:
# import necessary libraries
from sklearn.model_selection import train_test_split

# split the pre-processed data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[['ProcessedSentimentText']], df[['Sentiment']], test_size=0.2, random_state=42)

In [37]:
# import relevant libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy.sparse import csr_matrix

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X3, y, test_size=0.2, random_state=42)

# train a logistic regression model on the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Evaluate the model on the testing data
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

ValueError: Found input variables with inconsistent numbers of samples: [1, 1578612]

I analyze the text data using a tool called the Fourier transform. This tool helps us understand the different patterns and frequencies in the text. I use this information to figure out how people feel in the text. For example, we might find that certain patterns are associated with happy or sad feelings.

In [38]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import fft

# Convert the SentimentText column to a numpy array
text_data = np.array(processed_df['ProcessedSentimentText'])

# Apply the DFT to the text data
dft = fft(text_data)

# Calculate the power spectrum of the DFT
power_spectrum = np.abs(dft) ** 2

# Plot the power spectrum
freq = np.fft.fftfreq(len(power_spectrum))
plt.plot(freq, power_spectrum)
plt.xlabel('Frequency')
plt.ylabel('Power')
plt.title('Power Spectrum of SentimentText Before Processing')
plt.show()


ValueError: could not convert string to float: 'sad apl friend'

In [None]:
from scipy.fft import fft

# apply Fourier transform to the BOW representation
fft_representation = fft(bow_sparse_matrix)

# print the resulting Fourier coefficients
print(fft_representation.toarray())


In [None]:
plt.hist(freqs[np.argmax(X_freq, axis=1)], bins=20) # plot most prominent frequency for each document
plt.xlabel('Frequency')
plt.ylabel('Count')
plt.show()