# Introduction
**Sentiment:** When someone feels good about something, it's called a positive feeling, and when they feel bad about something, it's called a negative feeling, those feelings are "sentiment."

# OSEMN
* Obtain
* Scrub
* Explore 
* Model

## 1. Obtain

The Twitter Sentiment Analysis Dataset is a corpus of 1,578,627 classified tweets, with each tweet marked as 1 for positive sentiment and 0 for negative sentiment. The dataset is based on data from the University of Michigan Sentiment Analysis competition on Kaggle and the Twitter Sentiment Corpus by Niek Sanders. It is recommended to use 1/10 of the dataset for testing and the rest for training. The dataset has been used to achieve a 75% accuracy rate with a simple Naive Bayesian classification algorithm. The use of natural language processing can be helpful in extracting context and identifying features that contribute towards sentiment deduction. However, it is important to note that social informal communication, such as tweets, may not conform to grammatical rules and contain shortened words and overuse of punctuation. Despite these limitations, the dataset provides a good starting point for sentiment analysis modeling.

### Load the Data

In [None]:
# import relevant libraries
import pandas as pd

# read the csv file to table
df = pd.read_csv("/home/munyao/Desktop/flat_iron_school/Moringa/phase_4/NLP/Data/Sentiment Analysis Dataset.csv", on_bad_lines='skip', index_col=0)
df.columns

## Scrub
* Removing stop words (words that are very common and do not add much meaning to the text)
* Removing punctuation and special characters
* Tokenizing the text (splitting it into words or phrases)
* Stemming or lemmatizing the words (reducing them to their base form)
* Removing URLs, mentions, or hashtags if you are working with social media data

In [None]:
df

In [None]:
# check missing
df.isnull().sum()

>A function in python using the Natural Language Toolkit (NLTK) library to perform these text preprocessing steps above.

In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download the stopwords and lemmatizer data
nltk.download('stopwords')
nltk.download('wordnet')

# Define the pre-processing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove URLs, mentions, and hashtags
    text = re.sub(r'http\S+|www\S+|@[^\s]+|#\S+', '', text)

    # Remove punctuation and special characters
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize the text into words
    tokens = nltk.word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Perform lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Join the tokens back into a string
    text = ' '.join(tokens)

    return text

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Define the dataset names, sizes, and sentiment scores
dataset_names = ['Sentiment140', 'Kaggle']
dataset_sizes = [1577269, 1343]
dataset_scores = [0.2, 0.8] 

# Create a pandas DataFrame with the dataset information
data = pd.DataFrame({'Dataset': dataset_names, 'Size': dataset_sizes, 'Score': dataset_scores})

# Create a bubble chart using seaborn
sns.scatterplot(data=data, x='Score', y='Dataset', size='Size', sizes=(100, 1000), alpha=0.7)
plt.title('Sentiment and tweet count for each dataset')
plt.xlabel('Sentiment score')
plt.ylabel('Dataset')
plt.grid(True)
plt.show()

In [None]:
# Apply pre-processing to the 'text' column
df['ProcessedSentimentText'] = df['SentimentText'].apply(preprocess_text)

# Show the processed DataFrame
df

In [None]:
df = df.drop()

In [None]:
from sklearn.model_selection import train_test_split

# Split the pre-processed data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['ProcessedSentimentText'], df['Sentiment'], test_size=0.2, random_state=42)


## vectorization

In [None]:
# import relevant libraries
from scipy.sparse import csr_matrix

# example documents
documents = X_train

# create a CountVectorizer object
vectorizer = CountVectorizer()

# transform the documents into a sparse matrix using csr_matrix
bow_representation = csr_matrix(vectorizer.fit_transform(documents))

# print the shape of the sparse matrix
print(bow_representation.shape)

# print the sparse matrix in compressed sparse row format
bow_representation

I analyze the text data using a tool called the Fourier transform. This tool helps us understand the different patterns and frequencies in the text. I use this information to figure out how people feel in the text. For example, we might find that certain patterns are associated with happy or sad feelings.

**Fourier transform**
>A mathematical technique that decomposes a time-domain signal into its constituent frequencies. It is named after Joseph Fourier, who discovered that any periodic waveform can be expressed as a sum of sine and cosine waves of different frequencies.

>In mathematical terms, the Fourier transform of a continuous-time signal x(t) is defined as:

>>X(f) = ∫x(t)e^(-j2πft)dt

where X(f) is the frequency-domain representation of the signal x(t), and f is the frequency in hertz. The Fourier transform maps a function of time into a function of frequency.

>The inverse Fourier transform, on the other hand, is used to recover the time-domain signal from its frequency-domain representation. It is defined as:

>>x(t) = (1/T) ∫X(f)e^(j2πft)df

>where T is the duration of the signal.

In [None]:
from scipy.fft import fft

# apply Fourier transform to the BOW representation
fft_representation = fft(bow_sparse_matrix)

# print the resulting Fourier coefficients
print(fft_representation.toarray())


In [None]:
plt.hist(freqs[np.argmax(X_freq, axis=1)], bins=20) # plot most prominent frequency for each document
plt.xlabel('Frequency')
plt.ylabel('Count')
plt.show()