# Introduction
**Sentiment:** When someone feels good about something, it's called a positive feeling, and when they feel bad about something, it's called a negative feeling, those feelings are "sentiment."

# OSEMN
* Obtain
* Scrub
* Explore 
* Model

# 1. Obtain

The Twitter Sentiment Analysis Dataset is a corpus of 1,578,627 classified tweets, with each tweet marked as 1 for positive sentiment and 0 for negative sentiment. The dataset is based on data from the University of Michigan Sentiment Analysis competition on Kaggle and the Twitter Sentiment Corpus by Niek Sanders. It is recommended to use 1/10 of the dataset for testing and the rest for training. The dataset has been used to achieve a 75% accuracy rate with a simple Naive Bayesian classification algorithm. The use of natural language processing can be helpful in extracting context and identifying features that contribute towards sentiment deduction. However, it is important to note that social informal communication, such as tweets, may not conform to grammatical rules and contain shortened words and overuse of punctuation. Despite these limitations, the dataset provides a good starting point for sentiment analysis modeling.

#### Load neccesary libraries

In [7]:
#import relevant libraries
import pandas as pd
import seaborn as sns
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
np.random.seed(0)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, ENGLISH_STOP_WORDS


### Load the Data

In [8]:
# read the csv file to table
df = pd.read_csv("/home/munyao/Desktop/flat_iron_school/Moringa/phase_4/NLP/Data/Sentiment Analysis Dataset.csv", on_bad_lines='skip', index_col=0)

# preview first 7 rows of dataset.
df.head(7)

Unnamed: 0_level_0,Sentiment,SentimentSource,SentimentText
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,Sentiment140,is so sad for my APL frie...
2,0,Sentiment140,I missed the New Moon trail...
3,1,Sentiment140,omg its already 7:30 :O
4,0,Sentiment140,.. Omgaga. Im sooo im gunna CRy. I'...
5,0,Sentiment140,i think mi bf is cheating on me!!! ...
6,0,Sentiment140,or i just worry too much?
7,1,Sentiment140,Juuuuuuuuuuuuuuuuussssst Chillin!!


# 2. Scrub
* Convert the text to lowercase: This is done so that the analysis is not case-sensitive.

* Remove URLs, mentions, and hashtags: These are typically not relevant to the analysis and can be removed.

* Remove punctuation and special characters: These can also be removed as they do not add any value to the analysis.

* Tokenize the text into words: This splits the text into individual words, which can then be analyzed separately.

* Remove stop words: These are common words such as "the", "and", and "a" that do not typically carry much meaning and can be removed.

* Perform lemmatization: This reduces words to their base form, so that variations of the same word are treated as the same (e.g. "walks", "walked", and "walking" all become "walk").

* Join the tokens back into a string: This reassembles the processed words into a single string that can be used for further analysis.



In [9]:
# data shape
df.shape

(1578612, 3)

### 2.1 Cleaning and Normalizing 
1. Convert the text to lowercase: This is done so that the analysis is not case-sensitive.
2. Remove URLs, mentions, and hashtags: These are typically not relevant to the analysis and can be removed.
3. Remove punctuation and special characters: These can also be removed as they do not add any value to the analysis.
4. Tokenize the text into words: This splits the text into individual words, which can then be analyzed separately.
5. Remove stop words: These are common words such as "the", "and", and "a" that do not typically carry much meaning and can be removed.
6. Perform lemmatization: This reduces words to their base form, so that variations of the same word are treated as the same (e.g. "walks", "walked", and "walking" all become "walk").
7. Join the tokens back into a string: This reassembles the processed words into a single string that can be used for further analysis.



In [10]:
# define the pre-processing function
def preprocess_text(text):
    # convert to lowercase
    text = text.lower()

    # remove URLs, mentions, and hashtags
    text = re.sub(r'http\S+|www\S+|@[^\s]+|#\S+', '', text)

    # remove excessive letters
    text = re.sub(r'(.)\1+', r'\1\1', text)

    # remove punctuation and special characters
    text = text.translate(str.maketrans('', '', string.punctuation))

    # tokenize the text into words
    tokens = nltk.word_tokenize(text)

    # remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # perform lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # join the tokens back into a string
    text = ' '.join(tokens)

    return text


### 2.2 Processed Data

In [11]:
# apply pre-processing to the 'text' column
df['ProcessedSentimentText'] = df['SentimentText'].apply(preprocess_text)

# preview the processed data
df.head(7)

Unnamed: 0_level_0,Sentiment,SentimentSource,SentimentText,ProcessedSentimentText
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,Sentiment140,is so sad for my APL frie...,sad apl friend
2,0,Sentiment140,I missed the New Moon trail...,missed new moon trailer
3,1,Sentiment140,omg its already 7:30 :O,omg already 730
4,0,Sentiment140,.. Omgaga. Im sooo im gunna CRy. I'...,omgaga im soo im gunna cry ive dentist since 1...
5,0,Sentiment140,i think mi bf is cheating on me!!! ...,think mi bf cheating tt
6,0,Sentiment140,or i just worry too much?,worry much
7,1,Sentiment140,Juuuuuuuuuuuuuuuuussssst Chillin!!,juusst chillin


I analyze the text data using a tool called the Fourier transform. This tool helps us understand the different patterns and frequencies in the text. I use this information to figure out how people feel in the text. For example, we might find that certain patterns are associated with happy or sad feelings.

# 3. Explore

In [12]:
# import relevant libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, ENGLISH_STOP_WORDS

# define a function to plot the power spectrum
def plot_power_spectrum(power_spectrum, title):
    freq = np.fft.fftfreq(len(power_spectrum))
    plt.plot(freq, power_spectrum)
    plt.xlabel('Frequency')
    plt.ylabel('Power')
    plt.title(title)
    plt.show()