## Sentiment Analysis
To perform sentiment analysis, we used nltk, specifically the SentimentIntensityAnalyzer class and the word_tokenize function. We loaded the CSV file containing the tweets using the csv module. To be able to analyze the sentiment in the tweets, we tokenized the tweets using the word_tokenize function from nltk.tokenize. This function breaks the sentence into individual words, which allows the model to work with each word independently. We then created an instance of the SentimentIntensityAnalyzer class from nltk.sentiment. This class contains a pre-built lexicon called the VADER lexicon, which is specifically attuned to sentiments expressed in social media. The lexicon contains words and their associated sentiment scores, which are based on human-generated scores. It also contains slang and emoticons that are commonly used in social media and are not present in traditional sentiment analysis lexicons. We used the SentimentIntensityAnalyzer object to perform sentiment analysis on each tokenized tweet. We called the polarity_scores method on the object and passed in the tokenized tweet. This method returns a dictionary containing the sentiment scores for the text, including 'neg', 'neu', 'pos' and 'compound'. Then, we added the sentiment analysis results (neg, neu, pos, and compound) as new columns to the data variable. We used a loop to go through each tweet and extract the sentiment scores and add them to the data variable as new columns.

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements, known as tokens. Tokenization is an important step in natural language processing (NLP) because it allows the model to work with individual words or phrases, rather than the entire text.

In [1]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
#nltk.download('punkt')
#The lexicon contains words and their associated sentiment scores, which are based on human-generated scores. 
#lexico from social media
nltk.download('vader_lexicon')
import os

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/sarai/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
def read_text_file(filepath):
    """
    Read the text file located at the given file path and return its content as a list of strings.

    Parameters:
    filepath: A string representing the file path of the text file to read.

    Returns:
    data: A list of strings, where each string in the list represents a line in the text file.
    """

    with open(filepath, 'r') as file:
        data = file.readlines()
    return data


def write_text_file(filepath, data):
    """
    Write the given list of strings to a text file located at the given file path.

    Parameters:
    filepath: A string representing the file path of the text file to write.
    data: A list of strings, where each string in the list should represent a line in the text file.

    Returns:
    None
    """

    with open(filepath, 'w') as file:
        file.writelines(data)


In [3]:
def add_sentiment_analysis(data):
    """
    Perform sentiment analysis on each line of text in the given list using the VADER sentiment analysis tool.
    For each line of text, the function appends the sentiment analysis scores to the end of the line as tab-separated values.
    Returns a new list of strings with the updated lines of text.
    
    Parameters:
    data: A list of strings representing each line of text to be analyzed.
    
    Return: 
    A new list of strings with the updated lines of text.
    """
    # Create a SentimentIntensityAnalyzer object from the nltk package
    sia = SentimentIntensityAnalyzer()
    new_data = []
    # Loop through each line of text in the input list
    for line in data:
        # Perform sentiment analysis on the line using the SentimentIntensityAnalyzer object
        sentiment = sia.polarity_scores(line)
        # Append the sentiment analysis scores to the end of the line as tab-separated values
        new_line = line.strip() + f"\t{sentiment['neg']}\t{sentiment['neu']}\t{sentiment['pos']}\t{sentiment['compound']}\n"
        # Append the updated line of text to the output list
        new_data.append(new_line)
    # Return the list of updated lines of text
    return new_data


In [4]:
def process_text_files(directory):
    """
    Process all text files in the given directory.
    Each text file should contain a line of text in each row.
    The function performs sentiment analysis on each line of text, and writes the updated text back to the same file.
    
    Parameters:
    directory: The directory containing the text files to be processed.
    
    Return:
    Add the sentyment analysis tothe txt files
    """
    # Get a list of all text files in the directory
    files = [f for f in os.listdir(directory) if f.endswith('.txt')]

    # Loop through each file and process its contents
    for file in files:
        # Construct the full file path
        filepath = os.path.join(directory, file)

        # Read the data from the text file
        data = read_text_file(filepath)

        # Perform sentiment analysis on the data
        data_with_sentiment = add_sentiment_analysis(data)

        # Write the updated data back to the text file
        write_text_file(filepath, data_with_sentiment)



In [6]:
#Run the functions to add the sentyment analysis to our data
#news
process_text_files("/Users/sarai/Documents/DataAnalytics/_PROJECT/data/feb1March19/")
#ceos
process_text_files("/Users/sarai/Documents/DataAnalytics/_PROJECT/data/feb1March19/ceos/")
#companies
process_text_files("/Users/sarai/Documents/DataAnalytics/_PROJECT/data/feb1March19/companies/")


In [19]:
#errors
#process_text_files("/Users/sarai/Documents/DataAnalytics/_PROJECT/data/feb1March1/ceos/")