# Analysis of the instances of "yo" in NLTK's Twitter dataset


For my final project, I have developed a program that uses the Twitter sampleset from NLTK to identify instances of the word "yo" and how it is used in social media interactions. 

The objective of the project is to examine the sentiment associated with "yo", identify the most commonly used parts of speech of "yo", and determine the typical enviornments in which it is used.





My project contains three main parts:

First, the program calculates the sentiment score of each instance of "yo" using the SentimentIntensityAnalyzer from NLTK and stores the scores in a list. The code then calculates the average sentiment score of all "yo" instances.

The program then uses the POS tagger from NLTK to tag the words in each tweet and determine the POS of each instance of "yo". It then creates a frequency distribution of the POS tags for all instances of "yo".

Finally, the program creates a list of bigrams of the words that appear before and after each instance of "yo" in the dataset paired with their frequency distributions. 





In [41]:
#I imported NLTK and created a variable containing the Twitter dataset:
import nltk
tweets = nltk.corpus.twitter_samples.strings()
# List to store all instances of "yo"
yo_instances = []

#For loop tokenizing the tweets into individual words
for tweet in tweets:
    words = nltk.word_tokenize(tweet)
    
#Another loop for each word in the tweets to store instances of "yo" in a list
    for i in range(len(words)):
        if words[i].lower() == "yo":
            yo_instances.append(tweet)

# Print number of instances of "yo"
print("Total instances of 'yo':", len(yo_instances))


Total instances of 'yo': 13



I was very disappointed in the number of instances of "yo" found in the corpus. 

I was expecting to find more and considered extracting data from the Twitter API, or even Reddit API, but decided to stick with my orignal plan to avoid more confusion and overcomplicating my project to a level higher than what I felt I was capable of carrying out succesfully in the time allotted.

For future exploration, I would like to incorporate a larger dataset to get a more accurate conclusion of the ways in which "yo" is used in social media interaction.





However, I was still able to implement all parts of my project with the data available to me through the NLTK library. Also, because the dataset was so small, I was able to check for accuracy in the program's outputs. I can efficiently sort through the smaller set of data myself and see how my analysis compares to the way my program analyzes the corpus.


Part 1:

Part one consists of calculating the sentiment analysis score of the instances in which "yo" appears in the provided tweets from the corpus. This was the most difficult process. 

I initially planned to use Naïve Bayes, but ultimately decided to explore more options for sentiment analysis in NLTK. I used NLTK's Sentiment Intensity Analyzer to determine the sentiment score of the instances of "yo" for my project. The SIA is a tool that uses a lexicon based approach to determine the sentiment (positive, negative, neutral) of a text. The lexicon is used to calculate a sentiment score that ranges from -1 (the most negative) to +1 (the most positive) and 0 to indicate a neutral sentiment.



In [5]:
#I downloaded the resources necessary for the Sentiment Intensity Analyzer for calculating the sentiment score.
from nltk.sentiment import SentimentAnalyzer
# Function to calculate the sentiment score
def get_sentiment_score(sentence):
    analyzer = nltk.sentiment.SentimentIntensityAnalyzer()
    # Calculate sentiment score of sentence in tweet
    score = analyzer.polarity_scores(sentence)
    # Return the compound score of value between -1 and 1
    return score["compound"]

# Calculate sentiment score of "yo" instances and store in list
sentiment_scores = []
for instance in yo_instances:
    sentiment_scores.append(get_sentiment_score(instance))

# Calculate average sentiment score of "yo" instances
avg_sentiment_score = sum(sentiment_scores) / len(sentiment_scores)

# Print average sentiment score of "yo" instances
print("Average sentiment score of 'yo' instances:", avg_sentiment_score)


Average sentiment score of 'yo' instances: 0.11000769230769232


The sentiment score of 0.11000769230769232 indicates that "yo" is used in a *slightly* positive way.

However, I did my own sentiment score after analyzing the tweets based on what I thought to be positive, negative, and neutral sentiment in the tweets.

My sentiment analysis of "yo":
Positive: 4 instances
Negative: 3 instances
Neutral: 6 instances

sentiment score = (4(pos)*0.5 + 3(neg)*(-0.5) + 6(neut)*0) / 13(total) 
sentiment score = (2 - 1.5) / 13 
sentiment score = 0.0385



Although from what I understand, the SentimentIntensityAnalyzer has a more sophisticated and complex method for assigning sentiment scores to text than my simplfied approach (I chose to weight all the instances of positive and negative at .5 for positive and -.5 for negative because -1 is *extremely* negative and 1 is *extremely* positive), I did find that both my own analysis and NLTK's analysis both produced results that indicated that "yo" is overall used in slightly more positive settings based on the data.


Sources used to implement NLTK's SIA:
https://www.nltk.org/api/nltk.sentiment.html

https://www.nltk.org/_modules/nltk/sentiment/sentiment_analyzer.html#SentimentAnalyzer

https://www.nltk.org/howto/sentiment.html

https://realpython.com/python-nltk-sentiment-analysis/#using-nltks-pre-trained-sentiment-analyzer


Part 2:

Part two consists of counting the POS frequency of each instance of "yo". The POS tagger was very inaccurate and had a lot of trouble distinguishing the parts of speech of the instances in which "yo" is used in the tweets.

In [42]:
# POS tagging using the universal tagset on each instance of "yo"
for instance in yo_instances:
    tokens = nltk.word_tokenize(instance)
    yo_tags = [tag for (word, tag) in nltk.pos_tag(tokens, tagset='universal') if word.lower() == 'yo']
    if yo_tags:
#Prints the tag and tweet that "yo" occurred in to analyze the accuracy of the POS tagger
        print(f"'yo' = {yo_tags[0]} in the sentence: {instance}")

'yo' = NOUN in the sentence: Sometimes it be's like that, yo. Follow someone and then a few days later realise they're problematic as fuck. Life :(
'yo' = NOUN in the sentence: @KEEMSTARx yo dude. Fancy helping a fan out? I wanna grow a yt channel but can't purchase an Elgato :(
'yo' = NOUN in the sentence: @jenniferseon yo stop bragging not everybody got the skrillah to travel :(
'yo' = ADJ in the sentence: @aminadujean oh no :( is there anything yo ucan do?
'yo' = NOUN in the sentence: @jiarpi20 minal aidzin yo pi :D
'yo' = ADJ in the sentence: Woke up yo @trentowers fav' my tweet this made my day :))) now just @luketurner89 😂😂 http://t.co/oGw5sVij7G
'yo' = NOUN in the sentence: Can't date someone white. sorry. Yo grandpa prolly wanna burn me. :)
'yo' = NOUN in the sentence: @HollyyLive yo yall should invite me for ranked :))))
'yo' = NOUN in the sentence: @wuthering_alice I like that it doesn't have to be about consumerism :) Treat Yo Soul
'yo' = NOUN in the sentence: @TMobile Yo gi

Due to the dataset being as small as it is, I am able to determine the POS myself and compare my accurate POS tag (an interjection, most often) to what NLTK determined the POS tag was for each instance (typically a noun). 

My analysis:
Interjection: 7
Typo: 2
Pronoun: 2
Proper noun: 1
Different language: 1 

NLTK POS Tagger: 
Noun: 11
Adj: 2





Although the POS tagger does well with many words, I predict that words that are not used as frequently, words that could be considered "slang", and words that have several different parts of speech depending on their use will result in inaccurate tags.

However, with more data, a more accurate conclusion can be made.



Part 3:

Part three lists the bigrams before and after ocurrences of "yo" to show the enviornments in which the word typically occurs. 

My purpose for doing this is to see if there are any commonalities across the enviornments. 



In [40]:
yo = 'yo'

# List of bigrams for words that occur before and after "yo"
before_yo_bigrams = []
after_yo_bigrams = []
for tweet in tweets:
    tweet_tokens = nltk.word_tokenize(tweet.lower())
    for i in range(1, len(tweet_tokens)-1):
        if tweet_tokens[i] == yo and tweet_tokens[i-1].isalnum() and tweet_tokens[i+1].isalnum():
            before_yo_bigrams.append((tweet_tokens[i-1], yo))
            after_yo_bigrams.append((yo, tweet_tokens[i+1]))

# Print bigrams and frequency distribution
before_yo_fd = nltk.FreqDist(before_yo_bigrams)
after_yo_fd = nltk.FreqDist(after_yo_bigrams)
print(f"Bigrams before '{yo}':")
print(before_yo_fd.most_common())
print(f"Bigrams after '{yo}':")
print(after_yo_fd.most_common())


Bigrams before 'yo':
[(('keemstarx', 'yo'), 1), (('jenniferseon', 'yo'), 1), (('anything', 'yo'), 1), (('aidzin', 'yo'), 1), (('hollyylive', 'yo'), 1), (('treat', 'yo'), 1), (('tmobile', 'yo'), 1)]
Bigrams after 'yo':
[(('yo', 'dude'), 1), (('yo', 'stop'), 1), (('yo', 'ucan'), 1), (('yo', 'pi'), 1), (('yo', 'yall'), 1), (('yo', 'soul'), 1), (('yo', 'gim'), 1)]



Based on the outputs, I found that "yo" often follows someone's username. On Twitter, when addressing someone, you use @theirusername, which was often the case with this dataset. Therefore,"yo", based on this dataset, is commonly being used as a greeting/interjection in social media interactions.





Overall, this project has allowed me to see how valuable of a tool programming can be for analyzing datasets, but it can also generate inaccuracies that can result in incorrect conclusions. Moving forward, I believe that incorporating larger datasets, accounting for differences in dialects, and updating the lexicon are instrumental for improving the accuracy and efficacy of data analysis.

