In [9]:
# Eri Osta
# ads829

## Lab 6B

<em>Lab 6 consists of two exercises on natural language processing through conducting a sentiment analysis on text data and social media.</em>

In this exercise, you will download a set of tweets from Twitter based on a trending hashtag and conduct a sentiment analysis on the tweets and plot the average sentiment over time. 

You will download the tweets and save it into a CSV file for further analysis. Saving the tweets into a CSV file saves you from having to connect to Twitter each time. 

**Tasks**

1. Download 2000 original tweets that have a trending hashtag of your choice and save the tweets in a CSV file. Do not include retweets. You should select a popular hashtags that many people include repeatedly. <em>This may take several minutes to complete.</em>
2. Save the following information from the Tweet object into a CSV file: datetime, user's screen name, text, mentions (other users the tweet includes), and the hashtags in the text. For the mentions and hashtags, you may need to parse these out of the JSON object before saving it into the CSV file. The hashtag list will usually include the original hashtag you searched for, so do not include it in the CSV file. <em>HINT: Save all usernames, mentions, and hashtags as completely lower-cased strings. This will help you when you are counting the number of occurrences.</em>
3. Determine who was the most active Twitter user during this time frame in terms of the number of posts.
4. Calculate the average polarity and subjectivity for all tweets in your data set.
5. Conduct a social network analysis of the users in the data set. Users are connected by user mentions. There is a directional edge from a user who mentions another user in their tweet. Find the user with the highest in-degree centrality and the highest out-degree centrality. There is no need to visualize the social network as there would be too many nodes. 
7. Answer the questions below. 

**Important**

When working on this exercise, you will use your own Twitter authentication information. After you have completed the exercise, delete this information. The grader will replace it with their own keys for grading and testing. This information should not be shared with others.

**Submit CSV**

Submit your CSV file. It will be used for testing your lab and grading. 


In [10]:
import pandas as pd
import json
from textblob import TextBlob
import operator 
import networkx as nx
import matplotlib.pyplot as plt

# Load the CSV file into a pandas dataframe
df = pd.read_csv('tweets_hashtag_Putin.csv')

# Print the number of rows and columns in the dataset
print(f"Number of rows: {len(df)}, Number of columns: {len(df.columns)}")

Number of rows: 602, Number of columns: 4


In [11]:
# Group the dataframe by the 'user_name' column and count the number of rows for each group
user_counts = df.groupby('user_name').size().reset_index(name='counts')

# Sort the resulting dataframe in descending order by the count column
user_counts = user_counts.sort_values('counts', ascending=False)

# Print the user with the highest number of posts
most_active_user = user_counts.iloc[0]['user_name']
print(f"The most active Twitter user during this time frame is {most_active_user}.")

The most active Twitter user during this time frame is Asiyatu4.


In [12]:
# Initialize running totals for polarity and subjectivity
polarity_total = 0
subjectivity_total = 0

# Loop through each tweet in the dataframe
for index, row in df.iterrows():
    # Use TextBlob to perform sentiment analysis on the tweet text
    blob = TextBlob(row['text'])
    # Add the polarity and subjectivity values to running totals
    polarity_total += blob.sentiment.polarity
    subjectivity_total += blob.sentiment.subjectivity

# Calculate the average polarity and subjectivity for all tweets
num_tweets = len(df)
avg_polarity = polarity_total / num_tweets
avg_subjectivity = subjectivity_total / num_tweets

# Print the results
print(f"The average polarity for all tweets is {avg_polarity:.2f}.")
print(f"The average subjectivity for all tweets is {avg_subjectivity:.2f}.")

The average polarity for all tweets is 0.03.
The average subjectivity for all tweets is 0.29.


In [13]:
# Loop through each tweet in the filtered dataframe
for index, row in df.iterrows():
    # Use TextBlob to perform sentiment analysis on the tweet text
    blob = TextBlob(row['text'])
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity
    # Print the tweet text along with the polarity and subjectivity scores
    print(f"Tweet: {row['text']}")
    print(f"Polarity: {polarity:.2f}")
    print(f"Subjectivity: {subjectivity:.2f}")
    print('---')


Tweet: RT @RussianEmbassy: President #Putin: #Russia was doing everything possible to solve the Ukrainian crisis by peaceful means, patiently cond…
Polarity: 0.12
Subjectivity: 0.75
---
Tweet: RT @RussianEmbassy: President #Putin: #Russia was doing everything possible to solve the Ukrainian crisis by peaceful means, patiently cond…
Polarity: 0.12
Subjectivity: 0.75
---
Tweet: RT @RussianEmbassy: President #Putin: Those who plotted a new attack against #Donetsk and #Lugansk in #Donbass region understood that #Crim…
Polarity: 0.14
Subjectivity: 0.45
---
Tweet: RT @BHL: Thanks, Minister #Reznikov. Humbled by your words. You embody the resistance of #Ukraine against #Putin’s fascism. This film is yo…
Polarity: 0.20
Subjectivity: 0.20
---
Tweet: RT @RussianEmbassy: President #Putin: #Russia was doing everything possible to solve the Ukrainian crisis by peaceful means, patiently cond…
Polarity: 0.12
Subjectivity: 0.75
---
Tweet: US State Department Official Condemns #Putin's Suspending Russi

In [14]:
# Create an empty directed graph
graph = nx.DiGraph()

# Loop through each tweet in the dataframe
for index, row in df.iterrows():
    # Add the user and mention nodes to the graph if they don't already exist
    user = row['user_name']
    if not graph.has_node(user):
        graph.add_node(user)
    mentions = row['mentions']
    if mentions != '[]':
        # Replace the string "[]" with an empty list
        mentions = mentions.replace("'", '"')
        mentions = json.loads(mentions)
        for mention in mentions:
            if not graph.has_node(mention):
                graph.add_node(mention)
            # Add a directed edge from the user to the mention
            graph.add_edge(user, mention)


# Calculate the in-degree and out-degree centrality measures for each node
in_degree_centrality = nx.in_degree_centrality(graph)
out_degree_centrality = nx.out_degree_centrality(graph)

# Find the user with the highest in-degree centrality and the highest out-degree centrality
highest_in_degree = max(in_degree_centrality.items(), key=operator.itemgetter(1))[0]
highest_out_degree = max(out_degree_centrality.items(), key=operator.itemgetter(1))[0]

# Print the results
print(f"The user with the highest in-degree centrality is {highest_in_degree}.")
print(f"The user with the highest out-degree centrality is {highest_out_degree}.")

The user with the highest in-degree centrality is RussianEmbassy.
The user with the highest out-degree centrality is Mousstach.


## Questions

1. Why did you choose the hashtag you did?

2. Who was the most frequent user in terms of number of posts? What do you know about this user?

3. After conducting a sentiment analysis, what was the average polarity and subjectivity of the tweets that included your hashtags? What was the general sentiment about the posts associated with your hashtag? Taking a quick review of the tweets, what were some possible reasons for the sentiment analysis results?

4. Through a social network analysis, which user had the highest out-degree centrality (mentioning the most other users) and which user had the highest in-degree centrality (mentioned the most by other users)? What do you know about these two users?

## Answers

1. "Putin" was chosen for me
2. @Asiyatu4 seems to be an avid Putin supporter
3. The average polarity for all tweets is 0.03, and the average subjectivity for all tweets is 0.29. With respect to the reasons, factors that influence sentiment analysis results include tone and language, context, sampling bias, and ambiguity. Positive language and a promotional context tend to yield positive sentiment scores, while negative language and critical contexts tend to yield negative scores. Sampling bias and ambiguous language can also affect sentiment analysis.
4. The user with the highest in-degree centrality is RussianEmbassy, and the user with the highest out-degree centrality is Mousstach. RussianEmbassy is the Russian Embassy in London, UK. Mousstach is a professor from Belgium who tweets about geopolitical issues involving Russia.