# Section 0: Preamble <a id=preamble></a>
<a id=preamble></a>

## Project:
Description: This IPYNB script is part of my internship project. The project repository is available on GitHub. <br>
Data: Data is not available because of data sharing restrictions.<br>
Date: 2023-01-01 (start)<br>

## Versions:
Apple M2 macOS 
Pandas: 1.4.4<br>
Numpy: 1.21.5<br>
Matplotlib: 3.5.2<br>
Seaborn: 0.11.2<br>
scikit-learn: 1.1.1<br>
SciPy: 1.9.1<br>
PG: 0.5.3<br>

## Author:
**Name: Ekin Derdiyok<br>
GitHub: https://github.com/ekinderdiyok<br>
LinkedIn: https://www.linkedin.com/in/ekinderdiyok/<br>
Email: [ekin.derdiyok@icloud.com](mailto:ekin.derdiyok@icloud.com)<br>

## Table of Contents:
Section 0: [Preamble](#preamble)<br>
Section 1: [Import](#import-install)<br>
Section 2: [Explore](#explore)<br>
Section 3: [Data Wrangling](#data-wrangling)<br>
Section 4: [Tweet Analysis](#tweet-analysis)<br>
Section 5: [User Analysis](#user-analysis)<br>
Section 6: [Attrition Bias](#attrition-bias)<br>
Section 7: [Clustering](#clustering)<br>
&nbsp;&nbsp;&nbsp;&nbsp;Section 7.1: [K-Means Clustering](#k-means-clustering)<br>
&nbsp;&nbsp;&nbsp;&nbsp;Section 7.2: [t-SNE](#t-sne)<br>
Section 8: [Multiple Linear Regression](#multiple-linear-regression)<br>

# Section 1: Data import and package installation
<a id=import-install></a>

In [None]:
# Install pingouin
pip install pingouin

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy.stats
import pingouin as pg
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import linkage, dendrogram
from datetime import datetime


In [None]:
# Check if the import is succesful and for documentation
print(pd.__version__)
print(np.__version__)
print(matplotlib.__version__)
print(sns.__version__)
print(sklearn.__version__)
print(scipy.__version__)
print(pg.__version__)

In [None]:
# Load the data
data = pd.read_csv("/Users/ekinderdiyok/Documents/MPI/Twitter/public_dataset.csv", low_memory = False)

In [None]:
# Subset the data to exclude tweets with 0 favs or RTs
data["engagement"] = data.favorite_count + data.retweet_count # Create a new variable 
(data.engagement == 0).sum() # count the number of tweets with 0 engagement

data = data.loc[data.engagement>0,:] # do the subsetting

## Section 2: Explore
<a id=explore></a>

In [None]:
# Distribution of sentiment in the data
# * Very rough way to look at the sentiments in the data. Simply summing up the sentiment scores that are the number of occurences of sentimental words across all tweets.

_=data.loc[:,"sentiment_anger":"sentiment_positive"].sum().sort_values(ascending=False).plot(kind="bar") # Draws a barplot

In [None]:
# Distribution of sentiment in the data
# Very rough way to look at the motives in the data. Simply summing up the motive scores that are the self-reported in a 6-point likert scale

_=data.loc[:,"motive_entertain":"motive_informothers"].sum().sort_values(ascending=False).plot(kind="bar")

In [None]:
# Distribution of engagement percentage
# Exploring the engagement percentage to better understand the data, not part of the report.
# Engagement percentage is defined as (retweet count + favorite count)/follower count of the sender * 100

data["pct_eng"] = data.engagement / data.followers_count * 100
data.loc[data["pct_eng"] > 100000,"pct_eng"] = np.nan
data["pct_eng"].hist()
plt.yscale("log")
#plt.ylim(0,1000)

In [None]:
# Correlation between sentiments and motives
# * One important finding was to add tweet length as a covariate which reduces down the inflated correlation between sentiments. Since dictionary methods counts the occurence of words, and longer tweets tend to have more words some of which are sentimental, correlation among sentiments were inflated.
# * Due to space restrictions and not being directly related to other analysis,is not present in the report.
# * Adding text width as covariance reduces the correlation between sentiment_positive and sentiment_negative

data["mood"] = data.sentiment_positive - data.sentiment_negative # Create a new column that nullifies the tweets with equal amount of positive and negative words
sents_mots = [['sentiment_anger', 'sentiment_disgust', 'sentiment_fear', # Create a list of cols for correlation table
       'sentiment_anticipation', 'sentiment_joy', 'sentiment_sadness',
       'sentiment_surprise', 'sentiment_trust', 'sentiment_negative',
       'sentiment_positive',"mood"],['motive_provoke', 'motive_savecontent', 'motive_showemotions',
       'motive_connectwothers', 'motive_showachievement',
       'motive_showattitude', 'motive_deceiveothers', 'motive_gainattention',
       'motive_provepoint', 'motive_causechaos', 'motive_bringattention',
       'motive_influence', 'motive_surpriseothers', 'motive_informothers']]

corr = pg.pairwise_corr(data=data, columns=sents_mots) # sentiment_positive and sentiment_negative are correlated, so add display_text_width as covariance
sig_corr_table = corr[corr["p-unc"]<0.001] # Filter for p < .001
sig_corr_table["abs_r"] = abs(sig_corr_table.r) # Create abs value column for r
display(sig_corr_table.sort_values(by="abs_r",ascending=False)) # sort by abs_r so that biggest r values are on top
display(sig_corr_table[sig_corr_table.X.isin(["mood","sentiment_negative","sentiment_positive"])])
display(pg.pairwise_corr(data=data, columns=["sentiment_positive","sentiment_negative"], covar="display_text_width"))

# Section 3: Data Wrangling <a id=data-wrangling></a>

# Create new variable categorical variable `dom_mot` 
* that is the highest motive of that tweet.
* this variable was created in order to predict the dominant motive of a tweet using other information about that tweet. This was then abandoned since I failed to successfully program this analysis.
* the participant has to rate their motive above average (4,5,6 but not 1,2,3). Otherwise it is assigned NaN, hence no dominant motive.
* Analysis using `dom_mot` did not make it to the report

In [None]:
# create a variable that holds the names of mot variables
mots = ['motive_entertain', 'motive_expressopinion', 'motive_provoke', 'motive_savecontent', 'motive_showemotions', 'motive_connectwothers', 'motive_showachievement', 'motive_showattitude', 'motive_deceiveothers', 'motive_gainattention', 'motive_provepoint', 'motive_causechaos', 'motive_bringattention', 'motive_influence', 'motive_surpriseothers', 'motive_informothers'] 

# create a variable that holds the names of sent variables
sents =  ['sentiment_anger', 'sentiment_disgust', 'sentiment_fear',      'sentiment_anticipation', 'sentiment_joy', 'sentiment_sadness','sentiment_surprise', 'sentiment_trust', 'sentiment_negative','sentiment_positive']

data["dom_mot"] = data[mots].idxmax(axis=1).str[7:] # name of the motive with the highest score
data.loc[data[mots].max(axis=1) < 4, "dom_mot"] = np.nan # assign NaN to dom_mot whenever a tweet has less than 4 for the max motive.
print(data.dom_mot.value_counts())

# Section 4: Tweet Analysis <a id=tweet-analysis></a>
Analysis of Tweet Sentiments, Engagement, and Motives. Investigating whether tweets of different sentiments receive higher engagement


In [None]:
data["dom_sent"] = data[sents].idxmax(axis=1).str[10:] # name of the sentiment with highest percentage
data.loc[data[sents].max(axis=1) < 2, "dom_sent"] = np.nan # assign NaN to dom_mot whenever a tweet has less than 4 for the max motive.
print(data.dom_sent.value_counts())
sns.boxplot(data=data,x="pct_eng",y="dom_sent")
plt.xscale("log")

# Engagement percentage and popularity of a tweet by negative vs positive
* t-test and Mann-Whitney U test comparing positive and negative tweets with regards to their popularity
* MWU is reported in the internship report.

In [None]:
feats = ["pct_eng","favorite_count","retweet_count"]
pd.options.display.max_columns = None
display(data.groupby("dom_sent")[feats].describe().loc[["negative","positive"],:])

print("two sample independent t-test between for engagement percentage between  positive and negative tweets ")
display(pg.ttest(x=data.loc[data.dom_sent == "positive","pct_eng"],y=data.loc[data.dom_sent == "negative","pct_eng"]))

print("two sample independent t-test between for favorite count between  positive and negative tweets ")
display(pg.ttest(x=data.loc[data.dom_sent == "positive","favorite_count"],y=data.loc[data.dom_sent == "negative","favorite_count"]))

print("two sample independent t-test between for retweet count between  positive and negative tweets ")
display(pg.ttest(x=data.loc[data.dom_sent == "positive","retweet_count"],y=data.loc[data.dom_sent == "negative","retweet_count"]))

print("two sample independent mwu-test between for engagement percentage between  positive and negative tweets ")
display(pg.mwu(x=data.loc[data.dom_sent == "positive","pct_eng"],y=data.loc[data.dom_sent == "negative","pct_eng"]))

print("two sample independent mwu-test between for favorite count between  positive and negative tweets ")
display(pg.mwu(x=data.loc[data.dom_sent == "positive","favorite_count"],y=data.loc[data.dom_sent == "negative","favorite_count"]))

print("two sample independent mwu-test between for retweet count between  positive and negative tweets ")
display(pg.mwu(x=data.loc[data.dom_sent == "positive","retweet_count"],y=data.loc[data.dom_sent == "negative","retweet_count"]))



# How does the sentiment of the tweet differ between different tweet types
* Amount of sentiment_positive words retweet > original > (reply > quote)
* Amount of sentiment_negative words retweet > original > (quote > reply)
* The conclusion is that retweets tend to contain more sentimental words.

In [None]:
# Create tweet_type variable
for index, row in data.iterrows():
    if data.loc[index, "is_retweet"] == True:
        data.loc[index, "tweet_type"] = "retweet"
    elif (data.loc[index, "is_reply"] == True) & (data.loc[index, "is_quote"] == True):
        data.loc[index, "tweet_type"] = "quote_reply"
    elif (data.loc[index, "is_reply"] == False) & (data.loc[index, "is_quote"] == True): 
        data.loc[index, "tweet_type"] = "quote"
    elif (data.loc[index, "is_reply"] == True) & (data.loc[index, "is_quote"] == False):
        data.loc[index, "tweet_type"] = "reply"
    elif (data.loc[index, "is_reply"] == False) & (data.loc[index, "is_quote"] == False) & (data.loc[index, "is_retweet"] == False):
        data.loc[index, "tweet_type"] = "original"

In [None]:
# Visualize the distribution, do ANOVA. test for equality of variance, then post-hoc test

sns.violinplot(data=data.loc[data.tweet_type != "quote_reply",:],x="sentiment_positive",y="tweet_type",alpha=0.1)
plt.xlim(-1,5)
plt.show()
display(data.loc[data.tweet_type != "quote_reply",:].groupby("tweet_type").sentiment_positive.describe())
display(pg.homoscedasticity(data=data.loc[data.tweet_type != "quote_reply",:],group="tweet_type",dv="sentiment_positive"))
display(pg.welch_anova(data=data.loc[data.tweet_type != "quote_reply",:],between="tweet_type",dv="sentiment_positive"))
display(pg.pairwise_gameshowell(data=data.loc[data.tweet_type != "quote_reply",:],between="tweet_type",dv="sentiment_positive").round(4))

In [None]:
# Same thing but this time for sentiment_negative
data = data.loc[data.tweet_type != "quote_reply",:]
sns.violinplot(data=data,x="sentiment_negative",y="tweet_type",alpha=0.1)
plt.xlim(-1,5)
plt.show()
display(data.loc[data.tweet_type != "quote_reply",:].groupby("tweet_type").sentiment_negative.describe())
display(pg.homoscedasticity(data=data.loc[data.tweet_type != "quote_reply",:],group="tweet_type",dv="sentiment_negative"))
display(pg.welch_anova(data=data.loc[data.tweet_type != "quote_reply",:],between="tweet_type",dv="sentiment_negative"))
display(pg.pairwise_gameshowell(data=data.loc[data.tweet_type != "quote_reply",:],between="tweet_type",dv="sentiment_negative").round(4))

# How does a tweet's motive differ between different tweet types?
* visualizing different tweet types and distribution of motive scores. 

In [None]:
mots = ['motive_provoke', 'motive_savecontent', 'motive_showemotions',
       'motive_connectwothers', 'motive_showachievement',
       'motive_showattitude', 'motive_deceiveothers', 'motive_gainattention',
       'motive_provepoint', 'motive_causechaos', 'motive_bringattention',
       'motive_influence', 'motive_surpriseothers', 'motive_informothers']
fig, axs = plt.subplots(len(mots),1,figsize=(10,10*len(mots)))
for i, mot in enumerate(mots):
    sns.kdeplot(data=data, x=mot, hue="tweet_type",multiple="fill",common_norm=False,ax=axs[i])
    axs[i].set_xlim([1,6])

## Logistic Regression
Predict dominant motive as a category label using sentiment scores. This analysis was abandoned as I could not make it work

In [None]:
# Create dom_mot variable
data["dom_mot"] = data[mots].idxmax(axis=1).str[7:] # name of the motive with the highest score
data.loc[data[mots].max(axis=1) < 4, "dom_mot"] = np.nan # assign NaN to dom_mot whenever a tweet has less than 4 for the max motive.

In [None]:
# Create a dictionary, replace some values and make it ready for the logistic regression
mots = ['showemotions', 'informothers', 'gainattention', 'showattitude','connectwothers', 'bringattention', 'provoke', 'surpriseothers','savecontent', 'showachievement', 'provepoint', 'causechaos','influence', 'deceiveothers']
mydict = dict((val, i+1) for i, val in enumerate(mots))
#data.dom_mot =  data.dropna(subset="dom_mot").dom_mot.replace(mydict)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

sents =  ['sentiment_anger', 'sentiment_disgust', 'sentiment_fear',      'sentiment_anticipation', 'sentiment_joy', 'sentiment_sadness','sentiment_surprise', 'sentiment_trust', 'sentiment_negative','sentiment_positive']
X = data.loc[data["dom_mot"].notna(),sents]
y = data.loc[data["dom_mot"].notna(),"dom_mot"]

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=3606,train_size=0.7)

lr = LogisticRegression(max_iter=200,multi_class="auto")
lr.fit(X_train,y_train)
lr.predict(X_test)
lr.score(X_test,y_test)

# Section 5: User Analysis <a id=user-analysis></a>
Create `users` dataframe, aggregating tweets for each users for user level analysis `data` was a dataframe made of individual tweets

In [None]:
# Convert 10 char length nondates into a dummy date so that you can do rest of the calculations
for i in data.index:
    if len(data.loc[i,"account_created_at"]) == 10:
        data.loc[i,"account_created_at"] = "2000-01-01 00:00:00"

from datetime import datetime

last_day = datetime.strptime("2022-06-12 23:59:59", "%Y-%m-%d %H:%M:%S") # last day in the dataset

for i in data.index: # iterate over each row
    x = datetime.strptime(data.loc[i,"account_created_at"], "%Y-%m-%d %H:%M:%S") # create a temporary variable that holds the datetime object of that row
    data.loc[i,"account_age"] = (last_day - x).days # turn the datetime object into days count and store in a new col called "account_age"


    
#  Find the rows where sentiment_xxx is >2, for these rows add a new col called is_anger and set it to True
for i_col in np.arange(52,62):
    col_name = list(data)[i_col]
    col_name_formatted = col_name[10:]
    new_col_name = "is_%s"%col_name_formatted  # string formatting: https://www.geeksforgeeks.org/string-formatting-in-python/
    data.loc[data[col_name] > 2, new_col_name] = True
    data[new_col_name] = data[new_col_name].fillna(False)
    
# Create a new dataframe called users
users = data.groupby("session").created_at.count().to_frame(name="post_count").reset_index()
users["replies_sent"] = data.groupby("session").is_reply.sum().reset_index(name="reply_count")["reply_count"]
users["retweets_sent"] = data.groupby("session").is_retweet.sum().reset_index(name="retweet_count")["retweet_count"]
users["quotes_sent"] = data.groupby("session").is_quote.sum().reset_index(name="quote_count")["quote_count"]
users.set_index("session", inplace = True)
users["quote_replies_sent"] = data.loc[(data.is_reply == 1) & (data.is_quote == 1),:].groupby("session").created_at.count()
users["quote_replies_sent"].fillna(0, inplace=True)
users["tweets_sent"] = users.post_count - users.replies_sent - users.quotes_sent - users.retweets_sent + users.quote_replies_sent
users["mean_anger"] = data.groupby("session").sentiment_anger.mean()
users["mean_disgust"] = data.groupby("session").sentiment_disgust.mean()
users["mean_fear"] = data.groupby("session").sentiment_fear.mean()
users["mean_anticipation"] = data.groupby("session").sentiment_anticipation.mean()
users["mean_joy"] = data.groupby("session").sentiment_joy.mean()
users["mean_sadness"] = data.groupby("session").sentiment_sadness.mean()
users["mean_surprise"] = data.groupby("session").sentiment_surprise.mean()
users["mean_trust"] = data.groupby("session").sentiment_trust.mean()
users["mean_negative"] = data.groupby("session").sentiment_negative.mean()
users["mean_positive"] = data.groupby("session").sentiment_positive.mean()
users["account_age_days"] = data[["account_age","session"]].replace(8198,np.nan).set_index("session").dropna().groupby("session").mean("account_age")
users["followers_count"] = data.set_index("session")["followers_count"][~data.set_index("session")["followers_count"].index.duplicated()] # number of followers a user had when they sent their first tweet

users["mean_favorite_count"] = data.loc[data.tweet_type != "retweet",["favorite_count","session"]].groupby("session").mean() # does not include RTs since they all have 0 favs. People who only sent RTs are assigned 0 to get rid of NaN
users["mean_favorite_count"].fillna(0, inplace=True) # users who only sent RTs had NaN values, they are replaced with 0
users["mean_retweet_count"] = data.loc[data.tweet_type != "retweet",["retweet_count","session"]].groupby("session").mean() # does not include RTs since they all have 0 favs.  People who only sent RTs are assigned 0 to get rid of NaN
users["mean_retweet_count"].fillna(0, inplace=True) # users who only sent RTs had NaN values, they are replaced with 0

users["popularity"] = users["mean_favorite_count"] + users["mean_retweet_count"]
users["engagement_percentage"] = users["popularity"] / users["followers_count"] * 100 # 
users["engagement_percentage"].replace(np.inf, np.nan, inplace=True) # users with zero followers now has NaN engagement score instead of inf which makes no sense
users["engagement_percentage"].fillna(0, inplace=True) # 0 follower dude got NaN. now they have 0 engagement as well
users["mean_length"] = data.groupby("session").display_text_width.mean()
#users["motive_causechaos"] = data.loc[~data.motive_causechaos.isna(),["session","motive_causechaos"]].groupby("session").mean()
#users["angry_tweets"] = data.loc[data.is_anger == 1,["session","is_anger"]].groupby("session").count()

# Create mean motive of the users' tweets
mots = ['motive_entertain', 'motive_expressopinion', 'motive_provoke', 'motive_savecontent', 'motive_showemotions', 'motive_connectwothers', 'motive_showachievement', 'motive_showattitude', 'motive_deceiveothers', 'motive_gainattention', 'motive_provepoint', 'motive_causechaos', 'motive_bringattention', 'motive_influence', 'motive_surpriseothers', 'motive_informothers']
for col in mots:
    users[col] = data.loc[~data[col].isna(),["session",col]].groupby("session").mean()

# How many tweets the person has broken down by each emotion
is_sent = ['is_anger', 'is_disgust', 'is_fear', 'is_anticipation', 'is_joy', 'is_sadness', 'is_surprise', 'is_trust','is_positive','is_negative']
count_sent =  ['count_anger', 'count_disgust', 'count_fear', 'count_anticipation', 'count_joy', 'count_sadness', 'count_surprise', 'count_trust','count_positive','count_negative']
for col in is_sent:
    users["count_" + col[3:]] = data.loc[data[col] == 1,["session",col]].groupby("session").count() # number of sentimental tweets per user
users[count_sent] = users[count_sent].fillna(0) # fill NaN values with zero

# automatize creation of new columns that show percentage of tweets that correspond to the given sentiment
pct_sent = ['pct_anger', 'pct_disgust', 'pct_fear', 'pct_anticipation', 'pct_joy', 'pct_sadness', 'pct_surprise', 'pct_trust']
for col in count_sent:
    users["pct_" + col[6:]] = users[col] / users.post_count * 100
users[pct_sent] = users[pct_sent].fillna(0) # fill NaN values with zero

# pct_highest shows how dominant the dom_sent is.
users["pct_highest"] = users[pct_sent].max(axis=1) # highest percentage is saced to a new col
users["dom_sent"] = users[pct_sent].idxmax(axis=1).str[4:] # name of the sentiment with highest percentage

# Create is_xxx cols with motive_xxx more than 3
mots = ['motive_entertain', 'motive_expressopinion', 'motive_provoke', 'motive_savecontent', 'motive_showemotions', 'motive_connectwothers', 'motive_showachievement', 'motive_showattitude', 'motive_deceiveothers', 'motive_gainattention', 'motive_provepoint', 'motive_causechaos', 'motive_bringattention', 'motive_influence', 'motive_surpriseothers', 'motive_informothers']
for col in mots:
    data.loc[data[col] > 3, "is_" + col[7:] ] = True
    data["is_" + col[7:]].fillna(False, inplace = True) # The rest will be assigned NaN. Fix it with the following code.

# Calculate count_motive for each user: number of tweets sent by the given motive.
is_mot = ['is_entertain', 'is_expressopinion', 'is_provoke', 'is_savecontent', 'is_showemotions', 'is_connectwothers', 'is_showachievement', 'is_showattitude', 'is_deceiveothers', 'is_gainattention', 'is_provepoint', 'is_causechaos', 'is_bringattention', 'is_influence', 'is_surpriseothers','is_informothers']
for col in is_mot:
    users["count_" + col[3:]] = data[is_mot + ["session"]].groupby("session").sum().loc[:,col]

# Create dom_mot variable that shows the most commonly occuring motive for each user
count_mot = ['count_entertain', 'count_expressopinion', 'count_provoke', 'count_savecontent', 'count_showemotions', 'count_connectwothers', 'count_showachievement', 'count_showattitude', 'count_deceiveothers','count_gainattention', 'count_provepoint', 'count_causechaos','count_bringattention', 'count_influence', 'count_surpriseothers','count_informothers']
users["dom_mot"] = users[count_mot].idxmax(axis=1).str[6:] # name of the sentiment with highest percentage
users.loc[users[count_mot].max(axis=1) == 0, "dom_mot"] = np.nan # assign NaN to dom_mot whenever there is no tweets with motivation info

# Calculate pct_mot that show
for col in count_mot:
    users["pct_" + col[6:]] = users[col] / users.post_count * 100
#users[pct_sent] = users[pct_sent].fillna(0) # fill NaN values with zero
pct_mot = ['pct_entertain', 'pct_expressopinion', 'pct_provoke', 'pct_savecontent', 'pct_showemotions', 'pct_connectwothers', 'pct_showachievement', 'pct_showattitude', 'pct_deceiveothers', 'pct_gainattention', 'pct_provepoint', 'pct_causechaos', 'pct_bringattention', 'pct_influence', 'pct_surpriseothers', 'pct_informothers']

# Calculate pct_positive - pct_negative
users["pct_mood"] = users.pct_positive - users.pct_negative

# Calculate pct_positive + pct_negative
users["pct_total_pos_neg"] = users.pct_positive + users.pct_negative

users["account_age_days"] = data[["account_age","session"]].replace(8198,np.nan).set_index("session").dropna().groupby("session").mean("account_age")

# assert that total number of tweets in the users table match the total number of tweets in the data table
#assert users["tweets_sent"].sum() == data.loc[(data.is_reply==0)&(data.is_retweet==0)&(data.is_quote==0),:].shape[0]
#assert users["quote_replies_sent"].sum() == data.loc[(data.is_reply==1)&(data.is_retweet==0)&(data.is_quote==1),:].shape[0]
#assert users["replies_sent"].sum() == data.loc[(data.is_reply==1),:].shape[0]
#assert users["retweets_sent"].sum() == data.loc[(data.is_retweet==1),:].shape[0]

## Subset users with more than 10 non-retweet posts

In [None]:
users.loc[:,"non_retweet_post_count"] = users.post_count - users.retweets_sent
users = users[users.non_retweet_post_count >= 10]

# Calculate percentage of nonretweet posts per user
users["pct_non_retweet"] = users.non_retweet_post_count / users.post_count * 100

users

# Section 6: Attrition Bias <a id=attrition-bias></a>
Testing attrition bias, i.e., comparing tweets of those who filled out the survey vs did not. Not reported in the report due to space limitations

In [None]:
print("two sample independent t-test between for ACCOUNT AGE vals between responders and nonresponders")
display(pg.ttest(x=users.loc[users.motive_entertain.isna(),"account_age_days"],y=users.loc[users.motive_entertain.notna(),"account_age_days"]))

print("two sample independent t-test between for POST COUNT vals between responders and nonresponders")
display(pg.ttest(x=users.loc[users.motive_entertain.isna(),"post_count"],y=users.loc[users.motive_entertain.notna(),"post_count"]))

print("two sample independent t-test between for FOLLOWERS COUNT vals between responders and nonresponders")
display(pg.ttest(x=users.loc[users.motive_entertain.isna(),"followers_count"],y=users.loc[users.motive_entertain.notna(),"followers_count"]))

print("two sample independent t-test between for ENGAGEMENT PERCENTAGE vals between responders and nonresponders")
display(pg.ttest(x=users.loc[users.motive_entertain.isna(),"engagement_percentage"],y=users.loc[users.motive_entertain.notna(),"engagement_percentage"]))

print("two sample independent t-test between for MEAN LENGTH vals between responders and nonresponders")
display(pg.ttest(x=users.loc[users.motive_entertain.isna(),"mean_length"],y=users.loc[users.motive_entertain.notna(),"mean_length"]))

print("two sample independent t-test between for MEAN LENGTH vals between responders and nonresponders")
display(pg.ttest(x=users.loc[users.motive_entertain.isna(),"mean_length"],y=users.loc[users.motive_entertain.notna(),"mean_length"]))

## Some scatterplots to explore `users` by the following variables: followers_count, account_age, number_of_tweets, ratio of original tweets

In [None]:
sns.scatterplot(data=users,y="followers_count",x="pct_non_retweet")
plt.yscale("log")

In [None]:
sns.scatterplot(data=users,x="account_age_days",y="followers_count")
plt.yscale("log")

In [None]:
sns.scatterplot(data=users,x="post_count",y="followers_count")
plt.yscale("log")
plt.xscale("log")

In [None]:
sns.scatterplot(data=users,y="popularity",x="pct_non_retweet")
plt.yscale("log")

In [None]:
sns.scatterplot(data=users,x="followers_count",y="popularity")
plt.xscale("log")
plt.yscale("log")

# Section 7: Clustering <a id=clustering></a>
Clustering users based on followers_count, account_age, number_of_tweets, ratio of original tweets, mean_tweet_length

## Data wrangling for clustering

In [None]:
def optimize_kmeans(data, max_k): # Create a function that applies k-means clustering with increasingly many clusters and plits an elbow plot.
    """Applies k-means clustering to data for 1 cluster  upto max_k clusters. Draws an elbow plot for you to 
    manually determine the appropriate number of clusters. The point that corresponds to the elbow is the right
    amount of cluster you should have"""

    n_k = []
    inertias = []
    
    for k in range(1, max_k+1):
        kmeans = KMeans(n_clusters=k, random_state=0)
        kmeans.fit(data)
        n_k.append(k)
        inertias.append(kmeans.inertia_)
     
    # elbow plot
    fig = plt.subplots(figsize=(15,5))
    plt.plot(n_k,inertias,"o-")
    plt.xlabel("Number of clusters")
    plt.ylabel("Inertia")
    plt.grid(True)
    plt.show

In [None]:
feats = ["followers_count","account_age_days","post_count","pct_non_retweet","popularity","mean_length"] # features I would like to use for my clustering
from sklearn.preprocessing import StandardScaler

# using standard scaler to prepare features for the clustering.
scaler = StandardScaler() 
feats_t = []
for feat in feats:
    feats_t.append(feat + "_t")
    
users[feats_t] = scaler.fit_transform(users[feats])

## Section 7.1: K-means clustering <a id=k-means-clustering></a>

In [None]:
optimize_kmeans(users[feats_t].dropna(),8)
_=plt.title("Elbow plot for determining the appropriate number of clusters")
plt.savefig("/Users/ekinderdiyok/Documents/MPI/Twitter/Visualizations/elbow.png",dpi=300)

## Section 7.2: t-SNE clustering <a id=t-sne></a>

In [None]:
from sklearn.manifold import TSNE
model = TSNE(learning_rate=100)
transformed = model.fit_transform(users[feats_t].dropna())
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys)
plt.show()


In [None]:
# Bartlett test comparing variances
from scipy.stats import bartlett
stats = []
ps = []
for mot in mots:
    OPs = data.loc[data.tweet_type == "original",mot]
    reps_rets = data.loc[data.tweet_type == "retweet",mot]
    stat, p = bartlett(OPs, reps_rets)
    stats.append(stat)
    ps.append(p)
my_bartlett = pd.DataFrame(list([mots, stats, ps])).T
pd.options.display.float_format = '{:,.4f}'.format
my_bartlett = my_bartlett.rename(columns={0:"motive",1:"test statistics",2:"p-values"})
my_bartlett.set_index("motive")
my_bartlett.to_csv("/Users/ekinderdiyok/Documents/MPI/Twitter/my_bartlett_v02.csv")

In [None]:
display(data.loc[data.tweet_type=="original",mots].describe().loc["std"])
display(data.loc[data.tweet_type=="retweet",mots].describe().loc["std"])

In [None]:
mots = ['motive_provoke', 'motive_savecontent', 'motive_showemotions',
       'motive_connectwothers', 'motive_showachievement',
       'motive_showattitude', 'motive_deceiveothers', 'motive_gainattention',
       'motive_provepoint', 'motive_causechaos', 'motive_bringattention',
       'motive_influence', 'motive_surpriseothers', 'motive_informothers']
fig, axs = plt.subplots(len(mots),1,figsize=(10,10*len(mots)))
for i, mot in enumerate(mots):
    sns.kdeplot(data=data, x=mot, hue="tweet_type",multiple="fill",common_norm=False,ax=axs[i])
    axs[i].set_xlim([1,6])

In [None]:
data.groupby("tweet_type")[mots].describe()

data_rt_ori = data.loc[data.tweet_type.isin(["retweet","original"])]
data_rt_ori.tweet_type.value_counts()
data_rt_sampled = data.loc[data.tweet_type == "retweet"].sample(4135)
data_ori = data.loc[data.tweet_type == "original"]
data_ori.tweet_type.value_counts()
data_ori_rt_sampled = pd.concat([data_ori, data_rt_sampled], axis=1)
data_ori_rt_sampled

In [None]:
data.tweet_type.value_counts()
data_rt_sampled = data.loc[data.tweet_type == "retweet",:].sample(4135)
data_og = data.loc[data.tweet_type == "original",:]
data_rt_og = pd.concat([data_og,data_rt_sampled])

In [None]:
mots = ['motive_provoke', 'motive_savecontent', 'motive_showemotions',
       'motive_connectwothers', 'motive_showachievement',
       'motive_showattitude', 'motive_deceiveothers', 'motive_gainattention',
       'motive_provepoint', 'motive_causechaos', 'motive_bringattention',
       'motive_influence', 'motive_surpriseothers', 'motive_informothers']
fig, axs = plt.subplots(len(mots),1,figsize=(10,10*len(mots)))
for i, mot in enumerate(mots):
    sns.kdeplot(data=data_rt_og, x=mot, hue="tweet_type",multiple="fill",common_norm=False,ax=axs[i])
    axs[i].set_xlim([1,6])
    axs[i].axhline(y=0.5)

In [None]:
stats = []
ps = []
std_rt = []
std_op = []
mean_rt = []
mean_op = []

for mot in mots:
    OPs = data.loc[data.tweet_type == "original",mot]
    rts = data.loc[data.tweet_type == "retweet",mot]
    #reps_rets = data.loc[data.tweet_type.isin(["reply" or "retweet"]),mot]
    stat, p = bartlett(OPs, rts)
    stats.append(stat)
    ps.append(p)
    std_op.append(data.loc[data.tweet_type=="original",mot].std())
    std_rt.append(data.loc[data.tweet_type=="retweet",mot].std())
    #mean_op.append(OPs.mean())
    #mean_rt.append(rts.mean())

difference = (pd.Series(std_op) - pd.Series(std_rt))/pd.Series(std_rt)*100
my_bartlett = pd.DataFrame(list([mots, stats, ps,std_op,std_rt,difference])).T
pd.options.display.float_format = '{:,.4f}'.format
my_bartlett = my_bartlett.rename(columns={0:"motive",1:"test statistics",2:"p-values",3:"Original post SD",4:"Retweet SD",5:"Percentage difference SD"})
my_bartlett.set_index("motive")
my_bartlett.to_csv("/Users/ekinderdiyok/Documents/MPI/Twitter/my_bartlett_v03.csv")
my_bartlett

In [None]:
# Boxplot to visualize negativity bias. I decided to report the violin plot and abandoned this plot

x1 = data.loc[(data.dom_sent == "negative") & (data.retweet_count > 0),["retweet_count","dom_sent"]]
x2 = data.loc[(data.dom_sent == "positive") & (data.retweet_count > 0),["retweet_count","dom_sent"]]
xs = pd.concat([x1,x2])
sns.boxplot(data=xs,x="retweet_count",y="dom_sent")
plt.xscale("log")

In [None]:
# Violinplot to visualize negativity bias

x1 = data.loc[(data.dom_sent == "negative"),["retweet_count","dom_sent"]]
x2 = data.loc[(data.dom_sent == "positive"),["retweet_count","dom_sent"]]
xs = pd.concat([x1,x2])
sns.violinplot(data=xs,x="retweet_count",y="dom_sent",scale="area")
plt.xscale("log")
#plt.xlim([1,100000])
plt.xlabel("Retweet count")
plt.ylabel("Sentiment of the tweet")

In [None]:
x1 = data.loc[(data.dom_sent == "negative"),["retweet_count","dom_sent"]]
x2 = data.loc[(data.dom_sent == "positive"),["retweet_count","dom_sent"]]
xs = pd.concat([x1,x2])
#sns.kdeplot(x=x1.retweet_count)
sns.kdeplot(data=xs, x="retweet_count",hue="dom_sent",multiple="fill")
plt.xscale("log")
#plt.xlim([1,100000])
plt.xlabel("Retweet count")
#plt.ylabel("Sentiment of the tweet")

In [None]:
# ECDF plot to visualize negativity bias. I decided to report the violin plot and abandoned this plot

fig, axs = plt.subplots(1,2,figsize=(10,5))
sns.ecdfplot(data=xs, x="retweet_count", hue="dom_sent",ax=axs[0])
axs[0].set_xscale("log")
#plt.xlim([30000,1000000])
#plt.ylim([0.9,1])

sns.ecdfplot(data=xs, x="retweet_count", hue="dom_sent",ax=axs[1])
plt.xscale("log")
plt.xlim([30000,1000000])
plt.ylim([0.98,1])

# Section 8: Multiple Linear Regression <a id=multiple-linear-regression></a>
to explain a single motive with multiple sentiments

In [None]:
X = data.dropna(subset="motive_entertain")[sents]
y = data['motive_showemotions'].dropna()
lm = pg.linear_regression(X, y)
lm

In [None]:
data.info(verbose=True)

In [None]:
sent_n = []
for sent in sents:
    sent_n.append(sent + "_n")

for sent in sents:
    
    test[sent_n] = data[sent] / data.display_text_width

In [None]:
test = data[sent] / data.display_text_width

In [None]:
data[sents].describe()

In [None]:
# End of the script.