Goal of this notebook is some exploratory data analysis (EDA) on what features may be available within the data. 

1. Look at re-tweets and duplications to get a sense of how many unique textual statements exist within the corpus.
1. Look at how many unique tweets per sub-group to assess data balancing issues.
1. Look at keyword and topic proxies such as hashtags, mentions, and named entities or noun-phrases.
1. Run a basic sentiment analysis per sub-group.

# Load Data from `00_NST_DataPrep.ipynb`

In [None]:
import pandas as pd

offline_tweets_df = pd.read_pickle('/content/drive/MyDrive/Piper Gradient/Not-So-Twitterpated/cleaned_offline_tweets_df_large.pickle')

# Retweets and Duplications

In [None]:
# number of retweets
offline_tweets_df['is_retweet'].sum()

In [None]:
# number of unique retweets
offline_tweets_df.loc[offline_tweets_df['is_retweet']].text3.unique().size


In [None]:
# 10 most repeated tweets
offline_tweets_df.groupby(['text3']).size().reset_index(name='counts')\
  .sort_values('counts', ascending=False).head(20)



In [None]:
 import matplotlib.pyplot as plt
import numpy as np

# number of times each tweet appears
counts = offline_tweets_df.groupby(['text3']).size()\
           .reset_index(name='counts')\
           .counts

# plot histogram of tweet counts
plt.figure()
plt.hist(counts, bins = 'auto')
plt.xlabels = np.arange(1,counts.max()+1, 1)
plt.xlabel('Copies of each tweet')
plt.ylabel('Frequency')
plt.yscale('log', nonposy='clip')
plt.title('Frequency of Tweet/Retweet Duplications')
plt.show()


# Unique Tweets and Tweets by Political Typology

In [None]:
# Number of unique tweets in data set 

offline_tweets_df.text3.unique().size

In [None]:
type_counts = offline_tweets_df['tweet category'].sort_values().value_counts(sort=False, dropna=True)

type_counts

In [None]:
792/1950

In [None]:
offline_tweets_df['tweet category'].dropna().sort_values()\
                                   .astype('category')\
                                   .value_counts(sort=False).plot(kind='bar')

# Hashtags, Mentions, Retweets, and Links

In [None]:
hashtags_list_df = offline_tweets_df.loc[
                       offline_tweets_df.hashtags.apply(
                           lambda hashtags_list: hashtags_list !=[]
                       ),['hashtags']]

hashtags_df = pd.DataFrame(
    [hashtag for hashtags_list in hashtags_list_df.hashtags
    for hashtag in hashtags_list],
    columns=['hashtag'])

display(hashtags_df)

In [None]:
hashtags_df['hashtag'].unique().size


In [None]:
popular_hashtags = hashtags_df.groupby('hashtag').size()\
                              .reset_index(name='counts')\
                              .sort_values('counts', ascending=False)\
                              .reset_index(drop=True)


In [None]:
popular_hashtags.head(10)

In [None]:
521+77

In [None]:
598/893

[Finding Linguistic Patterns using `spaCy`](https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_iii/02_pattern_matching.html)