# Exploring Conditional Probability Concepts Using Trump's Tweets

## Setup

### Import tooling

In [60]:
import os
import pandas as pd

import plotly.express as px
import unicodedata
import nltk
nltk.download('wordnet') # if you've not downloaded these before
nltk.download('stopwords') # if you've not downloaded these before

import re

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/caitlynvanheest/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/caitlynvanheest/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Project Structure

In [2]:
# define the path for this project we're going to use 
project_folder = '/Users/caitlynvanheest/Documents/GitHub/trumpsTweets/'

# if the path does exist
if os.path.exists(project_folder):
    # move along
    pass
else:
    # create the folder at that location
    os.mkdir(project_folder)

In [3]:
# use cookiecutterdatascience project template at that location

## Terminal / command line commands for setting up project structure
# pip install cookiecutter 
# change to directory 
# cd '/Users/caitlynvanheest/Documents/GitHub/'
# cookiecutter https://github.com/drivendata/cookiecutter-data-science

## Data Ingest and Cleaning

In [4]:
# read in the csv to a dataframe

raw_tweets = pd.read_json("../data/raw/tweets_01-08-2021.json")

In [5]:
raw_tweets.sample(10)

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
40698,757775689525309400,"Elizabeth Warren, often referred to as Pocahon...",f,f,Twitter for Android,34257,9174,2016-07-26 03:12:57,f
54690,1129339132373884900,Will the Democrats give our Country a badly ne...,f,f,Twitter for iPhone,56476,11209,2019-05-17 10:53:25,f
23147,322384862499729400,@DohaBen I will be there soon!,f,f,Twitter Web Client,1,3,2013-04-11 16:25:15,f
21247,352135419195961340,.@Yankees are making a big mistake sending the...,f,f,Twitter Web Client,68,104,2013-07-02 18:43:20,f
28574,492392722792468500,"""""@AceWill Just bought @realDonaldTrump's: Thi...",f,f,Twitter Web Client,82,54,2014-07-24 19:35:47,f
49551,1206323905704710100,".@seanhannity, who will be interviewed on @mar...",f,f,Twitter for iPhone,44336,10874,2019-12-15 21:23:26,f
56252,1088430717611245600,Without a Wall there cannot be safety and secu...,f,f,Twitter for iPhone,133663,26748,2019-01-24 13:37:59,f
9217,1212734794762784800,A lot of very good people were taken down by a...,f,f,Twitter for iPhone,122813,29134,2020-01-02 13:58:01,f
45427,843779892776964100,"The real story that Congress, the FBI and all ...",f,f,Twitter for iPhone,53873,13166,2017-03-20 11:02:57,f
34445,624935615449186300,Isn't it funny that I am now #1 in the money l...,f,f,Twitter for Android,1497,621,2015-07-25 13:33:56,f


In [6]:
# check completeness of fields

raw_tweets.isnull().sum()

id           0
text         0
isRetweet    0
isDeleted    0
device       0
favorites    0
retweets     0
date         0
isFlagged    0
dtype: int64

In [7]:
# remove retweets so we are only looking at Donald Trump's original content

tweets = raw_tweets[raw_tweets['isRetweet'] == 'f']

## Decisions, Decisions...

We need to decide:]

1) What time period we're interested in investigating?

2) If we're going to contain users (@user) in here vs Trump's text alone

3) Do we want to combine individual records that were tied to one larger sentiment / point Trump was spreading over multiple tweets due the character limit?

4) Should we look at grouping terms that are synonmys into one representation/theme (ex: *loser, moron, stupid,* and *idiot*)?

### Processing the Text

In [8]:
# add appropriate words that will be ignored in the analysis
# rt: retweet, amp: &
ADDITIONAL_STOPWORDS = ['rt','amp']


### Adapt some very helpful code from [this post](https://towardsdatascience.com/exploring-the-trump-twitter-archive-6242e5100a74)

In [78]:
def clean(text):
    """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
    text = (unicodedata.normalize('NFKD',
                                  text).encode('ascii', 'ignore').decode(
                                      'utf-8', 'ignore').lower())
    words = re.sub(r'[^\w\s]', '', text).split() # not a word or whitespace
    return [wnl.lemmatize(word) for word in words if word not in stopwords]


def get_words(df, column):
    """
    Takes in a dataframe and columns and returns a list of
    words from the values in the specified column.
    """
    return clean(''.join(str(df[column].tolist())))


def get_bigrams(df, column, count = 10):
    """
    Takes in a list of words and returns a series of
    bigrams with value counts.
    """
    return (pd.Series(nltk.ngrams(get_words(df, column),
                                  2)).value_counts())[:count]


def get_trigrams(df, column, count = 10):
    """
    Takes in a list of words and returns a series of
    trigrams with value counts.
    """
    return (pd.Series(nltk.ngrams(get_words(df, column),
                                  3)).value_counts())[:count]

### Create a cleaned version of the tweets

### Lemmatization vs stemming vs tokenization

[why i did this](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/)

In [30]:
## Clean the tweets in df :

# lemmatize and regex replacement in new col
tweets['cleanTweet'] = tweets['text'].apply(clean)

# create a new column for the cleaned tweet as a complete string for n-grams

tweets['cleanString'] = tweets['cleanTweet'].apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweets['cleanTweet'] = tweets['text'].apply(clean)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweets['cleanString'] = tweets['cleanTweet'].apply(lambda x: " ".join(x))


In [31]:
# create a new column in the dataframe with unique words per tweet

tweets['uniqueWords'] = tweets['cleanTweet'].apply(lambda x: list(set(x)))

# Unpack the unique words into a single string

tweets['dedupedTweets'] = tweets['uniqueWords'].apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweets['uniqueWords'] = tweets['cleanTweet'].apply(lambda x: list(set(x)))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweets['dedupedTweets'] = tweets['uniqueWords'].apply(lambda x: " ".join(x))


### Determine the Probability of Each Word Appearing in a Tweet

In [32]:
from collections import Counter as counter

In [33]:
# use counter to get the number of tweets that contain a given lemma/ word

frequencies = counter(get_words(tweets,'dedupedTweets'))

# create a new dictionary with the probability of each word appearing in a tweet
# create this by dividing the values in the counter object by the length of tweets used

wordProbabilties = {k: v / len(tweets) for k, v in frequencies.items()}

### Get bi-gram & tri-gram frequencies from original Tweets

In [95]:
bigrams = pd.DataFrame(get_bigrams(tweets, 'cleanString', 20).sort_values(ascending=True)).reset_index()
bigrams['index'] = bigrams['index'].astype(str)

fig = px.bar(bigrams, y = 'index', x = 0,  orientation='h')

fig.update_layout(title = 'Most Frequently Occuring Bigrams', xaxis_title = 'Frequency', yaxis_title = 'Bigram')

fig.show()

In [96]:
trigrams = pd.DataFrame(get_trigrams(tweets, 'cleanString', 20).sort_values(ascending=True)).reset_index()
trigrams['index'] = trigrams['index'].astype(str)

fig = px.bar(trigrams, y = 'index', x = 0,  orientation='h')

fig.update_layout(title = 'Most Frequently Occuring Trigrams', xaxis_title = 'Frequency', yaxis_title = 'Trigram')

fig.show()