## S1.  General Business context	
The overall goal of Big hedge fund of America (BFOA) is to use unique data sources to identify market opportunities. One of the key problems the company is trying to solve includes trying to leverage publicly available data to make profitable trades.

The firm recently became aware of a study that proved the average cost of a food related type 1 recall in the 20 days after the announcement is a 305 million reduction in market cap, due to negative stock performance. Knowing the potential recall impact before a recall is self-reported to the FDA would allow financial analysts to better predict future stock performance in the time period following the announcement. Given the number of CPG companies and size of the overall market, we project the value of a prediction of this type to be worth at least $10m annually. 


## S2.  Specific Questions

The first question we need to address is whether we have data that is in enough of a usable state to make this prediction even possible. If our data sets are not workable or able to be labeled in a scalable way, then we will never be able to run it through a predictive model to classify products as potential recall risks. The second question is whether we can identify a signal in the text review data that helps us predict when a recall is likely in the next 90 days. 

90 days was selected as the time frame for a couple of reasons. The first reason is that it is the typical length of a business quarter, and many public companies must set quarterly targets for their investors. The second reason is that the closer the timeframe gets to the recall date the more obvious it probably is that there is a problem. Thus, the insight gets less valuable simply because the evidence is likely more substantial and there is a shorter amount time to act on the information.
    

## S3: Analysis Methods
Use at least one text analysis method other than term counts to help answer your question
Given what we know about the real time nature of consumers complaining, we have potential data that could be used to generate advanced signal detection of potential recalls. In the initial project sprint, we are using a database of 75 million publicly available amazon reviews, and official FDA recall data from 2012-today. The goal is to identify recalls in the wild in advance of the FDA official recall notices. 

To generate this likelihood value, we will look at the words in each review and assign positive and negative sentiment scores to different words. More weight will be given to extreme words such as medical events. From this we will calculate the probability of whether the associated product was recalled in the next 90 days from the social comment.


In [None]:
# import modules
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', 500)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer



In [None]:



reviewdf = pd.read_json(r'C:\Users\anaconda\downloads\meta_Pet_Supplies.json.gz',
                  compression='infer', lines = True)


In [None]:
reviewdf.head(20)

In [None]:



reviewdf = pd.read_json(r'C:\Users\anaconda\downloads\All_Amazon_Review_5.json.gz',
                  compression='infer', lines = True, chunksize=1000000)


In [None]:
userl = [[tweet.full_text, tweet.user.name, tweet.user.created_at, tweet.user.location, tweet.user.followers_count, tweet.user.friends_count] for tweet in tweets]

In [None]:
tweetdf = pd.DataFrame(data=userl, columns=['full_text', 'username', 'userCreatedDate', "location", 'followers', 'friends'])
#tweetdf



### EDA

In [None]:
import seaborn as sns

In [None]:
tweetdf.describe()

In [None]:
#Where are longcovid tweets occuring?
location = tweetdf.groupby('location')
location.count().sort_values(by="full_text",ascending=False)

#Definitely a biased sample based on how we limited lang to EN


In [None]:
#any nulls?
tweetdf.isnull().sum()
#Nope

## Q1. 
Our business question is whether we can identify the most common #longcovid symptoms through text mining. The data we pulled consists of self-reported and anecdotal symptoms from tweets. We did EDA on summaries of numerical data, as well as checking for nulls and looking for the most popular locations where users are tweeting these symptoms. The most frequently counted locations were around the UK region. This is likely due to the fact that we limited our analysis to 'EN' language in our search filter, and that the UK was exposed to COVID-19 before the US.Based on what we want to know, we'll need to account for case, cleaning of hashtags, mentions and urls, as well as custom stop words and stemming for multiple variations of the same symptom. ('fatigue' vs. 'fatigued')





## T2. 
Perform the preprocessing steps you identified in Q1 and append the results to your original data frame.  Print some examples that help demonstrate the effects of your decisions.  Be sure to identify at least two successes and two ‘mishaps.’ 

In [None]:
# https://pypi.org/project/tweet-preprocessor/
# The tweet-preprocessor package looks really useful for this
import preprocessor as p


In [None]:
import re, string, unicodedata
#This is an interesting idea I found, builds a column for hashtags and stores in a list. Not relevant for this step, but still interesting...
tweetdf['hashtag'] = tweetdf['full_text'].apply(lambda x: re.findall(r"#(\w+)", x))



In [None]:
#function that applies the preprocessing from the tweet-preprocessing package
def preprocess_tweet(row):
    text = row['full_text']
    cleantext = p.clean(text)
    return cleantext


In [None]:
p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.MENTION) #We want to clean out URLs, emojiis and mentions. Given more time, we may want to assign some kind of override logic when an emoji is present, as that could give an easy indication of that particular person's sentiment.
tweetdf['cleantext'] = tweetdf.apply(preprocess_tweet, axis=1)


In [None]:
tweetdf.head(45)

In [None]:
#Make it lowercase
tweetdf['cleantextlower']=tweetdf['cleantext'].apply(lambda x: x.lower())


In [None]:
import nltk
from nltk.corpus import stopwords
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)

#create an object storing the default nltk stopwords
nltk_stopwords = stopwords.words("english") 


In [None]:
#create a custom stop words list by adding a covid terms to the nltk list
my_stopwords = nltk_stopwords + ["covid", "long", "longhaulers", "still", "this", "longhaul", "get", "longcovid", "countlongcovid","symptoms", "people","covid19", "us", "also", "covid19", "amp","many","like"]

cv1 = CountVectorizer(binary = False, stop_words = my_stopwords, ngram_range=(1,5))

In [None]:
#put it in a list
tlist = tweetdf['cleantext'].values.tolist()
#print(tlist)



In [None]:
#cv1
cv1_tlist = cv1.fit_transform(tlist)

names_cv1 =cv1.get_feature_names()
count_cv1_review = np.sum(cv1_tlist.toarray(), axis = 0).tolist() #sum and convert to list
count_cv1_review_df = pd.DataFrame(count_cv1_review, index = names_cv1, columns = ['count']) # create a dataframe from the list
sorted_count1 = count_cv1_review_df.sort_values(['count'], ascending = False)  #order by count




In [None]:
#This seems like an improvement, at least, as we've removed the highest covid terms.
sorted_count1

In [None]:
#Can we cut down on fragmented counts using stemming?

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer() 

def stem_text(row):
    text = str(row).split() #splits the text apart before stemming
    stemtext = [ps.stem(word) for word in text] #tells it which stemmer to apply and how
    stem2text = ' '.join(stemtext) #puts everything back together again
    return stem2text

tweetdf['cleantextlowerstemmed'] = tweetdf['cleantextlower'].apply(lambda x: stem_text(x)) #apply the above function to our text


In [None]:
cv1_tlist2 = cv1.fit_transform(tweetdf['cleantextlowerstemmed'])

names_cv1 =cv1.get_feature_names()
count_cv1_review = np.sum(cv1_tlist2.toarray(), axis = 0).tolist() #sum and convert to list
count_cv1_review_df = pd.DataFrame(count_cv1_review, index = names_cv1, columns = ['count']) # create a dataframe from the list
sorted_count2 = count_cv1_review_df.sort_values(['count'], ascending = False)  #order by count




In [None]:
#Doesn't appear to be helping reduce complexity...
sorted_count2

In [None]:

#try again with stem
cv1_tlist = cv1.fit_transform(tweetdf['cleantextlowerstemmed'])

names_cv1 =cv1.get_feature_names()
count_cv1_review = np.sum(cv1_tlist.toarray(), axis = 0).tolist() #sum and convert to list
count_cv1_review_df = pd.DataFrame(count_cv1_review, index = names_cv1, columns = ['count']) # create a dataframe from the list
sorted_count = count_cv1_review_df.sort_values(['count'], ascending = False)  #order by count



In [None]:
sorted_count[0:10]

## Q2. Explain the examples you selected in T2 and whether they reflect the expected results based on your preprocessing decisions.  

The results of my decisions didn't necessarily yield desirable results. I was trying to attempt to simplify the fragments using stemming, but for some reason this moved words like 'this' towards the top of my counts. I'm not 100% sure why this occuring since I used a custom stop word list that removed for words like 'this'. I think maybe I'm referencing the wrong stopword variable and I need to go back and double check the reference. I definitely think removing case, emojiis and mentions was worthwhile, so I would probably keep those changes.


## T3. 
Create a sentiment dictionary from one of the sources in class or find/create your own (potential bonus points for appropriate creativity). Using your dictionary, create sentiment labels for the text entries (raw and processed) in your corpus.  Provide output that demonstrates the class balance (or lack thereof).  

In [None]:
#class example
from afinn import Afinn
afinn = Afinn(language='en')

afinn.score("most day i feel like the uk government, the nhs, and univers upper manag are gaslight me. covid isn't that bad, they say. number go up isn't much of a concern, they say. carri on, they say. #longcovid")


In [None]:
#class example

def afinn_sent(inputstring):
    
    sentcount =0
    for word in inputstring.split():  
        if word.rstrip('?:!.,;') in afinn:
            sentcount = sentcount + afinn[word.rstrip('?:!.,;')]
            
    
    if (sentcount < 0):
        sentiment = 'Negative'
    elif (sentcount > 0):
        sentiment = 'Positive'
    else:
        sentiment = 'Neutral'
    
    return sentiment
    #return sentcount

In [None]:
def afinn_sent(row):
    text = row['cleantextlowerstemmed']
    sentscore = afinn.score(text)
    return sentscore

def afinn_sent_lower(row):
    text = row['cleantextlower']
    sentscore = afinn.score(text)
    return sentscore


In [None]:
tweetdf['affin_score'] = tweetdf.apply(afinn_sent, axis=1)
tweetdf['affin_score_lower'] = tweetdf.apply(afinn_sent_lower, axis=1)

In [None]:
tweetdf.head()

In [None]:
tweetdf['affin_score'].describe()


In [None]:
tweetdf['affin_score_lower'].describe()

In [None]:
#class example to return sentiment category from afinn score

def afinn_sent_cat(inputstring):
    if (inputstring < 0):
        sentiment = 'Negative'
    elif (inputstring > 0):
        sentiment = 'Positive'
    else:
        sentiment = 'Neutral'
    
    return sentiment
    


In [None]:
tweetdf['afinn_sentiment'] = tweetdf['affin_score'].apply(afinn_sent_cat)

In [None]:
tweetdf['afinn_sentiment_lower'] = tweetdf['affin_score_lower'].apply(afinn_sent_cat)

In [None]:
tweetdf.head(500)

In [None]:

afinn_sentiment = tweetdf.groupby('afinn_sentiment')
afinn_sentiment.count().sort_values(by="full_text",ascending=False)

In [None]:

afinn_sentiment_lower = tweetdf.groupby('afinn_sentiment_lower')
afinn_sentiment_lower.count().sort_values(by="full_text",ascending=False)


## Q3. 
We chose to use the afinn method for measuring sentiment, because it seemed like a reasonable starting place due to being more explainable to stakeholders. The afinn sentiment measure is a check for each word from our sample that we can match back to manually coded words, then a sum of those values. I chose this because I like sticking with simple methods first to identify the least complex transformation. 

It seems like the stemming version almost acted as a regularization method. Overall ended with more neutral sentiment vs. non-stemming. My guess is the stemmed version removed matches from the afinn dictionary. I think I would code a manual dictionary if I was going to try and classify sentiment based on symptoms. There are obvious heart/breathing symptoms in the tweets that I pulled down that are downright alarming, and those should be given more weight than people who have a sunburn sensation. Although, to be honest, it's all very troubling to see.

Specific example pulled out below:
No stemming (-3 Negative)
'does any other #longhauler have phantom sunburn sensations? they keep getting worse and i'm looking for a way to manage them but like wtf do i do? it's not like aloe will help and i tried thc lotion but that was a bust. ibuprofen failed me too. #longcovid #countlongcovid'

vs. stemming (0 Neutral)
'doe ani other #longhaul have phantom sunburn sensations? they keep get wors and i'm look for a way to manag them but like wtf do i do? it' not like alo will help and i tri thc lotion but that wa a bust. ibuprofen fail me too. #longcovid #countlongcovid'	



