<a href="https://colab.research.google.com/github/diversifyguy/ML_Apps/blob/main/Woj_Shams_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [2]:
#hide
from fastbook import *
from IPython.display import display,HTML

# Woj vs. Shams - Twitter Analysis using NLP

Anyone who seriously follows the NBA knows where to go for breaking news: the Twitter accounts of ESPN's Adrian Wojnarowski and/or The Athletic's Shams Charania.

Like any Twitter account, it's easy to forget that Woj' and Shams' accounts both grew from humble beginnings.

Here's Woj's first ever tweet from @wojespn (note: Woj's Twitter handle changed when he joined ESPN in 2009), dated 24-June-2009: 

> twitter: https://twitter.com/wojespn/status/2311135902

Since then, Woj's follower count has grown to 4.7 million as of 6-July-2021.

Here's Shams's first ever tweet, dated 15-August-2010: 

>twitter: https://twitter.com/ShamsCharania/status/21240227201

Since then, Shams' follower count has grown to 1.2 million as of 6-July-2021.

What might these accounts have to teach us about the NBA and sports news media? The purpose of this exercise is to find out.

### Visualizing the Tweets



After using [twint](https://pypi.org/project/twint/) to extract all the tweets from Woj' and Shams' accounts, I went about cleaning the data and organizing it into a pandas dataframe:

In [1]:
#hide
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# pandas to read our csv file
import pandas as pd

In [None]:
#hide
from google.colab import files
uploaded = files.upload()

In [None]:
df = pd.read_csv('/content/woj.csv', sep='\t', lineterminator='\r')
# df2 = pd.read_csv('/content/shams.csv', sep='\t', lineterminator='\r')
# frames = [df1, df2]
# df = pd.concat(frames)




In [4]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16539 entries, 0 to 16538
Data columns (total 36 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               16539 non-null  object 
 1   conversation_id  16538 non-null  float64
 2   created_at       16538 non-null  object 
 3   date             16538 non-null  object 
 4   time             16538 non-null  object 
 5   timezone         16538 non-null  float64
 6   user_id          16538 non-null  float64
 7   username         16538 non-null  object 
 8   name             16538 non-null  object 
 9   place            0 non-null      float64
 10  tweet            16538 non-null  object 
 11  language         16538 non-null  object 
 12  mentions         16538 non-null  object 
 13  urls             16538 non-null  object 
 14  photos           16538 non-null  object 
 15  replies_count    16538 non-null  float64
 16  retweets_count   16538 non-null  float64
 17  likes_count 

None

In [5]:
# make a copy if you need so that the changes made in original df doesn't affect the copy
df_copy = df.copy(deep=True)

In [6]:
# I don't need these columns, so dropping them. You can keep them if you want.
drop_list = ['place','near','geo','source','user_rt_id','user_rt','retweet_id','retweet_date','translate','trans_src','trans_dest']
df = df.drop(columns=drop_list)

In [7]:
# remove URLs
df['tweet'] = df['tweet'].str.replace('http\S+|www.\S+', '',case=False)

In [8]:
#hide
#!pip install texthero
import texthero as hero

In [9]:
# text preprocessing
from texthero import preprocessing

# create a custom pipeline to preprocess the raw text we have
custom_pipeline = [preprocessing.fillna
                   , preprocessing.lowercase
                   , preprocessing.remove_digits
                   , preprocessing.remove_punctuation
                   , preprocessing.remove_diacritics
                   , preprocessing.remove_stopwords]
                  #  , preprocessing.remove_whitespace
                  #  , preprocessing.stem]

# call clean() method to clean the raw text in 'tweet' col and pass the custom_pipeline to pipeline argument
df['clean_tweet'] = hero.clean(df['tweet'], pipeline = custom_pipeline)

In [11]:
#hide
!pip3 install sweetviz
import sweetviz as sv

Collecting sweetviz
[?25l  Downloading https://files.pythonhosted.org/packages/e6/06/f7341e6dc3fae77962855001cd1c1a6a73e3f094ffba2039b4dafe66c751/sweetviz-2.1.2-py3-none-any.whl (15.1MB)
[K     |████████████████████████████████| 15.1MB 200kB/s 
Collecting tqdm>=4.43.0
[?25l  Downloading https://files.pythonhosted.org/packages/7a/ec/f8ff3ccfc4e59ce619a66a0bf29dc3b49c2e8c07de29d572e191c006eaa2/tqdm-4.61.2-py2.py3-none-any.whl (76kB)
[K     |████████████████████████████████| 81kB 12.7MB/s 
Installing collected packages: tqdm, sweetviz
  Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
Successfully installed sweetviz-2.1.2 tqdm-4.61.2


# Visualizations using TextHero

In [14]:
#df1 = df.drop(columns=['tweet','username','link'])

In [None]:
df['pca'] = (
            df['clean_tweet']
            .pipe(hero.clean)
            .pipe(hero.tfidf)
            .pipe(hero.pca)
   )


In [None]:
df['pca'] = hero.pca(df['tfidf'])
hero.scatterplot(
    df, 
    col='pca', 
    color='topic', 
    title="PCA Woj & Shams"
)

In [None]:
import matplotlib.pyplot as plt

# using top_words() method, get the top N words and make a bar plot.
hero.top_words(df1['clean_tweet']).head(20).plot.bar(figsize=(15,10))
plt.show()

In [None]:
# Want to add more stop words to your list? No problem. Follow the below steps.

from texthero import stopwords
default_stopwords = stopwords.DEFAULT
#add a list of stopwords to the stopwords
stop_w = ["co","https","http", "tell", "tells", "game", "season", "sports", "two"]
custom_stopwords = default_stopwords.union(set(stop_w))
#Call remove_stopwords and pass the custom_stopwords list
df1['clean_tweet'] = hero.remove_stopwords(df1['clean_tweet'], custom_stopwords)

In [None]:
# Let's visualize again.

hero.top_words(df1['clean_tweet']).head(20).plot.bar(figsize=(15,10))
plt.show()

In [None]:
# just checking for any null values
df1.clean_tweet.isna().sum()

In [None]:
# WordCloud with single line of code.

hero.visualization.wordcloud(df1['clean_tweet'],width = 400, height= 400,background_color='White')

In [None]:
#Add pca value to dataframe to use as visualization coordinates
df1['pca'] = (
            df1['clean_tweet']
            .pipe(hero.tfidf,max_features=300)
            .pipe(hero.pca)
   )
#Add k-means cluster to dataframe 
df1['kmeans'] = (
            df1['clean_tweet']
            .pipe(hero.tfidf,max_features=300)
            .pipe(hero.kmeans, n_clusters=5)
   )
df1.head()

In [None]:
# Generate scatter plot for pca and kmeans. Cool isn't it?
hero.scatterplot(df1, 'pca', color = 'kmeans', hover_data=['clean_tweet'] )

# 7. Other Visualizations for further analysis

In [None]:
#hide
!pip3 install chart-studio

In [None]:
#hide
import seaborn as sns # visualization library
import chart_studio.plotly as py # visualization library
from plotly.offline import init_notebook_mode, iplot # plotly offline mode
init_notebook_mode(connected=True) 
import plotly.graph_objs as go # plotly graphical object

In [None]:
df2 = df.drop(columns=['username','tweet','link'])

In [None]:
df2.head()

In [None]:
plt.figure(figsize=(17,10))
sns.lineplot(data=df2['retweets_count'], dashes=False)
plt.title("Retweets over time")
plt.show()

In [None]:
plt.figure(figsize=(17,10))
sns.lineplot(data=df2['replies_count'], dashes=False)
plt.title("Replies over time")
plt.show()

In [None]:
plt.figure(figsize=(17,10))
sns.lineplot(data=df2['likes_count'], dashes=False)
plt.title("Likes over time")
plt.show()

In [None]:
data_count = data_count[:20,]
plt.figure(figsize=(10,5))
sns.barplot(data_count.values, data_count.index, alpha=0.8)
plt.title(‘Top Words Overall’)
plt.ylabel(‘Word from Tweet’, fontsize=12)
plt.xlabel(‘Count of Words’, fontsize=12)
plt.show()

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
temp=' '.join(df['cleaned_tweets'].tolist())
wordcloud = WordCloud(width = 800, height = 500, 
                background_color ='white', 
                min_font_size = 10).generate(temp)
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
def plot_topn(sentences, ngram_range=(1,3), top=20,firstword=''):
    c=CountVectorizer(ngram_range=ngram_range)
    X=c.fit_transform(sentences)
    words=pd.DataFrame(X.sum(axis=0),columns=c.get_feature_names()).T.sort_values(0,ascending=False).reset_index()
    res=words[words['index'].apply(lambda x: firstword in x)].head(top)
    pl=px.bar(res, x='index',y=0)
    pl.update_layout(yaxis_title='count',xaxis_title='Phrases')
    pl.show()

In [None]:
plot_topn(tweet_list, ngram_range=(1,1))

In [None]:
plot_topn(tweet_list, ngram_range=(2,2))

In [None]:
plot_topn(tweet_list, ngram_range=(3,3))

In [None]:
from textblob import TextBlob
df['sentiment']=df['tweet'].apply(lambda x:TextBlob(x).sentiment[0])
df['subject']=df['tweet'].apply(lambda x: TextBlob(x).sentiment[1])
df['polarity']=df['sentiment'].apply(lambda x: 'pos' if x>=0 else 'neg')

In [None]:
fig=px.histogram(df[df['subject']>0.5], x='polarity', color='polarity')
fig.show()

In [None]:
#pre-process tweets to BOW
from gensim import corpora
r = [process_text(x,stem=False).split() for x in df['tweet'].tolist()] 
dictionary = corpora.Dictionary(r)
corpus = [dictionary.doc2bow(rev) for rev in r]
#initialize model and print topics
from gensim import models
model = models.ldamodel.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=15)
topics = model.print_topics(num_words=5)
for topic in topics:
    print(topics[0],process_text(topic[1]))

In [None]:
labels=[]
for x in model[corpus]:
    labels.append(sorted(x,key=lambda x: x[1],reverse=True)[0][0])
df['topic']=pd.Series(labels)

In [None]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [None]:
#hide
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data_count = data_count[:20,]
plt.figure(figsize=(10,5))
sns.barplot(data_count.values, data_count.index, alpha=0.8)
plt.title(‘Top Words Overall’)
plt.ylabel(‘Word from Tweet’, fontsize=12)
plt.xlabel(‘Count of Words’, fontsize=12)
plt.show()

In [None]:
# import twint
# # Set up TWINT config
# c = twint.Config()
# c.Search = "Oneplus 9 pro"
# # Custom output format
# c.Limit = 3000
# c.Pandas = True
# twint.run.Search(c)

def column_names():
    return twint.output.panda.Tweets_df.columns
def twint_to_pd(columns):
    return twint.output.panda.Tweets_df[columns]

column_names()
tweet_df = twint_to_pd(["date", "username", "tweet", "hashtags", "likes_count"])
tweet_df.head(10)

print(len(tweet_df))

from transformers import pipeline
sentiment_classifier = pipeline('sentiment-analysis')

results = sentiment_classifier(tweet_df['tweet'].tolist())

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk import sent_tokenize, word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import re  
import spacy
nlp = spacy.load('en_core_web_lg')

In [None]:

from twitterscraper import query_tweets
from twitterscraper.query import query_tweets_from_user
import datetime as dt 
import pandas as pd 


begin_date = dt.date(2020,7,1)
end_date = dt.date(2020,7,13)


limit = 100
lang = 'english'

#Use this to search a specific user

user = 'realDonaldTrump'
tweets = query_tweets_from_user(user)
df = pd.DataFrame(t.__dict__ for t in tweets)

df = df.loc[df['screen_name'] == user]

df = df['text']

df

#Use this if wanting to seach for a specific Phrase or word

#tweets = query_tweets('impeachment', begindate = begin_date, enddate = end_date, limit = limit, lang = lang)
#df = pd.DataFrame(t.__dict__ for t in tweets)

#df = df['text']

#df

In [None]:
#This splits all the sentences up which makes it easier for us to work with

all_sentences = []

for word in df:
    all_sentences.append(word)

all_sentences
#df1 = df.to_string()

#df_split = df1.split()

#df_split
lines = list()
for line in all_sentences:    
    words = line.split()
    for w in words: 
       lines.append(w)


print(lines)

In [None]:
#Removing Punctuation

lines = [re.sub(r'[^A-Za-z0-9]+', '', x) for x in lines]

lines

lines2 = []

for word in lines:
    if word != '':
        lines2.append(word)

In [None]:
#This is stemming the words to their root
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

stem = []
for word in lines2:
    stem.append(s_stemmer.stem(word))
    
#stem

In [None]:
#Removing all Stop Words

stem2 = []

for word in stem:
    if word not in nlp.Defaults.stop_words:
        stem2.append(word)

#stem2

In [None]:
df = pd.DataFrame(stem2)

df = df[0].value_counts()

from nltk.probability import FreqDist

freqdoctor = FreqDist()

for words in df:
    freqdoctor[words] += 1

freqdoctor

In [None]:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = df[:20,]
plt.figure(figsize=(10,5))
sns.barplot(df.values, df.index, alpha=0.8)
plt.title('Top Words Overall')
plt.ylabel('Word from Tweet', fontsize=12)
plt.xlabel('Count of Words', fontsize=12)
plt.show()

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))

In [None]:
str1 = " " 
stem2 = str1.join(lines2)

stem2 = nlp(stem2)

label = [(X.text, X.label_) for X in stem2.ents]

df6 = pd.DataFrame(label, columns = ['Word','Entity'])

df7 = df6.where(df6['Entity'] == 'ORG')

df7 = df7['Word'].value_counts()

In [None]:
df = df7[:20,]
plt.figure(figsize=(10,5))
sns.barplot(df.values, df.index, alpha=0.8)
plt.title('Top Organizations Mentioned')
plt.ylabel('Word from Tweet', fontsize=12)
plt.xlabel('Count of Words', fontsize=12)
plt.show()

In [None]:
str1 = " " 
stem2 = str1.join(lines2)

stem2 = nlp(stem2)

label = [(X.text, X.label_) for X in stem2.ents]

df10 = pd.DataFrame(label, columns = ['Word','Entity'])

df10 = df10.where(df10['Entity'] == 'PERSON')

df11 = df10['Word'].value_counts()

In [None]:
df = df11[:20,]
plt.figure(figsize=(10,5))
sns.barplot(df.values, df.index, alpha=0.8)
plt.title('Top People Mentioned')
plt.ylabel('Word from Tweet', fontsize=12)
plt.xlabel('Count of Words', fontsize=12)
plt.show()

In [None]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

In [None]:
txt = files[0].open().read(); txt[:75]

In [None]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

In [None]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

In [None]:
defaults.text_proc_rules

In [None]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

### Subword Tokenization

In [None]:
txts = L(o.open().read() for o in files[:2000])

In [None]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [None]:
subword(1000)

In [None]:
subword(200)

In [None]:
subword(10000)

### Numericalization with fastai

In [None]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

In [None]:
toks200 = txts[:200].map(tkn)
toks200[0]

In [None]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

In [None]:
nums = num(toks)[:20]; nums

In [None]:
' '.join(num.vocab[o] for o in nums)

### Putting Our Texts into Batches for a Language Model

In [None]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

In [None]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

In [None]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

In [None]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

In [None]:
nums200 = toks200.map(num)

In [None]:
dl = LMDataLoader(nums200)

In [None]:
x,y = first(dl)
x.shape,y.shape

In [None]:
' '.join(num.vocab[o] for o in x[0][:20])

In [None]:
' '.join(num.vocab[o] for o in y[0][:20])

## Training a Text Classifier

### Language Model Using DataBlock

In [None]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

In [None]:
dls_lm.show_batch(max_n=2)

### Fine-Tuning the Language Model

In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

In [None]:
learn.fit_one_cycle(1, 2e-2)

### Saving and Loading Models

In [None]:
learn.save('1epoch')

In [None]:
learn = learn.load('1epoch')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, 2e-3)

In [None]:
learn.save_encoder('finetuned')

### Text Generation

In [None]:
TEXT = "Breaking news: LeBron James has a "
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

### Creating the Classifier DataLoaders

In [None]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [None]:
dls_clas.show_batch(max_n=3)

In [None]:
nums_samp = toks200[:10].map(num)

In [None]:
nums_samp.map(len)

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

In [None]:
learn = learn.load_encoder('finetuned')

### Fine-Tuning the Classifier

In [None]:
learn.fit_one_cycle(1, 2e-2)

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))