# Sinple NLP run through

This notebook will go the stages of completing some NLP work as a demonstration.



### 1. Scraping

For scraping the information , scrapers largely fall into 3 categories:

1. It is a common site and scrapers are already built - Majority of scrapers / sites. I can supply this.
2. It is an easy site to scrape but you can't built a universal scraper for it e.g. Telegraph below....
3. It is an uncommon site and you have to build a scraper from scratch.  These are usually quite easy sites to scrape

### 2. Most Common Words

This analysis is done on pretty much everything.  Luckily I have written a program for this so you just have to run the program!!!!

### 3. Topic modelling

Although I haven't written a program for this, we can follow the steps of the process as it is quite easy...

### 4. Coefficient Analysis - Occasional

Occasionally they ask you to build a model and then share the coefficients. Although the process is easy, I will show you how to do this.


In [44]:
# Imports
import pandas as pd
import numpy as np

# for datetime
from datetime import datetime 
import dateparser


# Webscraping Imports
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ActionChains
from time import sleep

# My path for selenium chromedriver.  update to yours
chrome_path = '/Users/danielpayne/Desktop/Data Science/chromedriver'

# Other scraping tools
import requests
from bs4 import BeautifulSoup
import re
import urllib
import json

import os
from ast import literal_eval

# Reddit API Scraping
import praw
import pprint

# General scrapers already built
from bupa_scraper_functions import web_app_scraper

### Text Analysis

# Most Common Words Function - Will supply!!!!
from humantheory.text_analysis import MCW

# Get SpaCy for tokenisation
import spacy
nlp = spacy.load('en_core_web_sm')

# Import this package for selecting best topics
from operator import itemgetter

# Now gensim for Topic Modelling
import gensim
from gensim import corpora

from gensim.models import CoherenceModel

# Display Topic Modelling
import pyLDAvis.gensim

# To improve Topic Modelling - Will explain!!!!
os.environ.update({'MALLET_PATH':'/Users/danielpayne/Desktop/Data Science/Bassline/new_mallet/mallet-2.0.8/'})
mallet_path = r'../../../Bassline/new_mallet/mallet-2.0.8/bin/mallet'

# In case need to do Coefficients
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import stop_words
from sklearn.svm import SVC

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS


# In case need to plot something
import seaborn as sns

## Helper Functions

In [2]:
# Extracts lemmas for Topic Modelling
def extract_pos_of_interest(tokens, additional_stop_words = []):
    
    add_stop_words = ENGLISH_STOP_WORDS.union(additional_stop_words)
    
    tok_list = []
    
    # Loop through the tokens
    for token in tokens:
        
        if token.text in add_stop_words:
            pass
        
        else:
        
            # Check if adjective, adverb or verb
            if (token.pos_ == 'ADJ') or (token.pos_ == 'ADV') or (token.pos_ == 'VERB') or (token.pos_ == 'NOUN'):
            
                # Append the LEMMA of the token if it is Adj, Adv or Verb
                tok_list.append(token.lemma_)

    # Retrun the list
    return tok_list

# Helps add the keywords of the topics to each row in Dataframe
def topic_describer(topic_num, model):

    wp = model.show_topic(topic_num-1, topn=15)

    topic_keywords = "-".join([word for word, prop in wp])
    return topic_keywords

from gensim.models.ldamodel import LdaModel

# So we can display the "Mallet Model"
def convertldaMalletToldaGen(mallet_model):
    model_gensim = LdaModel(
        id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
        alpha=mallet_model.alpha) 
    model_gensim.state.sstats[...] = mallet_model.wordtopics
    model_gensim.sync_state()
    return model_gensim

# Find the optimal number of topics - Normal topic modelling
def compute_coherence_values(id2word, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

# Find the optimal number of topics - Mallet topic modelling - Much more accurate.  Takes longer though
def compute_mallet_coherence_values(id2word, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

# Function to get the dominant topic for each piece of text
def format_topics_sentences(ldamodel, corpus, texts):
    #Function to find the dominant topic in each review
    sent_topics_df = pd.DataFrame() 

    # Get main topic in each review
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each review
        for j, (topic_num, prop_topic) in enumerate(row):
            
            if j == 0:  # =&gt; dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                 break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

## 1. Scraping

Here I will star the scrape.  

I am staring with the **Long Covid** tab.  The reason being that I can see this is the easiest to do.  Basically I know that we don't have a scraper for the business forum site.

It is best to start with the easiest cohort / subject first.  This is because it takes the strategists 3-4 hours to do each one.  Therefore start with the easiest one so you can return some data as quickly as possible.

It saves them chasing you!!!!!

In [3]:
# define urls
links_long_covid = ['https://www.theguardian.com/commentisfree/2021/jun/10/long-covid-hope-recovery-symptoms#comments',
'https://s3.eu-west-1.amazonaws.com/spreadsheet-csv-files.production/5f215244c0268d0011acd66c/Bupa_bi-weekly%20-%20Jun%2024%2C%202021%20-%203%2045%2010%20PM.csv?AWSAccessKeyId=AKIAZTRMAZ4VVVXQZGHS&Expires=1627137926&Signature=%2BVILaTylzAlhFVAn6bSY7leYT6I%3D',
'https://www.telegraph.co.uk/news/2021/06/24/two-million-may-have-suffered-long-covid/']

# First run through the general scraper and see which ones it does.....
long_covid_df, crap = web_app_scraper(links_long_covid)

print(crap)
print(long_covid_df.shape)
long_covid_df.head()

['https://www.telegraph.co.uk/news/2021/06/24/two-million-may-have-suffered-long-covid/']
(3398, 5)


Unnamed: 0,author,content,datetime,url,source
0,CommunityMod,Comments on this piece are premoderated to ens...,2021-06-10T11:53:22Z,https://www.theguardian.com/commentisfree/2021...,Guardian
1,dwatsuts,"As you said, this is not going to help everyon...",2021-06-10T11:59:56Z,https://www.theguardian.com/commentisfree/2021...,Guardian
2,Mozartkugeln,An interesting but rather sad read; I'm glad t...,2021-06-10T12:07:01Z,https://www.theguardian.com/commentisfree/2021...,Guardian
3,OhIAmSuchASadGit,One of my mates had this. He turned into a rig...,2021-06-10T12:11:11Z,https://www.theguardian.com/commentisfree/2021...,Guardian
4,GaryFenton77,"One of the lads at work has had long covid , h...",2021-06-10T12:17:28Z,https://www.theguardian.com/commentisfree/2021...,Guardian


### Scraping output

Here we can see above that the general scraper managed to scrape all the links bar the Telgraph link.

Therefore we will have to manually scrape this.  However for the Telegraph this is quite easy and can be performed with the following steps...

<img src="https://raw.githubusercontent.com/danoscarpayne/most_common_words/main/Telegraph%20Scraping.png" width="800" />


1. open the link to the Article
2. Scroll down to the comments and click on "Show Comments"
3. Right click on text and "Inspect"
4. In the developer tab on right hand side click on "Network"
5. Below this, filter to "XHR" requests
6. Now in main secion of site press sort comments by Oldest
7. In the developer console, you should now see the api delivering the comment information!! It will show something like "0.json" or "1.json" - See diagram.
8. Right click on this link and click "copy as cUrl"
9. Go to https://curl.trillworks.com/ and paste this into the left hand box.
10. In the right hand box it will give you the python to retrieve this data!! 
11. Hit the api to get the data....

In [12]:
# From the process above
headers = {
    'authority': 'data.livefyre.com',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
    'accept': '*/*',
    'origin': 'https://www.telegraph.co.uk',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.telegraph.co.uk/',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

response = requests.get('https://data.livefyre.com/bs3/v3.1/telegraphmedia.fyre.co/384104/QXhXTjlnRjJsQnps/0.json', headers=headers)

# show output
response.json()

{'content': [{'vis': 0,
   'collectionId': '260593174',
   'content': {'id': '996351354', 'createdAt': 1624511640, 'parentId': ''},
   'source': 5,
   'type': 0,
   'event': 1624511660051999},
  {'vis': 0,
   'collectionId': '260593174',
   'content': {'id': '996351372', 'createdAt': 1624511709, 'parentId': ''},
   'source': 5,
   'type': 0,
   'event': 1624552019136380},
  {'vis': 1,
   'collectionId': '260593174',
   'content': {'generator': {'id': 'livefyre.com'},
    'updatedBy': 'mvrtqz3coeydg2dgmy4ggzrygy4wymlu@telegraphmedia.fyre.co',
    'bodyHtml': '<p>Utter tosh. Ask people who have had proper flu, and they’ll tell you 12 weeks later they are not quite yet over it.\xa0</p><p>Everyone I know who has had covid (maybe 50+ people both professionally and personally plus friends of my family), all \xa0made 100% recovery and that is a lot of people. This doesn’t include all the people who didn’t know they had it. Even my 78 year old mother and father had it- no issues.</p><p>Yes, so

In [13]:
# Look at the length
len(response.json()['content'])

58

Will only need to run through 2 pages as only 79 comments overall.

We just adapt the url to paginate.  Example below.

In [14]:
# Output container
tele_output = []

# paginate
for i in range(2):
    
    sleep(1)
    
    # template of url above
    graph_template = 'https://data.livefyre.com/bs3/v3.1/telegraphmedia.fyre.co/384104/QXhXTjlnRjJsQnps/{}.json'
    
    # Change number to paginate
    response = requests.get(graph_template.format(i), headers=headers)

    # loop through
    for comment in response.json()['content']:
        
        # dictionary object
        t_dict = {}
        
        # filter to make it easy to code....
        article = comment['content']
        
        # Author
        t_dict['author'] = article.get('authorId')
        
        # likes
        try:
            like_no = len(article['annotations'].get('likedBy', []))
        except:
            like_no = 0
        t_dict['likes'] = like_no
        
        # datetime
        t_dict['datetime'] = article.get('createdAt')
        
        # content
        raw_text = article.get('bodyHtml', '')
        soup_tele = BeautifulSoup(raw_text)
        t_dict['content'] = soup_tele.get_text()

        tele_output.append(t_dict)
        
telegraph_long_covid = pd.DataFrame(tele_output)
print(telegraph_long_covid.shape)
telegraph_long_covid.head()

(79, 4)


Unnamed: 0,author,content,datetime,likes
0,,,1624511640,0
1,,,1624511709,0
2,mvrtqz3coeydg2dgmy4ggzrygy4wymlu@telegraphmedi...,Utter tosh. Ask people who have had proper flu...,1624512584,24
3,gnvgy3tsnjyxmzdsnyzxcytvozyde2lt@telegraphmedi...,@Jonny Dangerous I know no-one who had more th...,1624527967,3
4,mrwgc2jtmrywm4lmmzrtg5tlonttmytk@telegraphmedi...,"""The most common symptoms were tiredness and m...",1624519258,16


In [15]:
# Drop blank entries
telegraph_long_covid = telegraph_long_covid.dropna()

# Add url and source columns
telegraph_long_covid['url'] = 'https://www.telegraph.co.uk/politics/2021/06/25/england-current-lockdown-rules-covid-restrictions-roadmap-when-end/#comment'
telegraph_long_covid['source'] = 'Telegraph'

# convert to datetime
telegraph_long_covid.datetime = pd.to_datetime(telegraph_long_covid.datetime, unit='s')

telegraph_long_covid.head()

Unnamed: 0,author,content,datetime,likes,url,source
2,mvrtqz3coeydg2dgmy4ggzrygy4wymlu@telegraphmedi...,Utter tosh. Ask people who have had proper flu...,2021-06-24 05:29:44,24,https://www.telegraph.co.uk/politics/2021/06/2...,Telegraph
3,gnvgy3tsnjyxmzdsnyzxcytvozyde2lt@telegraphmedi...,@Jonny Dangerous I know no-one who had more th...,2021-06-24 09:46:07,3,https://www.telegraph.co.uk/politics/2021/06/2...,Telegraph
4,mrwgc2jtmrywm4lmmzrtg5tlonttmytk@telegraphmedi...,"""The most common symptoms were tiredness and m...",2021-06-24 07:20:58,16,https://www.telegraph.co.uk/politics/2021/06/2...,Telegraph
5,n5vg4ytqgrww4ndcmjswm3lkn5rgmmlg@telegraphmedi...,"Not peer reviewed yet, so should not be publis...",2021-06-24 08:07:20,8,https://www.telegraph.co.uk/politics/2021/06/2...,Telegraph
7,oq2heytnoezxa5dlnfqxg2rrmq4dczlc@telegraphmedi...,Yet another excuse for the work shy and hypoch...,2021-06-24 08:31:23,7,https://www.telegraph.co.uk/politics/2021/06/2...,Telegraph


In [16]:
## Now add to original dataframe
long_covid_df = long_covid_df.append(telegraph_long_covid, ignore_index=True)
print(long_covid_df.shape)
long_covid_df.tail()

(3472, 6)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,author,content,datetime,likes,source,url
3467,om3gm3ljof3ggoldnjxdm3lhmrshm3dd@telegraphmedi...,"'The study, yet to be peer-reviewed, indicates...",2021-06-24 11:42:16,1.0,Telegraph,https://www.telegraph.co.uk/politics/2021/06/2...
3468,ozqtkndqmy4wq4zuobtgo2tcnuywwm3n@telegraphmedi...,People wanting to stay on furlough and long te...,2021-06-24 11:49:09,2.0,Telegraph,https://www.telegraph.co.uk/politics/2021/06/2...
3469,nvwtgmzrguzgmzlthfztoz3lojtg22rz@telegraphmedi...,More scaremongering nonsense from the DT.It's ...,2021-06-24 12:47:02,1.0,Telegraph,https://www.telegraph.co.uk/politics/2021/06/2...
3470,gy4xiy3pg44dq2jsn44wymjwojwwkyrq@telegraphmedi...,Well at least it only lasts 12 weeks. Try havi...,2021-06-24 19:42:17,2.0,Telegraph,https://www.telegraph.co.uk/politics/2021/06/2...
3471,hb2te4dpozwxa23enzvxg3lqmnqtinrw@telegraphmedi...,The quality of the research is er... so questi...,2021-06-24 22:57:41,0.0,Telegraph,https://www.telegraph.co.uk/politics/2021/06/2...


## Most Common Words

To generate the Most common Words is reasonably straight-forward as there is a pre-built function that I made to carry this out for you.

To do basic word counts is straightforward.  However there are a number of things you can adjust:

    - text_col: this must be filled in and denotes which column contains your text
    - output_stem: This is just for saving the file at the end if you want to save in another folder
    - output_number: The number of words / phrases to output.  Default set to 100
    - max_word_phrases : The maximum number phrase length to consider.  Default set to 3 words
    - additional_stop_words: Any stop words to add

In [17]:
# Will set up word counter with some stop words
stop_twit = ['https', 'http', 'pic', 'twitter', 'ly','ciomvhbxsn', 'varytjfcfl',
                                                         'rt', 'qt', 'bit']

# set text_col to 'content' as this column has our text in it
word_bot = MCW(text_col='content', additional_stop_words=stop_twit)

### Now we add the data

Here the word counter will pick out the most common nouns, verbs, adjectives and adverbs.  It causes it to take longer but the team like these in.

If you just want a straight "word count" without these.  you can pass `nlp_transform = False` next to the dataframe

In [18]:
# Add the dataframe.  Don't need nlp_transform = True as default set but just to show where the control is
word_bot.fit_transform(long_covid_df, nlp_transform=True)

# Do this step so it makes the word counts 
word_bot.make_word_counts()

# Show the table for main text column with 2 word phrases to check no errors or words we need to remove
word_bot.show_word_count_table('content', 2)

Unnamed: 0,word,count
0,long covid,3441
1,suffering long,172
2,post viral,172
3,long term,147
4,covid long,110
5,viral fatigue,97
6,covid symptoms,90
7,people long,76
8,young people,71
9,covid 19,64


In [19]:
# To Save the data.  Note this will create a sheet for each word type, e.g. original text, nouns, verbs etc
word_bot.export_to_excel('MCW_Niru_example.xlsx')

## Topic Modelling

For the topic modelling, I largely follow the process below.

Note that I use the `Mallet Model` for Topic Modelling.  This is an algorithm for Topic modelling but it makes the topics easier to distinguish.

It takes a bit longer but gives results that are easier to decipher.

Here are the steps.

Note that our original dataframe now has more columns added to it....



In [20]:
# Take a look
long_covid_df.head()

Unnamed: 0,author,content,datetime,likes,source,url,tokenized,nouns,adjectives,verbs,adverbs
0,CommunityMod,Comments on this piece are premoderated to ens...,2021-06-10T11:53:22Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(Comments, on, this, piece, are, premoderated,...",Comments piece discussion topics writer d...,aware short,are premoderated ensure remains raised be...,there
1,dwatsuts,"As you said, this is not going to help everyon...",2021-06-10T11:59:56Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(As, you, said, ,, this, is, not, going, to, h...",everyone person article concern doctors p...,other worthwhile main medical,said is going help suppose helps 's is ...,not at least then however far too often...
2,Mozartkugeln,An interesting but rather sad read; I'm glad t...,2021-06-10T12:07:01Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(An, interesting, but, rather, sad, read, ;, I...",author others recovery condition others t...,interesting sad glad long important long ...,read 'm have made debilitating will be ...,rather very there not very perhaps even ...
3,OhIAmSuchASadGit,One of my mates had this. He turned into a rig...,2021-06-10T12:11:11Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(One, of, my, mates, had, this, ., He, turned,...",mates misery layabout docs bit piss hind...,right total ashamed awful,had turned worked was 're taking says k...,Now mercilessly even also fully now now ...
4,GaryFenton77,"One of the lads at work has had long covid , h...",2021-06-10T12:17:28Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(One, of, the, lads, at, work, has, had, long,...",lads work covid months month months work...,long better easier light shorter last fu...,has had has had says suffers is getting...,now still off back now back off So the...


### Extract Lemmas and make dictionary and corpus

First we must extract the lemmas for topic modelling.  As we already have the tokenised text, we can run a simpole function over it to get the lemmas which I call `parse_data` here.

In [21]:
# To extract the Lemmas
long_covid_df['parse_data'] = long_covid_df.tokenized.map(lambda x: extract_pos_of_interest(x))

# Build our dictionary
dictionary = corpora.Dictionary(long_covid_df.parse_data)

# Build the corpus
corpus = [dictionary.doc2bow(text) for text in long_covid_df.parse_data]

### Determine the ideal number of topics

We run this program so we can work out statistically the ideal number of topics.

We try and find the highest values of scores in the `coherence_m_values` list.  The higher the score, the more likely this is to be the ideal number of topics.

Once we find the highest score, we then select the model in the same position.

E.g. If the highest score is the second value in the list, the best model is the second value in the model list

In [25]:
%%time
# Can take a long time to run.
model_m_list, coherence_m_values = compute_mallet_coherence_values(dictionary, corpus=corpus, texts=long_covid_df.parse_data, start=2, limit=10, step=1)

CPU times: user 4.13 s, sys: 806 ms, total: 4.94 s
Wall time: 5min 5s


In [26]:
coherence_m_values

[0.4810699072097745,
 0.42068753871732795,
 0.41505095396150127,
 0.39399141261363646,
 0.425454128780812,
 0.39454661705774213,
 0.3842254666858421,
 0.37950888566233193]

In [24]:
# To computationally find the index of highest value do this if you want....!
np.argmax(coherence_m_values)

0

In [27]:
# Note that position 0 is 2 topics, position 1 is 3 topics, position 2 is 4 topics etc....

# select best model
optimal_model = model_m_list[0]

# Make table showing deominamnt topic for each comment
topic_dist_df = format_topics_sentences(optimal_model, corpus, long_covid_df.parse_data)
print(long_covid_df.shape, topic_dist_df.shape)
topic_dist_df.head()

(3472, 12) (3472, 4)


Unnamed: 0,Dominant_Topic,Perc_Contribution,Topic_Keywords,parse_data
0,0.0,0.5156,"people, suffer, symptom, year, fatigue, month,...","[comment, piece, premoderate, ensure, discussi..."
1,0.0,0.5821,"people, suffer, symptom, year, fatigue, month,...","[say, go, help, suppose, help, person, be, wor..."
2,0.0,0.53,"people, suffer, symptom, year, fatigue, month,...","[interesting, sad, read, be, glad, author, rec..."
3,0.0,0.5196,"people, suffer, symptom, year, fatigue, month,...","[mate, turn, right, misery, total, layabout, d..."
4,0.0,0.5625,"people, suffer, symptom, year, fatigue, month,...","[lad, work, long, covid, month, say, suffer, g..."


In [28]:
# Add this to the Dataframe.  Add 1 to each topic to nicer to read and mtac hes the graphic below
long_covid_df['top_topic'] = topic_dist_df.Dominant_Topic + 1
long_covid_df['topic_probability'] = topic_dist_df.Perc_Contribution
long_covid_df.head()

Unnamed: 0,author,content,datetime,likes,source,url,tokenized,nouns,adjectives,verbs,adverbs,parse_data,top_topic,topic_probability
0,CommunityMod,Comments on this piece are premoderated to ens...,2021-06-10T11:53:22Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(Comments, on, this, piece, are, premoderated,...",Comments piece discussion topics writer d...,aware short,are premoderated ensure remains raised be...,there,"[comment, piece, premoderate, ensure, discussi...",1.0,0.5156
1,dwatsuts,"As you said, this is not going to help everyon...",2021-06-10T11:59:56Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(As, you, said, ,, this, is, not, going, to, h...",everyone person article concern doctors p...,other worthwhile main medical,said is going help suppose helps 's is ...,not at least then however far too often...,"[say, go, help, suppose, help, person, be, wor...",1.0,0.5821
2,Mozartkugeln,An interesting but rather sad read; I'm glad t...,2021-06-10T12:07:01Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(An, interesting, but, rather, sad, read, ;, I...",author others recovery condition others t...,interesting sad glad long important long ...,read 'm have made debilitating will be ...,rather very there not very perhaps even ...,"[interesting, sad, read, be, glad, author, rec...",1.0,0.53
3,OhIAmSuchASadGit,One of my mates had this. He turned into a rig...,2021-06-10T12:11:11Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(One, of, my, mates, had, this, ., He, turned,...",mates misery layabout docs bit piss hind...,right total ashamed awful,had turned worked was 're taking says k...,Now mercilessly even also fully now now ...,"[mate, turn, right, misery, total, layabout, d...",1.0,0.5196
4,GaryFenton77,"One of the lads at work has had long covid , h...",2021-06-10T12:17:28Z,,Guardian,https://www.theguardian.com/commentisfree/2021...,"(One, of, the, lads, at, work, has, had, long,...",lads work covid months month months work...,long better easier light shorter last fu...,has had has had says suffers is getting...,now still off back now back off So the...,"[lad, work, long, covid, month, say, suffer, g...",1.0,0.5625


In [29]:
# Connot visualise Mallet model so must conv ert this first
ldagensim = convertldaMalletToldaGen(optimal_model)

# Produce the graph
final_vis = pyLDAvis.gensim.prepare(ldagensim, corpus, dictionary, sort_topics = False)
# To display
pyLDAvis.display(final_vis)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [30]:

# Save the image
pyLDAvis.save_html(final_vis, 'Niru Example Long Covid 2 Mallet Topics.html')

In [32]:
# Make an overview table of topics and percent proportion
overview = pd.concat([long_covid_df.top_topic.value_counts(), long_covid_df.top_topic.value_counts(normalize=True)], axis=1)

# Change column names
overview.columns = ['count', 'percent']

overview.index = overview.index.map(lambda x: int(x))

# Get topic words
overview['topic_name'] = overview.index.map(lambda x: topic_describer(x, optimal_model))

overview

Unnamed: 0,count,percent,topic_name
2,1753,0.504896,long-covid-vaccine-child-death-risk-case-die-y...
1,1719,0.495104,people-suffer-symptom-year-fatigue-month-work-...


In [33]:
# Now save overview and data
writer = pd.ExcelWriter('Niru Example Long Covid 2 Mallet Topics.xlsx')

overview.to_excel(writer, 'overview')
long_covid_df.to_excel(writer, 'data')

writer.save()
writer.close()

# Coefficients

Sometimes the team ask for the coefficients of a model to see the differences in attitude between sets of data.

Here I am using data from a similar job we did recently for Whiskey - a client called Smokehead whiskey.

I have attached the brief sheet here https://docs.google.com/spreadsheets/d/1f-7EK37OfLqb8wFVILfLRwIt5-GLlZ8Ak-jUP0bog9w/edit#gid=0

Here we will reproduce the work from the `Drinking Whiskey Generally` tab.  We will do the comparison between UK sources and US sources.

In [34]:
# data path - where data is stored currently
data_path = '/Users/danielpayne/Desktop/Data Science/Dan HT Notebooks/Smokehead/Outputs/'

uk_whiskey = pd.read_excel(data_path + 'general_whiskey 5 Mallet topics Topics.xlsx', sheet_name='data')
print(uk_whiskey.shape)
uk_whiskey.head()

(7009, 13)


Unnamed: 0,author,content,datetime,source,url,tokenized,nouns,adjectives,verbs,adverbs,parse_data,top_topic,topic_probability
0,Razzafrachen,"yes, definitely. \n\n100-110 proof bourbon nea...",2015-08-12T04:15:33,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,"yes, definitely. \n\n100-110 proof bourbon nea...",proof bourbon preference whiskies occasion...,neat little lower higher proof neat thic...,is do mind going will drink tends offer...,definitely n't also much as well there ...,"['definitely', 'proof', 'bourbon', 'neat', 'pr...",1.0,0.3895
1,jmmdc,"I do, but not always. I find different whiskie...",2015-08-12T04:27:27,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,"I do, but not always. I find different whiskie...",whiskies work strengths bottle % buckets ...,different different undrinkable shocked ex...,do find found was tried was did add ge...,not always best when n't at all awhile ...,"['different', 'whisky', 'work', 'best', 'diffe...",1.0,0.3699
2,jo8edogawa,"When I buy cask-strenght, I drink cask-strengh...",2015-08-12T06:57:15,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,"When I buy cask-strenght, I drink cask-strenght.",cask strenght cask strenght,,buy drink,When,"['when', 'buy', 'cask', 'strenght', 'drink', '...",2.0,0.2242
3,reeepicheeep,I used to do it because I convinced myself tha...,2015-08-12T10:31:24,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,I used to do it because I convinced myself tha...,water,enough,used do convinced enjoyed did add can d...,really n't Now comfortably obviously,"['use', 'convince', 'enjoy', 'really', 'do', '...",1.0,0.2361
4,dustlesswalnut,"Yes, all the time. And the benefit to me is th...",2015-08-12T04:27:17,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,"Yes, all the time. And the benefit to me is th...",time benefit,,is tastes,better,"['time', 'benefit', 'taste', 'better']",1.0,0.2201


In [35]:
# Reduce down and add a target column
uk_whiskey = uk_whiskey.iloc[:, :5].copy()

# target col
uk_whiskey['is_uk'] = 1

uk_whiskey.head()

Unnamed: 0,author,content,datetime,source,url,is_uk
0,Razzafrachen,"yes, definitely. \n\n100-110 proof bourbon nea...",2015-08-12T04:15:33,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,1
1,jmmdc,"I do, but not always. I find different whiskie...",2015-08-12T04:27:27,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,1
2,jo8edogawa,"When I buy cask-strenght, I drink cask-strengh...",2015-08-12T06:57:15,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,1
3,reeepicheeep,I used to do it because I convinced myself tha...,2015-08-12T10:31:24,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,1
4,dustlesswalnut,"Yes, all the time. And the benefit to me is th...",2015-08-12T04:27:17,Reddit,/r/whisky/comments/3goej9/do_you_drink_whisky_...,1


In [36]:
links_us = ['https://s3.eu-west-1.amazonaws.com/spreadsheet-csv-files.production/5f215244c0268d0011acd66c/Whisky_drinking_-_general%20-%20Jun%2017%2C%202021%20-%201%2019%2040%20PM.csv?AWSAccessKeyId=AKIAZTRMAZ4VVVXQZGHS&Expires=1626524496&Signature=xl5o%2FX9DyPlani7kINpbN%2FOVG0Y%3D']

us_whiskey, crap = web_app_scraper(links_us)

print(us_whiskey.shape, len(crap))

us_whiskey.head()

(18953, 5) 0


Unnamed: 0,author,content,datetime,url,source
0,Adam Jensen,"Did you drink all my fucking whiskey, Francis?",17-Jun-2021 01:03PM,https://twitter.com/AdamJensenBot/statuses/140...,Twitter
1,Éadweard,@aelfred_D Oh yes; nuncheon. From noon schench...,17-Jun-2021 12:00PM,https://twitter.com/Ed35761527/statuses/140548...,Twitter
2,Susan Buck Jordan,"@MeghanMcCain In a cup of hot black tea, add a...",17-Jun-2021 10:52AM,http://twitter.com/SusanBJordan/statuses/14054...,Twitter
3,RoundballRoundup,@JoelEmbiid ...Pissin the nights away; Pissin ...,17-Jun-2021 09:55AM,https://twitter.com/RoundBallsDeep/statuses/14...,Twitter
4,Katelynn,My boyfriend is snoring so hard I can smell th...,17-Jun-2021 09:39AM,https://twitter.com/husbandaz/statuses/1405445...,Twitter


In [37]:
# Add target column
us_whiskey['is_uk'] = 0

# check shapes
print(uk_whiskey.shape, us_whiskey.shape)

(7009, 6) (18953, 6)


#### Big class imbalance

First we must rectify this before modelling.

In [38]:
# Get a sample of same size as uk dataset
american_whiskey = us_whiskey.sample(len(uk_whiskey))

print(american_whiskey.shape, uk_whiskey.shape)

(7009, 6) (7009, 6)


In [39]:
# Now put together in main dataframe
atlantic_whiskey = pd.concat([uk_whiskey, american_whiskey], ignore_index=True)

# Mix them up as all uk on top and all US on bottom and reset index - don't keep old index
atlantic_whiskey = atlantic_whiskey.sample(frac=1).reset_index(drop=True)

# Take a look
atlantic_whiskey.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,author,content,datetime,is_uk,source,url
0,blake,@BrettKollmann @ThiccFlairDrip just to be clea...,01-Mar-2021 12:56AM,0,Twitter,https://twitter.com/blksheep_92x/statuses/1366...
1,Peter 🐉🦈🎮,Apparently you can learn a lot about a person ...,19-Nov-2020 01:43PM,1,Twitter,http://twitter.com/PeteeyGaming/statuses/13294...
2,Terry Wilson,"@br_nning Yep, the hunger creates the overwhel...",17-Jun-2021 10:38AM,1,Twitter,https://twitter.com/writetjw/statuses/14054597...
3,Jackmerius Tacktheratrix,this comment thread made my day. cheers MBs,,1,Tiktok,https://www.tiktok.com/@barchemistry/video/692...
4,OliF2018,@LibertarianRed1 @ProperlyZuri It’s more of a ...,01-Apr-2021 05:58AM,1,Twitter,http://twitter.com/F2018Oli/statuses/137748556...


#### Can see a lot of handles and urls in data

I want to clean these out as they corrupt the data.  Will do that before modelling

In [41]:
### Both looked up from Stack Overflow!!!!

# Remove author / handle names
atlantic_whiskey.content = atlantic_whiskey.content.map(lambda x: re.sub(r'@\w+', '', x))

# Remove URLS
atlantic_whiskey.content = atlantic_whiskey.content.map(lambda x: re.sub(r'http[s]?://\S+', '', x))
atlantic_whiskey.head()

Unnamed: 0,author,content,datetime,is_uk,source,url
0,blake,just to be clear im not suggesting you shoul...,01-Mar-2021 12:56AM,0,Twitter,https://twitter.com/blksheep_92x/statuses/1366...
1,Peter 🐉🦈🎮,Apparently you can learn a lot about a person ...,19-Nov-2020 01:43PM,1,Twitter,http://twitter.com/PeteeyGaming/statuses/13294...
2,Terry Wilson,"Yep, the hunger creates the overwhelming desi...",17-Jun-2021 10:38AM,1,Twitter,https://twitter.com/writetjw/statuses/14054597...
3,Jackmerius Tacktheratrix,this comment thread made my day. cheers MBs,,1,Tiktok,https://www.tiktok.com/@barchemistry/video/692...
4,OliF2018,It’s more of a fun drink for most I think. M...,01-Apr-2021 05:58AM,1,Twitter,http://twitter.com/F2018Oli/statuses/137748556...


## Do Modelling

In [42]:
# Define features
X = atlantic_whiskey.content.values

#### Set up our word counting object
vect = TfidfVectorizer(stop_words='english', ngram_range=(2, 4), max_features=5000)

# Now we have the word counting object, pass the data through it
X = vect.fit_transform(X)

# This is what we are trying to predict
y = atlantic_whiskey.is_uk

In [45]:
# Set up the model
cos = SVC(kernel = 'linear')

# Fit data
cos.fit(X, y)

# Make predictions
pred = cos.predict(X)

# Check score to see if model works ok
score = cos.score(X, y)
score

0.8268654586959623

In [46]:
# Make the tables
coeff_df = pd.DataFrame(cos.coef_.todense(), columns=vect.get_feature_names(), index = ['coeffs']).T

# Sort values in order
coeff_df = coeff_df.sort_values('coeffs')

# Taske a look
coeff_df

Unnamed: 0,coeffs
whiskey beer,-2.683516
whiskey tequila,-2.673012
15 year,-2.594168
drinking tonight,-2.496878
drink vodka whiskey,-2.481900
tequila whiskey,-2.353224
drink bourbon,-2.325267
mixed drinks,-2.308179
drinking bad,-2.267150
whiskey sip,-2.234300


### Reminder

More negative comments here will be American i.e. is_uk = 0.

More positive coefficients here will be UK, i.e. is_uk = 1

In [47]:
# Save 
coeff_df.to_excel('Niru Example Coeffificents.xlsx')