<h1><center>Web Harvesting & NLP Analysis with RSelenium & Gensim</center></h1>
<h2><center>Uncovering Click-Bait in 2022's Most Popular Medium Posts in Data Science</center></h2>

What makes a title & subtitle so appealing to click? What are the buzzwords attract data scientists? <br>
In this project I sought out answers to these questions, and more:
- How does the title & subtitle differ from those above the mean applauds to those below the mean?
- How many applauds does it take to make an article above average?
- How many articles make up the top 80% of applauds?

***

<h2><center>Applying data science to data scientists</center></h2>

Most data scientists analyze human behavior, which introduces a layer of psychology to our field. I've always loved analyzing reactions and how predictable we can be, even though we have free-will. This was a fun project, and I was surprised by some of the answers.

<h3>Ethical Web-harvesting</h3>

The web-harvesting was done in R with the RSelenium package. In order to avoid IP flagging and burdening the servers, harvesting the data was spread out over time. While this data is public, it's important to approach this topic ethically.

<h3>R Code with RSelenium</h3>

If you would like to see my R code, please see the rmd files in the link to [01_web_harvesting.Rmd](https://github.com/dstephens179/nlp-web-harvesting/blob/main/01_web_harvesting.Rmd). I mainly used RSelenium to automate the data harvesting through the xpath. Xpath allowed greater flexibility to harvest the data that was necessary.



<h3>Data Details</h3>

- Data was harvested from medium.com archives with the tag "data science"
- Articles between Jan 1, 2022 and Dec 31, 2022
- Key information:
  - Author
  - Date
  - Title
  - Subtitle
  - Claps received

<h4>Legal</h4>

Medium's terms of use mentions web scraping twice:
1. <b>Scraping</b> and reposting content from other sources for the primary purpose of generating revenue or other personal gains.
2. You may not… interfere with, or disrupt, the access of any user, host, or network, including sending a virus, overloading, flooding, spamming, mail-bombing the Services, <b>scraping</b>, or by scripting the creation of content or accounts in such a manner as to interfere with or create an undue burden on the Services.

This project is only for personal interest. Neither of the above apply.

***

<h2><center>Load Data Cleaned in R</center></h2>

In [3]:
# Imports
import pandas as pd
import numpy as np
import plotly.express as px
import gensim

from nltk.corpus import stopwords
from IPython.display import Markdown as md
from IPython.display import HTML

medium = pd.read_csv("00_data/2022_data_cleaned_for_nlp.csv")

## fill NaN/blank claps with 0
medium['claps'] = medium['claps'].fillna(0)

## concat title+content
medium['combined'] = medium['title'].astype(str) + " " + medium['content']
medium = medium[['post_number', 'name', 'title', 'content', 'combined', 'claps']]

medium.head(3)


Unnamed: 0,post_number,name,title,content,combined,claps
0,0,georgia deaconu,3 ways to deal with large datasets in python,"as a data scientist, i find myself more and mo...",3 ways to deal with large datasets in python a...,295.0
1,1,aakriti sharma,a 22-week curriculum to learn data analytics i...,it’s always day one in tech! as the entire wor...,a 22-week curriculum to learn data analytics i...,678.0
2,2,michael zabolocki,top python libraries for visualization: a star...,"the guide to plotting scatter plots, heat maps...",top python libraries for visualization: a star...,64.0


***

<h2><center>High-Level Analysis</center></h2>
<h4>After cleaning, 33k articles by 12.4k writers were left to analyze</h4>

In [5]:
print("Total Articles: ", medium.shape[0])
print("Total Writers: ", medium['name'].value_counts().shape[0])

Total Articles:  33419
Total Writers:  12456


<h4>Claps ranged from 0-6,600 per article, with 66 as the mean</h4>

In [6]:
print("Total claps across all articles: ", int(medium['claps'].sum()))
print("Maximum claps received in an article: ", int(medium['claps'].max()))
print("Minimum claps received in an article: ", int(medium['claps'].min()))
print("Mean # claps received in an article: ", int(medium['claps'].mean()))
print("Median # claps received in an article: ", int(medium['claps'].median()))

Total claps across all articles:  2236214
Maximum claps received in an article:  6600
Minimum claps received in an article:  0
Mean # claps received in an article:  66
Median # claps received in an article:  21


<h4>Out of 33.4k articles, 24.5k (73% of the total) received claps below the mean</h4>

In [8]:
## Mean Summary
articles_total = medium['title'].shape[0]
mean_claps = int(medium['claps'].mean())
max_claps = int(medium['claps'].max())
articles_under_mean = medium[medium['claps'] < mean_claps].shape[0]
under_mean_pct = "{:.0%}".format(articles_under_mean/articles_total)
over_mean = articles_total - articles_under_mean
over_mean_pct = "{:.0%}".format(over_mean/articles_total)

print(f"Out of {articles_total} articles, {articles_under_mean} ({under_mean_pct} of the total) received fewer than {mean_claps} claps.")

x = [f'0-{mean_claps-1}', f'{mean_claps}-{max_claps}']
height = [articles_under_mean, over_mean]

px.bar(
    x = x,
    y = height,
    orientation = 'v',
    title = f'Only {over_mean_pct} of Articles Received Claps Above the Mean',
    labels={'x': 'Range of Claps', 'y':'Count of Articles'}
)


Out of 33419 articles, 24537 (73% of the total) received fewer than 66 claps.


![Alt text](above-mean-claps.PNG)

***

<h2><center>Writer Summary</center></h2>

<h4>125 writers, or 1%, wrote 7.4k articles, or 22%, out of 33.4k in 2022.</h4>

In [10]:
writers_total = medium['name'] \
    .value_counts() \
    .shape[0]

one_pct_of_writers = int(round(writers_total * 0.01, 0))

print("Total number of writers: ", writers_total)
print("Top 1% of writers: ", one_pct_of_writers)


top_percent_most_content = medium['name'].value_counts().nlargest(one_pct_of_writers)
writers_pct_of_total = "{:.0%}".format(one_pct_of_writers / writers_total)
articles_by_top_percent = top_percent_most_content.agg(np.sum)
articles_pct_of_total = "{:.0%}".format(articles_by_top_percent / articles_total)

print(f"{one_pct_of_writers} writers, or {writers_pct_of_total}, wrote {articles_by_top_percent} ({articles_pct_of_total}) articles in 2022.")


Total number of writers:  12456
Top 1% of writers:  125
125 writers, or 1%, wrote 7383 (22%) articles in 2022.


<h4>The top 3 writers with the most content wrote 1341 (4%) of 33,419 articles. An average of 1.2 articles per day in 2022.</h4>

In [11]:
### Top 3 Writers
top_3_writers = medium['name'].value_counts().nlargest(3)
articles_top_3 = top_3_writers.agg(np.sum)
articles_top_3_pct_of_total = "{:.0%}".format(articles_top_3 / articles_total)
avg_per_day_top_3 = round((articles_top_3/3) / 365, 1)

print(f"The top 3 writers out with the most content wrote {articles_top_3} ({articles_top_3_pct_of_total}) of {articles_total} articles. An average of {avg_per_day_top_3} articles per day in 2022.")

The top 3 writers out of 12456 wrote 1341 (4%) of 33419 articles. An average of 1.2 articles per day.


<h4>Paredo's Principle: 80/20</h4>

The 80/20 rule says '80% of the results come from 20% of the workers.'<br>
However, that's not valid in this analysis... <br>
<b>21k articles (or 63%) out of 33k are written by the top 20% (or 2.5k) writers.</b>

In [12]:
articles_80 = int(round(articles_total * 0.8, 0))
writers_20  = int(round(writers_total * 0.2, 0))

twenty_pct_of_writers = int(round(writers_total * 0.2, 0))
twenty_pct_writers = medium['name'].value_counts().nlargest(twenty_pct_of_writers)
twenty_pct_articles = twenty_pct_writers.agg(np.sum)

twenty_pct_result = "{:.0%}".format(twenty_pct_articles / articles_total)

print(f"{twenty_pct_articles} ({twenty_pct_result}) out of {articles_total} articles are written by the top 20% of writers, or {writers_20}.")

21296 (64%) out of 33419 articles, are written by the top 20% of writers, or 2491.


***

<h2><center>NLP with Gensim<center></h2>

<h3>NLP Process</h3>

We want to know what makes a data scientist click, read through, and engage with the article.  I analyzed the title+subtitle (ie. click-bait) of the articles with claps above the mean, and compared those results with articles below the mean.

<b>Please note:</b> The code for the articles below the mean is not included here. It's the same as below, but just reversing the "greater than or equal to" sign in the first line of code.

Each code chunk below walks you thorugh the steps, so you don't have to figure out what I was doing.

#### 1. Create empty list for tokens

In [16]:
# 5.0 TITLE+SUBTITLE PROCESSING (GENSIM): RESULTS FOR ARTICLES WITH CLAPS ABOVE THE MEAN (27%) ----
medium_above_mean = medium[medium['claps'] >= mean_claps]

tokens = []
for results_above in medium_above_mean['combined']:
    l = gensim.utils.simple_preprocess(str(results_above))
    tokens.append(l)

pd.Series(tokens)



0       [ways, to, deal, with, large, datasets, in, py...
1       [week, curriculum, to, learn, data, analytics,...
2       [is, coursera, ibm, data, science, professiona...
3       [plagiarism, detection, in, online, exams, usi...
4       [shooting, star, problem, simple, solution, an...
                              ...                        
8877    [days, of, pytorch, with, projects, series, ve...
8878    [days, of, tensorflow, and, keras, with, proje...
8879    [top, ai, predictions, for, from, forbes, inde...
8880    [data, science, my, most, successful, year, on...
8881    [what, chatgpt, chatgpt, is, neural, network, ...
Length: 8882, dtype: object

#### 2. Remove Stopwords (with, the, and, from, etc.) with nltk.corpus stopwords

In [17]:

## Remove stopwords
stop_words = stopwords.words('english')

### for loop to separate each element and loop through each list.  This removes stopwords from nested lists.
tokens_stopwords_removed = []
for i in tokens:
    ii = []
    for word in i:
        if word not in stop_words:
            ii.append(word)
    tokens_stopwords_removed.append(ii)

pd.Series(tokens_stopwords_removed)

0       [ways, deal, large, datasets, python, data, sc...
1       [week, curriculum, learn, data, analytics, fre...
2       [coursera, ibm, data, science, professional, c...
3       [plagiarism, detection, online, exams, using, ...
4       [shooting, star, problem, simple, solution, po...
                              ...                        
8877    [days, pytorch, projects, series, vertical, se...
8878    [days, tensorflow, keras, projects, series, ve...
8879    [top, ai, predictions, forbes, index, ventures...
8880    [data, science, successful, year, medium, than...
8881    [chatgpt, chatgpt, neural, network, model, dev...
Length: 8882, dtype: object

#### 3. Frequency Analysis: Uni-/Bi-grams

In [21]:
# Uni-grams
medium_tokenized = medium_above_mean.copy()
medium_tokenized['tokens'] = tokens_stopwords_removed
medium_tokenized


### explode: this creates a new row for each word, counts each words and dataframes it.
title_term_frequency = medium_tokenized[['post_number', 'tokens']] \
    .explode('tokens') \
    ['tokens'] \
    .value_counts() \
    .to_frame()


## Bi-grams
bigram = gensim.models.Phrases(
    tokens_stopwords_removed,
    min_count = 1,
    threshold = 0.1,
    delimiter = b"-"
)

bigram_model = gensim.models.phrases.Phraser(bigram)



### for loop again to remove stopwords, extracts uni-/bi-grams and appends them
bigram_list = []
for i in tokens_stopwords_removed:
    bigram_list.append(bigram_model[i])



## Frequency Analysis: Uni-/Bi-grams together
medium_bigram = medium_above_mean.copy()
medium_bigram['tokens'] = bigram_list


### explode, count and to dataframe
medium_bigram_term_frequency = medium_bigram[['post_number', 'tokens']] \
    .explode('tokens') \
    ['tokens'] \
    .value_counts() \
    .to_frame()


## Plot Uni-/Bi-grams together
px.bar(
    medium_bigram_term_frequency.head(50).sort_values(['tokens']),
    orientation = 'h',
    title = f'Most-repeated words in posts above mean (>= {mean_claps} claps)'
)

![Alt text](combined-top-25.PNG)

***

<h2><center>Conclusion</center></h2>

<h3>Articles calling "Time-Series" and "Pandas" are more likely to receive clicks & claps</h3>

![Alt text](comparison.png)

It shouldn't suprise us that keywords in titles+subtitles (click-bait) are similar across all articles.  Those similarities show us that the click-bait doesn't matter when it comes to claps.

- **Time-series** and **Pandas** were two keywords that were more popular, but weren't used ofter in articles below the clap-mean<br>
- **Big-data** and **deep-learning** were not so popular, but they were written about more. <br>

If I wanted to be an applauded data science writer on medium I would write about time-series in pandas.  That would give me the best chances of being clicked and applauded.<br>
I would stay away from "big-data" and "deep-learning."

"Big-data" and "deep-learning" are not bad topics, but maybe the "big-data" keyword is over-used and doesn't feel as cutting-edge as it used to.  Maybe "deep-learning" is still a black box or it's intimidating to learn.  That would be a different analysis.

<h2><center>Thanks for reading!</center></h2>