# Medium Data Data Cleaning
Clean the data scrapped from medium archive pages with the scrape_master.py file. 

Each archive was <b>scraped for each day between Jan 2016 and Sep 2020.</b>.

#### The following Tags are Scraped
['r','python', 'data-science','machine-learning', 'artificial-intelligence','deep-learning',data-engineering', 'data-analytics', 'statistics', 'reinforcement-learning']

## Purpose of the Data
 1. To <b>create a performance metric for Medium's authors</b>, so they can compare their work to the rest of Medium.
 2. To <b>compare the performance of authors and publications</b> on Medium.
 3. To <b>create a leaderboard</b> of the top performing authors and publications in each tag .
 
 4. To <b>find the differences that distinguish well-received articles.</b>
 
 


## Structure of the data
- Title
- Subtitle 
- Image (yes/no)
- Author
- Publication
- Year - Month - Day
- Tag
- Reading Time
- Claps
- Comment (yes/no)
- Story Url
- Author URL

<img src="img/card.png" width=500>

<hr>
# Load the Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import glob

# medium = pd.read_csv('Medium_Scrapermedium_artificial-intelligence_2009-2016.csv')
scraped_files = glob.glob("scraped_tags/*.csv")

frames =[]
for file in scraped_files:
    #all of the seperate scrapes from different tags
    df = pd.read_csv(file)
    frames.append(df)
medium = pd.concat(frames)
medium.head()

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Tag,Reading_Time,Claps,Comment,url,Author_url
0,,,1,Per Harald Borgen,Learning New Stuff,2016,1,1,artificial-intelligence,7,1.6K,0,https://medium.com/learning-new-stuff/how-to-l...,https://medium.com/@perborgen?source=tag_archi...
1,2016: The Future of User Experience,Whats happened with UX?,1,Krish Ramineni,"Technology, Invention, App, and More",2016,1,1,artificial-intelligence,5,42,0,https://medium.com/technology-invention-and-mo...,https://medium.com/@krishramineni?source=tag_a...
2,Computing all the Feels,#PostsFromTheNearFuture,1,Clayton d'Arnault,Digital Culturist,2016,1,1,artificial-intelligence,7,30,0,https://digitalculturist.com/computing-all-the...,https://digitalculturist.com/@cjdarnault?sourc...
3,Trying to Muse Rationally About the Singularit...,,1,/r/21dotco,,2016,1,1,artificial-intelligence,17,12,0,https://medium.com/@emergingtechnology/trying-...,https://medium.com/@emergingtechnology?source=...
4,Being Good Enough,"November 2nd, 2008",1,/r/21dotco,,2016,1,1,artificial-intelligence,16,0,0,https://medium.com/@emergingtechnology/being-g...,https://medium.com/@emergingtechnology?source=...


In [2]:
print("Number of articles scraped (before cleaning): ",medium.shape[0])

Number of articles scraped (before cleaning):  508028


<hr>
# Converting Strings to Floats

Before we can work with the data we need to <b>convert the "Claps" column from string to float values</b>. Note that the Object datatype is non-numeric. There is also an issue with <b>Claps in the form of "5.5K", rather than "5500".</b>

### Preview of DataTypes

In [15]:
medium.dtypes

Title                           object
Subtitle                        object
Image                            int64
Author                          object
Publication                     object
Year                             int64
Month                            int64
Day                              int64
Reading_Time                     int64
Claps                          float64
url                             object
Author_url                      object
Tag_artificial-intelligence      uint8
Tag_data-analytics               uint8
Tag_data-engineering             uint8
Tag_data-science                 uint8
Tag_deep-learning                uint8
Tag_machine-learning             uint8
Tag_nlp                          uint8
Tag_python                       uint8
Tag_r                            uint8
Tag_reinforcement-learning       uint8
Tag_statistics                   uint8
dtype: object

### Reformatting Clap Information to Floats

In [4]:
#Claps entries higher than 999 are written "5.5K"
# here we remove the "K", convert the string to float, then multiply by 1000.
numeric_claps = []
for x in medium.Claps:
    if "K" in str(x):
        numeric_claps.append(float(x[:-1])*1000)
    else:
        numeric_claps.append(x)
medium["Claps"] = numeric_claps
medium["Claps"] = pd.to_numeric(medium["Claps"])
print("Clap dtype: ", medium.dtypes["Claps"])

Clap dtype:  float64


<hr>
# Removing Comment Entries
Comment entries have been encoded into the data with the Comment column. Since these entries are not articles, I remove them in the following script.

In [5]:
no_comm = medium[medium.Comment==0]
no_comm = no_comm.drop(["Comment"], axis=1)
print("Number of Entries to be removed: ", medium.shape[0]-no_comm.shape[0])
print("Percentage of remaining data: " ,round(((medium.shape[0]-no_comm.shape[0])/medium.shape[0])*100,2), "%")
medium = no_comm

Number of Entries to be removed:  18229
Percentage of remaining data:  3.59 %


# Cleaning up  Urls.


In [6]:
#before
for i in range(3):
    print(medium.url.values[i])

https://medium.com/learning-new-stuff/how-to-learn-neural-networks-758b78f2736e?source=tag_archive---------0-----------------------
https://medium.com/technology-invention-and-more/2016-the-future-of-user-experience-6d1b7ef3481f?source=tag_archive---------1-----------------------
https://digitalculturist.com/computing-all-the-feels-ed567049a4?source=tag_archive---------2-----------------------


In [7]:
medium.url = medium.url.str.split("?", expand=True)
medium.Author_url = medium.Author_url.str.split("?", expand=True)

In [8]:
#after
for i in range(3):
    print(medium.url.values[i])
    print(medium.Author_url.values[i])

https://medium.com/learning-new-stuff/how-to-learn-neural-networks-758b78f2736e
https://medium.com/@perborgen
https://medium.com/technology-invention-and-more/2016-the-future-of-user-experience-6d1b7ef3481f
https://medium.com/@krishramineni
https://digitalculturist.com/computing-all-the-feels-ed567049a4
https://digitalculturist.com/@cjdarnault



# Checking for Non Entries in the Data


### All NaNs in Each Column
We only have missing values in Title, Subtitle, or Publication. <b>NaNs in publication column because not all articles are published. </b>

In [9]:
print("Number of NaNs")
for x in range(13):
    print("%-15s %10d" % (medium.columns.values[x], medium.iloc[:,x].isna().sum()))
print()
print("Total Entries:  ", medium.shape[0])

Number of NaNs
Title                15758
Subtitle            185827
Image                    0
Author                1149
Publication         248750
Year                     0
Month                    0
Day                      0
Tag                      0
Reading_Time             0
Claps                    0
url                      0
Author_url            1149

Total Entries:   489799


## Remove NaN Authors
Medium is doing something weird with adding existing articles from sites like pcmag.com. The cards on the archive timeline have neither author nor publication. Since there are only a coulple hundred entries withou an author, I choose to remove these from the data.

In [10]:
# medium = medium[medium.Author.notnull()]

## NaN Title and Subtitle Entries
Sometimes when scraping the archive page, Titles are in weird formats. The result, <b> some articles titles are scraped as subtitles</b>.

Here is a breakdown of the NonEntries in Title/SubTitle Columns. I choose to keep these in the data.

In [11]:
#Total entries with no Title
print("Total NaN Title Entries: ", medium[medium.Title.isnull()].shape[0])

#Entries with no title but with a subtitle
print("Entries with NaN Title but existing SubTitle: ",medium[(medium.Title.isnull() & medium.Subtitle.notnull())].shape[0])

#Neither Possible explanations?
print("Entries with neither title nor subtitle: ", medium[(medium.Title.isnull() & medium.Subtitle.isnull())].shape[0])

Total NaN Title Entries:  15758
Entries with NaN Title but existing SubTitle:  7035
Entries with neither title nor subtitle:  8723


## Final NaNs

In [12]:
print("Number of NaNs")
for x in range(13):
    print("%-15s %10d" % (medium.columns.values[x], medium.iloc[:,x].isna().sum()))
print()
print("Total Entries:  ", medium.shape[0])

Number of NaNs
Title                15758
Subtitle            185827
Image                    0
Author                1149
Publication         248750
Year                     0
Month                    0
Day                      0
Tag                      0
Reading_Time             0
Claps                    0
url                      0
Author_url            1149

Total Entries:   489799


# Removing Duplicate Articles with duplicated URLs and Multi-tagged
Medium allows an  author to include 5 tags for each story.

When we scraped the archive page, we scraped each individual tag. <b>As a result, stories will appear multiple times in our data (with different tags)</b>



In [13]:
#multi_urls is all entries in the dataset that have duplicates (includes all duplicates)
multi_urls = medium[medium.duplicated(subset=["url"], keep=False)]
print("There are: ", multi_urls.shape[0], "Duplicated URL entries.")
print("Unique posts with duplicate urls: ", multi_urls.shape[0]- medium[medium.duplicated(subset=["url"], keep="last")].shape[0])
print("Total unique urls: ", medium.shape[0]- medium[medium.duplicated(subset=["url"], keep="last")].shape[0])

There are:  283166 Duplicated URL entries.
Unique posts with duplicate urls:  110214
Total unique urls:  316847


In [14]:
#one hot encode the tags 
medium = pd.get_dummies(medium, columns = ["Tag"])

#multi_tags is all entries in the dataset that have duplicates (includes all duplicates)
multi_tags = medium[medium.duplicated(subset=["url", "Year", "Month","Day"], keep=False)]
print("There are: ", multi_tags.shape[0], "Duplicated tag entries.")
print("Unique posts with multiple tags: ", multi_tags.shape[0]- medium[medium.duplicated(subset=["url", "Year", "Month","Day"], keep="last")].shape[0])

There are:  260230 Duplicated tag entries.
Unique posts with multiple tags:  109508


####  Remove all but one of each duplicate entry, then sort by date

In [16]:
#keep only one entry of each duplicated article
sort_url = medium[~medium.duplicated(subset=["url"], keep="last")]

#sort the entry to put it in the exact same order as the groupby above
medium_clean = sort_url.sort_values(["url","Year","Month","Day"]).reset_index().drop("index",axis=1)

# medium_clean.shape[0]
# medium_clean.head()

316847

## Conclusion
How much data do we have after cleaning?

In [17]:
print("Number of after cleaning: ", medium_clean.shape[0])

Number of after cleaning:  316847


In [18]:
medium_clean.to_csv("Medium_scrape_urls_multi-tag _clean_2016-2020.csv")