![Joe Rogan embraces the almighty Python](LOLROGAN.jpg "I ain't apologizing for this work of art.")

---
### Rogan Guest Stats Notebook

Download, scrape, compile and view various data about the many guests that Joe Rogan has had on his show.

The notebook contains many different data insights through the pandas data module. Simply run the appropriate cells to collect and load data, then see which facts and statistics tickle your fancy. Modify and expand if you want.

**_Requirements_**: Needs my CNW Scraper module which can be found [here](https://github.com/hexadeci-male/CNW_Scraper). Other modules needed are listed inside the notebook as imports.

**_Fun Fact_**: I'm not even a fan of Joe Rogan or anything. Never listened to a single one of his podcasts. I only made this because it was good experience for collection and scraping of web data and because some guy on reddit said it would be interesting to see what kind of data any of his guests would have. Thought it was a neat idea.

---

In [None]:
%config Completer.use_jedi = False

import cnw_scraper as cnw
import pandas
import numpy
import bs4
import requests
import json
import re
import os

file_path = "guest_data.json"

---
### Data Collection
Run these cells to get the latest guest data and write it to a JSON file, then load the cleaned-up data into pandas.

---

In [None]:
# Modify as needed.

def collect_guest_data(file_path:str,update_logs:bool=False,cnw_logs:bool=False):
    """
    Collect Joe Rogan guest data and write to file with newest data from podcast using data sites and my CNW_Scraper tool's scrape_names function. Data is written out as a JSON and is ordered by guests that appeared from the latest episodes till the earliest.

    Data came from jrelibrary.com and DataWrapper. There is so much junk in here - unicode identifiers for non-ascii guest names, extra backslashes, inconsistent naming and multiple guest conventions used, missing names, hyperlinks sprinkled everywhere, extra quotes and other characters, junk html, ugh...

    The data is from the general podcasts ONLY - no MMA, fight companion, specials. Name and which episodes a guest appeared on are collected from jrelibrary, and the extra stuff (if avaliable) is from celebritynetworth. It's not perfect, but the data that gets collected is even less perfect. So...Enjoy.

    :file_path: String to where you want to save the guest data (file type should be saved as JSON, but whatever floats yer boat.) Will be overwritten if file already exists.

    :update_logs: Print to terminal what this function is doing? False by default.

    :cnw_logs: Set CNW verbose, console printing, and log file writing to true. Log file is written next to the file you set at file_path. Name is 'cnw.log' with date/time logging active. False by default.

    :return: None.
    """

    # Got a valid path for that file?
    try:
        with open(file_path,"w") as f:
            pass
    except:
        raise ValueError("Ya dun goofed - file path invalid or inaccessible.")

    if update_logs: print("Updating guest data ...\n")

    # If Rogan has any more guests on with funny names that only unicode can handle, add those chars here.
    uni_chars = {r"\u2019":"'",r"\u00E9":"e",r"\u00F1":"n"}
    url = "https://datawrapper.dwcdn.net/eoqPA/"
    usr_agt = "Young Jamie" # LOL

    # Perform html request for data
    if update_logs: print("Getting raw data ...")
    with requests.get(url=url,headers={"user-agent":usr_agt},timeout=10) as response:
        response.raise_for_status()
        # Find latest data url link.
        url = re.search(r'(?<=url=).+?(?=")',response.text).group(0)
        with requests.get(url=url,headers={"user-agent":usr_agt},timeout=10) as response:
            response.raise_for_status()
            html = response.content

    # *cries in regex...and in unicode...and in bytes...and in backslashes*
    if update_logs: print("Parsing raw data ...")
    raw_script = bs4.BeautifulSoup(html,"html.parser").find_all("script")[1].contents[0]
    for k,v in uni_chars.items():
        raw_script = raw_script.replace(k,v)
    clean_script = raw_script.replace("\\","").replace("\"\"","\"")
    raw_entries = [l[0] for l in re.findall(r'((rn|">)#.+?\d{4}")',clean_script)]
    # Entries are three parts: episode number, name(s) of guests, date of episode.
    entries = list(map(lambda x: x[3:].replace("</a>\"",""),raw_entries))

    # Create basic guest data from jrelibrary.com/datawrapper.
    if update_logs: print("Setting up data objects ...")
    guest_data = []
    fix_exceptions = ["Dr. Phil","Mr. T"] # Add more if needed.
    fix_removal = ["Dr. ","Mr. ","Mrs. ","Ms. ","Cmdr. "] # Ditto.
    fix = lambda x,r: x.replace(r, "") if x not in fix_exceptions else x
    for e in entries:
        ep_num = re.match(r'\d+',e).group(0)
        date = re.search(r'"\w+\s\d+,\s\d+"$',e).group(0)[1:-1]
        name_data = re.search(r'(?<=\d,)"?"?\w.+(?=,")',e)
        if not name_data: continue
        name_data = name_data.group(0)
        # Get rid of extra junk from name data.
        name_data = name_data[1:-1] if name_data[0] == "\"" else name_data
        if ": " in name_data:
            name_data = name_data[name_data.find(": ")+1:]
        if "- " in name_data:
            name_data = name_data[name_data.find("- ")+1:]
        name_data.strip()
        # Split up multiple guests if any.
        names = list(map(lambda x: x.strip(),re.split(r',|&',name_data)))
        ap = {"Episode": ep_num,"Date": date}
        for n in names:
            for f in fix_removal:
                n = fix(n,f).strip()
            for i,d in enumerate(guest_data):
                if n == d["Name"]:
                    # This person already exists - add appearance.
                    guest_data[i]["Appearances"].append(ap)
                    break
            else:
                guest_data.append({"Name":n,"Appearances":[ap]})

    # Get remaining data from celebritynetworth.com using my handy scraper.
    if update_logs: print("Collecting extra data from CNW (this may take a bit) ...")
    cnw.Options.custom_user_agent = usr_agt
    if cnw_logs:
        cnw.Logs.print_to_console = True
        cnw.Logs.verbose = True
        cnw.Logs.write_to_file(os.path.split(file_path)[0]+"/cnw")
    profiles = cnw.scrape_names([d["Name"] for d in guest_data])

    # Add extra data to the guests.
    if update_logs: print("Parsing and adding extra data ...")
    valid_chars = lambda c: c.isalnum() or any([x in c for x in [" ","-","'"]])
    parse_name = lambda n: "".join(filter(valid_chars, n)).strip()
    for i in range(len(guest_data)):
        for field in cnw.Profile.fields:
            if field == "Name": continue
            guest_data[i][field] = None
        guest_name = parse_name(guest_data[i]["Name"])
        for p in profiles:
            t = p.description.lower()[:400]
            if all([x in t for x in guest_name.lower().split()]):
                for k in guest_data[i].keys():
                    if k not in p.stats:continue
                    if k == "Name": continue
                    guest_data[i][k] = p.stats[k]
                break

    # Write data and done.
    if update_logs: print("Writing data to file ...")
    with open(file_path,"w") as f:
        json.dump(guest_data,f,indent=4)
    if update_logs: print("\nGuest updates done.\n")

In [None]:
# Get the latest data, if you don't have it already.
collect_guest_data(file_path,True,True)

In [None]:
# Load dataframe and tidy up data
df = pandas.read_json(file_path)
def clean_date(x):
    r = re.search(r'\w{3}\s\d{1,2},\s\d{4}',x)
    if r: return r.group()
    else: return x
clean_dates = df['Date of Birth'].map(clean_date,na_action="ignore")
df['Date of Birth'] = pandas.to_datetime(clean_dates,errors="coerce")
clean_heights = lambda x: re.search(r'(?<=\().*?(?=\s)',x).group()
df["Height"] = df["Height"].map(clean_heights,na_action="ignore").astype(float)
episodes,dates = [],[]
for x in df["Appearances"]:
    eps,dts = [],[]
    for d in x:
        eps.append(d["Episode"])
        dts.append(d["Date"])
    episodes.append(eps)
    dates.append(dts)
df.drop(["Appearances"],axis=1,inplace=True)
df.insert(loc=1,column="Appearances.Episodes",value=pandas.Series(episodes))
df.insert(loc=2,column="Appearances.Dates",value=pandas.Series(dates))

---
### FUN!
Run these cells to see various data facts and statistics about the guests - add your own, if you want!

---

In [None]:
# See guests alphabetically
alpha = df.sort_values("Name")
alpha["Name"]

In [None]:
# How many guests have extra data from celebritynetworth?
print(f"Total Guests      {df.shape[0]}")
print(f"Have extra data   {df.dropna(axis=0,thresh=4).shape[0]}")

In [None]:
# What's the male/female ratio?
mf = df["Gender"].value_counts()
ratio = pandas.Series([f"{mf[0]/mf[1]:.1F} : 1"])
mf.astype(str).append(ratio,ignore_index=True).set_axis(["Male","Female","Ratio(M/F)"])

In [None]:
# Who's the richest/poorest?
f = (df["Net Worth"].notna())
by_wealth = df.loc[f,["Name","Net Worth"]].sort_values("Net Worth",ascending=False)
by_wealth["Net Worth"] = by_wealth["Net Worth"].apply(lambda x: "{:,.0F}".format(x))
by_wealth

In [None]:
# Who's the youngest/oldest?
f = (df["Date of Birth"].notna())
by_age = df.loc[f,["Name","Date of Birth"]].sort_values("Date of Birth",ascending=False)
by_age["Age"] = (pandas.to_datetime("today") - by_age["Date of Birth"]) // numpy.timedelta64(1,"Y")
by_age

In [None]:
# Who's the tallest/shortest?
f = (df["Height"].notna())
by_height = df.loc[f,["Name","Height","Gender","Date of Birth"]].sort_values("Height",ascending=False)
by_height["Height(Imp)"] = by_height["Height"].apply(lambda x: f"{int(x*3.2808399)}ft {x*3.2808399%1*12:0.1F} in")
by_height.rename(columns={"Height":"Height(Met)"},inplace=True)
by_height[["Name","Height(Met)","Height(Imp)","Gender","Date of Birth"]]

In [None]:
# What salary information do the guests have?
f = (df["Salary"].notna())
by_salary = df.loc[f,["Name","Salary"]]
by_salary

In [None]:
# Who appeared the most/least on the show?
count = df["Appearances.Episodes"].apply(len)
count.name = "Appearances"
by_app = pandas.concat([df["Name"],count],axis=1)
by_app.sort_values(["Appearances","Name"],ascending=[False,True])

In [None]:
# How many different nationalities do the guests encompass?
nat_count = df["Nationality"].value_counts(dropna=False).rename(index={numpy.nan:"Unlisted"})
nat_count

In [None]:
# Which episodes did a guest appear in?
eps = df.rename(columns={"Appearances.Episodes":"Episodes"})
eps[["Name","Episodes"]]

In [None]:
# What is the wealth and age correlation?
w_a = df.loc[:,["Net Worth","Date of Birth"]]
w_a["Age"] = (pandas.to_datetime("today") - w_a["Date of Birth"]) // numpy.timedelta64(1,"Y")
w_a["Net Worth"].corr(w_a["Age"])

In [None]:
# How many guests held a title of each profession?
profs = df["Profession"].str.split(", ",expand=True).stack().str.title()
profs.value_counts()

In [None]:
# What is the relation between a guest's wealth and how many times that guest was on?
aps = df["Appearances.Episodes"].apply(len)
worth = df["Net Worth"]
guest_df = pandas.DataFrame({"Appearances":aps,"Wealth":worth}).sort_values("Wealth",ascending=False).dropna()

# With/without billionaires...
# guest_df = guest_df[guest_df["Wealth"] < 1_000_000_000]

guest_df.plot(kind="scatter",x="Appearances",y="Wealth")

---
### ETC ...
I'm sure you can do better than me. I had basically no experience with pandas or data science/analysis prior to making this notebook. Put some cells down below and see what else you can dig up from the data.

---