# CS 410 Usage Tutorial


This tutorial is a tutorial for loading the model, analysing the dataset and providing a method for inference on the trained model. Note that the training script is provided in a separate notebook for modularity.

### Important Links:

* Location for saved model: https://drive.google.com/drive/folders/11gDult7SE5hz2EhFwyqAz6_cFx0Y-vEE?usp=sharing
* Location for preprocessed dataset: https://drive.google.com/file/d/1G8j9lX0mPtoUSVv4k_VSl5uUzR8F3fEF/view?usp=sharing

In [1]:
!pip install rouge-score
!python3 -m pip install pytextrank
!python3 -m spacy download en_core_web_sm
!pip install transformers
!pip install sentencepiece

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 532 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
### Basic Imports
import numpy as np
import pandas as pd
import torch
import sklearn
import matplotlib.pyplot as plt
import spacy
import pytextrank

from google.colab import auth, drive
import os

from transformers import pipeline
from transformers import T5Tokenizer, T5ForConditionalGeneration


In [3]:
# Google Drive Authentication 
auth.authenticate_user()
drive.mount('/content/gdrive/')


Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


## Part 1: Data Loading

Note that the dataset is being loaded from google drive. 

In [4]:
# Google Drive Folder Update

BASE_DATASET_FOLDER = "TIS/"
ROOT_DATA_PATH = "/content/gdrive/My Drive/" + BASE_DATASET_FOLDER

# Changing directory path 
os.chdir(ROOT_DATA_PATH)


In [5]:
ls

 merged_prefix_transcript_description_summ.tsv  'TIS Presentation.gslides'
 summary-model                                   [0m[01;34mtisproj[0m/


Now, we shall load preprocessed transcript file, called `merged_prefix_transcript_description_summ.tsv`.
For running this notebook, this file can be downloaded and uploaded to colab from the following shared link:

In [6]:
# Loading the dataset from colab's mounted drive

TSV_NAME = "merged_prefix_transcript_description_summ.tsv"

podcast_df = pd.read_csv(TSV_NAME,sep='\t',header=0)

In [7]:
# Removing extra index column added while loading
podcast_df = podcast_df.drop('Unnamed: 0', 1)

# Head
podcast_df.head()

Unnamed: 0,episode_prefix,episode_transcript,episode_description
0,6mKHxeVEUzsN9zd0qQZ1Vw,Hello and welcome back to Lab Rats. I'm France...,"A and P of cranial nerves. Reference: Martini,..."
1,0oVZ7OHBw3R6op5FioXQrQ,Hey guys. Thank you so much for listening to t...,Welcome back to the Hop-up Today we have spec...
2,2wRMKnRo5V3QW4Kny6tOvm,"Hello, my name is David and this is a new seri...",Welcome to Our very 1st Podcast on all matters...
3,6sjqAv6M3mlKfY796hFIt3,I've got some exciting news and that's that. I...,If you’re ready to take action toward your goa...
4,6wQ70kw0apzaQhZae1NRdU,Welcome to the third season and episode 351 of...,Seneca uses a colorful analogy between life an...


**Dataset Statistics**

In [8]:
# Dataset Size

print("Total number of podcast episodes: {}".format(len(podcast_df)))

Total number of podcast episodes: 105360


In [9]:
# Podcast Length Summary

def get_word_len(row):
  return len(row.split(' '))

podcast_df['episode_len'] = podcast_df.apply(lambda row: get_word_len(row['episode_transcript']), axis=1)

In [10]:
# Describe

podcast_df.describe()

Unnamed: 0,episode_len
count,105360.0
mean,5806.834804
std,4200.227461
min,12.0
25%,2078.75
50%,5274.0
75%,8785.0
max,44001.0


We can see that there are `105360` data points i.e. unique episodes to train our models on. On an average, the episode transcript is `5000` words long. Taking an average of `150 words per minute`, we get an average podcast length of `33 minutes`. 

**Dataset Manipulation for Compute Concerns**

Note that in order to comply with limited computational time on GPU, we restrict the number of tokens in each `episode_transcript` to 7000, as instances greater than them would lead to code crash. 

The following lines of code make use of lambda functions and `apply` on Pandas Dataframes to efficiently carry out this computation over the entire dataframe.

In [11]:
def cap_word_len(row):
  row_len = len(row.split(' '))

  if(row_len>7000):
    row_list = row.split(' ')[:7000]
    smaller_row = ' '.join(row_list)
    return smaller_row
  else:
    return row


In [12]:
podcast_df['capped_episode_transcript'] = podcast_df.apply(lambda row: cap_word_len(row['episode_transcript']), axis=1)

Let us confirm the clipping by checking the length.

In [13]:
podcast_df['capped_episode_len'] = podcast_df.apply(lambda row: get_word_len(row['capped_episode_transcript']), axis=1)

In [14]:
podcast_df.describe()

Unnamed: 0,episode_len,capped_episode_len
count,105360.0,105360.0
mean,5806.834804,4539.273738
std,4200.227461,2496.514871
min,12.0,12.0
25%,2078.75,2078.75
50%,5274.0,5274.0
75%,8785.0,7000.0
max,44001.0,7000.0


# Part 2: Evaluating model performance

In [15]:
# Defining the ROUGE score objective
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)


## Qualitative Analysis 

Now, we shall demonstrate a qualitative analysis of the comparison of the two methods i.e. TextRank and T5


In [16]:
transcript = "Welcome to the bringing the human back to Human Resources Podcast. The podcast. All about the delicate balance between people and business, and quite literally, reconnecting the two. My name is Tracy Rubin and I've spent nearly my entire professional career in HR. Join me as I share stories, opinions and words of advice with you each week. Hey everyone, this is Tracy Rubin. Welcome back to the bringing the human back to Human Resources Podcast. It's officially week six, so I've been doing this podcast now for almost two months, which. I'm proud of it's exciting. It's an exciting opportunity for me to, you know, talk about whatever is on my mind and I think it gives me an opportunity also to interact with all of you. I really invite you to reach out if there's something that you want to talk about specifically or hear me talk about specifically and actually it leads me into the point of this week's podcast. But before I go into it, I want to give you a quick update on how I'm doing with my commitments. Two and improved or achieving a better work life balance. And on last week's podcast episode I said that I would be waking up at 5:00 AM everyday. I really meant Monday to Friday, which I clarified on Instagram. But waking up everyday at 5:00 AM. Getting myself ready to go for the day. Kind of giving back to myself. Creating routine, having that pseudo commute if you will. And I am here to tell you that I did just that. So I'm recording right now while I'm still in that week where I've committed to making those changes to achieve a better work life balance. And I actually I have to say like waking up at 5 which I used to do regularly. Feels so good again. Like I I feel actually more well rested, which could be because I'm going to sleep a little bit earlier. I'm giving myself time to do anything and everything that's not work related for like 4 hours before I even touch work. And so it's just a really nice improvement. So far. So I'm looking forward to keeping it up and I will continue to check back in and keep you guys updated. I hope that we all can as a little community. Hold each other accountable to these things so we can have better just experiences, especially while we're mostly, I would imagine most of us are working remotely if you're not working remotely. All the more reason to establish a really good routine because. It's not easy to be working in this environment or this climate right now, and so taking space for yourself, I think is still super critical. So to get into today's topic, I've had a lot of outreach around this topic for weeks now. Pretty much since the beginning of the podcast, and I've touched on it a little bit here and there, but I figured, why not dedicate a full episode to this, and it's probably going to result in my bringing on a guest and. A future episode to really dive into some actionable tips and tricks, but I wanted to talk about breaking into our and I've engaged with some of you over DMS on LinkedIn Messenger, and I've been able to give you some of my advice and thoughts individually, but I know that there are so many people who don't reach out and so hopefully if you're listening, this will help you. If you are considering a career in HR, I strongly recommend a career in HR. If you like the following things people business strategy, having an impact on people so. Great you like those things check right however you have to also know what you don't like because HR definitely will challenge you if you don't like the following things. Being resourceful and solutions oriented, feeling comfortable wearing many different hats and being extremely agile not only day-to-day and week to week but also minute to minute things. Change constantly and so that agility and adaptability is really really. Important you have to also have an appetite for some of the administration piece. I mean, even in my current role as a director, there's so much admin work still and we have an admin so you have to be comfortable with systems and operations. And I recognize that a lot of just roles and jobs call on all of these things, but I think the biggest thing is that if you don't like people and if you don't like helping people. Find a different career if you didn't want to hear the truth today, I'm sorry you got it, but that is the truth. If you don't like people and you don't like helping people, then this is the sign you've been waiting for that you should not pursue a career in HR. Certainly there are parts of HR that have less and less to do with people like you know some of the admin stuff you you could be doing a lot of manual entry and things like that. However, once you are a part of an HR team, people will see you as a resource. And so if you want to be an individual contributor in the sense that you have no impact or no involvement with the human element, don't come into this career. In the first episode, right? I talked to you guys about how my undying focus and goal is to destigmatize HR, and that's because there are people in HR who shouldn't be in HR who have stigmatized the career who don't care about people and don't want to help people and so. If you are not being honest with yourself, maybe you think you like people. Maybe you think you want to help people, but every time someone reaches out for help or support and you get irritated or you can't find the the motivation behind that. You at least should not be in a business partner role. Maybe there's an element of HR that you could get into, like maybe you really really like to do manual work and enter in data. Like maybe it's HR analytics that could be totally possible, but again, I really think and I don't know. Maybe I'm completely spoiled by the job that I'm currently in and the company that I work for, but I think that if you don't like people or you don't. Have an appetite for helping them that it's just not going to work out and actually think that's the case for management too. I don't know what jobs are out there where you don't have to work with people. Maybe it's like software development engineering, maybe I don't know. Are there jobs out there that don't deal with people? Hey, maybe not. If there aren't, then you better figure out a way to like people. That's all I have to say, but it's a fundamental part of this role. HR is a super super broad field. It's kind of like saying. Retail, right, like retail, could mean anything. Retail is literally any consumer product, consumer, brand, company. That's where retail is, right? The ability for someone to purchase what a company is selling, and so it's kind of the same with HR. Like what is HR? You can find HR anywhere in education, its personnel in retail. It's HR in tech, it's people, operations in medicine. It's HR. Two like there. Basically it's all in one in the same, and yet everything is completely different. So step one, find the industry that you feel you're best suited for and this can change overtime. When I graduated college, my mom was like this is your first job. It's not your last job and that is really good advice because I think if anyone can relate to the experience that I had when I was in college, I felt an incredible amount of pressure not only to get a job. But also almost like thinking in my mind that it was like the end all be all, and it's not. This year has been really challenging. I'm sure for recent graduates because they maybe have secured or had a secured their first job and then you know the pandemic hit and they either lost their jobs or they were furloughed. Or you know something potentially threw wrench into their plan and I couldn't imagine knowing the amount of pressure that I put on myself when I. Had my first job which was target out of college and I didn't like think past that. I almost and I don't know why. I don't know if I just. Thought that you go to a job and you stay there for 30 years because people who work at Johnson and Johnson do that. Like I, I think that that was my mentality. Now I mean, studies have shown that millennials are more likely to jump around between jobs within a year or two years of tenure in an organization. I wonder how the pandemic will change that people might see that now there's a bit more added value if you spend a longer time with the company. I personally, I mean now I've I've almost four years of my current company. I spent almost four years at Target. I haven't had an experience where I spent. Less than four years at this point, with an organization. But I know that it's real. It's a real thing and recruiters have actually had to change the way that they look at candidates. Knowing that millennials are much more likely to jump around, and that that's not necessarily reflection of their willingness to be loyal or their loyalty, I should say to a brand. So anyway, I digress. First thing is really again, figure out the industry for me. I had an experience in retail. So I knew I liked it when I was in college and in high school I had a lot of leadership experience and I knew that anything in business or management would really interest me because it's very similar. Of course, to a leadership experience you don't necessarily need to have all like all of those years of experience in order to even figure out what you like or dislike if you are considering a career in HR, first start with like the industry. It helps narrow it down, and I think. It's a lot easier than just looking for entry level roles across all industries because there actually are a lot, but it's really overwhelming and I don't know that it's necessarily time well spent to apply to every single entry level job. If you don't have HR experience just because it's a job, OK, so now you've narrowed down the industry. Maybe you have select 3 industries that you would really be interested in working in. Now. You should really think about OK, what area? Of HR interests me most, so maybe its payroll. Maybe its benefits? Maybe it's compensation, analytics people or business partnering administration recruiting. So when you take a look at all of those different facets of our and certainly there are more and there are subsets and we can go on and on. Think about what it is that you like and what you don't like. So for example, recruiting requires a lot of talking, a lot of patients. A lot of listening if you don't really like to meet tons and tons of people that are going to be really good candidates. Really bad candidates and everything in between. Maybe you don't even appreciate so much speaking on the phone or face timing. It's probably not going to be the right facet of HR for you. Conversely, if you really like numbers and you like strategy and analytics, payroll benefits compensation. Analytics those things actually really might interest you payroll. I have met people who love paper payroll and I have met people who hate payroll. And actually I am one of those naysayers of payroll. I knew after having my experience at Target where I was a generalist, really like an HR manager generalist, I had that payroll experience and I knew that it did not motivate me and so for me I was like, OK, I know I'm not going to do that. But I also knew that I loved. Interacting with people I loved connecting with employees, helping them solve their problems. You know, applying different strategies on the people and operations side of things. So it was very easy for me to realize that being a business partner or you know, being like a head of HR one day would be what I really wanted to go after. So kind of like think about what it is that your strengths are. What are your opportunities? What don't you like? What motivates you? That certainly should help. And then so now you have your industries. You have your likes and dislikes. Now you can get to applying to jobs. How do you do that? The first is reach out to your networks. I'm sure you've heard this a million times over. It's all about who you know, right? Like we hear that probably from a very young age and it's really true, like there's no denying it. That people and those connections, and who you know, really matter and and it might not necessarily mean that you get. A job because of who you know, but it's certainly can mean that you avoid the paper pushing and bureaucratic process of job application referrals. For example, if you have someone who works at a company that you really want to work for or that has a job posted that you're really interested in, reach out to someone that you know. Maybe you spoke with them ten years ago. Maybe you spoke with them two days ago. Reach out and say, hey, do you have an internal referral process? Because this is the job I'm really interested in. Can you refer me? I would love and really appreciate an opportunity at this, you know organization. If you're not that close with the person you're reaching out to, one of the ways that I would suggest reaching out if you don't really know the person, but you're familiar with whom they are. Or maybe it's a friend connecting you to one of their friends, have an authentic and real conversation first you know. Reach out, say, hey, do you? Can we grab a coffee in a non COVID world? Maybe stay safe? Can we FaceTime? I'd love to pick your brain on your experience. That is a really, really great way of building a network, but also not just saying. Hey, I'm hitting you up because I want to refer. OK thanks bye. This is an opportunity for you to actually like build a network or a connection on your own through the connection that you have or previously had with this person and it makes them feel good, right appeal to them. Be strategic, be authentic, but also use that as a way to ultimately get what you want. So now you have some of those tools in your tool belt, right? Like picking out the industry. Picking out the. Likes and dislikes or the things that you really want for a career, and then the third is leveraging your network, but the 4th is that your resume needs to reflect the skillset that the role requires. If you're applying to a recruiting position, you need to be able to reflect on your resume that you have some kind of skill set or experience with interpersonal communication. Communication skills in general. How you bring people together? How you keep networks and connections alive. If you're in a sorority or fraternity that's networking, right? So, highlight that. Really. Focus on the skill set. Because if you have even one year of establishing or generating, maintaining a skill set that can be applicable or transferable to the job that you're applying to, knowing how to talk to it is really important. OK, so let's take it way back. Let's say I'm. Back in college I went to Binghamton University in New York and I had internships every summer and usually during the winter also. And those internships I didn't necessarily have as much strategy for. I thought I was going to go into some sort of. Government work or environmental work like I studied English and I got a minor in environmental science. So clearly like that's not HR. That's not business, right? But I ultimately was able to translate my skillset. From what I gathered as a student and in my education and basically transfer that into a career with target in management. Like how do you do that, right? You do it in talking about your skill set, making sure your resume reflects the experience that you have had that's relevant to the job that you're applying to. When I was a senior in college, I went to a job fair that was on campus and historically the the vendors or the recruiters who were at the college fair were really looking for like finance and accounting students. Engineering students, prelaw Poli SCI and so. It could have easily deterred me from going, but I was like, hey, there could be someone interesting there. Let me go and I went, and certainly there was someone interesting there, and her name is target and they were speaking to me about my leadership experience. I would in a sorority, so I was. I held a few chair positions and leadership positions in that sorority. I was an RA. I was a tour guide like I was very involved and I was a student leader on campus. And so I was able to take all of that experience and translated into what ended up becoming my first real out of school manager experience. And I had an interview the next week. The following week the following week and then I think I went through like maybe six or seven interviews and then by Halloween of my senior year I had a job lined up. It was the best Halloween weekend. Probably of my life. I won't go into too much detail because this is an HR podcast, not a party podcast, but. Nevertheless, it was a really wonderful way to feel secure and stable, and it doesn't always happen that way. I get that my point is really that I didn't have manager experience. I had leadership experience, I had experience. Running events bringing people together, networking. I had all of those transferable skill sets from my RA job on campus where I was managing people in a different way, but also in a way that translates for the role that I was applying to. So if your resume doesn't currently speak to the job that you want and it only speaks to the job that you have, you are missing opportunities. So take this as a piece of advice that you should go into your resume. Get yourself prepared for interviews with your resume with your stories. Getting everything going so that you can really talk about what it is that you are able to do or what you will be able to do based on the experiences that you have retained and achieved over the course of however many years that you've either been a student or in the career that you might currently be in. And so if you are not a student and you're looking to make a switch. Into the HR industry, it's really all kind of the same advice in that your resume should reflect the job that you want or the job that you're applying for and not the job that you're currently in. The key there is that of course your resume is going to say all of the things that you currently do, but you have to frame it in a way that doesn't necessarily speak to the past or the present, but speaks to the future. So if you're in a current role or current industry that is an HR and you want to kind of get. Into or break into the industry. Take a look at your resume first. Take a look at your networks. It's all of the same messaging, right? All of the same advice that I have for you as I do for a recent graduate or a current student. Think about how you can interview and get those stories out. So if someone asks you well, why do you want to make the switch from X to HR? You have to be able to really speak to the wise and also talk about the challenges that you know will be there. So one of the questions that I usually ask people on interviews is what is an opportunity that you have that we need to be aware of so we can support you in your first like thirty 6090 days for example, or maybe even a little bit different? What is it that you are going to struggle with that we need to be aware of? And it's a question that can throw some people off, but really what I'm looking for is an awareness of the challenge of the role and transparency into opportunities. Just stop. With the my biggest opportunity is that I care too much. That's not an opportunity. That's a strength. Like, let's be real here. Get yourself prepared for a real interview and a real window, or providing a real window into your interests. And I think that it's all in that authenticity and the transparency and awareness that can make the interview just be that much better. Obviously there's a lot to break down there, and I I'm going to. I know I keep promising this, but I'm going to spend time on an episode talking about interview skills and how you can best set yourself up for that change in industry, but it really, it all comes back to that. For those of you who are already in your careers, switching into HR is the hardest thing for you. It's going to be hardest for those of you who are already in your careers. It's not going to be as hard for someone who doesn't have experience, and it's because when you're when you get into HR. When you finally break in. You are starting from the ground up, and while yes, I've been talking about transferable skill sets, that's definitely going to help you, but HR has its own acumen. Industry specific verbiage and knowledge, and things that can make it more challenging for someone who has not had their career in HR and isn't just starting out. So I caution you there. However, I think the most important thing for someone who's already in their career is to leverage the networks first. Anyone you know at any company is going to be a better way for you to break into the role they're going to be your best bet. My other piece of advice is that you talk to your current manager and your current our partner and express your interest. It's one of the best things that you can do, and actually I think it's a lot easier to break into HR in the company or for the company that you're currently at because they know who you are. They know how you perform. No risk or there's less risk when you apply for an external opportunity. Not only are they taking a risk and not knowing what your performance is like and how you work there, also taking a risk because you don't know the industry and you don't know the role. My suggestion will always be break into HR at the current company or the current organization that you're with and then make a move. I hope that those pieces of advice were helpful for you. I know that 30 minutes is not enough to really like breakdown. Every single thing that can be helpful, or every single thing that you should keep in mind, but I'm I'm always here for outreach. If you guys have any questions or you want to dig in a little bit more on any of the points that I have drawn out on this week's episode, you also can actually email me. I set up an email, it's podcast at HR tracy.com, so if you have a lengthier question or something that you'd really like to dive into, whether it's on this topic or another topic, please. Shoot me an email and I'm happy to engage in a lengthier conversation. So in the coming weeks I am going to have probably one or two guests on. I'm still working on it. If there's someone in this hour space or business space, or maybe a completely different industry that you'd really like to hear from, please let me know whether it's through that email. Like I said, podcasts at hrtracy.com or through my social networks. Please let me know. I'm excited about the people that I'll be interviewing this coming week. I don't know when the podcasts will launch with their interviews, but I know that the conversations are going to be really exciting and kind of touch on some of these things that we talked about today and previous on previous episodes. So nevertheless, I thought I would share a story that I think is really sweet. When I traveled to Japan with one of my friends back in 2016, first of all, it's the most amazing, beautiful, incredible country. So if you get the chance. Once this pandemic is over and it's safe to travel again, you really should make make moves get to Japan. It's incredible, but when we were there we went to a like a an American style diner. So it was basically like American food, but it was Japanese and it was called bills and I think it was in her ajuku and when we were sitting there and we were eating breakfast, I saw these two. Young women who were taking like I guess, like a selfie or a picture with their phones, and they had this incredible like attachment on the front facing camera or no. I'm sorry the back facing camera and I used Google Translate to say and I think I even said to my friend like Oh my gosh that's so cool. They were basically able to use a fisheye lens on their phone for their pictures and I used Google Translate to tell them that I really liked. Their phone camera and it sparked a friendship and actually they traveled to New York and we all stayed in touch and we went to Chelsea market and we showed them around there and we got pizza and they did not speak English and we did not speak Japanese and I don't know if anyone from Google is listening give us a commercial. It was such an incredible experience to connect with them without speaking their language. And without them speaking hours through Google Translate just to have like a little chat, we ended up taking a picture. And yeah, like I said, they came to New York and we all got together and we still speak to this day. I actually recently I think it was probably over the summer. Had like a little FaceTime with one of our friends that we met during that whole experience. I call her my pen pal and she learned English. And it was actually really easy to speak with her and I share this because she reached out to me on Instagram, which we we are connected on on Instagram already like on my personal account. But she also follows my HR Tracy Instagram and I followed her back which I did for pretty much everyone because I realized I actually have to follow someone in order to see if they tag me in a story if they're private. So I pretty much followed everybody that followed me, but she messaged me on Instagram and she said. I'm so excited that you followed me from this Instagram account. I listen every week and actually I know that she listens because I see in the metrics for every episode, like the location and I see Japan and actually there are like three or four listeners in Japan, which is really cool and I just thought it was an exciting way to share that the the world is so much smaller than we realize. And she learned how to speak English and she's still learning and. She has hopefully learned some new words and idioms, but she listens every week and it just warmed my heart to see her message and know that I'm reaching her on a topic that might not even like matter so much, even though she has her own embroidery business and that's super cool, but it was just an exciting story that I thought I'd share and hopefully also warms your heart. So, Rin, I hope that we can all see each other soon. As always, I remind you to please rate, review and subscribe if you haven't already. Wherever you're listening to this podcast, you can send me an email at podcast at hrtracy.com. You can follow me on Instagram at HR Tracy. Of course that's Tracy with an I. You can also connect with me on LinkedIn or you can go to my website and do all of that without even having having to think about it by going to our tracy.com. Thank you so much. Speak to you next week."
description = "On this episode of Bringing the Human back to Human Resources, Traci breaks down the stigmas of HR as a field, entity, and career. Sharing her experience, opinions, and questions posed to her, this inaugural episode aims to introduce individuals in all functions of a business to HR, whether you're an employee, employer, or HR professional. Episodes will be published every Tuesday--don't forget to follow Traci on Instagram at @HRTraci to get involved in the discussion!  Disclaimer: Thoughts, opinions, and statements made on this podcast are not a reflection of the thoughts, opinions, and statements of the Company Traci Rubin is actively employed by"

### TextRank 

In [27]:
import spacy
import pytextrank
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank")

doc = nlp(transcript)

textrank_summary = ""
for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5):
     print(sent)
     textrank_summary += str(sent)



Certainly there are parts of HR that have less and less to do with people like you know some of the admin stuff you you could be doing a lot of manual entry and things like that.
If you like the following things people business strategy, having an impact on people so.
If you don't like people and you don't like helping people, then this is the sign you've been waiting for that you should not pursue a career in HR.
I talked to you guys about how my undying focus and goal is to destigmatize HR, and that's because there are people in HR who shouldn't be in HR who have stigmatized the career who don't care about people and don't want to help people and so.
It's HR in tech, it's people, operations in medicine.


In [28]:
textrank_summary

"Certainly there are parts of HR that have less and less to do with people like you know some of the admin stuff you you could be doing a lot of manual entry and things like that.If you like the following things people business strategy, having an impact on people so.If you don't like people and you don't like helping people, then this is the sign you've been waiting for that you should not pursue a career in HR.I talked to you guys about how my undying focus and goal is to destigmatize HR, and that's because there are people in HR who shouldn't be in HR who have stigmatized the career who don't care about people and don't want to help people and so.It's HR in tech, it's people, operations in medicine."

In [29]:
textrank_score = scores = scorer.score(textrank_summary, description)
print('Rouge score for TextRank', textrank_score)

Rouge score for TextRank {'rouge1': Score(precision=0.32710280373831774, recall=0.23972602739726026, fmeasure=0.27667984189723316), 'rougeL': Score(precision=0.14953271028037382, recall=0.1095890410958904, fmeasure=0.12648221343873514)}


### Pretrained Text-To-Text Transformer (T5) (Off the shelf)

In [30]:
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("deep-learning-analytics/wikihow-t5-small")
model = AutoModelWithLMHead.from_pretrained("deep-learning-analytics/wikihow-t5-small")

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [31]:
preprocess_text = transcript.strip().replace("\n","")
tokenized_text = tokenizer.encode(preprocess_text, return_tensors="pt")

summary_ids = model.generate(
            tokenized_text,
            max_length=150, 
            num_beams=2,
            repetition_penalty=2.5, 
            length_penalty=1.0, 
            early_stopping=True
        )

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)
pretrained_t5_summary = output

Token indices sequence length is longer than the specified maximum sequence length for this model (6251 > 512). Running this sequence through the model will result in indexing errors




Summarized text: 
 Get to know the HR industry.Get to know the HR industry.Make sure you have a good resume.Ask people who are interested in HR.Seek out some of your advice.


In [32]:
pretrained_t5_score = scorer.score(pretrained_t5_summary, description)
print('Rouge score for pretrained T5 (without tuning)', pretrained_t5_score)

Rouge score for pretrained T5 (without tuning) {'rouge1': Score(precision=0.12149532710280374, recall=0.40625, fmeasure=0.18705035971223022), 'rougeL': Score(precision=0.07476635514018691, recall=0.25, fmeasure=0.11510791366906474)}


### FineTuned T5 on the Spotify Dataset
Loading T5 from saved model after training and using for evaluation



In [17]:
# Google Drive Folder Update

BASE_DATASET_FOLDER = "TIS/"
ROOT_DATA_PATH = "/content/gdrive/My Drive/" + BASE_DATASET_FOLDER

# Changing directory path 
os.chdir(ROOT_DATA_PATH)

Note that we have stored the saved model in `./tisproj` folder.

In [18]:
model_final = T5ForConditionalGeneration.from_pretrained('./tisproj',local_files_only = True)

In [19]:
tokenizer_final = T5Tokenizer.from_pretrained('./tisproj',local_files_only = True)

In [20]:
from transformers import pipeline
from transformers import T5Tokenizer, T5ForConditionalGeneration

summarizer = pipeline("summarization", model=model_final, tokenizer=tokenizer_final)


In [22]:
t5_summary = summarizer(transcript, min_length=5, max_length=100)[0]['summary_text']

In [23]:
t5_summary

"this week's episode is all about the importance of having a good resume and a strong network. --- Send in a voice message: https://anchor.fm/travis-rubin7/message Support this podcast: http://anterior-refrigerant/support/support This podcast is sponsored by  Anchor: The easiest way to make a podcast. https://answers.fandom.com/app Support"

Rouge Score for T5

In [13]:
description = "On this episode of Bringing the Human back to Human Resources, Traci breaks down the stigmas of HR as a field, entity, and career. Sharing her experience, opinions, and questions posed to her, this inaugural episode aims to introduce individuals in all functions of a business to HR, whether you're an employee, employer, or HR professional. Episodes will be published every Tuesday--don't forget to follow Traci on Instagram at @HRTraci to get involved in the discussion!  Disclaimer: Thoughts, opinions, and statements made on this podcast are not a reflection of the thoughts, opinions, and statements of the Company Traci Rubin is actively employed by"

In [33]:
t5_score = scores = scorer.score(t5_summary, description)
print('Rouge score for T5', t5_score)

Rouge score for T5 {'rouge1': Score(precision=0.1588785046728972, recall=0.30357142857142855, fmeasure=0.20858895705521474), 'rougeL': Score(precision=0.11214953271028037, recall=0.21428571428571427, fmeasure=0.14723926380368096)}


### Comparison of the three scores 

In [None]:
print(textrank_score)
print(pretrained_t5_score)
print(t5_score)

### Quantitative analysis

After having seen a single example, let us now run the analysis on a subset of our dataset, which is also our final results presented in the readme.

In [26]:
# Running on 10 randomly chosen rows of the validation set

val_subset = podcast_df.sample(n=10)
val_subset

Unnamed: 0,episode_prefix,episode_transcript,episode_description,episode_len,capped_episode_transcript,capped_episode_len
41731,6Cq1XugcVrGUqqH3BcJSrO,Hello everyone. This is Adam Meister the bitco...,25 Bitcoin minutes of Caitlin Long! Play it at...,4940,Hello everyone. This is Adam Meister the bitco...,4940
33733,1COfFbdyQ9VtfwTgyP1vkH,"Hello, tribe this meditation. Wednesday is spo...","The power and energy to vibrates into you, upg...",3207,"Hello, tribe this meditation. Wednesday is spo...",3207
20007,6WEDAaVOA9a9Px1EiA7ymX,"What up, Spencer big-time thoughts. We've done...",This week we cover the first week of B1G Footb...,6518,"What up, Spencer big-time thoughts. We've done...",6518
91108,0CUaPFoDpZvVVqgtkq4tSN,Before you aspire to be someone before you loo...,How did a call center agent make it all the wa...,12005,Before you aspire to be someone before you loo...,7000
89708,2eFx2INWt84g0GkVAxM0Bc,Welcome to the body image health and fitness p...,Today I speak about social media and how it ca...,3852,Welcome to the body image health and fitness p...,3852
72277,03CJYfZ2FCR6r8olTFR93I,Asian boss girl is brought to you by anchor an...,May 28th is Menstrual Hygiene Day. We celebrat...,12776,Asian boss girl is brought to you by anchor an...,7000
95333,3G3JRWueMO1AqecCXwX5U4,This episode is brought to you by TaxACT. This...,The Facts Surprisingly Awesome’s Theme Music i...,6928,This episode is brought to you by TaxACT. This...,6928
24953,6n3qq0N44E81KyhR5kXKo3,Are you an HR department of one trying to figu...,What do you get when you mix a young ambitions...,14604,Are you an HR department of one trying to figu...,7000
9221,6Dh8GFSEKOrY2EkNplLzUc,This episode of the ortho bullets podcast will...,"In this episode, we will review the high-yield...",1408,This episode of the ortho bullets podcast will...,1408
102946,0JDt85aS46MIzVgQnvfCgF,"Before I continue one of the ways, we keep all...","Breanna Chianne, Kittie Kaboom, and Alexis Bro...",8114,"Before I continue one of the ways, we keep all...",7000


### TextRank on Val Dataset

In [90]:
# TextRank 

def get_textrank_scores(val_subset):
  textrank_scores = []
  for i in range(len(val_subset)):
    transcript = val_subset.iloc[i]['capped_episode_transcript']
    description = val_subset.iloc[i]['episode_description']

    doc = nlp(transcript)

    textrank_summary = ""
    for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5):
        textrank_summary += str(sent)

    textrank_score = scorer.score(textrank_summary, description)
    avg_score = textrank_score['rouge1'].fmeasure

    textrank_scores.append(avg_score)

    print("Finished sample {}, {}".format(i,avg_score))

  final_textrank_score = sum(textrank_scores)/len(textrank_scores)
  print("TextRank performance on val dataset: {}".format(final_textrank_score))

In [86]:
get_textrank_scores(val_subset)

Finished sample 0, 0.11382113821138211
Finished sample 1, 0.06542056074766354
Finished sample 2, 0.126984126984127
Finished sample 3, 0.13201320132013203
Finished sample 4, 0.19745222929936304
Finished sample 5, 0.012499999999999999
Finished sample 6, 0.24
Finished sample 7, 0.08648648648648648
Finished sample 8, 0.048
Finished sample 9, 0.07352941176470588
TextRank performance on val dataset: 0.109620715481386


### Off the shelf T5 on Val Dataset

In [89]:
# T5 (Off the shelf) 

def get_t5_scores(val_subset, model, tokenizer):
  t5_scores = []
  for i in range(len(val_subset)):
    transcript = val_subset.iloc[i]['capped_episode_transcript']
    description = val_subset.iloc[i]['episode_description']

    preprocess_text = transcript.strip().replace("\n","")
    tokenized_text = tokenizer.encode(preprocess_text, return_tensors="pt")

    summary_ids = model.generate(
                tokenized_text,
                max_length=100, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
            )

    output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    pretrained_t5_summary = output

    t5_score = scorer.score(pretrained_t5_summary, description)
    avg_score = t5_score['rouge1'].fmeasure

    t5_scores.append(avg_score)

    print("Finished sample {}, {}".format(i,avg_score))

  final_t5_score = sum(t5_scores)/len(t5_scores)
  print("T5 (Off the shelf) performance on val dataset: {}".format(final_t5_score))

In [88]:
get_t5_scores(val_subset, model, tokenizer)

Finished sample 0, 0.29032258064516125
Finished sample 1, 0.27906976744186046
Finished sample 2, 0.16901408450704222
Finished sample 3, 0.06756756756756757
Finished sample 4, 0.2676056338028169
Finished sample 5, 0.031746031746031744
Finished sample 6, 0.09615384615384616
Finished sample 7, 0.15730337078651685
Finished sample 8, 0.08791208791208792
Finished sample 9, 0.0
T5 (Off the shelf) performance on val dataset: 0.14466949705629312


### T5 (Fine tuned on spotify dataset)

In [27]:
# T5 (Fine Tuned on Spotify Dataset) 

def get_t5_finetuned_scores(val_subset, model_final, tokenizer_final):
  t5_finetuned_scores = []
  for i in range(len(val_subset)):
    transcript = val_subset.iloc[i]['capped_episode_transcript']
    description = val_subset.iloc[i]['episode_description']

    summarizer = pipeline("summarization", model=model_final, tokenizer=tokenizer_final)
    t5_summary = summarizer(transcript, min_length=5, max_length=100)[0]['summary_text']

    t5_score = scorer.score(t5_summary, description)
    avg_score = t5_score['rouge1'].fmeasure# + t5_score['rougeL'].fmeasure)/2.0

    t5_finetuned_scores.append(avg_score)

    print("Finished sample {}, {}".format(i,avg_score))

  final_t5_score = sum(t5_finetuned_scores)/len(t5_finetuned_scores)
  print("T5 (Fine Tuned on Spotify Dataset) performance on val dataset: {}".format(final_t5_score))

In [28]:
get_t5_finetuned_scores(val_subset, model_final, tokenizer_final)

Finished sample 0, 0.3764705882352941
Finished sample 1, 0.20779220779220778
Finished sample 2, 0.20253164556962025
Finished sample 3, 0.3575418994413408
Finished sample 4, 0.2574257425742574
Finished sample 5, 0.3246073298429319
Finished sample 6, 0.1728395061728395
Finished sample 7, 0.3684210526315789
Finished sample 8, 0.6410256410256411
Finished sample 9, 0.2131979695431472
T5 (Fine Tuned on Spotify Dataset) performance on val dataset: 0.31218535828288585


#### Insights from Automatic Evaluation.

* While comparing ROUGE scores, we see that the FineTuned model performs well in comparison to other baselines, which confirms our expectation that domain adaptation on the Spotify dataset is a necessary step towards a higher score.


### Manual Evaluation

 We used five English
speaking volunteers to score the summaries into the defined spectrum of Bad(B) to Excellent(E), as defined by the original paper. This is so that we can effectively capture the subjectivity of how good or bad a summary is, based on how relevant it is to a human evaluator.

#### Insights from Manual Evaluation

* 3 out of 5 people felt that the summaries generated by TextRank and Fine-Tuned were comparable, and rated it Fair(F).
* 1 evaluator felt that TextRank is definitely better, and 1 Evaluator felt that given enough data, FineTuned T5 is a much better abstraction of the podcast transcript.
* 5 out of 5 evaluators agreed that the FineTuned T5 generated better summaries than Off-the-shelf Pretrained T5. This validates our assumption about the need to perform domain adaptation.















### Room for improvement and Error Analysis

* Dataset - Perhaps episode description is not the ideal ground truth to represent summary, since it often contains promotional material which the model learns to generate after every summary, leading to post processing overhead.
* Compute - Since even on the best settings on Colab, the T5 model can only take a certain amount of tokens, perhaps given enough compute T5 has the potential to generate even better summaries. Nevertheless, deep learning based techniques seem to be infeasible for simple use cases.
